Home | Publier un mémoire | Une page au hasard

The impact of covid-19: to predict the breaking point of the disease from big data by neural networks


par Woohyun SHIN
Paris School of Business - MSc Data Management 2001
Dans la categorie: Informatique et Télécommunications > Intelligence artificielle
   
Télécharger le fichier original

Disponible en mode multipage

PSB PARIS SCHOOL OF BUSINESS

Mémoire

Pour l'obtention du diplôme de

Master of Science

En Data Management

Présenté Par :
SHIN Woohyun

THÈME

The impact of COVID-19 : To predict the breaking

point of the disease from Big Data by Neural Networks

Responsable du Mémoire : OMRANI Nessrine

- 1 -

Année Scolaire : 2019-2020

- 2 -

Abstract

The weather data generated per second is BigData, which is difficult to process with computers at home. In particular, supercomputers used by the National Weather Service are more expensive than clusters that connect multiple computers in parallel with a single machine. To address these limitations of a single machine, the cluster environment was built using the BigData framework Hadoop and Spark. Subsequently, a deep learning prediction model was created using temperature data to predict the reduction point of COVID-19. The model is designed to put the maximum temperature of the past decade at each day at input value, and to predict the 2020 weather, hoping for an early end to COVID-19. As a result, the predicted reduction point of COVID-19 was consistent with the actual breaking point.

Keywords : BigData, Hadoop, Spark, Deep-Learning

Les données météorologiques générées par seconde sont BigData, qui est difficile à traiter avec des ordinateurs à la maison. En particulier, les superordinateurs utilisés par le National Weather Service sont plus chers que les clusters qui connectent plusieurs ordinateurs en parallèle avec une seule machine. Pour remédier à ces limitations d'une seule machine, l'environnement de cluster a été créé à l'aide du framework BigData Hadoop et Spark. Par la suite, un modèle de prévision d'apprentissage en profondeur a été créé en utilisant des données de température pour prédire le point de réduction de COVID-19. Le modèle est conçu pour mettre la température maximale de la dernière décennie à chaque jour à la valeur d'entrée, et pour prédire la météo 2020, en espérant une fin précoce de COVID-19. Par conséquent, le point de réduction prévu de COVID-19 était conforme au point de rupture réel.

Mots clés : BigData, Hadoop, Spark, Deep-Learning

- 3 -

Table of Contents

I. INTRODUCTION

II. LITERATURE REVIEW

1. Coronavirus

1.1. SARS and MERS

1.2. Seasonal Virus

1.3. 2019 Novel Coronavirus

2. Weather Prediction Model

2.1. Numerical Weather Prediction (NWP)

2.2. Deep Learning Model

3. Applied Technologies

3.1. Hadoop

3.1.1. HDFS

3.1.2. MapReduce

3.1.3. Yarn

3.2. Spark

3.2.1. Low-Level API

3.2.2. Structured API

3.2.3. Machine Learning on Spark

3.3. Docker

3.3.1. Micro-Service

3.3.2. Image and Container

3.3.3. Networks

3.3.4. Kubernetes

4. Conclusion

III. METHOD AND DATA

1. Development Environment

2. Data

2.1. Daily COVID-19 confirmed cases

2.2. Max Temperature Data

3. EDA, Exploratory data analysis

4. Prediction Modeling

4.1. Data Preprocessing

4.2. Multi-Layer Perceptron

IV. Result

V. Discussion

VI. Conclusion

- 4 -

I. INTRODUCTION

Coronavirus, which began to infect humans through bats, has appeared in various shapes since 2003. The variety of Coronavirus, which appeared in SARS in 2003, MERS in 2009 and COVID-19 in 2019, has social and economic implications as well as human health. In particular, this COVID-19 is a deadly situation, with the WHO issuing a "Pandemic" proclamation. Since the first confirmed case of COVID19 appeared in Wuhan, China on December 8, 2019, data from the World Health Organization (WHO) have shown that more than 2,314,621 confirmed cases have been reported worldwide by April 20, 2020. Among them, 157,847 people died, with a fatality rate of around 7 percent[19]. Fortunately, in some countries with a well-respected COVID-19 medical system, the number of confirmed cases remains double-digit every day, and the number of confirmed cases continues to decline. However, the impact of COVID-19 has led to a setback in global economic and social infrastructure, and a tentative recovery period is expected, as is the result of the 2008 global financial crisis. Economically, the U.S. stock market plunged due to the outbreak of the new coronavirus infection, resulting in massive unemployment within a week. In Europe, many countries that share borders from Italy to Spain, France and Germany were closed by COVID-19. This has slowed the growth rate of many countries tied to the euro this year.

Contrary to this disastrous reality, individual thinking predicts that these negative effects will create many new opportunities from a long-term perspective. Historically, many developments and advances are made by the limited environment of the times. For example, during World War II, there were many advances in communication and technology that we are currently using, and the commercialization of penicillin, among medical technologies, was one of the inventions that changed the world. We are now incorporating and developing technologies that existed but were not frequently used to fit with the times. Among them are remote education and telecommuting, grocery shopping using the Internet, and ordering food. These technologies existed in the past, but did not feel the need, and lacked the technology until commercialization. However, it shows that it is growing and adapting to new environment according to the current situation. Therefore, through temperature and humidity-based neural network learning, we will predict when the end of COVID-19 proliferation will come. Predicting the timing, we try to seize the economic opportunities that a new social culture brought by COVID-19 will bring.

I conducted a search paper to answer the question of when we would get freedom from the coronavirus - when would economic recovery begin?

First, I will talk about what is coronavirus ? and how we can predict the weather. I will also talk about what technology is needed to make climate predictions. The technology and data will then be used to predict the climate and find a point in time when the corona virus will decrease.

- 5 -

II. LITERATURE REVIEW

Bill Gates, founder of Microsoft and CEO of the Bill & Melinda Gates Foundation in 2015, predicted a number of infections and economic declines through the virus during a TED lecture. The disease will kill more than 10 million people within a few decades, and the route of infection can be found anywhere, not just in plane and in markets. The World Bank estimates that if we have a worldwide flu epidemic, global wealth will go down by over three trillion dollars and we'd have millions and millions of deaths [1].

1. CORONAVIRUS

Coronavirus (CoV) is a virus that can be infected with humans and various animals, meaning the RNA virus with a gene size of 27 to 32 kb [8]. There are four types of coronavirus (Alba, Beta, Gamma, Delta), and in the case of Alpha and Beta, Gamma and Delta can infect humans and animals, meaning that they can be infected by animals. So far, a total of six types of human-infected coronavirus have been known. There are types that cause colds (229E, OC43, NL43,HKU1) and types that can cause severe pneumonia (SARS-CoV, MERS-CoV) [8].

1.1. SARS and MERS

The emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in China at the end of 2019 is a one of kind of coronavirus from the bat. Phylogenetic analysis revealed that SARS-CoV-2 is 79% similar to SARS-CoV, which occurred in China in 2003, and 50% similar to MERS-CoV [4].

1.2. Seasonal Virus

Influenza is commonly known as the "flu" and is an acute respiratory disease caused by the influenza virus. Influenza is a highly contagious disease that causes large and small trends around the world every year, and 10 to 20 percent of the population usually gets infected within two to three weeks of its onset [9]. The most obvious symptoms are sudden high fever of 38 to 40 degrees Celsius within 24 hours of infection, with systemic symptoms such as headaches, muscle aches and fatigue, and respiratory symptoms such as sore throat, cough, gabbing, and rhinitis. Dynamically, influenza usually occurs in late fall and early spring when winter begins [10]. In the case of influenza, mainly worldwide, it is seasonal influenza, with 5-15 percent of the population infected each year.

1.3. 2019 Novel Coronavirus

SARS-CoV-2 is spread by human-to-human transmission via droplets or direct contact, and infection has been estimated to have mean incubation period of 6.4 days and basic reproduction number of 2.24~3.58 person [9]. Many highly infectious diseases, including COVID-19, exhibit seasonal patterns. For example, for SARS-CoV, MERS-CoV and Influenza, humidity and temperature are known as factors of virus survival. By mid-February 2020, the majority of confirmed cases had occurred in cities between 30 and 50 degrees latitude. These cities also have similar climatic characteristics with an average temperature of 5 to 11 degrees and humidity of 47 to 79 percent [2].

2. WEATHER PREDICTION MODEL

In January 2020, Google succeeded in developing a model called Nowcast, which uses artificial intelligence technology to predict weather conditions such as precipitation. Nowcast will shorten the analysis work, which used to take hours, by 5-10 minutes and forecast the

3. APPLIED TECHNOLOGIES

3.1. Hadoop

Apache Hadoop is a Java-based open source framework that can store and process big data. The two large components of Hadoop were created to better handle large amounts of data through parallel distributed storage(HDFS) and distributed processing(MapReduce) [11]. Weather sensors that collect data every hour across the globe collect large amounts of log data, which are joined to data analysis using MapReduce because they are semi-structured and record-oriented. Because there are tens of thousands of weather stations, the entire dataset has a large number of relatively small files. Due to the nature of Hadoop, a small number of large files are easy to process and efficient [5].

- 6 -

weather up to six hours later. Google researchers compared the new weather forecast model to the existing one based on past weather data from 2017 to 2019. The results showed that new and existing models showed similar levels of performance in terms of accuracy [3].

2.1. Numerical Weather Prediction (NWP)

Synoptic weather predictions (computer models based on physical equations) that predict weather from a minimum number of days to a maximum of two weeks have been developed by the Numerical Weather Prediction. These models remain the backbone of all weather forecasts. However, this approach has a critical limitation that will take a considerable amount of time to process the data [12].

2.2. Deep Learning Model

To define deep learning, a neural network must be defined first. Neural network is a graph of nodes with weights and activation functions. Nodes are stacked up one by one and consist of a layer. On the network, each layer is partially or completely associated with the previous layer. These simple functions can be learned to better recognize increasingly complex input signals as layers are added one by one. The ultimate goal of this learning is to properly adjust the connection between each node on the network, its weight, and the value of each node so that it can be associated with a specific value when a value is entered. Deep learning forms a variety of architectures and builds up these layers countless times. Neural networks themselves are algorithms that were created decades ago. However, attention has been focused on various machine learning algorithms and has declined. However, the recent use of large datasets (BigData), the development of powerful hardware (Cluster and GPUs), and the advent of new learning algorithms have enabled us to learn a large neural network that goes far beyond the performance of many traditional machine learning methods.

For typical machine learning techniques, the limitations of computational performance did not allow good results as data grew. However, deep learning is advantageous as data and information grows, and rather requires much larger datasets than conventional machine learning algorithms. Deep Neural Network (DNN) has now become the standard for computer vision, voice processing and some natural language processing operations, providing better learning performance compared to models learned by existing users manually adjusting parameters, and has been actively introduced in other areas of machine learning.

The DNN model is very effective when a lot of data needs to be processed and nonlinear, such as weather forecasting problems. According to the papers on various DNN algorithm models, MPL models are used the most and are considered to have high accuracy[13].

- 7 -

The difference between Hadoop and the traditional Relational Database Management System (RDBMS) is that Hadoop can store structured, semi-formal, unstructured and all-format data, but RDBMS can only store structured data. In terms of speed, RDBMS also adopts the Schema On Write function, which has the disadvantage of being very slow when data is recorded. On the other hand, Hadoop adopts the Schema On Read function, which has the advantage of being very fast when data is written. Hadoop also has the advantage of being able to form a cluster using common hardware rather than specific equipment, because it uses clusters by connecting equipment in parallel. These advantages can be highly useful in economic terms. Because the cost of upgrading a computer's memory or hard disk is exponentially increased compared to the performance of the component. However, many companies are now adopting the technology because parallel architectures can translate computer performance improvements into an arithmetic growth. Hadoop also has an advantage in terms of network efficiency because it has a philosophy of sending light code to where data exists and processing it.

3.1.1. HDFS

Hadoop works with two important concepts. One of them, HDFS, is an expanded, reliable, distributed file storage system created by Java, and files are stored in block units and operate in a master-worker structure. Block size is the maximum amount of data that can be read and written at a time. There are some benefits of introducing the concept of block abstraction into a distributed file system. The first benefit is that a single file can be larger than the capacity of a single disk. Multiple blocks that make up a single file need not be stored only on the same disk, so they can be stored on any disk in the cluster. The second benefit is that abstracting into blocks rather than file units can simplify Storage's subsystem. Furthermore, blocks are highly suitable for implementing the necessary replication to provide fault tolerance and availability. To cope with block damage and disk and machine failures, each block replicates data to a number of physically separated machines. If one block becomes unavailable, the client can be informed to read the copy on the other machine. In the Hadoop 2 version, the block size is 128 Mbyte by default, and you can adjust the cluster's block size in the hdfs-site.xml file. You can also set the number of replications in the hdfs-site.xml file.

An HDFS cluster operates on two types of nodes operating in a master-worker pattern. It consists of one Namenode, the master, and several Datanodes, the walker. The Namenode manages the namespace of the file system. The Namenode maintains metadata about the file system tree and all of the files and directories contained in it. This information is stored permanently on the local disk as two types of files, FSimage and Editlog. It also identifies which Datanodes all the blocks in the file are on. The DataNode is a real worker in the file system. A Datanode stores and navigates blocks at the request of a client or a Namenode, and periodically reports a list of blocks it is storing to the Namenode. In addition, the Datanode sends the disk-available spatial data, data movement, and load carrying capacity to the Namenode every three seconds through `heart beat'. Therefore, NameNode constructs metadata based on that information.

3.1.2. MapReduce

MapReduce is a programming model for data processing. It is also an action program for Hadoop clusters written in Java. To take advantage of the parallel processing offered by Hadoop, we need to re-express client's requirements in MapReduce Job. MapReduce operations are largely divided into Map and Reduce phases, helping to process data faster

- 8 -

through a simple step called Shuffling between Map and Reduce. Each step has a pair of key-value as input and output, and the type is chosen by the programmer. The MapReduce program follows a series of development procedures. First, write a unit test to implement the Map and Reduce functions and verify that they work well. Next, write a driver program that runs the Job, and run it on Integrated Development Environment (IDE) using part of the dataset to verify that it works properly. The Map process converts the entered data into a value of one pair of key-value values for each record (row). Afterwards, several values are put into one list based on the common key value through the process called Shuffling. Finally, the data organized by key-list is returned through the Reduce step. For example, if a meteorological station between 1950 and 1959 wants to write a programming that wants to know the highest temperature each year, it first maps the conversion of daily weather data records to a single key-value value. This results in a total of 3650 (365 days * 10 years) records, such as (1950, 0), (1950, 2) and ... Shuffling then proceeds to reduce the role of reducing to one list as of each year (1950,[0,2,...], (1951, [2,6,...], ...), ... (1959, [3,8,..., 32]). Subsequently, if an algorithm is created and executed to find the maximum value of each record through the Reduce process, results such as (1950,43), (1951,46),... (1959,32) can be obtained.

Hadoop provides the MapReduce API, which allows you to write Map and Reduce functions in other languages besides Java. Hadoop Streaming uses UNIX standard streams as an interface between Hadoop and user programs. Thus, users can write the MapReduce program using various languages, such as Python and Ruby, which can read standard inputs and write them as standard outputs.

3.1.3. Yarn

Apache Yarn (Yet Another Resource Navigator) is Hadoop's cluster resource management system. Yarn was first introduced in Hadoop version 2 to enhance MapReduce's performance. Today, however, various distributed computing frameworks such as Spark and Tez are used to manage resources within a cluster. Yarn provides key services through two types of long-running daemon: ResourceManager and NodeManager. The only ResourceManager in the cluster manages the usage of the entire resources of the cluster and the NodeManager, which runs on all machines, is responsible for running and monitoring the Container. The client connects to the ResourceManager to run the application on Yarn and requests the operation of the Application Master process. For example, a spark job request was received while one Hadoop MapReduce job was running on 100 servers. If you don't have Yarn, you can configure a separate cluster on 50 servers for MapReduce and a spark cluster on the remaining 50. In this case, if Hadoop's MapReduce finishes earlier than Spark's, it can be a waste of resources because 50 servers do nothing. As a result, Yarn can minimize server resource waste by allowing tasks working on different distributed computing platforms to be managed by a single resource administrator. This is the purpose of creating Yarn and is its greatest advantage.

3.2. Spark

Spark is an integrated computer engine that houses libraries that process data in parallel in a cluster environment. Spark has the philosophy of 'Providing the Integrated Platform for Developing BigData Applications'. From simple data reads to SparkSQL, Spark MLlib, and Spark Streaming, data analysis is designed to be performed with the same computation engine and consistent APIs. Spark does not store data internally for a long time and does not prefer specific storage systems. Therefore, it is designed to focus on processing regardless of where the data is stored. The fundamental background of Spark's emergence is due to

- 9 -

changes in the economic factors underlying Computer Application and Hardware. Historically, computers have accelerated year by year thanks to improved processor performance. Unfortunately, the performance improvement of hardware stopped around 2005 due to physical heat dissipation. This phenomenon requires parallel processing to improve the performance of applications, resulting in distributed computer platforms such as Spark and Hadoop. [6].

A computer cluster brings together resources from multiple computers so that they can be used as a single computer. However, configuring a cluster is not enough and requires a framework to coordinate operations in the cluster. Spark is the framework that does that. The Spark Application consists of a Driver process and a number of Executor processes. The Driver process runs on one of the cluster nodes and is essential because it performs maintenance of application information, response to user programs or inputs, analysis, distribution, and scheduling roles associated with the work of the overall Executor process. The Driver process is a heart-like existence that keeps all relevant information in memory throughout the life cycle of the application. The Executor, on the other hand, performs the tasks that the Driver process assigns. That is, it performs two roles: executing code assigned by the driver and reporting progress back to the Driver node. Spark splits data into chunk units called partitions so that all executors can work in parallel. For example, if there's only one Partition, even if there are thousands of Executors in the Spark, parallelism is one. In addition, even with hundreds of partitions, parallelism is one if there is only one Executor.

3.2.1. Low-Level API

Spark consists of a base element Low-level API, a Structured API, and a set of standard libraries that provide additional functionality. In most situations after Spark version 2, it is recommended to use the Structured API and access to RDD will be lost after version 3. However, it is important because RDD, the smallest units that run Spark, and distributed shared variables, are in the Low-Level API. With the Low-Level API, there is a unique concept of Spark called Resilient Distributed Dataset (RDD) for distributed data processing. Simply put, a collection of partitioned records that are immutable and can be processed in parallel. It is also mostly used to create a physical execution plan optimized in the Dataframe API.

There are two types of distributed shared variables: Broadcast Variable and Accumulator. The reason for the two shared variables is not to run applications on a single machine, but to do the same thing on a cluster containing multiple machines, you need to share certain values. Therefore, these variables are used to act as the exchange of certain values. Broadcast Variable is an immutable sharing variable that caches all equipment within a cluster. This method is very efficient, especially when large variables such as machine learning models must be used several times, so machine learning algorithms that operate in most sparks include Broadcast Variable as a necessity. Accumulator is used to update various values within Transformation. And it can deliver values from Executors to Driver in an efficient way while ensuring fault tolerance. Similar to Broadcast Variable, the accumulator is frequently used when working on a deep learning model to deliver a hyperparameter to the driver, updating the value several times, and finding a better model.

3.2.2. Structured API

While each record in Dataframe, one of the structural APIs, is a structured row consisting of fields that know the schema, RDD's record is simply an object in the programming language chosen by the programmer. On the other hand, the structured API consists of

- 10 -

Dataset, Dataframe, and SQL, and aims to handle big data using the API. Structured APIs are recommended because RDDs are not available in the Spark version 3 booth, which will be released in 2021.

Structured APIs are basic abstraction concepts that define data flows. The API's execution process consists of four steps. First, write code using Dataset, Dataframe, SQL first. For the second normal code, the Spark engine converts to a logical execution plan. Converts a third logical execution plan to a physical execution plan and verifies that further optimization is possible in the process. Physical execution plan during the process means converting query defined by structured API into RDD. Finally, send the plan to the Spark Driver to execute the physical execution plan within the cluster. The spark then returns the processing results to the user.

3.2.3. Machine Learning on Spark (MLlib)

MLlib offers data collection, refining, feature extraction and selection, learning and tuning of map and non-map learning machine learning models for large data, and an interface that helps these models to be used in the operating environment. MLlib is often compared with Scikit-Learn and TensorFlow. From a broad perspective, MLlib can be thought of as carrying out Scikit-Learn or similar tasks provided by Python. The tools mentioned above are tools that can perform machine learning on a single machine basis. However, because the package operates in a cluster environment, it has complementary relationships. MLlib has several basic 'structural' types, such as transformer, estimators, evaluators and pipelines.

3.2.3.1.Transformer

The transformer is a function that converts raw data in various ways. This could be creating a new interaction variable, normalizing a column, or changing an Integer type to a Double type in order to enter the model.

3.2.3.2.Estimator

The estimator has two meanings. First, it means a kind of transducer that initializes the data. For example, to normalize numeric data, the transformation is initialized using the current value information in the column you want to normalize. Second, the algorithms used by users to learn the model from the data are also called estimators.

3.2.3.3.Evaluator

It allows us to see how the performance of a given model works according to one

criterion, like the Receiver Operating Characters (ROC) curve. After selecting the best model among the models tested using the Evaluator, the final prediction can be made using that model.

3.2.3.4.External libraries

Sparks can run various projects using external libraries as well as embedded packages. Among them, a variety of external deep learning libraries can be used, especially in the new field, such as TensorFrame, BigDL, TensorFlowOnSpark, DeepLearning4J, and Elephas. There are two ways to develop a new deep learning model. One is to use a spark cluster to parallelize learning on a single model in multiple servings and update the final results through communication between each server. The other is how to use a specific library to learn various model objects in parallel and to review various model architectures and hyperparameters to efficiently select and optimize the final model.

- 11 -

Library

Framework based on DL

Case of application

TensorFrame

Tensorflow

Inference, Transfer learning

BigDL

BigDL

Distributed learning, Inference

TensorFlowOnSpark

Tensorflow

Distributed learning

DeepLearning4J

DeepLearning4J

Inference, Transfer learning, Distributed learning

Elephas

Keras

Distributed learning

[Table 1] Deep Learning External Libraries

Elephas is a library designed to run the Keras Deep Learning Framework in Spark. Keras maintains simplicity and high usability to support distributed models that can be run on large datasets. Using Spark's RDD and Dataframe, it is implemented on Keras as a class of data parallel algorithms, initialized from the Driver of the Spark, then serialized the data and passed to the Executor, and the parameters needed for the Model are passed from the Executor using the distributed shared variables Broadcast Variable and Accumulator of the Spark. Subsequently, the learned data and hyperparameters are passed back to the Driver. These values are synced updater by Optimizer in the Master of Node and continue learning.

3.3. Docker

Docker is an open-source project that makes it easier to use applications as containers by adding multiple features of Linux containers. Docker is written in Go language. Unlike virtual machines, which are traditional methods of virtualization, Docker containers have little performance loss, drawing attention from many developers in next-generation cloud infrastructure solutions. There are many projects related to Docker, including Docker Compose, Private Registry, Docker Machine, Kitemetic, and so on, but typically Docker refers to the Docker Engine. The Docker Engine is the main project of the Docker that creates and manages containers and provides a variety of functions and controls the containers on its own [7].

Traditional virtualization technology used Hypervisor to create and use multiple operating systems on a single host. These operating systems are identified as virtual machines, and each virtual machine has Ubuntu, CentOS, and so on. Operating systems created and managed by Hypervisor use independent space and system resources that are completely different from each guest operating system. Typical virtualization tools for this approach include VirtualBox, VMware, and others. However, virtualizing machines and creating independent space is a must-have hypervisor, resulting in performance loss compared to normal hosts. So while virtual machines have the advantage of creating a complete operating system, they have the potential to lose performance compared to typical hosts, and it's hard to deploy gigabytes of virtual machine images to applications.

In comparison, the Docker container has little performance loss because it creates a process-level isolation environment by using Linux's own features, chroot, namespace and cgroups, to create virtualized space. Because the necessary kernel for the container shares and uses the kernel on the host, and there are only libraries and executable files in the container that are needed to run the application, the image capacity is also significantly reduced when the container is imaged. This is faster than virtual machines and has the advantage of having little performance loss when using virtualized space.

- 12 -

3.3.1. Micro-Service

The way in which multiple modules in the software operate the logic they interact with within a program is called the Monolith application. While this approach may be appropriate for smaller services, the disadvantage is that the more complex and larger the function of the service, the less scalable and flexible the software itself is. To replace this Monolith approach, the latest emerging concept is the Micro-Service architecture. The Micro-Service structure has the advantage of being able to respond quickly to changes without being dependent on language, and making it easier to manage each module. Docker is the most commonly used virtualization technology in the Micro-Service architecture because containers can be created and started in seconds and provide multiple modules with an independent environment at the same time. The development of the BigData platform is the task of turning a large number of nodes into a single cluster. Managing and maintaining these clusters collectively requires skills to respond quickly to changes. This advantage of Docker is essential for the development of the BigData platform.

3.3.2. Image and Container

The basic units used in the Docker Engine are images and containers, both of which are key to the Docker Engine. Image is a necessary element for container creation and a concept similar to the iso file used to create virtual machines. The image exists as a binary file with multiple tiers and is used as read-only when creating and running containers. Basic Linux operating systems such as Ubuntu and CentOS, which can be the basis for applications, can be used through the Docker website called the Docker Hub. When creating a Container with these images, the image creates a separate space for the file system containing files that fit the purpose and for the isolated system resources and network, which is the Docker Container. The Container uses the image as read-only but stores only the changes in the image in the Container layer, so whatever the Container does, the original image is not affected. In addition, each created container is provided with a separate file system and is separated from the Host, so if an application is installed and deleted from a particular container, the other container does not change.

3.3.3. Networks

To run a single application on a Docker Engine installed on several physical servers, the Docker containers need to communicate with each other. Basically, on one server, the Docker Engine assigns sequentially the internal IP of the container, which can be changed each time the container is restarted. This internal IP is a host with Docker installed, that is, an IP that can only be written to the internal network, so it needs to be connected to the outside world. Typical network drivers provided by the Docker Engine on their own include bridge, host, none, controller, overlay, and MacVLAN. Among them, MacVLAN virtualizes Host's network interface cards, providing the same physical network environment for containers. Therefore, with MacVLAN, containers have virtual MAC Addresses over the physical network, enabling communication with other devices connected to that network. Multiple servers are connected to network equipment such as one router, and each server and container is dynamically allocated IP in one network band, such as 192.168.0.0/24, enabling containers using MacVLAN to communicate with each other.

3.3.4. Kubernetes

Kubernets is an open source-based management system that provides automatic distribution and scaling of containerized applications. It was designed by Google in 2014 and is now managed by the Linux Foundation. The purpose is to provide a platform for automating the deployment, scaling, and operation of application containers between hosts in

- 13 -

multiple clusters. Kubernets manages the belly's resources in a form called an 'object'. Objects can serve a variety of roles by grouping a kind of container, a set of basic containers called Pods, and even the controller Replica Set, which manages the set of containers, and even the Service Account and Node, can be used as one object. The role of the Kubernets Node is largely divided into master and worker. Master is responsible for managing the cluster so that Kubernets can function properly, and Worker is created an application container. Master Node is responsible for all services for deploying applications by running API Server, Controller Manager, Scheduler, DNS Server, etc.

4. CONCLUSION

Research so far shows that the new coronavirus is affected by temperature and humidity factors. Also, in common sense, cities with similar latitudes have similar climates. In 'Temperature and Latitude Analysis to Predict Potential Spread and Seasonality for COVID19', the COVID-19 virus assumed that the transmission of the virus would be active in large cities with an average temperature of 5 to 11 degrees per day and humidity of 47 to 79 percent [2]. Therefore, the hypothesis will be verified using actual climate data. We will then implement a model that predicts the date on which COVID-19's spread rate decreases. In order to verify these tests, we will make a deep learning model that uses historical weather data to predict this year's weather. For weather forecasting models using deep learning through `Weather forecasting model using artificial neural network' [13], the deep learning model will be constructed by adopting the hypothesis that the model predicted using five hidden-layer using 10 historical data as input value is the most accurate. For weather data, Hadoop, Spark clustering will be configured using multiple Raspberry Pi to efficiently store and process big data and semi-structured data. Afterwards, a climate prediction model based on the Keras Deep Learning model will be implemented using the Elephas external library [20].

III. METHOD AND DATA

In cities with similar latitudes, the climate tends to be similar. Therefore, Madrid, Tehran, Seoul and San Francisco, four major cities with latitude between 37 and 40 degrees, were chosen first. Since then, the time series of confirmed cases in the city and daily maximum temperature data observed at observatories in the city from January 1, 1950 to December 31, 2019 have been obtained. San Francisco had a population of about 880,000 and the other three cities had a population of more than 6 million. It was assumed that the transmission speed of COVID-19 would decrease after that date when the maximum temperature of the city was maintained at 16 degrees Celsius or higher for more than 12 days of the two weeks. Because generally the incubation period for COVID-19 is two weeks. Therefore, we decided to observe the temperature change for two weeks. However, there are additional conditions to maintain a certain temperature for more than 12 out of 14 days, because abnormal weather conditions may exist during the two weeks observed.

1. DEVELOPMENT ENVIRONMENT

For climate data, a framework to handle BigData is needed because much of the data at the meteorological station is organized in semi-format format. Therefore, instead of using the BigData service already available online, the company decided to create and use the environment personally. Hadoop connects several equipment in a parallel structure to create a single clustering[Figure 2]. Spark is a framework that uses BigData data stored in Hadoop Filesystem (hdfs) to help preprocess and deep learning analysis within the cluster

- 14 -

environment. The clustering are all managed by a node manager named YARN. Docker was used to easily install and deploy Hadoop and Spark on several physical machines. Zeppelin is an Integrated Development Environment that can be used in Spark, which helps you conduct analysis in the Web interface [Figure 1]. Finally, Elephas, the external package of Spark Deep-Learning, was installed in the cluster and weather prediction modeling was conducted

through deep learning.

Raspberry Pi 4

Arm 64 bits

Ubuntu

18.04 LTS

Hadoop

2.7.7.

Spark

2.4.0

Zeppelin

0.8.2

Python

3.6.9

[Table 2] Development Environment Version

[Figure 1] Logical Architecture

- 15 -

[Figure 2] Physical Architecture

2. DATA

2.1. Daily COVID-19 confirmed cases

Prior to the weather forecasting model, we identified the frequency distribution of the number of confirmed cases by date in the ongoing COVID-19. First, we collected COVID19 datasets from Wikipedia[17] and the Johns Hopkins Coronavirus Resource Center[16]. Afterwards, visualizations were carried out by organizing the dataset by date. In Iran, subsequent data were not available because the number of confirmed cases was not counted by cities since March 26, but only the total number of confirmed cases was announced. The graph below is a graph of new confirmed numbers that occurred every day, and the graph shows the change in the number of confirmed cases in four cities. It was also possible to specify the day at which the number of confirmed cases was increased.

- 16 -

[Figure 3] Confirmed case Trend at each date

In Seoul, the number of confirmed cases has been on the decline since March 10, when the number of confirmed cases reached 52. Madrid has seen a decline since 3,419 confirmed cases on March 30. Tehran has seen a downward trend since peaking 347 confirmed cases on March 14. San-Francisco, a steady increase in the number of confirmed cases has been seen since early March [Figure 3].

2.2. Max Temperature Data

I wanted to use average temperature data and average humidity data per day, but there were not many weather stations since 1950 and the amount of data was limited. It was also assumed that human-to-human infection of the COVID-19 virus is unlikely to occur because the minimum temperature is generally recorded at night and there is a small floating population at dawn, although daily minimum temperature data were available. For this reason, the study was conducted using daytime maximum temperature data with a large floating population instead of average temperature data and humidity data [18].

3. EDA, Exploratory data analysis

For the meteorological data, problems with the observation machine or environmental problems result in many missing values[Figure 4]. Thus, the missing values are replaced by the MissForest algorithm, an algorithm that estimates and replaces missing values, using the Random Forest model of machine learning, among the imputation techniques. The algorithm is a nonparametric method that uses correlations between variables to correct missing values and can be applied to mixed and high-dimension data and does not make distribution assumptions about the data [15].

- 17 -

[Figure 5] Visualization of Missing Values

In the case of Seoul, San Francisco and Madrid, missing values were replaced based on Miss Forest because there were several stations, and in the case of Tehran, missing values were interpolated based on the time series date index because only one observatory existed.

4. PREDICTION MODELING

The data used daily maximum temperature data from 1950 to 2019, and the prediction model was implemented using the Multi-Layer Perceptron (MLP) provided by Keras. Because of the large number of rows of data, the data was processed and analyzed using the BigData Analysis Framework Spark.

4.1. Data Preprocessing

For most Deep Learning models, train the model after converting the input data value to - 1 to 1 or 0 to 1[14]. After model learning and accuracy verification, put the converted input value for prediction and then return the output value to its original value. The values of all Max Temperatures were converted to 0 to 1 using the MinMaxScaler of Scikit Learn, and then returned to Fahrenheit, the unit of actual data, using the values of Max Value and Min Value that were stored, and then converted to Celsius units. Because one out of four years has leap year, I drop the rows which recorded in 29th February. Then all data are divided with 70% of Training datasets, 30% of Validation datasets and one dataset which contains the max temperature between 2010 and 2019 for predicting 2020 through well-made trained model.

4.2. Multi-Layer Perceptron

Since I have 10 inputs, a 5 hidden-layer network with 16 neurons per layer. For the first to fourth hidden-layer of the five hidden-layer, the function tan-sigmoid (tanh) is used as an activation function and the last hidden-layer that returns 1 output data is based on linear activation function[13]. On the basis of MSE( Mean Square Error), minimization criteria, I adjust the weights and biases of the neurons on each step that makes reduce MSE value. After a number of iterations (epochs) , the neural network is trained and the weights are saved.

- 18 -

Finally I was able to get 4 trained models at each cities, Madrid, Seoul, San-Francisco, Tehran.

[Figure 6] Diagrams of Multi-Layer Perceptron

IV. RESULT

The following results will demonstrate that the diffusion rate of COVID-19 will decrease if the temperature is maintained above 16 degrees for more than 12 days of 14 days. The process for reaching the results is as follows :

i) Create a `spark_base' container using the ubuntu image file of the docker and
install Hadoop and Spark from that container. Subsequently, the corresponding image file is stored in the Docker Hub repository.

ii) Set up MacVLAN on each Raspberry Pi to communicate between the three
Raspberry Pis.

iii) Download the `spark_base' image file on the Docker Hub from all servers and
proceed to the Namenode, Secondary Namenode, Datanode, and ResourceManager and NodeManager settings for Hadoop in accordance with the role of each corresponding node. The spark-related settings are also carried out.

iv) Install Zeppeline and Elephas, external libraries of deep learning, for analysis
environment in Container, which acts as Namenode. Then, proceed with clustering to connect all nodes together.

v) After the weather data set is interpolated through python, the data is imported to
the spark for MLP processing. During this process, the data is divided into train and test to evaluate and improve the model.

vi) After model Deep-Learning using PySpark and Elephas, the results are obtained
by adding the temperature of the last 10 years to the model in order to obtain the expected data for 2020.

- 19 -

vii) Create a function using condition statements according to the above hypothesis

and complete a model that predicts the date of the decrease using the estimated temperature data for each day this year. Results will come in the same format as 2020-04-10, and it can be seen that the number of confirmed cases in the city will begin to decline after 10 April 2020.

Madrid

2020-04-10

San-Francisco

2020-02-07

Tehran

2020-03-14

Seoul

2020-04-10

[Table 3] Predicted Breaking Point of COVID-19

Madrid

2020-03-30

San-Francisco

?

Tehran

2020-03-14

Seoul

2020-03-10

[Table 4] Real Breaking Point of COVID-19

Two of the actual four cities, Tehran and Madrid, were able to reach conclusions similar to the actual results[Table3][Table4]. It started to decline after recording the largest number of confirmed cases on March 14 and March 30, respectively. But it showed the exact opposite of what Seoul and San-Francisco had expected. San-Francisco in the United States has seen a sharp increase in the number of confirmed cases since March, starting with New York, but the expected start date of the decline was expected to be Feb. 7. Seoul, South Korea, was also expected to see a decrease after April 10, but with a quick initial response, the number of confirmed cases has remained low since March 10.

V. DISCUSSION

The research paper was conducted on COVID-19, the most necessary analysis in the international economy and health sector as of 2020. Indeed, since April 2020, the new Confirmed Cases of COVID-19 have been decreasing in many countries, and economic indicators such as the U.S. Stock Exchange can also confirm that they are recovering from the worst times in late March to early April. However, there is a new discovery of the COVID-19 virus of various types on different continents, and there is a study that if temperatures and humidity change back to a viral environment in winter 2020, there may be a risk of being attacked again. The ultimate solution to this problem is the development of vaccines for the virus. As we saw from the results, we could see that the radio waves of Coronavirus were somewhat affected by temperature. However, as in the case of Seoul and San-Francisco, we could see that the difference in the number of confirmed cases under each country's health system affects more than the effects of climate. Thus, the study of the infectious nature of coronavirus showed that it could be affected by a variety of external variables as well as the simple climate and became an opportunity to be alert to the dangers of viruses.

For the quick optimization of weather forecast deep learning models, models that have already been the best results in other studies have been actually taken and used[13]. 16 neurons and 5 hidden layers were used for each layer. Four of them adopted the function 'tangent sigmoid' as an activation function. In this study, the Mean Square Error (MSE),

- 20 -

which measured the accuracy of the model, obtained the accuracy of the model with a very stable result of 2.0, but in my study, MSE showed an overfit of the predictive model with a value of less than 1. The reason for this phenomenon was that the number of samples in the previous study was not so high at 730, and the previous study also showed that the MSE dropped from 2.71 to 0.211 when the number of samples increased from 730 to 1,460 was increased. Therefore, if deep-learning modeling is carried out using BigData, it can be found that overflowing occurs severely.

A general BigData analysis does not mean a large number of rows, but a large number of columns. In this search, it is not a true BigData analysis because it was analyzed using only one variable called daily max temperature. However, physical systems that were implemented using dockers and Raspberry Pi to handle BigData, and logical systems that were implemented using Hadoop and spark, will be required to analyze the BigData of better quality of the upcoming paper.

VI. CONCLUSION

'When will the transmission of COVID-19 begin to decrease? We also conducted a search paper to predict the date on when the economic recovery will begin? The study found that the virus had seasonal characteristics, but other external factors also played an important role through the results of four cases. We also found that climate forecasting models using Deep-Learning, unlike conventional numerical-based weather forecasting models, can complete hours of work in minutes. Such accelerated weather forecasting models could be used to minimize casualties and asset damage by applying them to areas that need to be responded in real time, such as natural disasters. A high-performance computer is essential in areas that handle a lot of data, such as weather, and have a lot of computations. In addition, data should be collected and processed in real time for these natural forecasting areas. However, weather forecasting using one existing supercomputer can be costly. I think the distributed computing system sector and the cloud sector are the right technologies to solve these problems. This is because the universality, one of the characteristics of distributed computing systems, can turn multiple computers into high-performance systems for a single application operation. Thus, it can play a role in dramatically reducing the cost of producing the same result.

References

[1] Ted, The next outbreak? We're not ready [Website]. (2015, Avril 3). Retrieved from https://www.ted.com/talks/bill_gates_the_next_outbreak_we_re_not_ready

[2] Sajadi, M. M., Habibzadeh, P., Vintzileos, A., Shokouhi, S., Miralles-Wilhelm, F., & Amoroso, A. (2020). Temperature and Latitude Analysis to Predict Potential Spread and Seasonality for COVID-19. Available at SSRN 3550308.

[3] Using Machine Learning to «Nowcast» Precipitation in High Resolution, Google AI Blog [Website]. (2020, January 13). Retrieved from https://ai.googleblog.com/2020/01/using-machine-learning-to.nowcast.html

[4] Lai, C. C., Shih, T. P., Ko, W. C., Tang, H. J., & Hsueh, P. R. (2020). Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and corona virus disease-2019 (COVID-19): the epidemic and the challenges. International journal of antimicrobial agents, 105924.

[5] Tom White. (2015). Hadoop: The Definitive Guide. California: O'Reilly Media.

[6]

- 21 -

Matei Zaharia. (2018). Spark: The Definitive Guide: Big Data Processing Made Simple. California: O'Reilly Media.

[7] YONG ChanHo. (2020). Start! : Docker/Kubernetes. Paju: Wikibook

[8] Richman DD, Whitley RJ, Hayden FG. Clinical Virology, 4th ed. Washington: ASM Press; 2016.

[9] Turner, D., Wailoo, A., Nicholson, K., Cooper, N., Sutton, A., & Abrams, K. (2003). Systematic review and economic decision modelling for the prevention and treatment of influenza A and B. In NIHR Health Technology Assessment programme: Executive Summaries. NIHR Journals Library.

[10] Centers for Disease Control and Prevention (CDC). (2015). Epidemiology and Prevention of Vaccine Preventable Diseases, 13th Edition. La Vergne: ingram.

[11] Karun, A. K., & Chitharanjan, K. (2013, April). A review on hadoop--HDFS infrastructure extensions. In 2013 IEEE conference on information & communication technologies (pp. 132-137). IEEE.

[12] Scher, S., & Messori, G. (2019). Weather and climate forecasting with neural networks: using general circulation models (GCMs) with different complexity as a study

ground. Geoscientific Model Development, 12(7), 2797-2809.

[13] Abhishek, K., Singh, M. P., Ghosh, S., & Anand, A. (2012). Weather forecasting model using artificial neural network. Procedia Technology, 4, 311-318.

[14] Sebastian Raschka and Vahid Mirjalili. (2017). Python Machine Learning - Second Edition: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow. Birmingham: Packt.

[15] Stekhoven, D. J., & Bühlmann, P. (2012). MissForest--non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112-118.

[16] COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University [Website]. (2019). Retrieved from https://github.com/CSSEGISandData/COVID-19.

[17] COVID-19 pandemic in the United Sates [Website]. (2020). Retrieved from https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_the_United_States.

[18] National Centers For Environmental Information: Climate Data Online [Website]. Retrieved from https://www.ncdc.noaa.gov/cdo-web/

[19] World Health Organization: Coronavirus disease (COVID-2019) situation reports [Website]. Retrieved from https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports.

[20] Elephas: Distributed Deep Learning with Keras & Spark [Website]. Retrieved from https://github.com/maxpumperla/elephas.