Home | Publier un mémoire | Une page au hasard

The impact of covid-19: to predict the breaking point of the disease from big data by neural networks

par Woohyun SHIN
Paris School of Business - MSc Data Management 2001
Dans la categorie: Informatique et Télécommunications > Intelligence artificielle

Télécharger le fichier original

précédent sommaire

3. EDA, Exploratory data analysis

For the meteorological data, problems with the observation machine or environmental problems result in many missing values[Figure 4]. Thus, the missing values are replaced by the MissForest algorithm, an algorithm that estimates and replaces missing values, using the Random Forest model of machine learning, among the imputation techniques. The algorithm is a nonparametric method that uses correlations between variables to correct missing values and can be applied to mixed and high-dimension data and does not make distribution assumptions about the data [15].

- 17 -

[Figure 5] Visualization of Missing Values

In the case of Seoul, San Francisco and Madrid, missing values were replaced based on Miss Forest because there were several stations, and in the case of Tehran, missing values were interpolated based on the time series date index because only one observatory existed.

4. PREDICTION MODELING

The data used daily maximum temperature data from 1950 to 2019, and the prediction model was implemented using the Multi-Layer Perceptron (MLP) provided by Keras. Because of the large number of rows of data, the data was processed and analyzed using the BigData Analysis Framework Spark.

4.1. Data Preprocessing

For most Deep Learning models, train the model after converting the input data value to - 1 to 1 or 0 to 1[14]. After model learning and accuracy verification, put the converted input value for prediction and then return the output value to its original value. The values of all Max Temperatures were converted to 0 to 1 using the MinMaxScaler of Scikit Learn, and then returned to Fahrenheit, the unit of actual data, using the values of Max Value and Min Value that were stored, and then converted to Celsius units. Because one out of four years has leap year, I drop the rows which recorded in 29^th February. Then all data are divided with 70% of Training datasets, 30% of Validation datasets and one dataset which contains the max temperature between 2010 and 2019 for predicting 2020 through well-made trained model.

4.2. Multi-Layer Perceptron

Since I have 10 inputs, a 5 hidden-layer network with 16 neurons per layer. For the first to fourth hidden-layer of the five hidden-layer, the function tan-sigmoid (tanh) is used as an activation function and the last hidden-layer that returns 1 output data is based on linear activation function[13]. On the basis of MSE( Mean Square Error), minimization criteria, I adjust the weights and biases of the neurons on each step that makes reduce MSE value. After a number of iterations (epochs) , the neural network is trained and the weights are saved.

- 18 -

Finally I was able to get 4 trained models at each cities, Madrid, Seoul, San-Francisco, Tehran.

[Figure 6] Diagrams of Multi-Layer Perceptron

IV. RESULT

The following results will demonstrate that the diffusion rate of COVID-19 will decrease if the temperature is maintained above 16 degrees for more than 12 days of 14 days. The process for reaching the results is as follows :

i) Create a `spark_base' container using the ubuntu image file of the docker and
install Hadoop and Spark from that container. Subsequently, the corresponding image file is stored in the Docker Hub repository.

ii) Set up MacVLAN on each Raspberry Pi to communicate between the three
Raspberry Pis.

iii) Download the `spark_base' image file on the Docker Hub from all servers and
proceed to the Namenode, Secondary Namenode, Datanode, and ResourceManager and NodeManager settings for Hadoop in accordance with the role of each corresponding node. The spark-related settings are also carried out.

iv) Install Zeppeline and Elephas, external libraries of deep learning, for analysis
environment in Container, which acts as Namenode. Then, proceed with clustering to connect all nodes together.

v) After the weather data set is interpolated through python, the data is imported to
the spark for MLP processing. During this process, the data is divided into train and test to evaluate and improve the model.

vi) After model Deep-Learning using PySpark and Elephas, the results are obtained
by adding the temperature of the last 10 years to the model in order to obtain the expected data for 2020.

- 19 -

vii) Create a function using condition statements according to the above hypothesis

and complete a model that predicts the date of the decrease using the estimated temperature data for each day this year. Results will come in the same format as 2020-04-10, and it can be seen that the number of confirmed cases in the city will begin to decline after 10 April 2020.

Madrid	2020-04-10	San-Francisco	2020-02-07
Tehran	2020-03-14	Seoul	2020-04-10

[Table 3] Predicted Breaking Point of COVID-19

Madrid	2020-03-30	San-Francisco	?
Tehran	2020-03-14	Seoul	2020-03-10

[Table 4] Real Breaking Point of COVID-19

Two of the actual four cities, Tehran and Madrid, were able to reach conclusions similar to the actual results[Table3][Table4]. It started to decline after recording the largest number of confirmed cases on March 14 and March 30, respectively. But it showed the exact opposite of what Seoul and San-Francisco had expected. San-Francisco in the United States has seen a sharp increase in the number of confirmed cases since March, starting with New York, but the expected start date of the decline was expected to be Feb. 7. Seoul, South Korea, was also expected to see a decrease after April 10, but with a quick initial response, the number of confirmed cases has remained low since March 10.

V. DISCUSSION

The research paper was conducted on COVID-19, the most necessary analysis in the international economy and health sector as of 2020. Indeed, since April 2020, the new Confirmed Cases of COVID-19 have been decreasing in many countries, and economic indicators such as the U.S. Stock Exchange can also confirm that they are recovering from the worst times in late March to early April. However, there is a new discovery of the COVID-19 virus of various types on different continents, and there is a study that if temperatures and humidity change back to a viral environment in winter 2020, there may be a risk of being attacked again. The ultimate solution to this problem is the development of vaccines for the virus. As we saw from the results, we could see that the radio waves of Coronavirus were somewhat affected by temperature. However, as in the case of Seoul and San-Francisco, we could see that the difference in the number of confirmed cases under each country's health system affects more than the effects of climate. Thus, the study of the infectious nature of coronavirus showed that it could be affected by a variety of external variables as well as the simple climate and became an opportunity to be alert to the dangers of viruses.

For the quick optimization of weather forecast deep learning models, models that have already been the best results in other studies have been actually taken and used[13]. 16 neurons and 5 hidden layers were used for each layer. Four of them adopted the function 'tangent sigmoid' as an activation function. In this study, the Mean Square Error (MSE),

- 20 -

which measured the accuracy of the model, obtained the accuracy of the model with a very stable result of 2.0, but in my study, MSE showed an overfit of the predictive model with a value of less than 1. The reason for this phenomenon was that the number of samples in the previous study was not so high at 730, and the previous study also showed that the MSE dropped from 2.71 to 0.211 when the number of samples increased from 730 to 1,460 was increased. Therefore, if deep-learning modeling is carried out using BigData, it can be found that overflowing occurs severely.

A general BigData analysis does not mean a large number of rows, but a large number of columns. In this search, it is not a true BigData analysis because it was analyzed using only one variable called daily max temperature. However, physical systems that were implemented using dockers and Raspberry Pi to handle BigData, and logical systems that were implemented using Hadoop and spark, will be required to analyze the BigData of better quality of the upcoming paper.

VI. CONCLUSION

'When will the transmission of COVID-19 begin to decrease? We also conducted a search paper to predict the date on when the economic recovery will begin? The study found that the virus had seasonal characteristics, but other external factors also played an important role through the results of four cases. We also found that climate forecasting models using Deep-Learning, unlike conventional numerical-based weather forecasting models, can complete hours of work in minutes. Such accelerated weather forecasting models could be used to minimize casualties and asset damage by applying them to areas that need to be responded in real time, such as natural disasters. A high-performance computer is essential in areas that handle a lot of data, such as weather, and have a lot of computations. In addition, data should be collected and processed in real time for these natural forecasting areas. However, weather forecasting using one existing supercomputer can be costly. I think the distributed computing system sector and the cloud sector are the right technologies to solve these problems. This is because the universality, one of the characteristics of distributed computing systems, can turn multiple computers into high-performance systems for a single application operation. Thus, it can play a role in dramatically reducing the cost of producing the same result.

References

[1] Ted, The next outbreak? We're not ready [Website]. (2015, Avril 3). Retrieved from https://www.ted.com/talks/bill_gates_the_next_outbreak_we_re_not_ready

[2] Sajadi, M. M., Habibzadeh, P., Vintzileos, A., Shokouhi, S., Miralles-Wilhelm, F., & Amoroso, A. (2020). Temperature and Latitude Analysis to Predict Potential Spread and Seasonality for COVID-19. Available at SSRN 3550308.

[3] Using Machine Learning to «Nowcast» Precipitation in High Resolution, Google AI Blog [Website]. (2020, January 13). Retrieved from https://ai.googleblog.com/2020/01/using-machine-learning-to.nowcast.html

[4] Lai, C. C., Shih, T. P., Ko, W. C., Tang, H. J., & Hsueh, P. R. (2020). Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and corona virus disease-2019 (COVID-19): the epidemic and the challenges. International journal of antimicrobial agents, 105924.

[5] Tom White. (2015). Hadoop: The Definitive Guide. California: O'Reilly Media.

[6]

- 21 -

Matei Zaharia. (2018). Spark: The Definitive Guide: Big Data Processing Made Simple. California: O'Reilly Media.

[7] YONG ChanHo. (2020). Start! : Docker/Kubernetes. Paju: Wikibook

[8] Richman DD, Whitley RJ, Hayden FG. Clinical Virology, 4th ed. Washington: ASM Press; 2016.

[9] Turner, D., Wailoo, A., Nicholson, K., Cooper, N., Sutton, A., & Abrams, K. (2003). Systematic review and economic decision modelling for the prevention and treatment of influenza A and B. In NIHR Health Technology Assessment programme: Executive Summaries. NIHR Journals Library.

[10] Centers for Disease Control and Prevention (CDC). (2015). Epidemiology and Prevention of Vaccine Preventable Diseases, 13th Edition. La Vergne: ingram.

[11] Karun, A. K., & Chitharanjan, K. (2013, April). A review on hadoop--HDFS infrastructure extensions. In 2013 IEEE conference on information & communication technologies (pp. 132-137). IEEE.

[12] Scher, S., & Messori, G. (2019). Weather and climate forecasting with neural networks: using general circulation models (GCMs) with different complexity as a study

ground. Geoscientific Model Development, 12(7), 2797-2809.

[13] Abhishek, K., Singh, M. P., Ghosh, S., & Anand, A. (2012). Weather forecasting model using artificial neural network. Procedia Technology, 4, 311-318.

[14] Sebastian Raschka and Vahid Mirjalili. (2017). Python Machine Learning - Second Edition: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow. Birmingham: Packt.

[15] Stekhoven, D. J., & Bühlmann, P. (2012). MissForest--non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112-118.

[16] COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University [Website]. (2019). Retrieved from https://github.com/CSSEGISandData/COVID-19.

[17] COVID-19 pandemic in the United Sates [Website]. (2020). Retrieved from https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_the_United_States.

[18] National Centers For Environmental Information: Climate Data Online [Website]. Retrieved from https://www.ncdc.noaa.gov/cdo-web/

[19] World Health Organization: Coronavirus disease (COVID-2019) situation reports [Website]. Retrieved from https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports.

[20] Elephas: Distributed Deep Learning with Keras & Spark [Website]. Retrieved from https://github.com/maxpumperla/elephas.

précédent sommaire