My Final Project for Data Mining
The objective of this project was to uncover insights from COVID-19 case data and build machine learning models to help provide a clearer picture of what is happening in the United States right now.
The data collected spans from January 21st 2020 to April 30th 2020. Four models were built, using supervised learning methods, each answering one of the questions posed below:
- If a state imposes a stay-at-home or shelter-in-place intervention, can I predict whether a county in that state will see a less than 30% increase in week-on-week COVID-19 case counts within 3-weeks of intervention?
- If a state imposes a stay-at-home or shelter-in-place intervention, can I predict whether a county in that state will see a 100% decrease in weekly case counts from week 2 to week 3, post-intervention?
- Can I predict whether a state will implement stay-at-home or shelter-in-place intervention?
- Can I predict whether there is a greater than 5% chance of dying in a U.S county if diagnosed COVID-19 positive?
County-level demographic, weather, and COVID-19 case data for the contiguous United States, Alaska, and Hawaii was used to build each model. This data was collected from several sources, including the Census Bureau API, The National Conference of State Legislatures, National Centers for Environmental Information (NOAA), and the NYTimes.
Overall, all models showed some predictive capability and would likely improve if provided with more county-level demographic data. The best performing model was the one for predicting whether a state would impose a stay-at-home or shelter-in-place intervention, which identified the political party of a state governor as a key indicator (if the state governor is a Democrat, an intervention is highly likely to be imposed).
The models for Questions 1, 2, and 4 were built using k-Nearest Neighbors (kNN). kNN is a supervised learning algorithm where you assign a value for k, which determines the number of neighbors the algorithm evaluates when calculating the classification of an unseen data point. The algorithm will assign the unseen data point the classification of the majority (the class assigned to the majority of its neighbors).
The model for Question 3 was built using a decision tree (aka CART), a supervised learning algorithm that selects features that will produce subsets of the original dataset (or data in the parent node) containing less noise or entropy (the number of observations that have different classifications). This process is done repeatedly until the subsets of data can’t be separated any further.
The decision tree constructed for Question 3 is available on GitHub: https://github.com/KISS/covid-19_research/blob/master/q3_model_decision_tree.png
Link to Code
Link to Full Report
https://github.com/KISS/covid-19_research/blob/master/CS619%20-%20COVID-19%20Research%20Report.pdf
How I built it
I used SQL, Python, Excel, Postman, Azure Data Studio, and Visual Studio Code as tools for collecting, visualizing, pre-processing, and merging the different data sources I needed. I also used SQL to output the final datasets I used to build my models.
I used Weka (https://www.cs.waikato.ac.nz/ml/weka/) to convert my datasets from CSV to the format supported by Weka (.arff) and to build my different models. Weka isn't my preferred tool but it was the tool required by my professor.
For model selection I analyzed each models Confusion Matrix and ROC area.
Additional Thoughts / Feelings / Stories
Two things I found really challenging:
- Coming up with machine learning (or predictive) problems
- Model selection
Correctly forming the questions I wanted to answer was difficult. I found it easy to come up with questions I was curious about but making sure they were actual machine learning problems and not data analysis problems was challenging. It took a lot of iterations. A quick test I started using was figuring out (a) could I find the answer by analyzing the data I had (was it a one-off check), and (b) could I assume that someone using the model I built would have access to the same data I used or was I depending on data not easily available/reproducible.
Model selection was challenging because in most cases the predictive power of a model was pretty shitty, and if it "looked" better, it was overfitting. This is why I started looking purely at the Confusion Matrix and ROC area, and then adjusted my selection criteria based on the question I was trying to answer and what outcome was most important. I also applied Occam's razor, which summarizes to: "the simplest solution is most likely the right one".
Data collection, pre-processing, and feature selection was also challenging. Knowing the questions I wanted to answer was a huge help with this step. I also tried several different machine learning algorithms and data transformation methods (ex: normalization) before narrowing down the models and ultimately selecting the ones I did.
Based on my calendar it seems like overall I probably spent 2 weeks (78-82 hours) on the whole project, from coming up with the proposal to writing the final report.
All the code I wrote, data I used/created, and the final models are available on GitHub if you'd like to check it out. Ping me if you have any questions or are curious about how I did things.
Top comments (2)
Nice , 😄, How did you learn
Python, Machine Learning
? Did teachers in university taught you or self taught by online tutorials ?Congrats Luisa for graduation 🏆🎉