NY Times COVID-19 data visualization
NY Times COVID-19 data visualization
Anton Antonov
MathematicaForPrediction at WordPress
SystemModeling at GitHub
March 2020
December 2020
January 2021
MathematicaForPrediction at WordPress
SystemModeling at GitHub
March 2020
December 2020
January 2021
Introduction
Introduction
The purpose of this notebook is to give data locations, data ingestion code, and code for rudimentary analysis and visualization of COVID-19 data provided by New York Times, [NYT1].
The following steps are taken:
◼
Ingest data
◼
Take COVID-19 data from The New York Times, based on reports from state and local health agencies, [NYT1].
◼
Take USA counties records data (FIPS codes, geo-coordinates, populations), [WRI1].
◼
Merge the data.
◼
Make data summaries and related plots.
◼
Make corresponding geo-plots.
◼
Do “out of the box” time series forecast.
◼
Analyze fluctuations around time series trends.
Note that other, older repositories with COVID-19 data exist, like, [JH1, VK1].
Remark: The time series section is done for illustration purposes only. The forecasts there should not be taken seriously.
Import data
Import data
NYTimes USA states data
NYTimes USA states data
In[]:=
dsNYDataStates=ResourceFunction["ImportCSVToDataset"]["https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv"];dsNYDataStates=dsNYDataStates[All,AssociationThread[Capitalize/@Keys[#],Values[#]]&];dsNYDataStates〚1;;6〛
Out[]=
In[]:=
ResourceFunction["RecordsSummary"][dsNYDataStates]
Out[]=
,
,
,
,
1 Date | ||||||||||||||
|
2 State | ||||||||||||||
|
3 Fips | ||||||||||||
|
4 Cases | ||||||||||||
|
5 Deaths | ||||||||||||
|
NYTimes USA counties data
NYTimes USA counties data
In[]:=
dsNYDataCounties=ResourceFunction["ImportCSVToDataset"]["https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"];dsNYDataCounties=dsNYDataCounties[All,AssociationThread[Capitalize/@Keys[#],Values[#]]&];dsNYDataCounties〚1;;6〛
Out[]=
In[]:=
ResourceFunction["RecordsSummary"][dsNYDataCounties]
Out[]=
,
,
,
,
,
1 Date | ||||||||||||||
|
2 County | ||||||||||||||
|
3 State | ||||||||||||||
|
4 Fips | ||||||||||||||
|
5 Cases | ||||||||||||
|
6 Deaths | ||||||||||||||
|
US county records
US county records
In[]:=
dsUSACountyData=ResourceFunction["ImportCSVToDataset"]["https://raw.githubusercontent.com/antononcube/SystemModeling/master/Data/dfUSACountyRecords.csv"];dsUSACountyData=dsUSACountyData[All,Join[#,<|"FIPS"ToExpression[#FIPS]|>]&];dsUSACountyData〚1;;6〛
Out[]=
In[]:=
ResourceFunction["RecordsSummary"][dsUSACountyData]
Out[]=
,
,
,
,
,
,
1 Country | ||
|
2 State | ||||||||||||||
|
3 County | ||||||||||||||
|
4 FIPS | ||||||||||||
|
5 Population | ||||||||||||
|
6 Lat | ||||||||||||
|
7 Lon | ||||||||||||
|
Merge data
Merge data
Verify that the two datasets have common FIPS codes:
In[]:=
Length[Intersection[Normal[dsUSACountyData[All,"FIPS"]],Normal[dsNYDataCounties[All,"Fips"]]]]
Out[]=
3133
Merge the datasets:
In[]:=
dsNYDataCountiesExtended=Dataset[JoinAcross[Normal[dsNYDataCounties],Normal[dsUSACountyData[All,{"FIPS","Lat","Lon","Population"}]],Key["Fips"]->Key["FIPS"]]];
Add a “DateObject” column and (reverse) sort by date:
In[]:=
dsNYDataCountiesExtended=dsNYDataCountiesExtended[All,Join[<|"DateObject"DateObject[#Date]|>,#]&];dsNYDataCountiesExtended=dsNYDataCountiesExtended[ReverseSortBy[#DateObject&]];dsNYDataCountiesExtended〚1;;6〛
Out[]=
Basic data analysis
Basic data analysis
We consider cases and deaths for the last date only. (The queries can easily adjusted for other dates.)
Here is the summary of the values of cases and deaths across the different USA counties:
The following table of plots shows the distributions of cases and deaths and the correspond Pareto principle adherence plots:
A couple of observations:
◼
The logarithms of the cases and deaths have nearly Normal or Logistic distributions.
Distributions
Distributions
Pareto principle locations
Pareto principle locations
Geo-histogram
Geo-histogram
Heat-map plots
Heat-map plots
An alternative of the geo-visualization is to use a heat-map plot. Here we use the package "HeatmapPlot.m", [AAp1].
Cases
Cases
Cross-tabulate states with dates over cases:
Make a heat-map plot by sorting the columns of the cross-tabulation matrix (that correspond to states):
Deaths
Deaths
Cross-tabulate states with dates over deaths and plot:
Time series analysis
Time series analysis
Cases
Cases
Time series
Time series
For each date sum all cases over the states, make a time series, and plot it:
Logarithmic plot:
“Forecast”
“Forecast”
Fit a time series model to log 10 of the time series:
Plot log 10 data and forecast:
Plot data and forecast:
Deaths
Deaths
Time series
Time series
For each date sum all cases over the states, make a time series, and plot it:
“Forecast”
“Forecast”
Fit a time series model:
Plot data and forecast:
Fluctuations
Fluctuations
We want to see does the time series data have fluctuations around its trends and estimate the distributions of those fluctuations. (Knowing those distributions some further studies can be done.)
This can be efficiently using the software monad QRMon, [AAp2, AA1]. Here we load the QRMon package:
Fluctuations presence
Fluctuations presence
Here we plot the consecutive differences of the cases:
Here we plot the consecutive differences of the deaths:
From the plots we see that time series are not monotonically increasing, and there are non-trivial fluctuations in the data.
Absolute and relative errors distributions
Absolute and relative errors distributions
Here we take interesting part of the cases data:
Here we specify QRMon workflow that rescales the data, fits a B-spline curve to get the trend, and finds the absolute and relative errors (residuals, fluctuations) around that trend:
Here we find the distribution of the of the relative errors:
References
References
[AA1] Anton Antonov, "A monad for Quantile Regression workflows", (2018), at MathematicaForPrediction WordPress.
[AAp2] Anton Antonov, Monadic Quantile Regression Mathematica package, (2018), MathematicaForPrediciton at GitHub.