The Coronavirus (COVID-19) outbreak is an ongoing outbreak which started in 2019 hence the number 19 in the COVID-19. It is caused by the Severe Acute Respiratory Syndrome(SARS) -CoV-2 virus.

In this tutorial we will see how to use python to do some basic exploratory data analysis of the coronavirus outbreak. We will be using the dataset collected from several sources such as

Let us see some basic questions we will be answering with the data we have

Number of Cases (Recovered,Confirmed,Deaths)
Which country has the highest cases?
List of countries affected
Distribution Per Continents and Country
Cases Per Day
Cases Per Country
Timeseries Analysis

By analysing our data we can see that it consist of data about

Country/Region
Latitude and Longitude
Date/ Time
Numerical Data

Hence by these simple overview we can perform the following types of analysis

Geo-spatial analysis from the Latitude and Longitudes.
Time series analysis from the Date/Time.
Statistical analysis from the numerical data.

Let us start with the basic EDA and then the rest.

We will be using pandas,matplotlib and geopandas to help us with our analysis.

Installation

pip install pandas geopandas matplotlib

Let us see the entire code for our analysis. You can get the notebook and the dataset here

In [1]:

# Load EDA Pkgs
import pandas as pd
import numpy as np

In [2]:

# Load Data Viz Packages
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [3]:

### Load Geopandas
import geopandas as gpd
from shapely.geometry import Point, Polygon
import descartes

In [4]:

# Load Dataset
df = pd.read_csv("data/coronavirus_data.csv")

In [5]:

df.head()

Out[5]:

	Index	Province/State\n	Country/Region\n	Lat\n	Long\n\n\n\n\n	Date\n\n\n\n\n	Confirmed\n
0	1	Anhui	Mainland China	31.8257	117.2264	1/22/20	1
1	2	Beijing	Mainland China	40.1824	116.4142	1/22/20	14
2	3	Chongqing	Mainland China	30.0572	107.8740	1/22/20	6
3	4	Fujian	Mainland China	26.0789	117.9874	1/22/20	1
4	5	Gansu	Mainland China	36.0611	103.8343	1/22/20	0

In [6]:

df.columns

Out[6]:

Index(['Index', 'Province/State\n', 'Country/Region\n', 'Lat\n',
       'Long\n\n\n\n\n', 'Date\n\n\n\n\n', 'Confirmed\n', 'Deaths\n\n',
       'Recovered\n\n\n\n\n'],
      dtype='object')

In [8]:

df.columns.str.replace(r'\n','', regex=True)

Out[8]:

Index(['Index', 'Province/State', 'Country/Region', 'Lat', 'Long', 'Date',
       'Confirmed', 'Deaths', 'Recovered'],
      dtype='object')

In [9]:

df.columns = df.columns.str.replace(r'\n','', regex=True)

In [10]:

df.columns

Out[10]:

Index(['Index', 'Province/State', 'Country/Region', 'Lat', 'Long', 'Date',
       'Confirmed', 'Deaths', 'Recovered'],
      dtype='object')

In [11]:

df.rename(columns={'Province/State':'Province_State','Country/Region':'Country_Region'},inplace=True)

In [12]:

df.columns

Out[12]:

Index(['Index', 'Province_State', 'Country_Region', 'Lat', 'Long', 'Date',
       'Confirmed', 'Deaths', 'Recovered'],
      dtype='object')

In [13]:

# Shape of Dataset
df.shape

Out[13]:

(3885, 9)

In [14]:

# Datatypes
df.dtypes

Out[14]:

Index               int64
Province_State     object
Country_Region     object
Lat               float64
Long              float64
Date               object
Confirmed           int64
Deaths              int64
Recovered           int64
dtype: object

In [15]:

# First 10
df.head(10)

Out[15]:

	Index	Province_State	Country_Region	Lat	Long	Date	Confirmed
0	1	Anhui	Mainland China	31.8257	117.2264	1/22/20	1
1	2	Beijing	Mainland China	40.1824	116.4142	1/22/20	14
2	3	Chongqing	Mainland China	30.0572	107.8740	1/22/20	6
3	4	Fujian	Mainland China	26.0789	117.9874	1/22/20	1
4	5	Gansu	Mainland China	36.0611	103.8343	1/22/20	0
5	6	Guangdong	Mainland China	23.3417	113.4244	1/22/20	26
6	7	Guangxi	Mainland China	23.8298	108.7881	1/22/20	2
7	8	Guizhou	Mainland China	26.8154	106.8748	1/22/20	1
8	9	Hainan	Mainland China	19.1959	109.7453	1/22/20	4
9	10	Hebei	Mainland China	38.0428	114.5149	1/22/20	1

In [17]:

df = df[['Province_State', 'Country_Region', 'Lat', 'Long', 'Date',
       'Confirmed', 'Deaths', 'Recovered']]

In [18]:

df.isna().sum()

Out[18]:

Province_State    1665
Country_Region       0
Lat                  0
Long                 0
Date                 0
Confirmed            0
Deaths               0
Recovered            0
dtype: int64

In [19]:

df.describe()

Out[19]:

	Lat	Long	Confirmed	Deaths	Recovered
count	3885.000000	3885.000000	3885.000000	3885.000000	3885.000000
mean	32.252000	45.775760	396.487773	10.804118	78.544402
std	18.256877	84.338854	4017.397180	137.191519	846.918788
min	-37.813600	-123.869500	0.000000	0.000000	0.000000
25%	27.610400	8.227500	0.000000	0.000000	0.000000
50%	35.191700	78.000000	2.000000	0.000000	0.000000
75%	42.315400	113.614000	40.000000	0.000000	4.000000
max	64.000000	153.400000	65596.000000	2641.000000	23383.000000

In [20]:

# Number of Case Per Date/Day
df.head()

Out[20]:

	Province_State	Country_Region	Lat	Long	Date	Confirmed
0	Anhui	Mainland China	31.8257	117.2264	1/22/20	1
1	Beijing	Mainland China	40.1824	116.4142	1/22/20	14
2	Chongqing	Mainland China	30.0572	107.8740	1/22/20	6
3	Fujian	Mainland China	26.0789	117.9874	1/22/20	1
4	Gansu	Mainland China	36.0611	103.8343	1/22/20	0

In [21]:

df.columns

Out[21]:

Index(['Province_State', 'Country_Region', 'Lat', 'Long', 'Date', 'Confirmed',
       'Deaths', 'Recovered'],
      dtype='object')

In [22]:

df.groupby('Date')['Confirmed','Deaths', 'Recovered'].sum()

Out[22]:

	Confirmed	Deaths	Recovered
Date
1/22/20	555	17	28
1/23/20	653	18	30
1/24/20	941	26	36
1/25/20	1434	42	39
1/26/20	2118	56	52
1/27/20	2927	82	61
1/28/20	5578	131	107
1/29/20	6166	133	126
1/30/20	8234	171	143
1/31/20	9927	213	222
2/1/20	12038	259	284
2/10/20	42763	1013	3946
2/11/20	44803	1113	4683
2/12/20	45222	1118	5150
2/13/20	60370	1371	6295
2/14/20	66887	1523	8058
2/15/20	69032	1666	9395
2/16/20	71226	1770	10865
2/17/20	73260	1868	12583
2/18/20	75138	2007	14352
2/19/20	75641	2122	16121
2/2/20	16787	362	472
2/20/20	76199	2247	18177
2/21/20	76843	2251	18890
2/22/20	78599	2458	22886
2/23/20	78985	2469	23394
2/24/20	79570	2629	25227
2/25/20	80415	2708	27905
2/26/20	81397	2770	30384
2/27/20	82756	2814	33277
2/3/20	19881	426	623
2/4/20	23892	492	852
2/5/20	27636	564	1124
2/6/20	30818	634	1487
2/7/20	34392	719	2011
2/8/20	37121	806	2616
2/9/20	40151	906	3244

In [23]:

df.groupby('Date')['Confirmed','Deaths', 'Recovered'].max()

Out[23]:

	Confirmed	Deaths	Recovered
Date
1/22/20	444	17	28
1/23/20	444	17	28
1/24/20	549	24	31
1/25/20	761	40	32
1/26/20	1058	52	42
1/27/20	1423	76	45
1/28/20	3554	125	80
1/29/20	3554	125	88
1/30/20	4903	162	90
1/31/20	5806	204	141
2/1/20	7153	249	168
2/10/20	31728	974	2222
2/11/20	33366	1068	2639
2/12/20	33366	1068	2686
2/13/20	48206	1310	3459
2/14/20	54406	1457	4774
2/15/20	56249	1596	5623
2/16/20	58182	1696	6639
2/17/20	59989	1789	7862
2/18/20	61682	1921	9128
2/19/20	62031	2029	10337
2/2/20	11177	350	295
2/20/20	62442	2144	11788
2/21/20	62662	2144	11881
2/22/20	64084	2346	15299
2/23/20	64084	2346	15343
2/24/20	64287	2495	16748
2/25/20	64786	2563	18971
2/26/20	65187	2615	20969
2/27/20	65596	2641	23383
2/3/20	13522	414	386
2/4/20	16678	479	522
2/5/20	19665	549	633
2/6/20	22112	618	817
2/7/20	24953	699	1115
2/8/20	27100	780	1439
2/9/20	29631	871	1795

In [24]:

df_per_day = df.groupby('Date')['Confirmed','Deaths', 'Recovered'].max()

In [25]:

df_per_day.head()

Out[25]:

	Confirmed	Deaths	Recovered
Date
1/22/20	444	17	28
1/23/20	444	17	28
1/24/20	549	24	31
1/25/20	761	40	32
1/26/20	1058	52	42

In [26]:

df_per_day.describe()

Out[26]:

	Confirmed	Deaths	Recovered
count	37.000000	37.000000	37.000000
mean	32616.756757	1082.513514	5338.540541
std	25664.132012	915.678972	6895.411802
min	444.000000	17.000000	28.000000
25%	5806.000000	204.000000	141.000000
50%	29631.000000	871.000000	1795.000000
75%	61682.000000	1921.000000	9128.000000
max	65596.000000	2641.000000	23383.000000

In [29]:

# Max No of Cases
df_per_day['Confirmed'].max()

Out[29]:

In [30]:

# Min No Of Cases
df_per_day['Confirmed'].min()

Out[30]:

In [31]:

# Date for Maximum Number Cases
df_per_day['Confirmed'].idxmax()

Out[31]:

'2/27/20'

In [32]:

# Date for Min Number Cases
df_per_day['Confirmed'].idxmin()

Out[32]:

'1/22/20'

In [33]:

# Number of Case Per Country/Province
df.groupby(['Country_Region'])['Confirmed','Deaths', 'Recovered'].max()

Out[33]:

	Confirmed	Deaths	Recovered
Country_Region
Afghanistan	1	0	0
Algeria	1	0	0
Australia	8	0	4
Austria	3	0	0
Bahrain	33	0	0
Belgium	1	0	1
Brazil	1	0	0
Cambodia	1	0	1
Canada	7	0	3
Croatia	3	0	0
Denmark	1	0	0
Egypt	1	0	0
Estonia	1	0	0
Finland	2	0	1
France	38	2	11
Georgia	1	0	0
Germany	46	0	16
Greece	3	0	0
Hong Kong	92	2	24
India	3	0	3
Iran	245	26	49
Iraq	7	0	0
Israel	3	0	1
Italy	655	17	45
Japan	214	4	22
Kuwait	43	0	0
Lebanon	2	0	0
Macau	10	0	8
Mainland China	65596	2641	23383
Malaysia	23	0	18
Nepal	1	0	1
Netherlands	1	0	0
North Macedonia	1	0	0
Norway	1	0	0
Oman	4	0	0
Others	705	4	10
Pakistan	2	0	0
Philippines	3	1	1
Romania	1	0	0
Russia	2	0	2
San Marino	1	0	0
Singapore	93	0	62
South Korea	1766	13	22
Spain	15	0	2
Sri Lanka	1	0	1
Sweden	7	0	0
Switzerland	8	0	0
Taiwan	32	1	5
Thailand	40	0	22
UK	15	0	8
US	42	0	2
United Arab Emirates	13	0	4
Vietnam	16	0	16

In [34]:

# Number of Case Per Country/Province
df.groupby(['Province_State','Country_Region'])['Confirmed','Deaths', 'Recovered'].max()

Out[34]:

		Confirmed	Deaths	Recovered
Province_State	Country_Region
Anhui	Mainland China	989	6	792
Beijing	Mainland China	410	5	248
Boston, MA	US	1	0	0
British Columbia	Canada	7	0	3
Chicago, IL	US	2	0	2
Chongqing	Mainland China	576	6	401
Diamond Princess cruise ship	Others	705	4	10
From Diamond Princess	Australia	8	0	0
Fujian	Mainland China	296	1	228
Gansu	Mainland China	91	2	81
Guangdong	Mainland China	1347	7	890
Guangxi	Mainland China	252	2	161
Guizhou	Mainland China	146	2	112
Hainan	Mainland China	168	5	131
Hebei	Mainland China	317	6	274
Heilongjiang	Mainland China	480	13	270
Henan	Mainland China	1272	20	1068
Hong Kong	Hong Kong	92	2	24
Hubei	Mainland China	65596	2641	23383
Humboldt County, CA	US	1	0	0
Hunan	Mainland China	1017	4	804
Inner Mongolia	Mainland China	75	0	43
Jiangsu	Mainland China	631	0	498
Jiangxi	Mainland China	934	1	754
Jilin	Mainland China	93	1	67
Lackland, TX (From Diamond Princess)	US	2	0	0
Liaoning	Mainland China	121	1	93
London, ON	Canada	1	0	1
Los Angeles, CA	US	1	0	0
Macau	Macau	10	0	8
Madison, WI	US	1	0	0
New South Wales	Australia	4	0	4
Ningxia	Mainland China	72	0	68
Omaha, NE (From Diamond Princess)	US	11	0	0
Orange, CA	US	1	0	0
Qinghai	Mainland China	18	0	18
Queensland	Australia	5	0	1
Sacramento County, CA	US	2	0	0
San Antonio, TX	US	1	0	0
San Benito, CA	US	2	0	0
San Diego County, CA	US	2	0	1
Santa Clara, CA	US	2	0	1
Seattle, WA	US	1	0	1
Shaanxi	Mainland China	245	1	195
Shandong	Mainland China	756	6	387
Shanghai	Mainland China	337	3	276
Shanxi	Mainland China	133	0	107
Sichuan	Mainland China	534	3	321
South Australia	Australia	2	0	2
Taiwan	Taiwan	32	1	5
Tempe, AZ	US	1	0	1
Tianjin	Mainland China	136	3	102
Tibet	Mainland China	1	0	1
Toronto, ON	Canada	5	0	2
Travis, CA (From Diamond Princess)	US	5	0	0
Unassigned Location (From Diamond Princess)	US	42	0	0
Victoria	Australia	4	0	4
Xinjiang	Mainland China	76	2	43
Yunnan	Mainland China	174	2	150
Zhejiang	Mainland China	1205	1	932

In [35]:

df['Country_Region'].value_counts()

Out[35]:

Mainland China          1147
US                       629
Australia                185
Canada                   111
Philippines               37
UK                        37
Egypt                     37
India                     37
Sri Lanka                 37
Cambodia                  37
Spain                     37
Georgia                   37
Hong Kong                 37
South Korea               37
Oman                      37
Estonia                   37
Afghanistan               37
Algeria                   37
Vietnam                   37
Pakistan                  37
San Marino                37
Sweden                    37
Greece                    37
Taiwan                    37
Croatia                   37
Bahrain                   37
Iran                      37
Romania                   37
Netherlands               37
Brazil                    37
Lebanon                   37
Russia                    37
Switzerland               37
Austria                   37
France                    37
Belgium                   37
Germany                   37
North Macedonia           37
Iraq                      37
Norway                    37
Finland                   37
Others                    37
Nepal                     37
Japan                     37
Israel                    37
Thailand                  37
Denmark                   37
Italy                     37
Malaysia                  37
Macau                     37
Kuwait                    37
Singapore                 37
United Arab Emirates      37
Name: Country_Region, dtype: int64

In [36]:

df['Country_Region'].value_counts().plot(kind='bar',figsize=(20,10))

Out[36]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f814b9716d0>

In [37]:

# How Many Country Affect
df['Country_Region'].unique()

Out[37]:

array(['Mainland China', 'Thailand', 'Japan', 'South Korea', 'Taiwan',
       'US', 'Macau', 'Hong Kong', 'Singapore', 'Vietnam', 'France',
       'Nepal', 'Malaysia', 'Canada', 'Australia', 'Cambodia',
       'Sri Lanka', 'Germany', 'Finland', 'United Arab Emirates',
       'Philippines', 'India', 'Italy', 'UK', 'Russia', 'Sweden', 'Spain',
       'Belgium', 'Others', 'Egypt', 'Iran', 'Lebanon', 'Iraq', 'Oman',
       'Afghanistan', 'Bahrain', 'Kuwait', 'Algeria', 'Croatia',
       'Switzerland', 'Austria', 'Israel', 'Pakistan', 'Brazil',
       'Georgia', 'Greece', 'North Macedonia', 'Norway', 'Romania',
       'Denmark', 'Estonia', 'Netherlands', 'San Marino'], dtype=object)

In [38]:

# How Many Country Affect
len(df['Country_Region'].unique())

Out[38]:

In [40]:

plt.figure(figsize=(20,10))
df['Country_Region'].value_counts().plot.pie(autopct="%1.1f%%")

Out[40]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f814b1e6a90>

Check for Distribution on Map

Lat/Long
Geometry/ Point

In [41]:

dir(gpd)

Out[41]:

['GeoDataFrame',
 'GeoSeries',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_compat',
 '_config',
 '_version',
 'array',
 'base',
 'clip',
 'datasets',
 'geodataframe',
 'geopandas',
 'geoseries',
 'gpd',
 'io',
 'np',
 'options',
 'overlay',
 'pd',
 'plotting',
 'points_from_xy',
 'read_file',
 'read_postgis',
 'show_versions',
 'sjoin',
 'tools']

In [43]:

df.head()

Out[43]:

	Province_State	Country_Region	Lat	Long	Date	Confirmed
0	Anhui	Mainland China	31.8257	117.2264	1/22/20	1
1	Beijing	Mainland China	40.1824	116.4142	1/22/20	14
2	Chongqing	Mainland China	30.0572	107.8740	1/22/20	6
3	Fujian	Mainland China	26.0789	117.9874	1/22/20	1
4	Gansu	Mainland China	36.0611	103.8343	1/22/20	0

In [45]:

# Convert Data to GeoDataframe
gdf01 = gpd.GeoDataFrame(df,geometry=gpd.points_from_xy(df['Long'],df['Lat']))

In [46]:

gdf01.head()

Out[46]:

	Province_State	Country_Region	Lat	Long	Date	Confirmed	geometry
0	Anhui	Mainland China	31.8257	117.2264	1/22/20	1	POINT (117.22640 31.82570)
1	Beijing	Mainland China	40.1824	116.4142	1/22/20	14	POINT (116.41420 40.18240)
2	Chongqing	Mainland China	30.0572	107.8740	1/22/20	6	POINT (107.87400 30.05720)
3	Fujian	Mainland China	26.0789	117.9874	1/22/20	1	POINT (117.98740 26.07890)
4	Gansu	Mainland China	36.0611	103.8343	1/22/20	0	POINT (103.83430 36.06110)

In [47]:

type(gdf01)

Out[47]:

geopandas.geodataframe.GeoDataFrame

In [48]:

# Method 2
points = [ Point(x,y) for x,y in zip(df.Long,df.Lat)]

In [49]:

gdf03 = gpd.GeoDataFrame(df,geometry=points)

In [50]:

gdf03

Out[50]:

	Province_State	Country_Region	Lat	Long	Date	Confirmed	Deaths	Recovered	geometry
0	Anhui	Mainland China	31.8257	117.2264	1/22/20	1	0	0	POINT (117.22640 31.82570)
1	Beijing	Mainland China	40.1824	116.4142	1/22/20	14	0	0	POINT (116.41420 40.18240)
2	Chongqing	Mainland China	30.0572	107.8740	1/22/20	6	0	0	POINT (107.87400 30.05720)
3	Fujian	Mainland China	26.0789	117.9874	1/22/20	1	0	0	POINT (117.98740 26.07890)
4	Gansu	Mainland China	36.0611	103.8343	1/22/20	0	0	0	POINT (103.83430 36.06110)
…	…	…	…	…	…	…	…	…	…
3880	NaN	Romania	45.9432	24.9668	2/27/20	1	0	0	POINT (24.96680 45.94320)
3881	NaN	Denmark	56.2639	9.5018	2/27/20	1	0	0	POINT (9.50180 56.26390)
3882	NaN	Estonia	58.5953	25.0136	2/27/20	1	0	0	POINT (25.01360 58.59530)
3883	NaN	Netherlands	52.1326	5.2913	2/27/20	1	0	0	POINT (5.29130 52.13260)
3884	NaN	San Marino	43.9424	12.4578	2/27/20	1	0	0	POINT (12.45780 43.94240)

3885 rows × 9 columns

In [51]:

# Map Plot
gdf01.plot(figsize=(20,10))

Out[51]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f814af95fd0>

In [52]:

# overlapping with world map
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
ax = world.plot(figsize=(20,10))
ax.axis('off')

Out[52]:

(-198.0, 198.00000000000006, -98.6822565, 92.32738650000002)

In [53]:

# Overlap
fig,ax = plt.subplots(figsize=(20,10))
gdf01.plot(cmap='Purples',ax=ax)
world.geometry.boundary.plot(color=None,edgecolor='k',linewidth=2,ax=ax)

Out[53]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f81491e8e10>

In [54]:

fig,ax = plt.subplots(figsize=(20,10))
gdf01.plot(cmap='Purples',ax=ax)
world.geometry.plot(color='Yellow',edgecolor='k',linewidth=2,ax=ax)

Out[54]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f8149126050>

In [55]:

# Per Country
world

Out[55]:

	pop_est	continent	name	iso_a3	gdp_md_est	geometry
0	920938	Oceania	Fiji	FJI	8374.0	MULTIPOLYGON (((180.00000 -16.06713, 180.00000…
1	53950935	Africa	Tanzania	TZA	150600.0	POLYGON ((33.90371 -0.95000, 34.07262 -1.05982…
2	603253	Africa	W. Sahara	ESH	906.5	POLYGON ((-8.66559 27.65643, -8.66512 27.58948…
3	35623680	North America	Canada	CAN	1674000.0	MULTIPOLYGON (((-122.84000 49.00000, -122.9742…
4	326625791	North America	United States of America	USA	18560000.0	MULTIPOLYGON (((-122.84000 49.00000, -120.0000…
…	…	…	…	…	…	…
172	7111024	Europe	Serbia	SRB	101800.0	POLYGON ((18.82982 45.90887, 18.82984 45.90888…
173	642550	Europe	Montenegro	MNE	10610.0	POLYGON ((20.07070 42.58863, 19.80161 42.50009…
174	1895250	Europe	Kosovo	-99	18490.0	POLYGON ((20.59025 41.85541, 20.52295 42.21787…
175	1218208	North America	Trinidad and Tobago	TTO	43570.0	POLYGON ((-61.68000 10.76000, -61.10500 10.890…
176	13026129	Africa	S. Sudan	SSD	20880.0	POLYGON ((30.83385 3.50917, 29.95350 4.17370, …

177 rows × 6 columns

In [56]:

world['continent'].unique()

Out[56]:

array(['Oceania', 'Africa', 'North America', 'Asia', 'South America',
       'Europe', 'Seven seas (open ocean)', 'Antarctica'], dtype=object)

In [57]:

asia = world[world['continent'] == 'Asia']

In [58]:

asia

Out[58]:

	pop_est	continent	name	iso_a3	gdp_md_est	geometry
5	18556698	Asia	Kazakhstan	KAZ	460700.00	POLYGON ((87.35997 49.21498, 86.59878 48.54918…
6	29748859	Asia	Uzbekistan	UZB	202300.00	POLYGON ((55.96819 41.30864, 55.92892 44.99586…
8	260580739	Asia	Indonesia	IDN	3028000.00	MULTIPOLYGON (((141.00021 -2.60015, 141.01706 …
24	1291358	Asia	Timor-Leste	TLS	4975.00	POLYGON ((124.96868 -8.89279, 125.08625 -8.656…
76	8299706	Asia	Israel	ISR	297000.00	POLYGON ((35.71992 32.70919, 35.54567 32.39399…
77	6229794	Asia	Lebanon	LBN	85160.00	POLYGON ((35.82110 33.27743, 35.55280 33.26427…
79	4543126	Asia	Palestine	PSE	21220.77	POLYGON ((35.39756 31.48909, 34.92741 31.35344…
83	10248069	Asia	Jordan	JOR	86190.00	POLYGON ((35.54567 32.39399, 35.71992 32.70919…
84	6072475	Asia	United Arab Emirates	ARE	667200.00	POLYGON ((51.57952 24.24550, 51.75744 24.29407…
85	2314307	Asia	Qatar	QAT	334500.00	POLYGON ((50.81011 24.75474, 50.74391 25.48242…
86	2875422	Asia	Kuwait	KWT	301100.00	POLYGON ((47.97452 29.97582, 48.18319 29.53448…
87	39192111	Asia	Iraq	IRQ	596700.00	POLYGON ((39.19547 32.16101, 38.79234 33.37869…
88	3424386	Asia	Oman	OMN	173100.00	MULTIPOLYGON (((55.20834 22.70833, 55.23449 23…
90	16204486	Asia	Cambodia	KHM	58940.00	POLYGON ((102.58493 12.18659, 102.34810 13.394…
91	68414135	Asia	Thailand	THA	1161000.00	POLYGON ((105.21878 14.27321, 104.28142 14.416…
92	7126706	Asia	Laos	LAO	40960.00	POLYGON ((107.38273 14.20244, 106.49637 14.570…
93	55123814	Asia	Myanmar	MMR	311100.00	POLYGON ((100.11599 20.41785, 99.54331 20.1866…
94	96160163	Asia	Vietnam	VNM	594900.00	POLYGON ((104.33433 10.48654, 105.19991 10.889…
95	25248140	Asia	North Korea	PRK	40000.00	MULTIPOLYGON (((130.78000 42.22001, 130.78000 …
96	51181299	Asia	South Korea	KOR	1929000.00	POLYGON ((126.17476 37.74969, 126.23734 37.840…
97	3068243	Asia	Mongolia	MNG	37000.00	POLYGON ((87.75126 49.29720, 88.80557 49.47052…
98	1281935911	Asia	India	IND	8721000.00	POLYGON ((97.32711 28.26158, 97.40256 27.88254…
99	157826578	Asia	Bangladesh	BGD	628400.00	POLYGON ((92.67272 22.04124, 92.65226 21.32405…
100	758288	Asia	Bhutan	BTN	6432.00	POLYGON ((91.69666 27.77174, 92.10371 27.45261…
101	29384297	Asia	Nepal	NPL	71520.00	POLYGON ((88.12044 27.87654, 88.04313 27.44582…
102	204924861	Asia	Pakistan	PAK	988200.00	POLYGON ((77.83745 35.49401, 76.87172 34.65354…
103	34124811	Asia	Afghanistan	AFG	64080.00	POLYGON ((66.51861 37.36278, 67.07578 37.35614…
104	8468555	Asia	Tajikistan	TJK	25810.00	POLYGON ((67.83000 37.14499, 68.39203 38.15703…
105	5789122	Asia	Kyrgyzstan	KGZ	21010.00	POLYGON ((70.96231 42.26615, 71.18628 42.70429…
106	5351277	Asia	Turkmenistan	TKM	94720.00	POLYGON ((52.50246 41.78332, 52.94429 42.11603…
107	82021564	Asia	Iran	IRN	1459000.00	POLYGON ((48.56797 29.92678, 48.01457 30.45246…
108	18028549	Asia	Syria	SYR	50280.00	POLYGON ((35.71992 32.70919, 35.70080 32.71601…
109	3045191	Asia	Armenia	ARM	26300.00	POLYGON ((46.50572 38.77061, 46.14362 38.74120…
124	80845215	Asia	Turkey	TUR	1670000.00	MULTIPOLYGON (((44.77268 37.17044, 44.29345 37…
138	22409381	Asia	Sri Lanka	LKA	236700.00	POLYGON ((81.78796 7.52306, 81.63732 6.48178, …
139	1379302771	Asia	China	CHN	21140000.00	MULTIPOLYGON (((109.47521 18.19770, 108.65521 …
140	23508428	Asia	Taiwan	TWN	1127000.00	POLYGON ((121.77782 24.39427, 121.17563 22.790…
145	9961396	Asia	Azerbaijan	AZE	167900.00	MULTIPOLYGON (((46.40495 41.86068, 46.68607 41…
146	4926330	Asia	Georgia	GEO	37270.00	POLYGON ((39.95501 43.43500, 40.07696 43.55310…
147	104256076	Asia	Philippines	PHL	801900.00	MULTIPOLYGON (((120.83390 12.70450, 120.32344 …
148	31381992	Asia	Malaysia	MYS	863000.00	MULTIPOLYGON (((100.08576 6.46449, 100.25960 6…
149	443593	Asia	Brunei	BRN	33730.00	POLYGON ((115.45071 5.44773, 115.40570 4.95523…
155	126451398	Asia	Japan	JPN	4932000.00	MULTIPOLYGON (((141.88460 39.18086, 140.95949 …
157	28036829	Asia	Yemen	YEM	73450.00	POLYGON ((52.00001 19.00000, 52.78218 17.34974…
158	28571770	Asia	Saudi Arabia	SAU	1731000.00	POLYGON ((34.95604 29.35655, 36.06894 29.19749…
160	265100	Asia	N. Cyprus	-99	3600.00	POLYGON ((32.73178 35.14003, 32.80247 35.14550…
161	1221549	Asia	Cyprus	CYP	29260.00	POLYGON ((32.73178 35.14003, 32.91957 35.08783…

In [59]:

africa = world[world['continent'] == 'Africa']
north_america = world[world['continent'] == 'North America']
europe = world[world['continent'] == 'Europe']

In [60]:

# Cases in China
df.head()

Out[60]:

	Province_State	Country_Region	Lat	Long	Date	Confirmed	geometry
0	Anhui	Mainland China	31.8257	117.2264	1/22/20	1	POINT (117.22640 31.82570)
1	Beijing	Mainland China	40.1824	116.4142	1/22/20	14	POINT (116.41420 40.18240)
2	Chongqing	Mainland China	30.0572	107.8740	1/22/20	6	POINT (107.87400 30.05720)
3	Fujian	Mainland China	26.0789	117.9874	1/22/20	1	POINT (117.98740 26.07890)
4	Gansu	Mainland China	36.0611	103.8343	1/22/20	0	POINT (103.83430 36.06110)

In [61]:

df[df['Country_Region'] == 'Mainland China']

Out[61]:

	Province_State	Country_Region	Lat	Long	Date	Confirmed	Deaths	Recovered	geometry
0	Anhui	Mainland China	31.8257	117.2264	1/22/20	1	0	0	POINT (117.22640 31.82570)
1	Beijing	Mainland China	40.1824	116.4142	1/22/20	14	0	0	POINT (116.41420 40.18240)
2	Chongqing	Mainland China	30.0572	107.8740	1/22/20	6	0	0	POINT (107.87400 30.05720)
3	Fujian	Mainland China	26.0789	117.9874	1/22/20	1	0	0	POINT (117.98740 26.07890)
4	Gansu	Mainland China	36.0611	103.8343	1/22/20	0	0	0	POINT (103.83430 36.06110)
…	…	…	…	…	…	…	…	…	…
3806	Tianjin	Mainland China	39.3054	117.3230	2/27/20	136	3	102	POINT (117.32300 39.30540)
3807	Tibet	Mainland China	31.6927	88.0924	2/27/20	1	0	1	POINT (88.09240 31.69270)
3808	Xinjiang	Mainland China	41.1129	85.2401	2/27/20	76	2	43	POINT (85.24010 41.11290)
3809	Yunnan	Mainland China	24.9740	101.4870	2/27/20	174	2	150	POINT (101.48700 24.97400)
3810	Zhejiang	Mainland China	29.1832	120.0934	2/27/20	1205	1	932	POINT (120.09340 29.18320)

1147 rows × 9 columns

In [62]:

gdf01[gdf01['Country_Region'] == 'Mainland China']

Out[62]:

	Province_State	Country_Region	Lat	Long	Date	Confirmed	Deaths	Recovered	geometry
0	Anhui	Mainland China	31.8257	117.2264	1/22/20	1	0	0	POINT (117.22640 31.82570)
1	Beijing	Mainland China	40.1824	116.4142	1/22/20	14	0	0	POINT (116.41420 40.18240)
2	Chongqing	Mainland China	30.0572	107.8740	1/22/20	6	0	0	POINT (107.87400 30.05720)
3	Fujian	Mainland China	26.0789	117.9874	1/22/20	1	0	0	POINT (117.98740 26.07890)
4	Gansu	Mainland China	36.0611	103.8343	1/22/20	0	0	0	POINT (103.83430 36.06110)
…	…	…	…	…	…	…	…	…	…
3806	Tianjin	Mainland China	39.3054	117.3230	2/27/20	136	3	102	POINT (117.32300 39.30540)
3807	Tibet	Mainland China	31.6927	88.0924	2/27/20	1	0	1	POINT (88.09240 31.69270)
3808	Xinjiang	Mainland China	41.1129	85.2401	2/27/20	76	2	43	POINT (85.24010 41.11290)
3809	Yunnan	Mainland China	24.9740	101.4870	2/27/20	174	2	150	POINT (101.48700 24.97400)
3810	Zhejiang	Mainland China	29.1832	120.0934	2/27/20	1205	1	932	POINT (120.09340 29.18320)

1147 rows × 9 columns

In [63]:

# Overlap
fig,ax = plt.subplots(figsize=(20,10))
gdf01[gdf01['Country_Region'] == 'Mainland China'].plot(cmap='Purples',ax=ax)
world.geometry.boundary.plot(color=None,edgecolor='k',linewidth=2,ax=ax)

Out[63]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f8149bb84d0>

In [65]:

# Overlap
fig,ax = plt.subplots(figsize=(20,10))
gdf01[gdf01['Country_Region'] == 'Mainland China'].plot(cmap='Purples',ax=ax)
asia.geometry.boundary.plot(color=None,edgecolor='k',linewidth=2,ax=ax)

Out[65]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f81491e1b10>

In [66]:

# Overlap
fig,ax = plt.subplots(figsize=(20,10))
gdf01[gdf01['Country_Region'] == 'India'].plot(cmap='Purples',ax=ax)
asia.geometry.boundary.plot(color=None,edgecolor='k',linewidth=2,ax=ax)

Out[66]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f8149021bd0>

In [67]:

# Overlap
fig,ax = plt.subplots(figsize=(20,10))
gdf01[gdf01['Country_Region'] == 'Egypt'].plot(cmap='Purples',ax=ax)
africa.geometry.boundary.plot(color=None,edgecolor='k',linewidth=2,ax=ax)

Out[67]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f8148f946d0>

In [68]:

# Overlap
fig,ax = plt.subplots(figsize=(20,10))
gdf01[gdf01['Country_Region'] == 'US'].plot(cmap='Purples',ax=ax)
north_america.geometry.boundary.plot(color=None,edgecolor='k',linewidth=2,ax=ax)

Out[68]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f8148f102d0>

In [69]:

# Overlap
fig,ax = plt.subplots(figsize=(20,10))
gdf01[gdf01['Country_Region'] == 'UK'].plot(cmap='Purples',ax=ax)
europe.geometry.boundary.plot(color=None,edgecolor='k',linewidth=2,ax=ax)

Out[69]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f8148e85b50>

In [ ]:

In [70]:

# Time Series Analysis
df.head()

Out[70]:

	Province_State	Country_Region	Lat	Long	Date	Confirmed	geometry
0	Anhui	Mainland China	31.8257	117.2264	1/22/20	1	POINT (117.22640 31.82570)
1	Beijing	Mainland China	40.1824	116.4142	1/22/20	14	POINT (116.41420 40.18240)
2	Chongqing	Mainland China	30.0572	107.8740	1/22/20	6	POINT (107.87400 30.05720)
3	Fujian	Mainland China	26.0789	117.9874	1/22/20	1	POINT (117.98740 26.07890)
4	Gansu	Mainland China	36.0611	103.8343	1/22/20	0	POINT (103.83430 36.06110)

In [71]:

df_per_day

Out[71]:

	Confirmed	Deaths	Recovered
Date
1/22/20	444	17	28
1/23/20	444	17	28
1/24/20	549	24	31
1/25/20	761	40	32
1/26/20	1058	52	42
1/27/20	1423	76	45
1/28/20	3554	125	80
1/29/20	3554	125	88
1/30/20	4903	162	90
1/31/20	5806	204	141
2/1/20	7153	249	168
2/10/20	31728	974	2222
2/11/20	33366	1068	2639
2/12/20	33366	1068	2686
2/13/20	48206	1310	3459
2/14/20	54406	1457	4774
2/15/20	56249	1596	5623
2/16/20	58182	1696	6639
2/17/20	59989	1789	7862
2/18/20	61682	1921	9128
2/19/20	62031	2029	10337
2/2/20	11177	350	295
2/20/20	62442	2144	11788
2/21/20	62662	2144	11881
2/22/20	64084	2346	15299
2/23/20	64084	2346	15343
2/24/20	64287	2495	16748
2/25/20	64786	2563	18971
2/26/20	65187	2615	20969
2/27/20	65596	2641	23383
2/3/20	13522	414	386
2/4/20	16678	479	522
2/5/20	19665	549	633
2/6/20	22112	618	817
2/7/20	24953	699	1115
2/8/20	27100	780	1439
2/9/20	29631	871	1795

In [72]:

# Copy
df2 = df

In [73]:

df.to_csv("coronavirus_data_clean.csv")

In [74]:

import datetime as dt

In [75]:

df['cases_date'] = pd.to_datetime(df2['Date'])

In [76]:

df2.dtypes

Out[76]:

Province_State            object
Country_Region            object
Lat                      float64
Long                     float64
Date                      object
Confirmed                  int64
Deaths                     int64
Recovered                  int64
geometry                geometry
cases_date        datetime64[ns]
dtype: object

In [77]:

df['cases_date'].plot(figsize=(20,10))

Out[77]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f8148e09750>

In [78]:

ts = df2.set_index('cases_date')

ts

In [80]:

# Select For January
ts.loc['2020-01']

Out[80]:

	Province_State	Country_Region	Lat	Long	Date	Confirmed	Deaths	Recovered	geometry
cases_date
2020-01-22	Anhui	Mainland China	31.8257	117.2264	1/22/20	1	0	0	POINT (117.22640 31.82570)
2020-01-22	Beijing	Mainland China	40.1824	116.4142	1/22/20	14	0	0	POINT (116.41420 40.18240)
2020-01-22	Chongqing	Mainland China	30.0572	107.8740	1/22/20	6	0	0	POINT (107.87400 30.05720)
2020-01-22	Fujian	Mainland China	26.0789	117.9874	1/22/20	1	0	0	POINT (117.98740 26.07890)
2020-01-22	Gansu	Mainland China	36.0611	103.8343	1/22/20	0	0	0	POINT (103.83430 36.06110)
…	…	…	…	…	…	…	…	…	…
2020-01-31	NaN	Romania	45.9432	24.9668	1/31/20	0	0	0	POINT (24.96680 45.94320)
2020-01-31	NaN	Denmark	56.2639	9.5018	1/31/20	0	0	0	POINT (9.50180 56.26390)
2020-01-31	NaN	Estonia	58.5953	25.0136	1/31/20	0	0	0	POINT (25.01360 58.59530)
2020-01-31	NaN	Netherlands	52.1326	5.2913	1/31/20	0	0	0	POINT (5.29130 52.13260)
2020-01-31	NaN	San Marino	43.9424	12.4578	1/31/20	0	0	0	POINT (12.45780 43.94240)

1050 rows × 9 columns

In [81]:

ts.loc['2020-02-24':'2020-02-25']

Out[81]:

	Province_State	Country_Region	Lat	Long	Date	Confirmed	Deaths	Recovered	geometry
cases_date
2020-02-24	Anhui	Mainland China	31.8257	117.2264	2/24/20	989	6	663	POINT (117.22640 31.82570)
2020-02-24	Beijing	Mainland China	40.1824	116.4142	2/24/20	399	4	198	POINT (116.41420 40.18240)
2020-02-24	Chongqing	Mainland China	30.0572	107.8740	2/24/20	576	6	349	POINT (107.87400 30.05720)
2020-02-24	Fujian	Mainland China	26.0789	117.9874	2/24/20	293	1	183	POINT (117.98740 26.07890)
2020-02-24	Gansu	Mainland China	36.0611	103.8343	2/24/20	91	2	80	POINT (103.83430 36.06110)
…	…	…	…	…	…	…	…	…	…
2020-02-25	NaN	Romania	45.9432	24.9668	2/25/20	0	0	0	POINT (24.96680 45.94320)
2020-02-25	NaN	Denmark	56.2639	9.5018	2/25/20	0	0	0	POINT (9.50180 56.26390)
2020-02-25	NaN	Estonia	58.5953	25.0136	2/25/20	0	0	0	POINT (25.01360 58.59530)
2020-02-25	NaN	Netherlands	52.1326	5.2913	2/25/20	0	0	0	POINT (5.29130 52.13260)
2020-02-25	NaN	San Marino	43.9424	12.4578	2/25/20	0	0	0	POINT (12.45780 43.94240)

210 rows × 9 columns

In [82]:

ts.loc['2020-02-24':'2020-02-25'][['Confirmed','Recovered']]

Out[82]:

	Confirmed	Recovered
cases_date
2020-02-24	989	663
2020-02-24	399	198
2020-02-24	576	349
2020-02-24	293	183
2020-02-24	91	80
…	…	…
2020-02-25	0	0
2020-02-25	0	0
2020-02-25	0	0
2020-02-25	0	0
2020-02-25	0	0

210 rows × 2 columns

In [83]:

ts.loc['2020-02-24':'2020-02-25'][['Confirmed','Recovered']].plot(figsize=(20,10))

Out[83]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f8148e01950>

In [86]:

ts.loc['2020-02-2':'2020-02-25'][['Confirmed','Deaths']].plot(figsize=(20,10))

/usr/local/lib/python3.7/dist-packages/pandas/plotting/_matplotlib/core.py:1085: UserWarning: Attempting to set identical left == right == 737480.0 results in singular transformations; automatically expanding.
  ax.set_xlim(left, right)

Out[86]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f8148bad7d0>

In [89]:

df_by_date = ts.groupby(['cases_date']).sum().reset_index(drop=None)

In [90]:

df_by_date

Out[90]:

	cases_date	Lat	Long	Confirmed	Deaths	Recovered
0	2020-01-22	3386.46002	4806.4548	555	17	28
1	2020-01-23	3386.46002	4806.4548	653	18	30
2	2020-01-24	3386.46002	4806.4548	941	26	36
3	2020-01-25	3386.46002	4806.4548	1434	42	39
4	2020-01-26	3386.46002	4806.4548	2118	56	52
5	2020-01-27	3386.46002	4806.4548	2927	82	61
6	2020-01-28	3386.46002	4806.4548	5578	131	107
7	2020-01-29	3386.46002	4806.4548	6166	133	126
8	2020-01-30	3386.46002	4806.4548	8234	171	143
9	2020-01-31	3386.46002	4806.4548	9927	213	222
10	2020-02-01	3386.46002	4806.4548	12038	259	284
11	2020-02-02	3386.46002	4806.4548	16787	362	472
12	2020-02-03	3386.46002	4806.4548	19881	426	623
13	2020-02-04	3386.46002	4806.4548	23892	492	852
14	2020-02-05	3386.46002	4806.4548	27636	564	1124
15	2020-02-06	3386.46002	4806.4548	30818	634	1487
16	2020-02-07	3386.46002	4806.4548	34392	719	2011
17	2020-02-08	3386.46002	4806.4548	37121	806	2616
18	2020-02-09	3386.46002	4806.4548	40151	906	3244
19	2020-02-10	3386.46002	4806.4548	42763	1013	3946
20	2020-02-11	3386.46002	4806.4548	44803	1113	4683
21	2020-02-12	3386.46002	4806.4548	45222	1118	5150
22	2020-02-13	3386.46002	4806.4548	60370	1371	6295
23	2020-02-14	3386.46002	4806.4548	66887	1523	8058
24	2020-02-15	3386.46002	4806.4548	69032	1666	9395
25	2020-02-16	3386.46002	4806.4548	71226	1770	10865
26	2020-02-17	3386.46002	4806.4548	73260	1868	12583
27	2020-02-18	3386.46002	4806.4548	75138	2007	14352
28	2020-02-19	3386.46002	4806.4548	75641	2122	16121
29	2020-02-20	3386.46002	4806.4548	76199	2247	18177
30	2020-02-21	3386.46002	4806.4548	76843	2251	18890
31	2020-02-22	3386.46002	4806.4548	78599	2458	22886
32	2020-02-23	3386.46002	4806.4548	78985	2469	23394
33	2020-02-24	3386.46002	4806.4548	79570	2629	25227
34	2020-02-25	3386.46002	4806.4548	80415	2708	27905
35	2020-02-26	3386.46002	4806.4548	81397	2770	30384
36	2020-02-27	3386.46002	4806.4548	82756	2814	33277

In [91]:

df_by_date.columns

Out[91]:

Index(['cases_date', 'Lat', 'Long', 'Confirmed', 'Deaths', 'Recovered'], dtype='object')

In [92]:

df_by_date[['Confirmed', 'Deaths', 'Recovered']].plot(kind='line',figsize=(20,10))

Out[92]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f8148b40bd0>

You can check out the entire video here

Thanks for your time

Jesus Saves

By Jesse E.Agbe(JCharis)

2 thoughts on “Data Analysis of Coronavirus Outbreak with Python and Geopandas”

data science certification
May 7, 2020 at 6:09 am

What a really awesome post this is. Truly, one of the best posts I’ve ever witnessed to see in my whole life. Wow, just keep it up.
data science certification
360DigiTMG

Data Science certification
August 11, 2020 at 9:46 am

Happy to visit your blog, I am by all accounts forward to more solid articles and I figure we as a whole wish to thank such huge numbers of good articles, blog to impart to us.data science certification

Check for Distribution on Map

ts

2 thoughts on “Data Analysis of Coronavirus Outbreak with Python and Geopandas”

Leave a Comment Cancel Reply