The Coronavirus (COVID-19) outbreak is an ongoing outbreak which started in 2019 hence the number 19 in the COVID-19. It is caused by the Severe Acute Respiratory Syndrome(SARS) -CoV-2 virus.
In this tutorial we will see how to use python to do some basic exploratory data analysis of the coronavirus outbreak. We will be using the dataset collected from several sources such as
- https://github.com/CSSEGISandData/COVID-19
- https://github.com/RamiKrispin/coronavirus
- https://www.kaggle.com/imdevskp/corona-virus-report#covid_19_clean_complete.csv
Let us see some basic questions we will be answering with the data we have
- Number of Cases (Recovered,Confirmed,Deaths)
- Which country has the highest cases?
- List of countries affected
- Distribution Per Continents and Country
- Cases Per Day
- Cases Per Country
- Timeseries Analysis
By analysing our data we can see that it consist of data about
- Country/Region
- Latitude and Longitude
- Date/ Time
- Numerical Data
Hence by these simple overview we can perform the following types of analysis
- Geo-spatial analysis from the Latitude and Longitudes.
- Time series analysis from the Date/Time.
- Statistical analysis from the numerical data.
Let us start with the basic EDA and then the rest.
We will be using pandas,matplotlib and geopandas to help us with our analysis.
Installation
pip install pandas geopandas matplotlib
Let us see the entire code for our analysis. You can get the notebook and the dataset here
In [1]:
# Load EDA Pkgs
import pandas as pd
import numpy as np
In [2]:
# Load Data Viz Packages
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [3]:
### Load Geopandas
import geopandas as gpd
from shapely.geometry import Point, Polygon
import descartes
In [4]:
# Load Dataset
df = pd.read_csv("data/coronavirus_data.csv")
In [5]:
df.head()
Out[5]:
In [6]:
df.columns
Out[6]:
In [8]:
df.columns.str.replace(r'\n','', regex=True)
Out[8]:
In [9]:
df.columns = df.columns.str.replace(r'\n','', regex=True)
In [10]:
df.columns
Out[10]:
In [11]:
df.rename(columns={'Province/State':'Province_State','Country/Region':'Country_Region'},inplace=True)
In [12]:
df.columns
Out[12]:
In [13]:
# Shape of Dataset
df.shape
Out[13]:
In [14]:
# Datatypes
df.dtypes
Out[14]:
In [15]:
# First 10
df.head(10)
Out[15]:
In [17]:
df = df[['Province_State', 'Country_Region', 'Lat', 'Long', 'Date',
'Confirmed', 'Deaths', 'Recovered']]
In [18]:
df.isna().sum()
Out[18]:
In [19]:
df.describe()
Out[19]:
In [20]:
# Number of Case Per Date/Day
df.head()
Out[20]:
In [21]:
df.columns
Out[21]:
In [22]:
df.groupby('Date')['Confirmed','Deaths', 'Recovered'].sum()
Out[22]:
In [23]:
df.groupby('Date')['Confirmed','Deaths', 'Recovered'].max()
Out[23]:
In [24]:
df_per_day = df.groupby('Date')['Confirmed','Deaths', 'Recovered'].max()
In [25]:
df_per_day.head()
Out[25]:
In [26]:
df_per_day.describe()
Out[26]:
In [29]:
# Max No of Cases
df_per_day['Confirmed'].max()
Out[29]:
In [30]:
# Min No Of Cases
df_per_day['Confirmed'].min()
Out[30]:
In [31]:
# Date for Maximum Number Cases
df_per_day['Confirmed'].idxmax()
Out[31]:
In [32]:
# Date for Min Number Cases
df_per_day['Confirmed'].idxmin()
Out[32]:
In [33]:
# Number of Case Per Country/Province
df.groupby(['Country_Region'])['Confirmed','Deaths', 'Recovered'].max()
Out[33]:
In [34]:
# Number of Case Per Country/Province
df.groupby(['Province_State','Country_Region'])['Confirmed','Deaths', 'Recovered'].max()
Out[34]:
In [35]:
df['Country_Region'].value_counts()
Out[35]:
In [36]:
df['Country_Region'].value_counts().plot(kind='bar',figsize=(20,10))
Out[36]:
In [37]:
# How Many Country Affect
df['Country_Region'].unique()
Out[37]:
In [38]:
# How Many Country Affect
len(df['Country_Region'].unique())
Out[38]:
In [40]:
plt.figure(figsize=(20,10))
df['Country_Region'].value_counts().plot.pie(autopct="%1.1f%%")
Out[40]:
Check for Distribution on Map
- Lat/Long
- Geometry/ Point
In [41]:
dir(gpd)
Out[41]:
In [43]:
df.head()
Out[43]:
In [45]:
# Convert Data to GeoDataframe
gdf01 = gpd.GeoDataFrame(df,geometry=gpd.points_from_xy(df['Long'],df['Lat']))
In [46]:
gdf01.head()
Out[46]:
In [47]:
type(gdf01)
Out[47]:
In [48]:
# Method 2
points = [ Point(x,y) for x,y in zip(df.Long,df.Lat)]
In [49]:
gdf03 = gpd.GeoDataFrame(df,geometry=points)
In [50]:
gdf03
Out[50]:
In [51]:
# Map Plot
gdf01.plot(figsize=(20,10))
Out[51]:
In [52]:
# overlapping with world map
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
ax = world.plot(figsize=(20,10))
ax.axis('off')
Out[52]:
In [53]:
# Overlap
fig,ax = plt.subplots(figsize=(20,10))
gdf01.plot(cmap='Purples',ax=ax)
world.geometry.boundary.plot(color=None,edgecolor='k',linewidth=2,ax=ax)
Out[53]:
In [54]:
fig,ax = plt.subplots(figsize=(20,10))
gdf01.plot(cmap='Purples',ax=ax)
world.geometry.plot(color='Yellow',edgecolor='k',linewidth=2,ax=ax)
Out[54]:
In [55]:
# Per Country
world
Out[55]:
In [56]:
world['continent'].unique()
Out[56]:
In [57]:
asia = world[world['continent'] == 'Asia']
In [58]:
asia
Out[58]:
In [59]:
africa = world[world['continent'] == 'Africa']
north_america = world[world['continent'] == 'North America']
europe = world[world['continent'] == 'Europe']
In [60]:
# Cases in China
df.head()
Out[60]:
In [61]:
df[df['Country_Region'] == 'Mainland China']
Out[61]:
In [62]:
gdf01[gdf01['Country_Region'] == 'Mainland China']
Out[62]:
In [63]:
# Overlap
fig,ax = plt.subplots(figsize=(20,10))
gdf01[gdf01['Country_Region'] == 'Mainland China'].plot(cmap='Purples',ax=ax)
world.geometry.boundary.plot(color=None,edgecolor='k',linewidth=2,ax=ax)
Out[63]: