The Coronavirus (COVID-19) outbreak is an ongoing outbreak which started in 2019 hence the number 19 in the COVID-19. It is caused by the Severe Acute Respiratory Syndrome(SARS) -CoV-2 virus.
In this tutorial we will see how to use python to do some basic exploratory data analysis of the coronavirus outbreak. We will be using the dataset collected from several sources such as
- https://github.com/CSSEGISandData/COVID-19
- https://github.com/RamiKrispin/coronavirus
- https://www.kaggle.com/imdevskp/corona-virus-report#covid_19_clean_complete.csv
Let us see some basic questions we will be answering with the data we have
- Number of Cases (Recovered,Confirmed,Deaths)
- Which country has the highest cases?
- List of countries affected
- Distribution Per Continents and Country
- Cases Per Day
- Cases Per Country
- Timeseries Analysis
By analysing our data we can see that it consist of data about
- Country/Region
- Latitude and Longitude
- Date/ Time
- Numerical Data
Hence by these simple overview we can perform the following types of analysis
- Geo-spatial analysis from the Latitude and Longitudes.
- Time series analysis from the Date/Time.
- Statistical analysis from the numerical data.
Let us start with the basic EDA and then the rest.
We will be using pandas,matplotlib and geopandas to help us with our analysis.
Installation
pip install pandas geopandas matplotlib
# Load EDA Pkgs
import pandas as pd
import numpy as np
# Load Data Viz Packages
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
### Load Geopandas
import geopandas as gpd
from shapely.geometry import Point, Polygon
import descartes
# Load Dataset
df = pd.read_csv("data/coronavirus_data.csv")
df.head()
df.columns
df.columns.str.replace(r'\n','', regex=True)
df.columns = df.columns.str.replace(r'\n','', regex=True)
df.columns
df.rename(columns={'Province/State':'Province_State','Country/Region':'Country_Region'},inplace=True)
df.columns
# Shape of Dataset
df.shape
# Datatypes
df.dtypes
# First 10
df.head(10)
df = df[['Province_State', 'Country_Region', 'Lat', 'Long', 'Date',
'Confirmed', 'Deaths', 'Recovered']]
df.isna().sum()
df.describe()
# Number of Case Per Date/Day
df.head()
df.columns
df.groupby('Date')['Confirmed','Deaths', 'Recovered'].sum()
df.groupby('Date')['Confirmed','Deaths', 'Recovered'].max()
df_per_day = df.groupby('Date')['Confirmed','Deaths', 'Recovered'].max()
df_per_day.head()
df_per_day.describe()
# Max No of Cases
df_per_day['Confirmed'].max()
# Min No Of Cases
df_per_day['Confirmed'].min()
# Date for Maximum Number Cases
df_per_day['Confirmed'].idxmax()
# Date for Min Number Cases
df_per_day['Confirmed'].idxmin()
# Number of Case Per Country/Province
df.groupby(['Country_Region'])['Confirmed','Deaths', 'Recovered'].max()
# Number of Case Per Country/Province
df.groupby(['Province_State','Country_Region'])['Confirmed','Deaths', 'Recovered'].max()
df['Country_Region'].value_counts()
df['Country_Region'].value_counts().plot(kind='bar',figsize=(20,10))
# How Many Country Affect
df['Country_Region'].unique()
# How Many Country Affect
len(df['Country_Region'].unique())
plt.figure(figsize=(20,10))
df['Country_Region'].value_counts().plot.pie(autopct="%1.1f%%")
Check for Distribution on Map
- Lat/Long
- Geometry/ Point
dir(gpd)
df.head()
# Convert Data to GeoDataframe
gdf01 = gpd.GeoDataFrame(df,geometry=gpd.points_from_xy(df['Long'],df['Lat']))
gdf01.head()
type(gdf01)
# Method 2
points = [ Point(x,y) for x,y in zip(df.Long,df.Lat)]
gdf03 = gpd.GeoDataFrame(df,geometry=points)
gdf03
# Map Plot
gdf01.plot(figsize=(20,10))
# overlapping with world map
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
ax = world.plot(figsize=(20,10))
ax.axis('off')
# Overlap
fig,ax = plt.subplots(figsize=(20,10))
gdf01.plot(cmap='Purples',ax=ax)
world.geometry.boundary.plot(color=None,edgecolor='k',linewidth=2,ax=ax)
fig,ax = plt.subplots(figsize=(20,10))
gdf01.plot(cmap='Purples',ax=ax)
world.geometry.plot(color='Yellow',edgecolor='k',linewidth=2,ax=ax)
# Per Country
world
world['continent'].unique()
asia = world[world['continent'] == 'Asia']
asia
africa = world[world['continent'] == 'Africa']
north_america = world[world['continent'] == 'North America']
europe = world[world['continent'] == 'Europe']
# Cases in China
df.head()
df[df['Country_Region'] == 'Mainland China']
gdf01[gdf01['Country_Region'] == 'Mainland China']
# Overlap
fig,ax = plt.subplots(figsize=(20,10))
gdf01[gdf01['Country_Region'] == 'Mainland China'].plot(cmap='Purples',ax=ax)
world.geometry.boundary.plot(color=None,edgecolor='k',linewidth=2,ax=ax)
# Overlap
fig,ax = plt.subplots(figsize=(20,10))
gdf01[gdf01['Country_Region'] == 'Mainland China'].plot(cmap='Purples',ax=ax)
asia.geometry.boundary.plot(color=None,edgecolor='k',linewidth=2,ax=ax)
# Overlap
fig,ax = plt.subplots(figsize=(20,10))
gdf01[gdf01['Country_Region'] == 'India'].plot(cmap='Purples',ax=ax)
asia.geometry.boundary.plot(color=None,edgecolor='k',linewidth=2,ax=ax)
# Overlap
fig,ax = plt.subplots(figsize=(20,10))
gdf01[gdf01['Country_Region'] == 'Egypt'].plot(cmap='Purples',ax=ax)
africa.geometry.boundary.plot(color=None,edgecolor='k',linewidth=2,ax=ax)
# Overlap
fig,ax = plt.subplots(figsize=(20,10))
gdf01[gdf01['Country_Region'] == 'US'].plot(cmap='Purples',ax=ax)
north_america.geometry.boundary.plot(color=None,edgecolor='k',linewidth=2,ax=ax)
# Overlap
fig,ax = plt.subplots(figsize=(20,10))
gdf01[gdf01['Country_Region'] == 'UK'].plot(cmap='Purples',ax=ax)
europe.geometry.boundary.plot(color=None,edgecolor='k',linewidth=2,ax=ax)
# Time Series Analysis
df.head()
df_per_day
# Copy
df2 = df
df.to_csv("coronavirus_data_clean.csv")
import datetime as dt
df['cases_date'] = pd.to_datetime(df2['Date'])
df2.dtypes
df['cases_date'].plot(figsize=(20,10))
ts = df2.set_index('cases_date')
ts
# Select For January
ts.loc['2020-01']
ts.loc['2020-02-24':'2020-02-25']
ts.loc['2020-02-24':'2020-02-25'][['Confirmed','Recovered']]
ts.loc['2020-02-24':'2020-02-25'][['Confirmed','Recovered']].plot(figsize=(20,10))
ts.loc['2020-02-2':'2020-02-25'][['Confirmed','Deaths']].plot(figsize=(20,10))
df_by_date = ts.groupby(['cases_date']).sum().reset_index(drop=None)
df_by_date
df_by_date.columns
df_by_date[['Confirmed', 'Deaths', 'Recovered']].plot(kind='line',figsize=(20,10))
You can check out the entire video here
Thanks for your time
Jesus Saves
By Jesse E.Agbe(JCharis)
What a really awesome post this is. Truly, one of the best posts I’ve ever witnessed to see in my whole life. Wow, just keep it up.
data science certification
360DigiTMG
Happy to visit your blog, I am by all accounts forward to more solid articles and I figure we as a whole wish to thank such huge numbers of good articles, blog to impart to us.data science certification