Exploratory Data Analysis is an important aspect of any data science project. It forms the initial steps before moving into the Machine learning aspects.
In this tutorial we will be exploring the drug review dataset using python in an elaborate way. In doing EDA (exploratory data analysis) it is recommended to keep in mind the basic questions you want to find answers to using your dataset. This will direct you on the various analysis to use and how deep to explore the given data for more insight. In our case we will be breaking our questions into questions on the following
- Drugs
- Reviews
- Ratings
- Conditions
- Combinations
We will be using the dataset from UCI machine learning repository which already have some basic info about what we will be doing.
By the end of this tutorial you will learn about
- The various libraries to use for EDA
- Descriptive analytics
- How to do value counts
- How to generate some plots for more insights
- How to classify drugs based on their suffixes
- How to do sentiment analysis on drug reviews
- How to find and identify genuine review
- Time series analysis on drug review and rating
- Distribution Analysis
- and More
You can get the entire code on Github here.
Let us start.
Data Science EDA Project From Scratch with Python
- Tools & Libraries
- EDA: Pandas
- Viz: Seaborn,Matplotlib
- NLP:spaCy,TextBlob,NeatText
- ML: sklearn,xgboost,pycaret
DataSource
Attributes
- drugName (categorical): name of drug
- condition (categorical): name of condition
- review (text): patient review
- rating (numerical): 10 star patient rating
- date (date): date of review entry
- usefulCount (numerical): number of users who found review useful
Questions
- Types of questions we can ask?(Drugs,Review,Rating,Conditions,Time,Genuiness,etc)
- What is the most popular drug?
- What are the groups/classification of drugs used?
- Which Drug has the best review?
- How many drugs do we have?
- The number of drugs per condition
- Number of patients that searched on a particular drug
- How genuine is the review? (Using sentiment analysis)
- How many reviews are positive,negative,neutral?
- Correlation between rating and review and users who found the review useful
- Can you predict the rating using the review?
- Distribution of rating
- Amount of review made per year and per month
- Which condition has the most review on drugs
# Load EDA Pkgs
import pandas as pd
import numpy as np
# Load Data Viz
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Load Sentiment Pkgs
from textblob import TextBlob
Question on Drugs
- How many drugs do we have?
- What is the most popular drug?
- What are the groups/classification of drugs used?
- Which Drug has the best review?
- The number of drugs per condition
- Number of patients that searched on a particular drug
# Load Dataset
df = pd.read_csv("drugsCom_raw/drugsComTrain_raw.tsv",sep='\t')
# Preview Dataset
df.head()
# Columns
df.columns
# Missing Values
df.isnull().sum()
Narrative
- Most of the missing values are in the condition column
- This implies that most people don’t know their condition by name or privacy
Question on Drugs
- How many drugs do we have?
# How many drugs do we have?
df['drugName'].unique().tolist()
# How many drugs do we have?
len(df['drugName'].unique().tolist())
# What is the most popular drug?
df['drugName'].value_counts()
# What is the most popular drug?
# Top 20 Drugs (Most Popular)
df['drugName'].value_counts().nlargest(20)
# Top 20 Drugs (Most Popular)
plt.figure(figsize=(20,10))
df['drugName'].value_counts().nlargest(20).plot(kind='bar')
plt.title("Top 20 Most popular drugs based on counts")
plt.show()
Narrative
- Most of the commonest drugs are hormonal drugs
# Least 20 Drugs (Most Popular)
df['drugName'].value_counts().nsmallest(20)
df['drugName'].value_counts().nsmallest(20).plot(kind='bar')
### What are the groups/classification of drugs used?
+ suffix or endings
drug_suffix = {"azole":"antifungal (except metronidazole)",
"caine":"anesthetic",
"cillin":"antibiotic(penicillins)",
"mycin":"antibiotic",
"micin":"antibiotic",
"cycline":"antibiotic",
"oxacin":"antibiotic",
"ceph":"antibiotic(cephalosporins)",
"cef":"antibiotic (cephalosporins)",
"dine":"h2 blockers (anti-ulcers)",
"done":"opiod analgesics",
"ide":"oral hypoglycemics",
"lam":"anti-anxiety",
"pam":"anti-anxiety",
"mide":"diuretics",
"zide":"diuretics",
"nium":"neuromuscular blocking agents",
"olol":"beta blockers",
"tidine":"h2 antagonist",
"tropin":"pituitary hormone",
"zosin":"alpha blocker",
"ase":"thrombolytics",
"plase":"thrombolytics",
"azepam":"anti-anziety(benzodiazepine)",
"azine":"antipyschotics (phenothiazine)",
"barbital":"barbiturate",
"dipine":"calcium channel blocker",
"lol":"beta blocker",
"zolam":"cns depressants",
"pril":"ace inhibitor",
"artan":"arb blocker",
"statins":"lipid-lowering drugs",
"parin":"anticoagulants",
"sone":"corticosteroid (prednisone)"}
def classify_drug(drugname):
for i in drug_suffix.keys():
if drugname.endswith(i):
print(True)
print(drug_suffix[i])
classify_drug('Valsartan')
classify_drug('losartan')
def classify_drug(drugname):
for i in drug_suffix.keys():
if drugname.endswith(i):
return drug_suffix[i]
classify_drug('valsartan')
df['drug_class'] = df['drugName'].apply(classify_drug)
df[['drugName','drug_class']]
# How many Groups of Drugs By Class
df['drug_class'].unique().tolist()
# How many Groups of Drugs By Class
len(df['drug_class'].unique().tolist())
# Which of class of drug is the most commonest
df['drug_class'].value_counts()
# Which of class of drug is the most commonest
plt.figure(figsize=(20,10))
df['drug_class'].value_counts().plot(kind='bar')
plt.title("Distribution of Drugs By Class")
plt.show()
Narrative
- The most commonest class/group of drugs used is
- Antifungal
- Opiod Analgesics(Pain Killers)
- Oral Hypoglycemics (DM)
- Antibiotic
# Distribution of Drugs Per Drug Group based on size
drug_groups = df.groupby('drug_class').size()
type(drug_groups)
# Convert to DF
# Method 1
drug_groups.to_frame()
# Convert to DF
# Method 2
drug_groups_df = pd.DataFrame({'drug_class':drug_groups.index,'counts':drug_groups.values})
# Seaborn Plot
plt.figure(figsize=(20,10))
g = sns.barplot(data=drug_groups_df,x='drug_class',y='counts')
plt.show()
# Seaborn Plot
plt.figure(figsize=(20,10))
g = sns.barplot(data=drug_groups_df,x='drug_class',y='counts')
g.set_xticklabels(drug_groups_df['drug_class'].values,rotation=30)
plt.show()
# Seaborn Plot
plt.figure(figsize=(20,10))
g = sns.barplot(data=drug_groups_df,x='drug_class',y='counts')
plt.xticks(rotation=30)
plt.show()
### Question on Conditions
+ How many conditions are there?
+ Which conditions are the most common?
+ Distribution of conditions and rating
# Number of Conditions
df['condition'].unique()
len(df['condition'].unique().tolist())
Narrative
- We have 885 different conditions
#### Distribution of Conditions
df['condition'].value_counts()
#### Most commonest Conditions
df['condition'].value_counts().nlargest(20)
#### Most commonest Conditions
df['condition'].value_counts().nlargest(20).plot(kind='bar',figsize=(20,10))
Narrative
- The most commonest condition is Birth Control,followed by Depression and Pain and Anxiety
- Makes sense as compared to the drug distribution
df['condition'].value_counts().nsmallest(20)
#### Least commonest Conditions
df['condition'].value_counts().nsmallest(20).plot(kind='bar',figsize=(20,10))
Questions on Drugs and Conditions
- How many drugs per condition
# How many Drugs per condition (Top 20)
df.groupby('condition')['drugName'].nunique().nlargest(20)
# How many Drugs per condition (Top 20)
plt.figure(figsize=(15,10))
df.groupby('condition')['drugName'].nunique().nlargest(20).plot(kind='bar')
plt.title("Number of Drugs Per Condition")
plt.grid()
plt.show()
Narrative
- Pain,Birth Control and HBP have the highest number of different/unique drugs for their condition
#### Questions on Rating
+ Distribution of rating
+ Average Rating Per Count
df['rating']
# Distrubtion of Rating By Size
df.groupby('rating').size()
# Distrubtion of Rating By Size
df.groupby('rating').size().plot(kind='bar')
# # Distrubtion of Rating By Size Using Histogram
plt.figure(figsize=(20,10))
df['rating'].hist()
plt.title("Distrubtion of Rating By Size Using Histogram")
plt.show()
Narative
- Most people rated at the extremes
# Average Rating of Drugs
avg_rating = (df['rating'].groupby(df['drugName']).mean())
avg_rating
# Average Rating For All Drugs
plt.figure(figsize=(20,10))
avg_rating.hist()
plt.title("Distrubtion of Average Rating For All Drugs")
plt.show()
# Average Rating of Drugs By Class
avg_rating_per_drug_class = (df['rating'].groupby(df['drug_class']).mean())
avg_rating_per_drug_class
# Average Rating For All Drugs
plt.figure(figsize=(20,10))
avg_rating_per_drug_class.hist()
plt.title("Distrubtion of Average Rating For Drug Classes")
plt.show()
# Which Group of Drugs have the higest mean/average rating
avg_rating_per_drug_class.nlargest(20)
# Which Drugs have the higest mean/average rating
avg_rating.nlargest(20)
df.columns
### Question on Review
+ How genuine is the review? (Using sentiment analysis)
+ How many reviews are positive,negative,neutral?
+ Correlation between rating and review and users who found the review useful
+ Distribution of rating
+ Amount of review made per year and per month
+ Which condition has the most review on drugs
+ Can you predict the rating using the review?
# How genuine is the review? (Using sentiment analysis)
from textblob import TextBlob
df['review']
def get_sentiment(text):
blob = TextBlob(text)
return blob.polarity
def get_sentiment_label(text):
blob = TextBlob(text)
if blob.polarity > 0:
result = 'positive'
elif blob.polarity < 0:
result = 'negative'
else:
result = 'neutral'
return result
# text fxn
get_sentiment("I love apples")
# text fxn
get_sentiment_label("I love apples")
# Sentiment Score for Review
df['sentiment'] = df['review'].apply(get_sentiment)
# Sentiment Labels for Review
df['sentiment_label'] = df['review'].apply(get_sentiment_label)
df[['review','sentiment','sentiment_label']]
# How many positive and negative and neutral reviews?
df['sentiment_label'].value_counts()
# How many positive and negative and neutral reviews?
df['sentiment_label'].value_counts().plot(kind='bar')
#### Correlation Between Our sentiment and rating
sns.lineplot(data=df,x='rating',y='sentiment')
plt.show()
Narrative
- The rating increases with increase in sentiment
# Correlation btween rating and sentiment
sns.lineplot(data=df,x='rating',y='sentiment',hue='sentiment_label')
# How many reviews are genuine as compared to the rating
+ genuine good rating =positive + rating 10-6
+ genuine bad rating = negative + rating 4-1
# Genuine Good Rating Per Review
good_review = df[(df['rating'] >= 6) & (df['sentiment_label'] == 'positive')]
# Genuine Bad Rating Per Review
bad_review = df[(df['rating'] <= 4) & (df['sentiment_label'] == 'negative')]
good_review.head()
good_review.iloc[0]['review']
#### Questions on UsefulCount
+ number of users who found review useful
+ Top UsefulCount By Drugs/Class
+ Best drugs based usefulcount
df.groupby('drugName')['usefulCount'].value_counts()
# Top Drugs Per UsefulCount
df.groupby('drugName')['usefulCount'].nunique().nlargest(20)
# Top Drugs Per UsefulCount
df.groupby('drugName')['usefulCount'].nunique().nlargest(20).plot(kind='bar')
# Top Drugs Class Per UsefulCount
df.groupby('drug_class')['usefulCount'].nunique().nlargest(20)
# Top Drugs Class Per UsefulCount
df.groupby('drug_class')['usefulCount'].nunique().nlargest(20).plot(kind='bar')
plt.title("Top Drug Class Per Usefulcount")
plt.show()
# Top Drugs Class Per UsefulCount
df.groupby('drug_class')['usefulCount'].nunique().nsmallest(20).plot(kind='bar')
plt.title("Least Drug Class Per Usefulcount")
plt.show()
### Correlation between Rating and Usefulcount
sns.lineplot(data=df,x='rating',y='usefulCount')
Narrative
- As the rating goes up the usefulcount goes up
#### Question on Date
df.columns
# Rating Per Year
df.groupby('date')['rating'].size()
# Averaging Rating Per Day of A Year
df.groupby('date')['rating'].mean()
# Average Rating Per Day of Every Year
df.groupby('date')['rating'].mean().plot(figsize=(20,10))
plt.title("Average Rating Per Day of Every Year")
plt.show()
# Average Useful Per Day of Every Year
df.groupby('date')['usefulCount'].mean().plot(figsize=(20,10))
plt.title("Average UsefulCount Per Day of Every Year")
plt.show()
# Average Sentiment Per Day of Every Year
df.groupby('date')['sentiment'].mean().plot(figsize=(20,10))
plt.title("Average sentiment Per Day of Every Year")
plt.show()
# Amount of Review Per Day of Every Year
df.groupby('date')['review'].size().plot(figsize=(20,10))
plt.title("Amount of Review Per Day of Every Year")
plt.show()
# Amount of Review Per Day of Every Year
df.groupby('date')['review'].size().plot(kind='bar',figsize=(20,10))
plt.title("Amount of Review Per Day of Every Year")
plt.show()
#### Using DatetimeIndex
grouped_date = df.groupby('date').agg({'rating':np.mean,'usefulCount':np.sum,'review':np.size})
grouped_date
grouped_date.index
grouped_date['date'] = grouped_date.index
grouped_date['date'] = pd.DatetimeIndex(grouped_date['date'])
grouped_date.dtypes
grouped_date = grouped_date.set_index('date')
# Select A Particular Date Range
grouped_date['2008'].plot()
# AMount of Review Fr 2008
grouped_date['2008']['review'].plot()
plt.title("Amount of Review For 2008")
plt.show()
# AMount of Review Fr 2008
grouped_date['2008':'2009']['review'].plot()
plt.title("Amount of Review For 2008-2009")
plt.show()
# Distribution of Rating Over Time
grouped_date['2008':'2009']['rating'].plot()
plt.title("Distribution of Rating Over Time")
plt.show()
# Distribution of Rating Over Time
grouped_date['2008':'2012']['rating'].plot(figsize=(20,10))
plt.title("Distribution of Rating Over Time")
plt.show()
grouped_date['2008-04'].plot()
# Distribution of Rating Over A Month
grouped_date['2008-4':'2008-5']['rating'].plot()
plt.title("Distribution of Rating Over Time")
plt.show()
# Save Dataset
df.to_csv("drug_review_dataset_with_sentiment.csv",index=False)
You can also check out the video tutorial on YouTube or below
Thanks for Your Time
Jesus Saves
By Jesse E.Agbe(JCharis)