Unsupervised Machine learning has several applications among them include its usage in identifying patterns and for clustering similar data into clusters or groups. In this tutorial we will see how to use unsupervised ML to perform a simple task of identifying the right continents a country belongs to given the longitude and the latitude.
Unsupervised Machine learning refers to the type of ML which looks for previously undetected patterns in a data set with no predefined labels or target value. In this case the dataset being used has no labels – ie there are only independent variables with no dependent target or labels and our task is to detect the labels without any human supervision. Hence the term unsupervised – as there is usually little to no human to label the data being used.
There are several algorithms that are used in unsupervised ML, these include
- Clustering
- Anomaly Detection
- Novelty Detection
- Association Rules
- Neural Network
- PCA
- Dimensionality Reduction:simplify data without losing too much info by merging correlated features into one
- Association Rule Learning:discovering interesting relations between attributes
In our task we will be using clustering to help us solve our challenge.
So what is Clustering?
Clustering refers to the process of grouping or segmenting unlabelled datasets into similar groups(clusters) that share similar attributes or features. It is almost like classification except that it is used in unsupervised ML unlike classification that is used in Supervised ML (data that has been labelled). In another tutorial we will discuss the difference.
There are several approaches for clustering unlabelled data. All of these approaches basically deals with the different type of distances between the individual data points such as euclidean distance,graph distance,etc.
Types of Clustering
Clustering can be grouped as either
- Flat or Hierarchical
- Centroid Based or Density Based
The various algorithms used in clustering based on the basic principle of distance include;
- K-Means (distance between points)
- Hierarchical Clustering
- Affinity propagation (graph distance)
- Mean-shift (distance between points)
- DBSCAN (distance between nearest points),
- Gaussian mixtures (Mahalanobis distance to centers),
- Spectral clustering (graph distance), etc.
- BIRCH
- etc
NB: You can check the full tutorial on most of these algorithms in here or the video.
Remember that our problem is to segment or cluster our unlabelled countries into their respective continents hence in our case we will be using K-Means Clustering Algorithm.
Let us start with our task. Below is the preview of the dataset that we are working with As you can see there are only the country name and their coordinates but what we need is to have an additional column that shows which country each continent belongs to.
You can check the original dataset here.
We will be using scikit-learn,pandas and matplotlib for our task.
Let us see the workflow of our task
Problem: cluster countries into their respective continents
Steps
- Fetch and Prep the Data
- Visualize the Data using Scatter Plots
- Identify the Number of K- Clusters we need to group into (In our case we know by general knowledge that there are 7 continents)
- Build our KMeans Model
- Cluster our data into their respective continents
- Overlap our result over world map using geopandas
- Relabel them in plain English
Below is the entire code for our task.
# Load EDA Pkgs
import pandas as pd
import numpy as np
# Load Visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Preparing the Data
# Load Dataset
df = pd.read_csv("countries_geodata.csv",sep='\t')
df.head()
# Save to CSV
df.to_csv("countries_data.csv")
df.head()
# Check For Dtypes
df.dtypes
# Check For Missing NAN
df.isnull().sum()
# Drop Na
df = df.dropna()
df.isnull().sum()
# Columns Consistency
df.columns
# Check For Countries
df.shape
# Plot of our Countries
plt.scatter(df['longitude'],df['latitude'])
Working with Kmeans
from sklearn.cluster import KMeans
# By Assumation we have 7 continents
# k = 7
km = KMeans(n_clusters=7)
# Prep
xfeatures = df[['longitude','latitude']]
# Fit n Predict
clusters = km.fit_predict(xfeatures)
# Get all the Labels(Clusters)
km.labels_
clusters
# Check if predicted clusters is the same as our labels
clusters is km.labels_
# Centroid/Center
km.cluster_centers_
# Store and Map
df['cluster_continents'] = clusters
df.head()
# Plot of our clusters
plt.scatter(df['longitude'],df['latitude'],c=df['cluster_continents'],cmap='rainbow')
# Map our scatter plot over worldmap
import geopandas as gpd
from shapely.geometry import Point,Polygon
import descartes
# World Map
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
ax = world.plot(figsize=(20,10))
ax.axis('off')
# Overlap our clusters
g01 = gpd.GeoDataFrame(df,geometry=gpd.points_from_xy(df['longitude'],df['latitude']))
g01
fig,ax = plt.subplots(figsize=(20,10))
g01.plot(cmap='rainbow',ax=ax)
world.geometry.boundary.plot(color=None,edgecolor='k',linewidth=2,ax=ax)
# Plot of our clusters
plt.figure(figsize=(20,10))
plt.scatter(df['longitude'],df['latitude'],c=df['cluster_continents'],cmap='rainbow')
plt.show()
df.head(20)
continent_dict = {'South America':0,'North America':,'Asia':1,'Africa':4,'Australasia':5,'Europe':6}
df[df['cluster_continents'] == 3]
You can see how we can practically use unsupervised machine learning to cluster countries into their respective continents. In an upcoming tutorial we will see how to use PyCaret to do the same task.
You can check the video tutorial on Unsupervised ML using Scikit Learn below.
Thank You For Your Time
Jesus Saves
By Jesse E.Agbe(JCharis)