Unsupervised Machine learning has several applications among them include its usage in identifying patterns and for clustering similar data into clusters or groups. In this tutorial we will see how to use unsupervised ML to perform a simple task of identifying the right continents a country belongs to given the longitude and the latitude.

Unsupervised Machine learning refers to the type of ML which looks for previously undetected patterns in a data set with no predefined labels or target value. In this case the dataset being used has no labels – ie there are only independent variables with no dependent target or labels and our task is to detect the labels without any human supervision. Hence the term unsupervised – as there is usually little to no human to label the data being used.

There are several algorithms that are used in unsupervised ML, these include

Clustering
Anomaly Detection
Novelty Detection
Association Rules
Neural Network
PCA
Dimensionality Reduction:simplify data without losing too much info by merging correlated features into one
Association Rule Learning:discovering interesting relations between attributes

In our task we will be using clustering to help us solve our challenge.

So what is Clustering?

Clustering refers to the process of grouping or segmenting unlabelled datasets into similar groups(clusters) that share similar attributes or features. It is almost like classification except that it is used in unsupervised ML unlike classification that is used in Supervised ML (data that has been labelled). In another tutorial we will discuss the difference.

There are several approaches for clustering unlabelled data. All of these approaches basically deals with the different type of distances between the individual data points such as euclidean distance,graph distance,etc.

Types of Clustering

Clustering can be grouped as either

Flat or Hierarchical
Centroid Based or Density Based

The various algorithms used in clustering based on the basic principle of distance include;

K-Means (distance between points)
Hierarchical Clustering
Affinity propagation (graph distance)
Mean-shift (distance between points)
DBSCAN (distance between nearest points),
Gaussian mixtures (Mahalanobis distance to centers),
Spectral clustering (graph distance), etc.
BIRCH
etc

NB: You can check the full tutorial on most of these algorithms in here or the video.

Remember that our problem is to segment or cluster our unlabelled countries into their respective continents hence in our case we will be using K-Means Clustering Algorithm.

Let us start with our task. Below is the preview of the dataset that we are working with As you can see there are only the country name and their coordinates but what we need is to have an additional column that shows which country each continent belongs to.

You can check the original dataset here.

We will be using scikit-learn,pandas and matplotlib for our task.

Let us see the workflow of our task

Problem: cluster countries into their respective continents

Steps

Fetch and Prep the Data
Visualize the Data using Scatter Plots
Identify the Number of K- Clusters we need to group into (In our case we know by general knowledge that there are 7 continents)
Build our KMeans Model
Cluster our data into their respective continents
Overlap our result over world map using geopandas
Relabel them in plain English

Below is the entire code for our task.

# Load EDA Pkgs
import pandas as pd
import numpy as np

In [2]:

# Load Visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

/usr/local/lib/python3.7/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

Preparing the Data

In [8]:

# Load Dataset
df = pd.read_csv("countries_geodata.csv",sep='\t')

In [9]:

df.head()

Out[9]:

	country	latitude	longitude	name
0	AD	42.546245	1.601554	Andorra
1	AE	23.424076	53.847818	United Arab Emirates
2	AF	33.939110	67.709953	Afghanistan
3	AG	17.060816	-61.796428	Antigua and Barbuda
4	AI	18.220554	-63.068615	Anguilla

In [10]:

# Save to CSV
df.to_csv("countries_data.csv")

In [11]:

df.head()

Out[11]:

	country	latitude	longitude	name
0	AD	42.546245	1.601554	Andorra
1	AE	23.424076	53.847818	United Arab Emirates
2	AF	33.939110	67.709953	Afghanistan
3	AG	17.060816	-61.796428	Antigua and Barbuda
4	AI	18.220554	-63.068615	Anguilla

In [12]:

# Check For Dtypes
df.dtypes

Out[12]:

country       object
latitude     float64
longitude    float64
name          object
dtype: object

In [13]:

# Check For Missing NAN
df.isnull().sum()

Out[13]:

country      1
latitude     1
longitude    1
name         0
dtype: int64

In [14]:

# Drop Na
df = df.dropna()

In [15]:

df.isnull().sum()

Out[15]:

country      0
latitude     0
longitude    0
name         0
dtype: int64

In [16]:

# Columns Consistency
df.columns

Out[16]:

Index(['country', 'latitude', 'longitude', 'name'], dtype='object')

In [18]:

# Check For Countries
df.shape

Out[18]:

(243, 4)

Data Visualization using Scatter plot

In [19]:

# Plot of our Countries 
plt.scatter(df['longitude'],df['latitude'])

Out[19]:

<matplotlib.collections.PathCollection at 0x7fa0491a31d0>

Working with Kmeans

In [20]:

from sklearn.cluster import KMeans

In [22]:

# By Assumation we have 7 continents
# k = 7
km = KMeans(n_clusters=7)

In [24]:

# Prep
xfeatures = df[['longitude','latitude']]

In [25]:

# Fit n Predict
clusters = km.fit_predict(xfeatures)

In [26]:

# Get all the Labels(Clusters)
km.labels_

Out[26]:

array([6, 2, 2, 0, 0, 6, 2, 0, 4, 4, 0, 3, 6, 5, 0, 2, 6, 0, 1, 6, 4, 6,
       2, 4, 4, 0, 1, 0, 0, 0, 1, 4, 4, 6, 0, 0, 1, 4, 4, 4, 6, 4, 3, 0,
       4, 1, 0, 0, 0, 6, 1, 2, 6, 6, 2, 6, 0, 0, 6, 0, 6, 2, 6, 2, 6, 2,
       6, 5, 0, 5, 6, 6, 4, 6, 0, 2, 0, 6, 4, 6, 6, 6, 4, 0, 4, 6, 4, 0,
       5, 6, 0, 2, 1, 4, 0, 6, 0, 6, 1, 6, 2, 6, 1, 1, 2, 2, 6, 6, 6, 0,
       2, 1, 4, 2, 1, 3, 4, 0, 1, 1, 2, 0, 2, 1, 2, 0, 6, 1, 4, 4, 6, 6,
       6, 6, 6, 6, 6, 6, 4, 5, 6, 6, 1, 1, 1, 5, 0, 6, 0, 6, 4, 1, 4, 0,
       1, 4, 5, 6, 5, 4, 0, 6, 6, 1, 5, 3, 5, 2, 0, 0, 3, 5, 1, 2, 6, 0,
       3, 0, 2, 6, 5, 0, 2, 4, 6, 6, 1, 4, 2, 5, 2, 2, 6, 1, 4, 6, 6, 6,
       4, 6, 6, 2, 0, 4, 0, 2, 4, 0, 4, 4, 4, 1, 2, 3, 5, 2, 6, 3, 2, 0,
       5, 1, 4, 6, 4, 0, 0, 2, 6, 0, 0, 0, 0, 1, 5, 3, 3, 6, 2, 4, 4, 4,
       4], dtype=int32)

In [27]:

clusters

Out[27]:

array([6, 2, 2, 0, 0, 6, 2, 0, 4, 4, 0, 3, 6, 5, 0, 2, 6, 0, 1, 6, 4, 6,
       2, 4, 4, 0, 1, 0, 0, 0, 1, 4, 4, 6, 0, 0, 1, 4, 4, 4, 6, 4, 3, 0,
       4, 1, 0, 0, 0, 6, 1, 2, 6, 6, 2, 6, 0, 0, 6, 0, 6, 2, 6, 2, 6, 2,
       6, 5, 0, 5, 6, 6, 4, 6, 0, 2, 0, 6, 4, 6, 6, 6, 4, 0, 4, 6, 4, 0,
       5, 6, 0, 2, 1, 4, 0, 6, 0, 6, 1, 6, 2, 6, 1, 1, 2, 2, 6, 6, 6, 0,
       2, 1, 4, 2, 1, 3, 4, 0, 1, 1, 2, 0, 2, 1, 2, 0, 6, 1, 4, 4, 6, 6,
       6, 6, 6, 6, 6, 6, 4, 5, 6, 6, 1, 1, 1, 5, 0, 6, 0, 6, 4, 1, 4, 0,
       1, 4, 5, 6, 5, 4, 0, 6, 6, 1, 5, 3, 5, 2, 0, 0, 3, 5, 1, 2, 6, 0,
       3, 0, 2, 6, 5, 0, 2, 4, 6, 6, 1, 4, 2, 5, 2, 2, 6, 1, 4, 6, 6, 6,
       4, 6, 6, 2, 0, 4, 0, 2, 4, 0, 4, 4, 4, 1, 2, 3, 5, 2, 6, 3, 2, 0,
       5, 1, 4, 6, 4, 0, 0, 2, 6, 0, 0, 0, 0, 1, 5, 3, 3, 6, 2, 4, 4, 4,
       4], dtype=int32)

In [30]:

# Check if predicted clusters is the same as our labels
clusters is km.labels_

Out[30]:

True

In [31]:

# Centroid/Center
km.cluster_centers_

Out[31]:

array([[ -70.34407094,    9.8465647 ],
       [ 103.45510325,   18.34838525],
       [  47.93853438,   28.52611853],
       [-164.167216  ,  -15.7990057 ],
       [  20.38957626,  -11.53326343],
       [ 156.84523619,   -7.98094281],
       [   7.80604397,   44.17186475]])

In [32]:

# Store and Map
df['cluster_continents'] = clusters

In [33]:

df.head()

Out[33]:

	country	latitude	longitude	name	cluster_continents
0	AD	42.546245	1.601554	Andorra	6
1	AE	23.424076	53.847818	United Arab Emirates	2
2	AF	33.939110	67.709953	Afghanistan	2
3	AG	17.060816	-61.796428	Antigua and Barbuda	0
4	AI	18.220554	-63.068615	Anguilla	0

In [34]:

# Plot of our clusters
plt.scatter(df['longitude'],df['latitude'],c=df['cluster_continents'],cmap='rainbow')

Out[34]:

<matplotlib.collections.PathCollection at 0x7fa040e54310>

Overlaying our Plot on World Map using Geopandas

In [36]:

# Map our scatter plot over worldmap
import geopandas as gpd
from shapely.geometry import Point,Polygon
import descartes

In [38]:

# World Map
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
ax = world.plot(figsize=(20,10))
ax.axis('off')

Out[38]:

(-198.0, 198.00000000000006, -98.6822565, 92.32738650000002)

In [39]:

# Overlap our clusters
g01 = gpd.GeoDataFrame(df,geometry=gpd.points_from_xy(df['longitude'],df['latitude']))

In [40]:

g01

Out[40]:

	country	latitude	longitude	name	cluster_continents	geometry
0	AD	42.546245	1.601554	Andorra	6	POINT (1.60155 42.54624)
1	AE	23.424076	53.847818	United Arab Emirates	2	POINT (53.84782 23.42408)
2	AF	33.939110	67.709953	Afghanistan	2	POINT (67.70995 33.93911)
3	AG	17.060816	-61.796428	Antigua and Barbuda	0	POINT (-61.79643 17.06082)
4	AI	18.220554	-63.068615	Anguilla	0	POINT (-63.06862 18.22055)
…	…	…	…	…	…	…
240	YE	15.552727	48.516388	Yemen	2	POINT (48.51639 15.55273)
241	YT	-12.827500	45.166244	Mayotte	4	POINT (45.16624 -12.82750)
242	ZA	-30.559482	22.937506	South Africa	4	POINT (22.93751 -30.55948)
243	ZM	-13.133897	27.849332	Zambia	4	POINT (27.84933 -13.13390)
244	ZW	-19.015438	29.154857	Zimbabwe	4	POINT (29.15486 -19.01544)

243 rows × 6 columns

In [41]:

fig,ax = plt.subplots(figsize=(20,10))
g01.plot(cmap='rainbow',ax=ax)
world.geometry.boundary.plot(color=None,edgecolor='k',linewidth=2,ax=ax)

Out[41]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fa0408b0d10>

In [42]:

# Plot of our clusters
plt.figure(figsize=(20,10))
plt.scatter(df['longitude'],df['latitude'],c=df['cluster_continents'],cmap='rainbow')
plt.show()

Mapping Our Clusters in Plain English

In [44]:

df.head(20)

Out[44]:

	country	latitude	longitude	name	cluster_continents	geometry
0	AD	42.546245	1.601554	Andorra	6	POINT (1.60155 42.54624)
1	AE	23.424076	53.847818	United Arab Emirates	2	POINT (53.84782 23.42408)
2	AF	33.939110	67.709953	Afghanistan	2	POINT (67.70995 33.93911)
3	AG	17.060816	-61.796428	Antigua and Barbuda	0	POINT (-61.79643 17.06082)
4	AI	18.220554	-63.068615	Anguilla	0	POINT (-63.06862 18.22055)
5	AL	41.153332	20.168331	Albania	6	POINT (20.16833 41.15333)
6	AM	40.069099	45.038189	Armenia	2	POINT (45.03819 40.06910)
7	AN	12.226079	-69.060087	Netherlands Antilles	0	POINT (-69.06009 12.22608)
8	AO	-11.202692	17.873887	Angola	4	POINT (17.87389 -11.20269)
9	AQ	-75.250973	-0.071389	Antarctica	4	POINT (-0.07139 -75.25097)
10	AR	-38.416097	-63.616672	Argentina	0	POINT (-63.61667 -38.41610)
11	AS	-14.270972	-170.132217	American Samoa	3	POINT (-170.13222 -14.27097)
12	AT	47.516231	14.550072	Austria	6	POINT (14.55007 47.51623)
13	AU	-25.274398	133.775136	Australia	5	POINT (133.77514 -25.27440)
14	AW	12.521110	-69.968338	Aruba	0	POINT (-69.96834 12.52111)
15	AZ	40.143105	47.576927	Azerbaijan	2	POINT (47.57693 40.14310)
16	BA	43.915886	17.679076	Bosnia and Herzegovina	6	POINT (17.67908 43.91589)
17	BB	13.193887	-59.543198	Barbados	0	POINT (-59.54320 13.19389)
18	BD	23.684994	90.356331	Bangladesh	1	POINT (90.35633 23.68499)
19	BE	50.503887	4.469936	Belgium	6	POINT (4.46994 50.50389)

In [ ]:

continent_dict = {'South America':0,'North America':,'Asia':1,'Africa':4,'Australasia':5,'Europe':6}

In [53]:

df[df['cluster_continents'] == 3]

Out[53]:

	country	latitude	longitude	name	cluster_continents	geometry
11	AS	-14.270972	-170.132217	American Samoa	3	POINT (-170.13222 -14.27097)
42	CK	-21.236736	-159.777671	Cook Islands	3	POINT (-159.77767 -21.23674)
115	KI	-3.370417	-168.734039	Kiribati	3	POINT (-168.73404 -3.37042)
166	NU	-19.054445	-169.867233	Niue	3	POINT (-169.86723 -19.05445)
171	PF	-17.679742	-149.406843	French Polynesia	3	POINT (-149.40684 -17.67974)
177	PN	-24.703615	-127.439308	Pitcairn Islands	3	POINT (-127.43931 -24.70361)
214	TK	-8.967363	-171.855881	Tokelau	3	POINT (-171.85588 -8.96736)
218	TO	-21.178986	-175.198242	Tonga	3	POINT (-175.19824 -21.17899)
237	WF	-13.768752	-177.156097	Wallis and Futuna	3	POINT (-177.15610 -13.76875)
238	WS	-13.759029	-172.104629	Samoa	3	POINT (-172.10463 -13.75903)

You can see how we can practically use unsupervised machine learning to cluster countries into their respective continents. In an upcoming tutorial we will see how to use PyCaret to do the same task.

You can check the video tutorial on Unsupervised ML using Scikit Learn below.

Thank You For Your Time

Jesus Saves

By Jesse E.Agbe(JCharis)

Clustering Countries into Continents using Unsupervised Machine Learning

Working with Kmeans

Leave a Comment Cancel Reply