Hash Identification Using Machine Learning and 3 Tools

A hash is unique fixed-size values or a fixed-length string of characters produced by a hash function. It is the values returned by a hash algorithm. Such values can also be termed as Message Digests or simply a Hash.

A hash function is any function than can be used to map data of arbitrary size to fixed size values. Hashes are the output of a hashing algorithm such as MD5 (Message Digest 5) or SHA( Secure Hashing Algorithm). The purpose of the hashing algorithms is essentially to produce a unique,fixed-length string – the hash value for any given piece of data or message.

Types of Hash Algorithm

There are several hash algorithms or hash functions. The type of hash algorithm used can be used to define the type of hash or message digest produced. The various common hash algorithms include

  • MD5
  • SHA 254
  • SHA 514
  • etc

In this tutorial we will explore how to use machine learning and other tools to identify the particular hash type of a given hash. Before we start let us check out some tools and libraries for identifying the hash type of a given hash value. Most of these tools either rely on Regular Expression or other heuristics.

Tools For Identifying Hash Algorithm

  • Hashid
  • Hashi-identifier
  • nth & Name-that-Hash

You can also use regular expression since every hash follows a particular pattern and has specific length. Let us see an example of how to use regex to identify the type of hash.

For example to identify an MD5 hash you have to realize that MD5 hashes has these rules

  • They have a fixed size of 32 characters
  • Consist of either alpha numerics
  • Either all lowercase or all uppercase characters not a mixture

Based on these rules we can use regex to identify this type of hash

import re
md5_pattern = re.compile(r'[a-fA-F\d]{32}')
re.findall(md5_pattern,'your text')

For a more elaborate identification we can use the following function

re.compile(r"^[a-f0-9]{32}(:.+)?$", re.IGNORHASH_TYPE_REGEXE):
, → "MD4", "MD2", "Double MD5"],
re.compile(r"^[a-f0-9]{64}(:.+)?$", re.IGNORECASE):
"RIPEMD-256", "SHA3-256", "Haval-256",
, →
re.compile(r"^[a-f0-9]{128}(:.+)?$", re.IGNORECASE): ["SHA-512",␣
, → "Whirlpool", "Salsa10",
, → "SHA3-512", "Skein-512",

# Function
def custom_hash_identifier(hashed_text):
     result = [ HASH_TYPE_REGEX[algo] for algo in HASH_TYPE_REGEX if algo.match(hashed_text)]
     return result

Using other Tools For Hash Identification

Below are examples of how to use each tool

# Installation
pip install hashid

# Usage
hashid 79054025255fb1a26e4bc422aef54eb4
# Installation
sudo apt-get install hash-identifier

# Usage
hash-identifier 79054025255fb1a26e4bc422aef54eb4
# Installation
pip install name-that-hash nth

# Usage
nth --text 79054025255fb1a26e4bc422aef54eb4

Using Machine Learning to Identify Hash Type

The goal of this post is to see if we can use ML to identify Hash types. In order to do so we collected a list of passwords from secList and using hashlib we encoded each word respectively for the various hash algorithms we will be using (MD5,Sha256,Sha512,sha1.

Below is the code for the process

# ML
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,classification_report

# Load Dataset
df = pd.read_csv("data/10-million-hashed_password-list-top-10000.csv")

# Features & Label
Xfeatures = df['hashed_text']
ylabels = df['label']

# Split Dataset
x_train,x_test,y_train,y_test = train_test_split(Xfeatures,ylabels,test_size=0.3,random_state=42)

# Base Model
from sklearn.dummy import DummyClassifier

# Build Pipeline (Data To Data and Data to Model)
pipe_dummy = Pipeline(steps=[('cv',CountVectorizer()),('dummy',DummyClassifier())])
pipe_dt = Pipeline(steps=[('cv',CountVectorizer()),('dt',DecisionTreeClassifier())])
pipe_lr = Pipeline(steps=[('cv',CountVectorizer()),('lr',LogisticRegression())])

Let us fit the model pipeline on our data to train

# Base Model
Pipeline(steps=[('cv', CountVectorizer()), ('dummy', DummyClassifier())])

# Base Accuracy

Interestingly our base model gave us an accuracy of 0.243 hence we should be expecting our model to be above 0.24 for it to be considered better. We will check two ML estimators DecisionTree Classifier and LogisticRegression .

# Using DT
44 s ± 2.09 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Accuracy

# Using LR

# Accuracy

From our analysis we have been able to build a predictive model to identify the particular type of hash using ML. We can also combine the model alongside the regex to make it even more accurate. You can check out the video tutorial below

Thanks for your time

Jesus Saves

By Jesse E.Agbemabiase(JCharis)

Leave a Comment

Your email address will not be published. Required fields are marked *