A hash is unique fixed-size values or a fixed-length string of characters produced by a hash function. It is the values returned by a hash algorithm. Such values can also be termed as Message Digests or simply a Hash.
A hash function is any function than can be used to map data of arbitrary size to fixed size values. Hashes are the output of a hashing algorithm such as MD5 (Message Digest 5) or SHA( Secure Hashing Algorithm). The purpose of the hashing algorithms is essentially to produce a unique,fixed-length string – the hash value for any given piece of data or message.
Types of Hash Algorithm
There are several hash algorithms or hash functions. The type of hash algorithm used can be used to define the type of hash or message digest produced. The various common hash algorithms include
- MD5
- SHA 254
- SHA 514
- etc
In this tutorial we will explore how to use machine learning and other tools to identify the particular hash type of a given hash. Before we start let us check out some tools and libraries for identifying the hash type of a given hash value. Most of these tools either rely on Regular Expression or other heuristics.
Tools For Identifying Hash Algorithm
- Hashid
- Hashi-identifier
- nth & Name-that-Hash
You can also use regular expression since every hash follows a particular pattern and has specific length. Let us see an example of how to use regex to identify the type of hash.
For example to identify an MD5 hash you have to realize that MD5 hashes has these rules
- They have a fixed size of 32 characters
- Consist of either alpha numerics
- Either all lowercase or all uppercase characters not a mixture
Based on these rules we can use regex to identify this type of hash
import re md5_pattern = re.compile(r'[a-fA-F\d]{32}') re.findall(md5_pattern,'your text')
For a more elaborate identification we can use the following function
HASH_TYPE_REGEX = { re.compile(r"^[a-f0-9]{32}(:.+)?$", re.IGNORHASH_TYPE_REGEXE): , → "MD4", "MD2", "Double MD5"], re.compile(r"^[a-f0-9]{64}(:.+)?$", re.IGNORECASE): "RIPEMD-256", "SHA3-256", "Haval-256", ["MD5",␣ ["SHA-256",␣ , → ], re.compile(r"^[a-f0-9]{128}(:.+)?$", re.IGNORECASE): ["SHA-512",␣ , → "Whirlpool", "Salsa10", "Salsa20",␣ , → "SHA3-512", "Skein-512", "Skein-1024(512)"] } # Function def custom_hash_identifier(hashed_text): result = [ HASH_TYPE_REGEX[algo] for algo in HASH_TYPE_REGEX if algo.match(hashed_text)] return result
Using other Tools For Hash Identification
Below are examples of how to use each tool
Hashid
# Installation pip install hashid # Usage hashid 79054025255fb1a26e4bc422aef54eb4
Hash-Identifier
# Installation sudo apt-get install hash-identifier # Usage hash-identifier 79054025255fb1a26e4bc422aef54eb4
Name-that-hash
# Installation pip install name-that-hash nth # Usage nth --text 79054025255fb1a26e4bc422aef54eb4
Using Machine Learning to Identify Hash Type
The goal of this post is to see if we can use ML to identify Hash types. In order to do so we collected a list of passwords from secList and using hashlib we encoded each word respectively for the various hash algorithms we will be using (MD5,Sha256,Sha512,sha1.
Below is the code for the process
# ML from sklearn.linear_model import LogisticRegression from sklearn.naive_bayes import MultinomialNB from sklearn.tree import DecisionTreeClassifier from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import CountVectorizer from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score,classification_report # Load Dataset df = pd.read_csv("data/10-million-hashed_password-list-top-10000.csv") # Features & Label Xfeatures = df['hashed_text'] ylabels = df['label'] # Split Dataset x_train,x_test,y_train,y_test = train_test_split(Xfeatures,ylabels,test_size=0.3,random_state=42) # Base Model from sklearn.dummy import DummyClassifier # Build Pipeline (Data To Data and Data to Model) pipe_dummy = Pipeline(steps=[('cv',CountVectorizer()),('dummy',DummyClassifier())]) pipe_dt = Pipeline(steps=[('cv',CountVectorizer()),('dt',DecisionTreeClassifier())]) pipe_lr = Pipeline(steps=[('cv',CountVectorizer()),('lr',LogisticRegression())])
Let us fit the model pipeline on our data to train
# Base Model >>>pipe_dummy.fit(x_train,y_train) Pipeline(steps=[('cv', CountVectorizer()), ('dummy', DummyClassifier())]) # Base Accuracy >>>pipe_dummy.score(x_test,y_test) 0.243
Interestingly our base model gave us an accuracy of 0.243 hence we should be expecting our model to be above 0.24 for it to be considered better. We will check two ML estimators DecisionTree Classifier and LogisticRegression .
# Using DT pipe_dt.fit(x_train,y_train) 44 s ± 2.09 s per loop (mean ± std. dev. of 7 runs, 1 loop each) # Accuracy pipe_dt.score(x_test,y_test) 0.4063333333333333 # Using LR pipe_lr.fit(x_train,y_train) # Accuracy pipe_lr.score(x_test,y_test) 0.7029166666666666
From our analysis we have been able to build a predictive model to identify the particular type of hash using ML. We can also combine the model alongside the regex to make it even more accurate. You can check out the video tutorial below
Thanks for your time
Jesus Saves
By Jesse E.Agbemabiase(JCharis)