In this tutorial we will be looking at how to use the command line when working with a data science project. We will be doing data analysis with the command line – shell or bash.
Most data science projects are done with languages such as Python,R,Julia, Javascript,Java ,C languages,Matlab . We will be checking how to use some CLI tools to do data analysis.
The various tools we will be using include
- Bash
- Csvkit
Bash comes with every Unix based system (Linux,MacOs,FreeBSD) as well as with Windows that has Linux Subsystem enabled.
We will have to install csvkit using
sudo apt-get install python3-csvkit
or with pip
pip install csvkit
Csvkit offers several tools to for working with csv files in data science
Let us start
Viewing Dataset
There are several ways we can view or read our dataset from the command line. These include the use of head,cat vi,nano.
head data.csv
To View the First n rows
head -n 5 data.csv
Using Cat
cat data.csv cat data.csv | more cat data.csv | less -S
Using csvlook to view dataset in a tabular form
head data.csv | csvlook
head -n 5 data.csv | csvlook
Checking Columns of Dataset
You can also check for the names of the columns of the dataset using csvcut
csvcut -n data.csv
Checking For Data Types and Summary of Dataset
This is equivalent to using df.dtypes and df.describe() in pandas. In this case we will be using the csvstat
csvstat data.csv
Checking For Statistics or Summary of A Column
csvstat data.csv -c 4
or
csvstat -c 4 data.csv
You can also use the column names instead of the column number or index
csvstat data.csv -c "species"
Checking For Value Counts
You can also check for the value counts of each column using the –uniq option.
csvstat data.csv --uniq
You can get the maximum value,mean value and sum for each column using the –max ,–mean,–sum option respectively
csvstat data.csv --max csvstat data.csv --mean csvstat data.csv --sum
Checking For Shape of Dataset Using wc
wc data.csv
Checking For Null Values and Missing Values
csvstat data.csv --null
Selecting Columns
To select a column you can use the csvcut.
csvcut -c 1 data.csv
csvcut -c "sepal_length" data.csv
You can also select multiple columns
csvcut -c 1,2,3 data.csv
Finding A Value in the Dataset
In case you want to find a particular value you can use the csvgrep tool.
csvcut -c "species" data.csv | csvgrep -c "species" -m virgin | csvlook
You can also do a simple form without the tabular outlook
csvgrep -c "species" -m virgin iris.csv
Using the command line with bash and csvkit is great to have as an addition to your skill set.
You can check out the video tutorial here
Thanks for your time
Jesus Saves
By Jesse E.Agbe(JCharis)