data science in the terminal

Data Science In the Command Line

In this tutorial we will be looking at how to use the command line when working with a data science project. We will be doing data analysis with the command line – shell or bash.

Most data science projects are done with languages such as Python,R,Julia, Javascript,Java ,C languages,Matlab .  We will be checking how to use some CLI tools to do data analysis.

The various tools we will be using include

  • Bash
  • Csvkit

Bash comes with every Unix based system (Linux,MacOs,FreeBSD) as well as with Windows that has Linux Subsystem enabled.

We will have to install csvkit using

sudo apt-get install python3-csvkit

or with pip

pip install csvkit

Csvkit offers several tools to for working with csv files in data science

Let us start

Viewing Dataset

There are several ways we can view or read our dataset from the command line. These include the use of head,cat vi,nano.

head  data.csv

To View the First n rows

head -n 5 data.csv

Using Cat

cat data.csv

cat data.csv | more 

cat data.csv | less -S

Using csvlook to view dataset in a tabular form

head data.csv | csvlook
head -n 5 data.csv | csvlook

Checking Columns of Dataset

You can also check for the names of the columns of the dataset using csvcut

csvcut -n data.csv

Checking For Data Types and Summary of Dataset

This is equivalent to using df.dtypes and df.describe() in pandas. In this case we will be using the csvstat

csvstat data.csv

 

Checking For Statistics or Summary of A Column

 csvstat data.csv -c 4

or

csvstat -c 4 data.csv

You can also use the column names instead of the column number or index

csvstat data.csv -c "species"

Checking For Value Counts

You can also check for the value counts of each column using the –uniq option.

csvstat data.csv --uniq

You can get the maximum value,mean value and sum for each column using the –max ,–mean,–sum option respectively

csvstat data.csv --max

csvstat data.csv --mean

csvstat data.csv --sum

Checking For Shape of Dataset Using wc

wc data.csv

Checking For Null Values and Missing Values

csvstat data.csv --null

Selecting Columns

To select a column you can use the csvcut.

csvcut -c 1 data.csv
csvcut -c "sepal_length" data.csv

You can also select multiple columns

csvcut -c 1,2,3 data.csv

Finding A Value in the Dataset

In case you want to find a particular value you can use the csvgrep tool.

csvcut -c "species" data.csv | csvgrep -c "species" -m virgin | csvlook

You can also do a simple form without the tabular outlook

csvgrep -c "species" -m virgin iris.csv

Using the command line  with bash and csvkit is great  to have as an addition to your skill set.

 

You can check out the video tutorial here

Thanks for your time

Jesus Saves

By Jesse E.Agbe(JCharis)

 

 

Leave a Comment

Your email address will not be published. Required fields are marked *