pypolars tutorial

PyPolars – Data Analysis with PyPolars – a Pandas Alternative

Pandas is one of the best libraries for doing data analysis in general. It is so popular and useful that it has become the defactor DataFrames library when doing Data Science in Python.

However, there are times that your dataset may be too big for Pandas and too small for PySpark – this is where PyPolars comes to play.

By the end of this tutorial you will be able to know and understand :

  • What Pypolars is.
  • When to use it.
  • How to Install it
  • How to use it for Exploratory Data Analysis(EDA)

Let us start.

What is Pypolars?

PyPolars is a python library useful for doing exploratory data analysis (EDA for short). It is a port of the famous DataFrames Library in Rust called Polars. (Like the bear like creature Polar Bear similar to Panda Bear: Hence the name Polars vs Pandas)

Pypolars is quite easy to pick up as it has a similar API to that of Pandas. Moroever it can also leverage Pandas behind the seen in case you are looking for a feature or function that is not yet available using “.to_pandas()” function. Cool right!!!

Let us explore this wonderful python library.

How to Install PyPolars

PyPolars is available on PyPI so you can install it with pip as below

pip install py-polars 

or with Pip3

pip3 install py-polars

PyPolars For Data Analysis

As we earlier mentioned, PyPolars has a similar API to that of Pandas so you can read in any csv file using the .read_csv() function and start from there. You can get the full code here.

To begin let us import it

import pypolars as pl 
df = pl.read_csv("data/mydataset.csv")

Exploratory Data Analysis with PyPolars

  • PyPolars can be used for reading csv files
  • Selection of Rows and Column For Manipulation
  • Do Groupy,Join etc
  • But it doesn’t have a directly link to matplotlib however you can use the .to_pandas() to help fix that challenge.

In [1]:

#  Load EDA Pkgs
import pypolars as pl

In [2]:

# Methods/Attrib
dir(pl)

Out[2]:

['BinaryIO',
 'Boolean',
 'Callable',
 'DTYPE_TO_FFINAME',
 'DataFrame',
 'Date32',
 'Date64',
 'Dict',
 'DurationMicrosecond',
 'DurationMillisecond',
 'DurationNanosecond',
 'DurationSecond',
 'Expr',
 'Float32',
 'Float64',
 'Int16',
 'Int32',
 'Int64',
 'Int8',
 'LazyFrame',
 'LazyGroupBy',
 'List',
 'Object',
 'Optional',
 'PyExpr',
 'PyLazyFrame',
 'PyLazyGroupBy',
 'Series',
 'TextIO',
 'Time32Millisecond',
 'Time32Second',
 'Time64Microsecond',
 'Time64Nanosecond',
 'TimestampMicrosecond',
 'TimestampMillisecond',
 'TimestampNanosecond',
 'TimestampSecond',
 'UDF',
 'UInt16',
 'UInt32',
 'UInt64',
 'UInt8',
 'Union',
 'Utf8',
 'When',
 'WhenThen',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__pdoc__',
 '__spec__',
 '__version__',
 'arg_where',
 'avg',
 'binary_expr',
 'col',
 'count',
 'ctypes',
 'datatypes',
 'dtype_to_ctype',
 'dtype_to_int',
 'dtype_to_primitive',
 'dtypes',
 'expr_to_lit_or_expr',
 'ffi',
 'first',
 'frame',
 'from_pandas',
 'functions',
 'get_dummies',
 'last',
 'lazy',
 'list',
 'lit',
 'max',
 'mean',
 'median',
 'min',
 'n_unique',
 'np',
 'os',
 'pycol',
 'pylit',
 'pypolars',
 'pytype_to_polars_type',
 'pywhen',
 'read_csv',
 'read_ipc',
 'read_parquet',
 'scan_csv',
 'scan_parquet',
 'series',
 'shutil',
 'std',
 'subprocess',
 'sum',
 'tempfile',
 'udf',
 'var',
 'version',
 'when',
 'wrap_df',
 'wrap_expr',
 'wrap_ldf',
 'wrap_s']

In [3]:

# Reading CSV
df = pl.read_csv("data/diamonds.csv")

In [4]:

# Check Type
type(df)

Out[4]:

pypolars.frame.DataFrame

In [5]:

# Head/First Rows
df.head()

Out[5]:

shape: (5, 10)
╭───────┬─────────┬───────┬─────────┬─────┬───────┬───────┬───────┬──────╮
│ carat ┆ cut     ┆ color ┆ clarity ┆ ... ┆ depth ┆ table ┆ price ┆ x    │
│ ---   ┆ ---     ┆ ---   ┆ ---     ┆     ┆ ---   ┆ ---   ┆ ---   ┆ ---  │
│ f64   ┆ str     ┆ str   ┆ str     ┆     ┆ f64   ┆ f64   ┆ i64   ┆ f64  │
╞═══════╪═════════╪═══════╪═════════╪═════╪═══════╪═══════╪═══════╪══════╡
│ 0.23  ┆ Ideal   ┆ E     ┆ SI2     ┆ ... ┆ 61.5  ┆ 55    ┆ 326   ┆ 3.95 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.21  ┆ Premium ┆ E     ┆ SI1     ┆ ... ┆ 59.8  ┆ 61    ┆ 326   ┆ 3.89 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.23  ┆ Good    ┆ E     ┆ VS1     ┆ ... ┆ 56.9  ┆ 65    ┆ 327   ┆ 4.05 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.29  ┆ Premium ┆ I     ┆ VS2     ┆ ... ┆ 62.4  ┆ 58    ┆ 334   ┆ 4.2  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.31  ┆ Good    ┆ J     ┆ SI2     ┆ ... ┆ 63.3  ┆ 58    ┆ 335   ┆ 4.34 │
╰───────┴─────────┴───────┴─────────┴─────┴───────┴───────┴───────┴──────╯

In [6]:

# Last 10 Rows
df.tail(10)

Out[6]:

shape: (10, 10)
╭───────┬───────────┬───────┬─────────┬─────┬───────┬───────┬───────┬──────╮
│ carat ┆ cut       ┆ color ┆ clarity ┆ ... ┆ depth ┆ table ┆ price ┆ x    │
│ ---   ┆ ---       ┆ ---   ┆ ---     ┆     ┆ ---   ┆ ---   ┆ ---   ┆ ---  │
│ f64   ┆ str       ┆ str   ┆ str     ┆     ┆ f64   ┆ f64   ┆ i64   ┆ f64  │
╞═══════╪═══════════╪═══════╪═════════╪═════╪═══════╪═══════╪═══════╪══════╡
│ 0.71  ┆ Premium   ┆ E     ┆ SI1     ┆ ... ┆ 60.5  ┆ 55    ┆ 2756  ┆ 5.79 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.71  ┆ Premium   ┆ F     ┆ SI1     ┆ ... ┆ 59.8  ┆ 62    ┆ 2756  ┆ 5.74 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.7   ┆ Very Good ┆ E     ┆ VS2     ┆ ... ┆ 60.5  ┆ 59    ┆ 2757  ┆ 5.71 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.7   ┆ Very Good ┆ E     ┆ VS2     ┆ ... ┆ 61.2  ┆ 59    ┆ 2757  ┆ 5.69 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ...   ┆ ...       ┆ ...   ┆ ...     ┆ ... ┆ ...   ┆ ...   ┆ ...   ┆ ...  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.72  ┆ Ideal     ┆ D     ┆ SI1     ┆ ... ┆ 60.8  ┆ 57    ┆ 2757  ┆ 5.75 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.72  ┆ Good      ┆ D     ┆ SI1     ┆ ... ┆ 63.1  ┆ 55    ┆ 2757  ┆ 5.69 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.7   ┆ Very Good ┆ D     ┆ SI1     ┆ ... ┆ 62.8  ┆ 60    ┆ 2757  ┆ 5.66 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.86  ┆ Premium   ┆ H     ┆ SI2     ┆ ... ┆ 61    ┆ 58    ┆ 2757  ┆ 6.15 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.75  ┆ Ideal     ┆ D     ┆ SI2     ┆ ... ┆ 62.2  ┆ 55    ┆ 2757  ┆ 5.83 │
╰───────┴───────────┴───────┴─────────┴─────┴───────┴───────┴───────┴──────╯

In [7]:

# Check For Shape
df.shape

Out[7]:

(53940, 10)

In [8]:

# Check For Datatypes
df.dtypes

Out[8]:

[pypolars.datatypes.Float64,
 pypolars.datatypes.Utf8,
 pypolars.datatypes.Utf8,
 pypolars.datatypes.Utf8,
 pypolars.datatypes.Float64,
 pypolars.datatypes.Float64,
 pypolars.datatypes.Int64,
 pypolars.datatypes.Float64,
 pypolars.datatypes.Float64,
 pypolars.datatypes.Float64]

In [9]:

# Check For Column Names
df.columns

Out[9]:

['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price', 'x', 'y', 'z']

In [10]:

# Method/Attrib on DF
dir(df)

Out[10]:

['__add__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__truediv__',
 '__weakref__',
 '_df',
 '_from_pydf',
 '_rechunk',
 'clone',
 'columns',
 'drop',
 'drop_duplicates',
 'drop_in_place',
 'drop_nulls',
 'dtypes',
 'explode',
 'fill_none',
 'frame_equal',
 'get_columns',
 'groupby',
 'head',
 'height',
 'hstack',
 'insert_at_idx',
 'is_duplicated',
 'is_unique',
 'join',
 'lazy',
 'max',
 'mean',
 'median',
 'melt',
 'min',
 'n_chunks',
 'pipe',
 'quantile',
 'read_csv',
 'read_ipc',
 'read_parquet',
 'replace',
 'replace_at_idx',
 'sample',
 'select_at_idx',
 'shape',
 'shift',
 'slice',
 'sort',
 'std',
 'sum',
 'tail',
 'to_csv',
 'to_dummies',
 'to_ipc',
 'to_numpy',
 'to_pandas',
 'var',
 'vstack',
 'width']

In [11]:

# Not Working
df.describe()
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pypolars/frame.py in __getattr__(self, item)
    260         try:
--> 261             return wrap_s(self._df.column(item))
    262         except RuntimeError:

RuntimeError: Any(NotFound("describe"))

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
<ipython-input-11-f517d93c89f7> in <module>
      1 # Not Working
----> 2 df.describe()

/usr/local/lib/python3.7/dist-packages/pypolars/frame.py in __getattr__(self, item)
    261             return wrap_s(self._df.column(item))
    262         except RuntimeError:
--> 263             raise AttributeError(f"{item} not found")
    264 
    265     def __iter__(self):

AttributeError: describe not found

In [12]:

# Alternative Using Pandas
df.to_pandas().describe()

Out[12]:

caratdepthtablepricexyz
count53940.00000053940.00000053940.00000053940.00000053940.00000053940.00000053940.000000
mean0.79794061.74940557.4571843932.7997225.7311575.7345263.538734
std0.4740111.4326212.2344913989.4397381.1217611.1421350.705699
min0.20000043.00000043.000000326.0000000.0000000.0000000.000000
25%0.40000061.00000056.000000950.0000004.7100004.7200002.910000
50%0.70000061.80000057.0000002401.0000005.7000005.7100003.530000
75%1.04000062.50000059.0000005324.2500006.5400006.5400004.040000
max5.01000079.00000095.00000018823.00000010.74000058.90000031.800000

Selection

  • By Columns By Name/Index
  • By Rows

In [17]:

df.columns

Out[17]:

['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price', 'x', 'y', 'z']

In [18]:

# Select Columns By Col Name
# pandas:: df['carat']
df['carat']

Out[18]:

Series: 'carat' [f64]
[
	0.23
	0.21
	0.23
	0.29
	0.31
	0.24
	0.24
	0.26
	0.22
	0.23
]

In [19]:

# Select Columns By Index
df[0]

Out[19]:

Series: 'carat' [f64]
[
	0.23
	0.21
	0.23
	0.29
	0.31
	0.24
	0.24
	0.26
	0.22
	0.23
]

In [20]:

# Select Multiple Columns
df[['carat','price']]

Out[20]:

shape: (53940, 2)
╭───────┬───────╮
│ carat ┆ price │
│ ---   ┆ ---   │
│ f64   ┆ i64   │
╞═══════╪═══════╡
│ 0.23  ┆ 326   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 0.21  ┆ 326   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 0.23  ┆ 327   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 0.29  ┆ 334   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ ...   ┆ ...   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 0.72  ┆ 2757  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 0.72  ┆ 2757  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 0.7   ┆ 2757  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 0.86  ┆ 2757  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 0.75  ┆ 2757  │
╰───────┴───────╯

In [22]:

# Select By Index Position for columns
#  df[0] same in pandas
df.select_at_idx(0)

Out[22]:

Series: 'carat' [f64]
[
	0.23
	0.21
	0.23
	0.29
	0.31
	0.24
	0.24
	0.26
	0.22
	0.23
]

In [23]:

# Select By Rows
# Select rows 0 to 3 and all columns
# df.iloc[0:3]
df[0:3]

Out[23]:

shape: (3, 10)
╭───────┬─────────┬───────┬─────────┬─────┬───────┬───────┬───────┬──────╮
│ carat ┆ cut     ┆ color ┆ clarity ┆ ... ┆ depth ┆ table ┆ price ┆ x    │
│ ---   ┆ ---     ┆ ---   ┆ ---     ┆     ┆ ---   ┆ ---   ┆ ---   ┆ ---  │
│ f64   ┆ str     ┆ str   ┆ str     ┆     ┆ f64   ┆ f64   ┆ i64   ┆ f64  │
╞═══════╪═════════╪═══════╪═════════╪═════╪═══════╪═══════╪═══════╪══════╡
│ 0.23  ┆ Ideal   ┆ E     ┆ SI2     ┆ ... ┆ 61.5  ┆ 55    ┆ 326   ┆ 3.95 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.21  ┆ Premium ┆ E     ┆ SI1     ┆ ... ┆ 59.8  ┆ 61    ┆ 326   ┆ 3.89 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.23  ┆ Good    ┆ E     ┆ VS1     ┆ ... ┆ 56.9  ┆ 65    ┆ 327   ┆ 4.05 │
╰───────┴─────────┴───────┴─────────┴─────┴───────┴───────┴───────┴──────╯

In [24]:

# SLicing 0 to length of rows
# df.iloc[0:3]
df.slice(0,3)

Out[24]:

shape: (3, 10)
╭───────┬─────────┬───────┬─────────┬─────┬───────┬───────┬───────┬──────╮
│ carat ┆ cut     ┆ color ┆ clarity ┆ ... ┆ depth ┆ table ┆ price ┆ x    │
│ ---   ┆ ---     ┆ ---   ┆ ---     ┆     ┆ ---   ┆ ---   ┆ ---   ┆ ---  │
│ f64   ┆ str     ┆ str   ┆ str     ┆     ┆ f64   ┆ f64   ┆ i64   ┆ f64  │
╞═══════╪═════════╪═══════╪═════════╪═════╪═══════╪═══════╪═══════╪══════╡
│ 0.23  ┆ Ideal   ┆ E     ┆ SI2     ┆ ... ┆ 61.5  ┆ 55    ┆ 326   ┆ 3.95 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.21  ┆ Premium ┆ E     ┆ SI1     ┆ ... ┆ 59.8  ┆ 61    ┆ 326   ┆ 3.89 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.23  ┆ Good    ┆ E     ┆ VS1     ┆ ... ┆ 56.9  ┆ 65    ┆ 327   ┆ 4.05 │
╰───────┴─────────┴───────┴─────────┴─────┴───────┴───────┴───────┴──────╯

In [28]:

# Select from 3 rows of a column
df[0:3,"carat"]

Out[28]:

shape: (3, 1)
╭───────╮
│ carat │
│ ---   │
│ f64   │
╞═══════╡
│ 0.23  │
├╌╌╌╌╌╌╌┤
│ 0.21  │
├╌╌╌╌╌╌╌┤
│ 0.23  │
╰───────╯

In [29]:

# Select from different rows of a column
df[[0,4,10],"carat"]

Out[29]:

shape: (3, 1)
╭───────╮
│ carat │
│ ---   │
│ f64   │
╞═══════╡
│ 0.23  │
├╌╌╌╌╌╌╌┤
│ 0.31  │
├╌╌╌╌╌╌╌┤
│ 0.3   │
╰───────╯

In [30]:

# Select from different rows of a multiple columns
df[[0,4,10],["carat","cut"]]

Out[30]:

shape: (3, 2)
╭───────┬───────╮
│ carat ┆ cut   │
│ ---   ┆ ---   │
│ f64   ┆ str   │
╞═══════╪═══════╡
│ 0.23  ┆ Ideal │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 0.31  ┆ Good  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 0.3   ┆ Good  │
╰───────┴───────╯

In [31]:

# Value Counts
df.head()

Out[31]:

shape: (5, 10)
╭───────┬─────────┬───────┬─────────┬─────┬───────┬───────┬───────┬──────╮
│ carat ┆ cut     ┆ color ┆ clarity ┆ ... ┆ depth ┆ table ┆ price ┆ x    │
│ ---   ┆ ---     ┆ ---   ┆ ---     ┆     ┆ ---   ┆ ---   ┆ ---   ┆ ---  │
│ f64   ┆ str     ┆ str   ┆ str     ┆     ┆ f64   ┆ f64   ┆ i64   ┆ f64  │
╞═══════╪═════════╪═══════╪═════════╪═════╪═══════╪═══════╪═══════╪══════╡
│ 0.23  ┆ Ideal   ┆ E     ┆ SI2     ┆ ... ┆ 61.5  ┆ 55    ┆ 326   ┆ 3.95 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.21  ┆ Premium ┆ E     ┆ SI1     ┆ ... ┆ 59.8  ┆ 61    ┆ 326   ┆ 3.89 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.23  ┆ Good    ┆ E     ┆ VS1     ┆ ... ┆ 56.9  ┆ 65    ┆ 327   ┆ 4.05 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.29  ┆ Premium ┆ I     ┆ VS2     ┆ ... ┆ 62.4  ┆ 58    ┆ 334   ┆ 4.2  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.31  ┆ Good    ┆ J     ┆ SI2     ┆ ... ┆ 63.3  ┆ 58    ┆ 335   ┆ 4.34 │
╰───────┴─────────┴───────┴─────────┴─────┴───────┴───────┴───────┴──────╯

In [32]:

df['cut'].value_counts()

Out[32]:

shape: (5, 2)
╭───────────┬────────╮
│ cut       ┆ counts │
│ ---       ┆ ---    │
│ str       ┆ u32    │
╞═══════════╪════════╡
│ Ideal     ┆ 21551  │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Premium   ┆ 13791  │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Very Good ┆ 12082  │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Good      ┆ 4906   │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Fair      ┆ 1610   │
╰───────────┴────────╯

In [ ]:

# No Describe

Applying Fxn and Filter

  • apply fxn
  • boolean index
  • conditions

In [33]:

# Get Unique Values
df['cut'].unique()

Out[33]:

Series: 'cut' [str]
[
	"Very ..."
	"Good"
	"Fair"
	"Ideal..."
	"Premi..."
]

In [34]:

# Boolean Indexing & Condition & Find 
# Find all rows/datapoint with cut as ideal
df['cut'] == "Ideal"

Out[34]:

Series: 'cut' [bool]
[
	true
	false
	false
	false
	false
	false
	false
	false
	false
	false
]

In [35]:

# Boolean Indexing & Condition & Find 
# Find all rows/datapoint with cut as ideal
df[df['cut'] == "Ideal"]

Out[35]:

shape: (21551, 10)
╭───────┬───────┬───────┬─────────┬─────┬───────┬───────┬───────┬──────╮
│ carat ┆ cut   ┆ color ┆ clarity ┆ ... ┆ depth ┆ table ┆ price ┆ x    │
│ ---   ┆ ---   ┆ ---   ┆ ---     ┆     ┆ ---   ┆ ---   ┆ ---   ┆ ---  │
│ f64   ┆ str   ┆ str   ┆ str     ┆     ┆ f64   ┆ f64   ┆ i64   ┆ f64  │
╞═══════╪═══════╪═══════╪═════════╪═════╪═══════╪═══════╪═══════╪══════╡
│ 0.23  ┆ Ideal ┆ E     ┆ SI2     ┆ ... ┆ 61.5  ┆ 55    ┆ 326   ┆ 3.95 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.23  ┆ Ideal ┆ J     ┆ VS1     ┆ ... ┆ 62.8  ┆ 56    ┆ 340   ┆ 3.93 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.31  ┆ Ideal ┆ J     ┆ SI2     ┆ ... ┆ 62.2  ┆ 54    ┆ 344   ┆ 4.35 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.3   ┆ Ideal ┆ I     ┆ SI2     ┆ ... ┆ 62    ┆ 54    ┆ 348   ┆ 4.31 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ...   ┆ ...   ┆ ...   ┆ ...     ┆ ... ┆ ...   ┆ ...   ┆ ...   ┆ ...  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.79  ┆ Ideal ┆ I     ┆ SI1     ┆ ... ┆ 61.6  ┆ 56    ┆ 2756  ┆ 5.95 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.71  ┆ Ideal ┆ E     ┆ SI1     ┆ ... ┆ 61.9  ┆ 56    ┆ 2756  ┆ 5.71 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.71  ┆ Ideal ┆ G     ┆ VS1     ┆ ... ┆ 61.4  ┆ 56    ┆ 2756  ┆ 5.76 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.72  ┆ Ideal ┆ D     ┆ SI1     ┆ ... ┆ 60.8  ┆ 57    ┆ 2757  ┆ 5.75 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.75  ┆ Ideal ┆ D     ┆ SI2     ┆ ... ┆ 62.2  ┆ 55    ┆ 2757  ┆ 5.83 │
╰───────┴───────┴───────┴─────────┴─────┴───────┴───────┴───────┴──────╯

In [37]:

# Check For Price More than A Value (300)
df[df['price'] > 300]

Out[37]:

shape: (53940, 10)
╭───────┬───────────┬───────┬─────────┬─────┬───────┬───────┬───────┬──────╮
│ carat ┆ cut       ┆ color ┆ clarity ┆ ... ┆ depth ┆ table ┆ price ┆ x    │
│ ---   ┆ ---       ┆ ---   ┆ ---     ┆     ┆ ---   ┆ ---   ┆ ---   ┆ ---  │
│ f64   ┆ str       ┆ str   ┆ str     ┆     ┆ f64   ┆ f64   ┆ i64   ┆ f64  │
╞═══════╪═══════════╪═══════╪═════════╪═════╪═══════╪═══════╪═══════╪══════╡
│ 0.23  ┆ Ideal     ┆ E     ┆ SI2     ┆ ... ┆ 61.5  ┆ 55    ┆ 326   ┆ 3.95 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.21  ┆ Premium   ┆ E     ┆ SI1     ┆ ... ┆ 59.8  ┆ 61    ┆ 326   ┆ 3.89 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.23  ┆ Good      ┆ E     ┆ VS1     ┆ ... ┆ 56.9  ┆ 65    ┆ 327   ┆ 4.05 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.29  ┆ Premium   ┆ I     ┆ VS2     ┆ ... ┆ 62.4  ┆ 58    ┆ 334   ┆ 4.2  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ...   ┆ ...       ┆ ...   ┆ ...     ┆ ... ┆ ...   ┆ ...   ┆ ...   ┆ ...  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.72  ┆ Ideal     ┆ D     ┆ SI1     ┆ ... ┆ 60.8  ┆ 57    ┆ 2757  ┆ 5.75 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.72  ┆ Good      ┆ D     ┆ SI1     ┆ ... ┆ 63.1  ┆ 55    ┆ 2757  ┆ 5.69 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.7   ┆ Very Good ┆ D     ┆ SI1     ┆ ... ┆ 62.8  ┆ 60    ┆ 2757  ┆ 5.66 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.86  ┆ Premium   ┆ H     ┆ SI2     ┆ ... ┆ 61    ┆ 58    ┆ 2757  ┆ 6.15 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.75  ┆ Ideal     ┆ D     ┆ SI2     ┆ ... ┆ 62.2  ┆ 55    ┆ 2757  ┆ 5.83 │
╰───────┴───────────┴───────┴─────────┴─────┴───────┴───────┴───────┴──────╯

In [ ]:

#### Apply A Fxn on A DataFrame or Series
+ .apply()
+ .lazy() + filter
+ .to_pandas().

In [40]:

df.head()

Out[40]:

shape: (5, 10)
╭───────┬─────────┬───────┬─────────┬─────┬───────┬───────┬───────┬──────╮
│ carat ┆ cut     ┆ color ┆ clarity ┆ ... ┆ depth ┆ table ┆ price ┆ x    │
│ ---   ┆ ---     ┆ ---   ┆ ---     ┆     ┆ ---   ┆ ---   ┆ ---   ┆ ---  │
│ f64   ┆ str     ┆ str   ┆ str     ┆     ┆ f64   ┆ f64   ┆ i64   ┆ f64  │
╞═══════╪═════════╪═══════╪═════════╪═════╪═══════╪═══════╪═══════╪══════╡
│ 0.23  ┆ Ideal   ┆ E     ┆ SI2     ┆ ... ┆ 61.5  ┆ 55    ┆ 326   ┆ 3.95 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.21  ┆ Premium ┆ E     ┆ SI1     ┆ ... ┆ 59.8  ┆ 61    ┆ 326   ┆ 3.89 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.23  ┆ Good    ┆ E     ┆ VS1     ┆ ... ┆ 56.9  ┆ 65    ┆ 327   ┆ 4.05 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.29  ┆ Premium ┆ I     ┆ VS2     ┆ ... ┆ 62.4  ┆ 58    ┆ 334   ┆ 4.2  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.31  ┆ Good    ┆ J     ┆ SI2     ┆ ... ┆ 63.3  ┆ 58    ┆ 335   ┆ 4.34 │
╰───────┴─────────┴───────┴─────────┴─────┴───────┴───────┴───────┴──────╯

In [41]:

# Method 1
df['x'].apply(round)

Out[41]:

Series: 'x' [i64]
[
	4
	4
	4
	4
	4
	4
	4
	4
	4
	4
]

In [42]:

# Method 2 Using Lambda
df['x'].apply(lambda x : round(x))

Out[42]:

Series: 'x' [i64]
[
	4
	4
	4
	4
	4
	4
	4
	4
	4
	4
]

In [44]:

# Method 3 : Using Custom Fxn
df['cut'].unique()

Out[44]:

Series: 'cut' [str]
[
	"Very ..."
	"Premi..."
	"Ideal..."
	"Good"
	"Fair"
]

In [45]:

# Convert to A Dictionary For Map
mydict = {v:k for k,v in enumerate(df['cut'].unique())}

In [46]:

mydict

Out[46]:

{'Ideal': 0, 'Very Good': 1, 'Good': 2, 'Fair': 3, 'Premium': 4}

In [56]:

# WOnt work
df['cut'].apply(mydict)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-56-a91f1064400e> in <module>
----> 1 df['cut'].apply(mydict)

/usr/local/lib/python3.7/dist-packages/pypolars/series.py in apply(self, func, dtype_out, sniff_dtype)
    898             dtype_out = Boolean
    899 
--> 900         return wrap_s(self._s.apply_lambda(func, dtype_out))
    901 
    902     def shift(self, periods: int) -> "Series":

TypeError: 'dict' object is not callable

In [57]:

def mapcut_values(x):
    for k,v in mydict.items():
        if x == k:
            return v

In [58]:

mapcut_values("Ideal")

Out[58]:

0

In [59]:

df['cut'].apply(lambda x :mapcut_values(x))

Out[59]:

Series: 'cut' [i64]
[
	0
	4
	2
	4
	2
	1
	1
	1
	3
	1
]

In [62]:

# Method 4
# lazy API
import pypolars.lazy as plazy

In [63]:

# Methods/
dir(plazy)

Out[63]:

['Callable',
 'DataFrame',
 'Dict',
 'Expr',
 'LazyFrame',
 'LazyGroupBy',
 'List',
 'Optional',
 'PyExpr',
 'PyLazyFrame',
 'PyLazyGroupBy',
 'Series',
 'UDF',
 'Union',
 'When',
 'WhenThen',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '_selection_to_pyexpr_list',
 'avg',
 'binary_expr',
 'col',
 'count',
 'datatypes',
 'expr_to_lit_or_expr',
 'first',
 'last',
 'list',
 'lit',
 'max',
 'mean',
 'median',
 'min',
 'n_unique',
 'os',
 'pycol',
 'pylit',
 'pywhen',
 'shutil',
 'std',
 'subprocess',
 'sum',
 'tempfile',
 'udf',
 'var',
 'when',
 'wrap_df',
 'wrap_expr',
 'wrap_ldf']

In [64]:

# Define your fxn
def cust_mapcut_values(s: plazy.Series) -> plazy.Series:
    return s.apply(lambda x: mydict[x])
# Return

In [65]:

df.lazy().with_column(plazy.col('cut').map(cust_mapcut_values))

Out[65]:

<pypolars.lazy.LazyFrame at 0x7f93c05666d0>

In [66]:

output = df.lazy().with_column(plazy.col('cut').map(cust_mapcut_values))

In [67]:

output.collect()

Out[67]:

shape: (53940, 10)
╭───────┬─────┬───────┬─────────┬─────┬───────┬───────┬───────┬──────╮
│ carat ┆ cut ┆ color ┆ clarity ┆ ... ┆ depth ┆ table ┆ price ┆ x    │
│ ---   ┆ --- ┆ ---   ┆ ---     ┆     ┆ ---   ┆ ---   ┆ ---   ┆ ---  │
│ f64   ┆ i64 ┆ str   ┆ str     ┆     ┆ f64   ┆ f64   ┆ i64   ┆ f64  │
╞═══════╪═════╪═══════╪═════════╪═════╪═══════╪═══════╪═══════╪══════╡
│ 0.23  ┆ 0   ┆ E     ┆ SI2     ┆ ... ┆ 61.5  ┆ 55    ┆ 326   ┆ 3.95 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.21  ┆ 4   ┆ E     ┆ SI1     ┆ ... ┆ 59.8  ┆ 61    ┆ 326   ┆ 3.89 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.23  ┆ 2   ┆ E     ┆ VS1     ┆ ... ┆ 56.9  ┆ 65    ┆ 327   ┆ 4.05 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.29  ┆ 4   ┆ I     ┆ VS2     ┆ ... ┆ 62.4  ┆ 58    ┆ 334   ┆ 4.2  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ...   ┆ ... ┆ ...   ┆ ...     ┆ ... ┆ ...   ┆ ...   ┆ ...   ┆ ...  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.72  ┆ 0   ┆ D     ┆ SI1     ┆ ... ┆ 60.8  ┆ 57    ┆ 2757  ┆ 5.75 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.72  ┆ 2   ┆ D     ┆ SI1     ┆ ... ┆ 63.1  ┆ 55    ┆ 2757  ┆ 5.69 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.7   ┆ 1   ┆ D     ┆ SI1     ┆ ... ┆ 62.8  ┆ 60    ┆ 2757  ┆ 5.66 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.86  ┆ 4   ┆ H     ┆ SI2     ┆ ... ┆ 61    ┆ 58    ┆ 2757  ┆ 6.15 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.75  ┆ 0   ┆ D     ┆ SI2     ┆ ... ┆ 62.2  ┆ 55    ┆ 2757  ┆ 5.83 │
╰───────┴─────┴───────┴─────────┴─────┴───────┴───────┴───────┴──────╯

In [68]:

type(output)

Out[68]:

pypolars.lazy.LazyFrame

In [69]:

# Method 5:using Pandas
df2 = df.to_pandas()

In [70]:

type(df2)

Out[70]:

pandas.core.frame.DataFrame

In [71]:

df2['cut'].map(mydict)

Out[71]:

0        0
1        4
2        2
3        4
4        2
        ..
53935    0
53936    2
53937    1
53938    4
53939    0
Name: cut, Length: 53940, dtype: int64

In [72]:

### Groupby Using Pypolars

In [78]:

# df.groupby('cut')['price'].sum()
df.groupby('cut').select('price').sum()

Out[78]:

shape: (5, 2)
╭───────────┬───────────╮
│ cut       ┆ price_sum │
│ ---       ┆ ---       │
│ str       ┆ i64       │
╞═══════════╪═══════════╡
│ Ideal     ┆ 74513487  │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Premium   ┆ 63221498  │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Fair      ┆ 7017600   │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Good      ┆ 19275009  │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Very Good ┆ 48107623  │
╰───────────┴───────────╯

In [80]:

df.groupby('cut').select('price').count()

Out[80]:

shape: (5, 2)
╭───────────┬─────────────╮
│ cut       ┆ price_count │
│ ---       ┆ ---         │
│ str       ┆ u32         │
╞═══════════╪═════════════╡
│ Ideal     ┆ 21551       │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Premium   ┆ 13791       │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Fair      ┆ 1610        │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Very Good ┆ 12082       │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Good      ┆ 4906        │
╰───────────┴─────────────╯

In [81]:

# Selecting Multiple COlumns From Groupby
df.groupby('cut').select(['price','carat']).sum()

Out[81]:

shape: (5, 3)
╭───────────┬───────────┬────────────╮
│ cut       ┆ price_sum ┆ carat_sum  │
│ ---       ┆ ---       ┆ ---        │
│ str       ┆ i64       ┆ f64        │
╞═══════════╪═══════════╪════════════╡
│ Very Good ┆ 48107623  ┆ 9742.7     │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Fair      ┆ 7017600   ┆ 1684.28    │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Premium   ┆ 63221498  ┆ 1.230095e4 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Ideal     ┆ 74513487  ┆ 1.514684e4 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Good      ┆ 19275009  ┆ 4166.1     │
╰───────────┴───────────┴────────────╯

In [82]:

# Multiple Groupby + Selecting Multiple Columns From Groupby + Sum
df.groupby(['cut','color']).select(['price','carat']).sum()

Out[82]:

shape: (35, 4)
╭───────────┬───────┬───────────┬───────────╮
│ cut       ┆ color ┆ price_sum ┆ carat_sum │
│ ---       ┆ ---   ┆ ---       ┆ ---       │
│ str       ┆ str   ┆ i64       ┆ f64       │
╞═══════════╪═══════╪═══════════╪═══════════╡
│ Ideal     ┆ D     ┆ 7450854   ┆ 1603.38   │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Fair      ┆ G     ┆ 1331126   ┆ 321.48    │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Fair      ┆ E     ┆ 824838    ┆ 191.88    │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Fair      ┆ F     ┆ 1194025   ┆ 282.27    │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ...       ┆ ...   ┆ ...       ┆ ...       │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Premium   ┆ G     ┆ 13160170  ┆ 2460.51   │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Very Good ┆ E     ┆ 7715165   ┆ 1623.16   │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Fair      ┆ J     ┆ 592103    ┆ 159.6     │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Very Good ┆ G     ┆ 8903461   ┆ 1762.87   │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Ideal     ┆ I     ┆ 9317974   ┆ 1910.97   │
╰───────────┴───────┴───────────┴───────────╯

In [89]:

# Select Every Column + Fxn
df.groupby(['cut','color']).select_all().first()

Out[89]:

shape: (35, 10)
╭────────────┬───────┬────────────┬────────────┬─────┬───────────┬───────────┬───────────┬─────────╮
│ cut        ┆ color ┆ carat_firs ┆ clarity_fi ┆ ... ┆ depth_fir ┆ table_fir ┆ price_fir ┆ x_first │
│ ---        ┆ ---   ┆ t          ┆ rst        ┆     ┆ st        ┆ st        ┆ st        ┆ ---     │
│ str        ┆ str   ┆ ---        ┆ ---        ┆     ┆ ---       ┆ ---       ┆ ---       ┆ f64     │
│            ┆       ┆ f64        ┆ str        ┆     ┆ f64       ┆ f64       ┆ i64       ┆         │
╞════════════╪═══════╪════════════╪════════════╪═════╪═══════════╪═══════════╪═══════════╪═════════╡
│ Very Good  ┆ F     ┆ 0.23       ┆ VS1        ┆ ... ┆ 60.9      ┆ 57        ┆ 357       ┆ 3.96    │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Very Good  ┆ J     ┆ 0.24       ┆ VVS2       ┆ ... ┆ 62.8      ┆ 57        ┆ 336       ┆ 3.94    │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Fair       ┆ J     ┆ 1.05       ┆ SI2        ┆ ... ┆ 65.8      ┆ 59        ┆ 2789      ┆ 6.41    │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Fair       ┆ E     ┆ 0.22       ┆ VS2        ┆ ... ┆ 65.1      ┆ 61        ┆ 337       ┆ 3.87    │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ ...        ┆ ...   ┆ ...        ┆ ...        ┆ ... ┆ ...       ┆ ...       ┆ ...       ┆ ...     │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Ideal      ┆ F     ┆ 0.81       ┆ SI2        ┆ ... ┆ 58.8      ┆ 57        ┆ 2761      ┆ 6.14    │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Ideal      ┆ D     ┆ 0.3        ┆ SI1        ┆ ... ┆ 62.5      ┆ 57        ┆ 552       ┆ 4.29    │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Very Good  ┆ D     ┆ 0.23       ┆ VS2        ┆ ... ┆ 60.5      ┆ 61        ┆ 357       ┆ 3.96    │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Premium    ┆ E     ┆ 0.21       ┆ SI1        ┆ ... ┆ 59.8      ┆ 61        ┆ 326       ┆ 3.89    │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Premium    ┆ J     ┆ 0.3        ┆ SI2        ┆ ... ┆ 59.3      ┆ 61        ┆ 405       ┆ 4.43    │
╰────────────┴───────┴────────────┴────────────┴─────┴───────────┴───────────┴───────────┴─────────╯

In [90]:

# Select Every Column + Fxn
df.groupby(['cut','color']).select_all().sum()

Out[90]:

shape: (35, 9)
╭────────────┬───────┬───────────┬───────────┬─────┬───────────┬───────────┬───────────┬───────────╮
│ cut        ┆ color ┆ carat_sum ┆ depth_sum ┆ ... ┆ table_sum ┆ price_sum ┆ x_sum     ┆ y_sum     │
│ ---        ┆ ---   ┆ ---       ┆ ---       ┆     ┆ ---       ┆ ---       ┆ ---       ┆ ---       │
│ str        ┆ str   ┆ f64       ┆ f64       ┆     ┆ f64       ┆ i64       ┆ f64       ┆ f64       │
╞════════════╪═══════╪═══════════╪═══════════╪═════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ Ideal      ┆ H     ┆ 2490.52   ┆ 1.922989e ┆ ... ┆ 1.743336e ┆ 12115278  ┆ 1.785324e ┆ 1.788149e │
│            ┆       ┆           ┆ 5         ┆     ┆ 5         ┆           ┆ 4         ┆ 4         │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Very Good  ┆ J     ┆ 768.32    ┆ 4.19696e4 ┆ ... ┆ 3.95123e4 ┆ 3460182   ┆ 4380.41   ┆ 4403.66   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Ideal      ┆ G     ┆ 3422.29   ┆ 3.013436e ┆ ... ┆ 2.730272e ┆ 18171930  ┆ 2.691677e ┆ 2.697925e │
│            ┆       ┆           ┆ 5         ┆     ┆ 5         ┆           ┆ 4         ┆ 4         │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Ideal      ┆ D     ┆ 1603.38   ┆ 1.747965e ┆ ... ┆ 1.586066e ┆ 7450854   ┆ 1.469912e ┆ 1.47261e4 │
│            ┆       ┆           ┆ 5         ┆     ┆ 5         ┆           ┆ 4         ┆           │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ...        ┆ ...   ┆ ...       ┆ ...       ┆ ... ┆ ...       ┆ ...       ┆ ...       ┆ ...       │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Fair       ┆ D     ┆ 149.98    ┆ 1.04399e4 ┆ ... ┆ 9612      ┆ 699443    ┆ 980.99    ┆ 972       │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Very Good  ┆ E     ┆ 1623.16   ┆ 1.481526e ┆ ... ┆ 1.392933e ┆ 7715165   ┆ 1.303792e ┆ 1.311171e │
│            ┆       ┆           ┆ 5         ┆     ┆ 5         ┆           ┆ 4         ┆ 4         │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Ideal      ┆ J     ┆ 952.98    ┆ 5.53925e4 ┆ ... ┆ 5.01873e4 ┆ 4406695   ┆ 5662.76   ┆ 5673.56   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Fair       ┆ E     ┆ 191.88    ┆ 1.41836e4 ┆ ... ┆ 1.32977e4 ┆ 824838    ┆ 1323.63   ┆ 1312.24   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Premium    ┆ F     ┆ 1927.82   ┆ 1.42797e5 ┆ ... ┆ 1.367814e ┆ 10081319  ┆ 1.369857e ┆ 1.362389e │
│            ┆       ┆           ┆           ┆     ┆ 5         ┆           ┆ 4         ┆ 4         │
╰────────────┴───────┴───────────┴───────────┴─────┴───────────┴───────────┴───────────┴───────────╯

Plotting with Pypolars

  • No Direct Connection to Matplotlib
    • df[].plot(kind=’bar’)
  • Alternative is to use .to_pandas()

In [91]:

# Load Data Vis Pkgs
import matplotlib.pyplot as plt
import seaborn as sns

In [92]:

# Load EDA Pkgs
# pip install py-polars
import pypolars as pl

In [95]:

# Create A Series
x = pl.Series('X',['A','B','C'])
y = pl.Series('Y',[3,5,4])

In [96]:

dir(x)

Out[96]:

['__add__',
 '__array_ufunc__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__invert__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__mul__',
 '__ne__',
 '__new__',
 '__radd__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rfloordiv__',
 '__rmul__',
 '__rsub__',
 '__rtruediv__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__truediv__',
 '__weakref__',
 '_from_pyseries',
 '_s',
 'append',
 'apply',
 'arg_true',
 'arg_unique',
 'argsort',
 'as_duration',
 'cast',
 'chunk_lengths',
 'clone',
 'datetime_str_fmt',
 'day',
 'drop_nulls',
 'dtype',
 'explode',
 'fill_none',
 'filter',
 'head',
 'hour',
 'inner',
 'is_duplicated',
 'is_float',
 'is_not_null',
 'is_null',
 'is_numeric',
 'is_unique',
 'len',
 'limit',
 'max',
 'mean',
 'min',
 'minute',
 'month',
 'n_chunks',
 'name',
 'nanosecond',
 'null_count',
 'ordinal_day',
 'parse_date',
 'rechunk',
 'rename',
 'rolling_max',
 'rolling_mean',
 'rolling_min',
 'rolling_sum',
 'sample',
 'second',
 'series_equal',
 'set',
 'set_at_idx',
 'shift',
 'slice',
 'sort',
 'std',
 'str_contains',
 'str_lengths',
 'str_parse_date',
 'str_replace',
 'str_replace_all',
 'str_to_lowercase',
 'str_to_uppercase',
 'sum',
 'tail',
 'take',
 'take_every',
 'to_dummies',
 'to_list',
 'to_numpy',
 'unique',
 'value_counts',
 'var',
 'view',
 'year',
 'zip_with']

In [97]:

# Bar Chart Using Matplotlib
plt.bar(x,y)

Out[97]:

<BarContainer object of 3 artists>

In [98]:

# Load Dataset
df = pl.read_csv("data/diamonds.csv")

In [99]:

df.head()

Out[99]:

shape: (5, 10)
╭───────┬─────────┬───────┬─────────┬─────┬───────┬───────┬───────┬──────╮
│ carat ┆ cut     ┆ color ┆ clarity ┆ ... ┆ depth ┆ table ┆ price ┆ x    │
│ ---   ┆ ---     ┆ ---   ┆ ---     ┆     ┆ ---   ┆ ---   ┆ ---   ┆ ---  │
│ f64   ┆ str     ┆ str   ┆ str     ┆     ┆ f64   ┆ f64   ┆ i64   ┆ f64  │
╞═══════╪═════════╪═══════╪═════════╪═════╪═══════╪═══════╪═══════╪══════╡
│ 0.23  ┆ Ideal   ┆ E     ┆ SI2     ┆ ... ┆ 61.5  ┆ 55    ┆ 326   ┆ 3.95 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.21  ┆ Premium ┆ E     ┆ SI1     ┆ ... ┆ 59.8  ┆ 61    ┆ 326   ┆ 3.89 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.23  ┆ Good    ┆ E     ┆ VS1     ┆ ... ┆ 56.9  ┆ 65    ┆ 327   ┆ 4.05 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.29  ┆ Premium ┆ I     ┆ VS2     ┆ ... ┆ 62.4  ┆ 58    ┆ 334   ┆ 4.2  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.31  ┆ Good    ┆ J     ┆ SI2     ┆ ... ┆ 63.3  ┆ 58    ┆ 335   ┆ 4.34 │
╰───────┴─────────┴───────┴─────────┴─────┴───────┴───────┴───────┴──────╯

In [100]:

df['cut'].value_counts()

Out[100]:

shape: (5, 2)
╭───────────┬────────╮
│ cut       ┆ counts │
│ ---       ┆ ---    │
│ str       ┆ u32    │
╞═══════════╪════════╡
│ Ideal     ┆ 21551  │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Premium   ┆ 13791  │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Very Good ┆ 12082  │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Good      ┆ 4906   │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Fair      ┆ 1610   │
╰───────────┴────────╯

In [101]:

# type
type(df['cut'].value_counts())

Out[101]:

pypolars.frame.DataFrame

In [102]:

# In Pandas
df['cut'].value_counts().plot(kind='bar')
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pypolars/frame.py in __getattr__(self, item)
    260         try:
--> 261             return wrap_s(self._df.column(item))
    262         except RuntimeError:

RuntimeError: Any(NotFound("plot"))

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
<ipython-input-102-6c9e7394c03c> in <module>
      1 # In Pandas
----> 2 df['cut'].value_counts().plot(kind='bar')

/usr/local/lib/python3.7/dist-packages/pypolars/frame.py in __getattr__(self, item)
    261             return wrap_s(self._df.column(item))
    262         except RuntimeError:
--> 263             raise AttributeError(f"{item} not found")
    264 
    265     def __iter__(self):

AttributeError: plot not found

In [103]:

# Method : Separately
vc = df['cut'].value_counts()

In [104]:

vc.head()

Out[104]:

shape: (5, 2)
╭───────────┬────────╮
│ cut       ┆ counts │
│ ---       ┆ ---    │
│ str       ┆ u32    │
╞═══════════╪════════╡
│ Ideal     ┆ 21551  │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Premium   ┆ 13791  │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Very Good ┆ 12082  │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Good      ┆ 4906   │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Fair      ┆ 1610   │
╰───────────┴────────╯

In [105]:

plt.bar(vc['cut'],vc['counts'])

Out[105]:

<BarContainer object of 5 artists>

In [107]:

# Method 2: Indirect via to_pandas()
df['cut'].value_counts().to_pandas().plot(kind='bar')

Out[107]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f93a76f7450>

In [108]:

# Using Seaborn
df

Out[108]:

shape: (53940, 10)
╭───────┬───────────┬───────┬─────────┬─────┬───────┬───────┬───────┬──────╮
│ carat ┆ cut       ┆ color ┆ clarity ┆ ... ┆ depth ┆ table ┆ price ┆ x    │
│ ---   ┆ ---       ┆ ---   ┆ ---     ┆     ┆ ---   ┆ ---   ┆ ---   ┆ ---  │
│ f64   ┆ str       ┆ str   ┆ str     ┆     ┆ f64   ┆ f64   ┆ i64   ┆ f64  │
╞═══════╪═══════════╪═══════╪═════════╪═════╪═══════╪═══════╪═══════╪══════╡
│ 0.23  ┆ Ideal     ┆ E     ┆ SI2     ┆ ... ┆ 61.5  ┆ 55    ┆ 326   ┆ 3.95 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.21  ┆ Premium   ┆ E     ┆ SI1     ┆ ... ┆ 59.8  ┆ 61    ┆ 326   ┆ 3.89 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.23  ┆ Good      ┆ E     ┆ VS1     ┆ ... ┆ 56.9  ┆ 65    ┆ 327   ┆ 4.05 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.29  ┆ Premium   ┆ I     ┆ VS2     ┆ ... ┆ 62.4  ┆ 58    ┆ 334   ┆ 4.2  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ...   ┆ ...       ┆ ...   ┆ ...     ┆ ... ┆ ...   ┆ ...   ┆ ...   ┆ ...  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.72  ┆ Ideal     ┆ D     ┆ SI1     ┆ ... ┆ 60.8  ┆ 57    ┆ 2757  ┆ 5.75 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.72  ┆ Good      ┆ D     ┆ SI1     ┆ ... ┆ 63.1  ┆ 55    ┆ 2757  ┆ 5.69 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.7   ┆ Very Good ┆ D     ┆ SI1     ┆ ... ┆ 62.8  ┆ 60    ┆ 2757  ┆ 5.66 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.86  ┆ Premium   ┆ H     ┆ SI2     ┆ ... ┆ 61    ┆ 58    ┆ 2757  ┆ 6.15 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.75  ┆ Ideal     ┆ D     ┆ SI2     ┆ ... ┆ 62.2  ┆ 55    ┆ 2757  ┆ 5.83 │
╰───────┴───────────┴───────┴─────────┴─────┴───────┴───────┴───────┴──────╯

In [109]:

import seaborn as sns

In [112]:

# Works in Pandas
# Convert to pandas
sns.displot(data=df.to_pandas(),x='price',hue='cut')

Out[112]:

<seaborn.axisgrid.FacetGrid at 0x7f93a764959

PyPolars has some cool benefits and features. Cool right?

You can check out the video tutorial below

Thanks For Your Time

Jesus Saves

Leave a Comment

Your email address will not be published. Required fields are marked *