Pandas is one of the best libraries for doing data analysis in general. It is so popular and useful that it has become the defactor DataFrames library when doing Data Science in Python.
However, there are times that your dataset may be too big for Pandas and too small for PySpark – this is where PyPolars comes to play.
By the end of this tutorial you will be able to know and understand :
- What Pypolars is.
- When to use it.
- How to Install it
- How to use it for Exploratory Data Analysis(EDA)
Let us start.
What is Pypolars?
PyPolars is a python library useful for doing exploratory data analysis (EDA for short). It is a port of the famous DataFrames Library in Rust called Polars. (Like the bear like creature Polar Bear similar to Panda Bear: Hence the name Polars vs Pandas)
Pypolars is quite easy to pick up as it has a similar API to that of Pandas. Moroever it can also leverage Pandas behind the seen in case you are looking for a feature or function that is not yet available using “.to_pandas()” function. Cool right!!!
Let us explore this wonderful python library.
How to Install PyPolars
PyPolars is available on PyPI so you can install it with pip as below
pip install py-polars
or with Pip3
pip3 install py-polars
PyPolars For Data Analysis
As we earlier mentioned, PyPolars has a similar API to that of Pandas so you can read in any csv file using the .read_csv() function and start from there. You can get the full code here.
To begin let us import it
import pypolars as pl
df = pl.read_csv("data/mydataset.csv")
Exploratory Data Analysis with PyPolars
- PyPolars can be used for reading csv files
- Selection of Rows and Column For Manipulation
- Do Groupy,Join etc
- But it doesn’t have a directly link to matplotlib however you can use the .to_pandas() to help fix that challenge.
In [1]:
# Load EDA Pkgs import pypolars as pl
In [2]:
# Methods/Attrib dir(pl)
Out[2]:
['BinaryIO', 'Boolean', 'Callable', 'DTYPE_TO_FFINAME', 'DataFrame', 'Date32', 'Date64', 'Dict', 'DurationMicrosecond', 'DurationMillisecond', 'DurationNanosecond', 'DurationSecond', 'Expr', 'Float32', 'Float64', 'Int16', 'Int32', 'Int64', 'Int8', 'LazyFrame', 'LazyGroupBy', 'List', 'Object', 'Optional', 'PyExpr', 'PyLazyFrame', 'PyLazyGroupBy', 'Series', 'TextIO', 'Time32Millisecond', 'Time32Second', 'Time64Microsecond', 'Time64Nanosecond', 'TimestampMicrosecond', 'TimestampMillisecond', 'TimestampNanosecond', 'TimestampSecond', 'UDF', 'UInt16', 'UInt32', 'UInt64', 'UInt8', 'Union', 'Utf8', 'When', 'WhenThen', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__pdoc__', '__spec__', '__version__', 'arg_where', 'avg', 'binary_expr', 'col', 'count', 'ctypes', 'datatypes', 'dtype_to_ctype', 'dtype_to_int', 'dtype_to_primitive', 'dtypes', 'expr_to_lit_or_expr', 'ffi', 'first', 'frame', 'from_pandas', 'functions', 'get_dummies', 'last', 'lazy', 'list', 'lit', 'max', 'mean', 'median', 'min', 'n_unique', 'np', 'os', 'pycol', 'pylit', 'pypolars', 'pytype_to_polars_type', 'pywhen', 'read_csv', 'read_ipc', 'read_parquet', 'scan_csv', 'scan_parquet', 'series', 'shutil', 'std', 'subprocess', 'sum', 'tempfile', 'udf', 'var', 'version', 'when', 'wrap_df', 'wrap_expr', 'wrap_ldf', 'wrap_s']
In [3]:
# Reading CSV df = pl.read_csv("data/diamonds.csv")
In [4]:
# Check Type type(df)
Out[4]:
pypolars.frame.DataFrame
In [5]:
# Head/First Rows df.head()
Out[5]:
shape: (5, 10) ╭───────┬─────────┬───────┬─────────┬─────┬───────┬───────┬───────┬──────╮ │ carat ┆ cut ┆ color ┆ clarity ┆ ... ┆ depth ┆ table ┆ price ┆ x │ │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ str ┆ str ┆ str ┆ ┆ f64 ┆ f64 ┆ i64 ┆ f64 │ ╞═══════╪═════════╪═══════╪═════════╪═════╪═══════╪═══════╪═══════╪══════╡ │ 0.23 ┆ Ideal ┆ E ┆ SI2 ┆ ... ┆ 61.5 ┆ 55 ┆ 326 ┆ 3.95 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.21 ┆ Premium ┆ E ┆ SI1 ┆ ... ┆ 59.8 ┆ 61 ┆ 326 ┆ 3.89 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.23 ┆ Good ┆ E ┆ VS1 ┆ ... ┆ 56.9 ┆ 65 ┆ 327 ┆ 4.05 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.29 ┆ Premium ┆ I ┆ VS2 ┆ ... ┆ 62.4 ┆ 58 ┆ 334 ┆ 4.2 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.31 ┆ Good ┆ J ┆ SI2 ┆ ... ┆ 63.3 ┆ 58 ┆ 335 ┆ 4.34 │ ╰───────┴─────────┴───────┴─────────┴─────┴───────┴───────┴───────┴──────╯
In [6]:
# Last 10 Rows df.tail(10)
Out[6]:
shape: (10, 10) ╭───────┬───────────┬───────┬─────────┬─────┬───────┬───────┬───────┬──────╮ │ carat ┆ cut ┆ color ┆ clarity ┆ ... ┆ depth ┆ table ┆ price ┆ x │ │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ str ┆ str ┆ str ┆ ┆ f64 ┆ f64 ┆ i64 ┆ f64 │ ╞═══════╪═══════════╪═══════╪═════════╪═════╪═══════╪═══════╪═══════╪══════╡ │ 0.71 ┆ Premium ┆ E ┆ SI1 ┆ ... ┆ 60.5 ┆ 55 ┆ 2756 ┆ 5.79 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.71 ┆ Premium ┆ F ┆ SI1 ┆ ... ┆ 59.8 ┆ 62 ┆ 2756 ┆ 5.74 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.7 ┆ Very Good ┆ E ┆ VS2 ┆ ... ┆ 60.5 ┆ 59 ┆ 2757 ┆ 5.71 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.7 ┆ Very Good ┆ E ┆ VS2 ┆ ... ┆ 61.2 ┆ 59 ┆ 2757 ┆ 5.69 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.72 ┆ Ideal ┆ D ┆ SI1 ┆ ... ┆ 60.8 ┆ 57 ┆ 2757 ┆ 5.75 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.72 ┆ Good ┆ D ┆ SI1 ┆ ... ┆ 63.1 ┆ 55 ┆ 2757 ┆ 5.69 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.7 ┆ Very Good ┆ D ┆ SI1 ┆ ... ┆ 62.8 ┆ 60 ┆ 2757 ┆ 5.66 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.86 ┆ Premium ┆ H ┆ SI2 ┆ ... ┆ 61 ┆ 58 ┆ 2757 ┆ 6.15 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.75 ┆ Ideal ┆ D ┆ SI2 ┆ ... ┆ 62.2 ┆ 55 ┆ 2757 ┆ 5.83 │ ╰───────┴───────────┴───────┴─────────┴─────┴───────┴───────┴───────┴──────╯
In [7]:
# Check For Shape df.shape
Out[7]:
(53940, 10)
In [8]:
# Check For Datatypes df.dtypes
Out[8]:
[pypolars.datatypes.Float64, pypolars.datatypes.Utf8, pypolars.datatypes.Utf8, pypolars.datatypes.Utf8, pypolars.datatypes.Float64, pypolars.datatypes.Float64, pypolars.datatypes.Int64, pypolars.datatypes.Float64, pypolars.datatypes.Float64, pypolars.datatypes.Float64]
In [9]:
# Check For Column Names df.columns
Out[9]:
['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price', 'x', 'y', 'z']
In [10]:
# Method/Attrib on DF dir(df)
Out[10]:
['__add__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__', '__weakref__', '_df', '_from_pydf', '_rechunk', 'clone', 'columns', 'drop', 'drop_duplicates', 'drop_in_place', 'drop_nulls', 'dtypes', 'explode', 'fill_none', 'frame_equal', 'get_columns', 'groupby', 'head', 'height', 'hstack', 'insert_at_idx', 'is_duplicated', 'is_unique', 'join', 'lazy', 'max', 'mean', 'median', 'melt', 'min', 'n_chunks', 'pipe', 'quantile', 'read_csv', 'read_ipc', 'read_parquet', 'replace', 'replace_at_idx', 'sample', 'select_at_idx', 'shape', 'shift', 'slice', 'sort', 'std', 'sum', 'tail', 'to_csv', 'to_dummies', 'to_ipc', 'to_numpy', 'to_pandas', 'var', 'vstack', 'width']
In [11]:
# Not Working df.describe()
--------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) /usr/local/lib/python3.7/dist-packages/pypolars/frame.py in __getattr__(self, item) 260 try: --> 261 return wrap_s(self._df.column(item)) 262 except RuntimeError: RuntimeError: Any(NotFound("describe")) During handling of the above exception, another exception occurred: AttributeError Traceback (most recent call last) <ipython-input-11-f517d93c89f7> in <module> 1 # Not Working ----> 2 df.describe() /usr/local/lib/python3.7/dist-packages/pypolars/frame.py in __getattr__(self, item) 261 return wrap_s(self._df.column(item)) 262 except RuntimeError: --> 263 raise AttributeError(f"{item} not found") 264 265 def __iter__(self): AttributeError: describe not found
In [12]:
# Alternative Using Pandas df.to_pandas().describe()
Out[12]:
carat | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|
count | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 |
mean | 0.797940 | 61.749405 | 57.457184 | 3932.799722 | 5.731157 | 5.734526 | 3.538734 |
std | 0.474011 | 1.432621 | 2.234491 | 3989.439738 | 1.121761 | 1.142135 | 0.705699 |
min | 0.200000 | 43.000000 | 43.000000 | 326.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.400000 | 61.000000 | 56.000000 | 950.000000 | 4.710000 | 4.720000 | 2.910000 |
50% | 0.700000 | 61.800000 | 57.000000 | 2401.000000 | 5.700000 | 5.710000 | 3.530000 |
75% | 1.040000 | 62.500000 | 59.000000 | 5324.250000 | 6.540000 | 6.540000 | 4.040000 |
max | 5.010000 | 79.000000 | 95.000000 | 18823.000000 | 10.740000 | 58.900000 | 31.800000 |
Selection
- By Columns By Name/Index
- By Rows
In [17]:
df.columns
Out[17]:
['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price', 'x', 'y', 'z']
In [18]:
# Select Columns By Col Name # pandas:: df['carat'] df['carat']
Out[18]:
Series: 'carat' [f64] [ 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ]
In [19]:
# Select Columns By Index df[0]
Out[19]:
Series: 'carat' [f64] [ 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ]
In [20]:
# Select Multiple Columns df[['carat','price']]
Out[20]:
shape: (53940, 2) ╭───────┬───────╮ │ carat ┆ price │ │ --- ┆ --- │ │ f64 ┆ i64 │ ╞═══════╪═══════╡ │ 0.23 ┆ 326 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤ │ 0.21 ┆ 326 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤ │ 0.23 ┆ 327 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤ │ 0.29 ┆ 334 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤ │ ... ┆ ... │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤ │ 0.72 ┆ 2757 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤ │ 0.72 ┆ 2757 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤ │ 0.7 ┆ 2757 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤ │ 0.86 ┆ 2757 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤ │ 0.75 ┆ 2757 │ ╰───────┴───────╯
In [22]:
# Select By Index Position for columns # df[0] same in pandas df.select_at_idx(0)
Out[22]:
Series: 'carat' [f64] [ 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ]
In [23]:
# Select By Rows # Select rows 0 to 3 and all columns # df.iloc[0:3] df[0:3]
Out[23]:
shape: (3, 10) ╭───────┬─────────┬───────┬─────────┬─────┬───────┬───────┬───────┬──────╮ │ carat ┆ cut ┆ color ┆ clarity ┆ ... ┆ depth ┆ table ┆ price ┆ x │ │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ str ┆ str ┆ str ┆ ┆ f64 ┆ f64 ┆ i64 ┆ f64 │ ╞═══════╪═════════╪═══════╪═════════╪═════╪═══════╪═══════╪═══════╪══════╡ │ 0.23 ┆ Ideal ┆ E ┆ SI2 ┆ ... ┆ 61.5 ┆ 55 ┆ 326 ┆ 3.95 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.21 ┆ Premium ┆ E ┆ SI1 ┆ ... ┆ 59.8 ┆ 61 ┆ 326 ┆ 3.89 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.23 ┆ Good ┆ E ┆ VS1 ┆ ... ┆ 56.9 ┆ 65 ┆ 327 ┆ 4.05 │ ╰───────┴─────────┴───────┴─────────┴─────┴───────┴───────┴───────┴──────╯
In [24]:
# SLicing 0 to length of rows # df.iloc[0:3] df.slice(0,3)
Out[24]:
shape: (3, 10) ╭───────┬─────────┬───────┬─────────┬─────┬───────┬───────┬───────┬──────╮ │ carat ┆ cut ┆ color ┆ clarity ┆ ... ┆ depth ┆ table ┆ price ┆ x │ │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ str ┆ str ┆ str ┆ ┆ f64 ┆ f64 ┆ i64 ┆ f64 │ ╞═══════╪═════════╪═══════╪═════════╪═════╪═══════╪═══════╪═══════╪══════╡ │ 0.23 ┆ Ideal ┆ E ┆ SI2 ┆ ... ┆ 61.5 ┆ 55 ┆ 326 ┆ 3.95 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.21 ┆ Premium ┆ E ┆ SI1 ┆ ... ┆ 59.8 ┆ 61 ┆ 326 ┆ 3.89 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.23 ┆ Good ┆ E ┆ VS1 ┆ ... ┆ 56.9 ┆ 65 ┆ 327 ┆ 4.05 │ ╰───────┴─────────┴───────┴─────────┴─────┴───────┴───────┴───────┴──────╯
In [28]:
# Select from 3 rows of a column df[0:3,"carat"]
Out[28]:
shape: (3, 1) ╭───────╮ │ carat │ │ --- │ │ f64 │ ╞═══════╡ │ 0.23 │ ├╌╌╌╌╌╌╌┤ │ 0.21 │ ├╌╌╌╌╌╌╌┤ │ 0.23 │ ╰───────╯
In [29]:
# Select from different rows of a column df[[0,4,10],"carat"]
Out[29]:
shape: (3, 1) ╭───────╮ │ carat │ │ --- │ │ f64 │ ╞═══════╡ │ 0.23 │ ├╌╌╌╌╌╌╌┤ │ 0.31 │ ├╌╌╌╌╌╌╌┤ │ 0.3 │ ╰───────╯
In [30]:
# Select from different rows of a multiple columns df[[0,4,10],["carat","cut"]]
Out[30]:
shape: (3, 2) ╭───────┬───────╮ │ carat ┆ cut │ │ --- ┆ --- │ │ f64 ┆ str │ ╞═══════╪═══════╡ │ 0.23 ┆ Ideal │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤ │ 0.31 ┆ Good │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤ │ 0.3 ┆ Good │ ╰───────┴───────╯
In [31]:
# Value Counts df.head()
Out[31]:
shape: (5, 10) ╭───────┬─────────┬───────┬─────────┬─────┬───────┬───────┬───────┬──────╮ │ carat ┆ cut ┆ color ┆ clarity ┆ ... ┆ depth ┆ table ┆ price ┆ x │ │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ str ┆ str ┆ str ┆ ┆ f64 ┆ f64 ┆ i64 ┆ f64 │ ╞═══════╪═════════╪═══════╪═════════╪═════╪═══════╪═══════╪═══════╪══════╡ │ 0.23 ┆ Ideal ┆ E ┆ SI2 ┆ ... ┆ 61.5 ┆ 55 ┆ 326 ┆ 3.95 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.21 ┆ Premium ┆ E ┆ SI1 ┆ ... ┆ 59.8 ┆ 61 ┆ 326 ┆ 3.89 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.23 ┆ Good ┆ E ┆ VS1 ┆ ... ┆ 56.9 ┆ 65 ┆ 327 ┆ 4.05 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.29 ┆ Premium ┆ I ┆ VS2 ┆ ... ┆ 62.4 ┆ 58 ┆ 334 ┆ 4.2 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.31 ┆ Good ┆ J ┆ SI2 ┆ ... ┆ 63.3 ┆ 58 ┆ 335 ┆ 4.34 │ ╰───────┴─────────┴───────┴─────────┴─────┴───────┴───────┴───────┴──────╯
In [32]:
df['cut'].value_counts()
Out[32]:
shape: (5, 2) ╭───────────┬────────╮ │ cut ┆ counts │ │ --- ┆ --- │ │ str ┆ u32 │ ╞═══════════╪════════╡ │ Ideal ┆ 21551 │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ │ Premium ┆ 13791 │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ │ Very Good ┆ 12082 │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ │ Good ┆ 4906 │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ │ Fair ┆ 1610 │ ╰───────────┴────────╯
In [ ]:
# No Describe
Applying Fxn and Filter
- apply fxn
- boolean index
- conditions
In [33]:
# Get Unique Values df['cut'].unique()
Out[33]:
Series: 'cut' [str] [ "Very ..." "Good" "Fair" "Ideal..." "Premi..." ]
In [34]:
# Boolean Indexing & Condition & Find # Find all rows/datapoint with cut as ideal df['cut'] == "Ideal"
Out[34]:
Series: 'cut' [bool] [ true false false false false false false false false false ]
In [35]:
# Boolean Indexing & Condition & Find # Find all rows/datapoint with cut as ideal df[df['cut'] == "Ideal"]
Out[35]:
shape: (21551, 10) ╭───────┬───────┬───────┬─────────┬─────┬───────┬───────┬───────┬──────╮ │ carat ┆ cut ┆ color ┆ clarity ┆ ... ┆ depth ┆ table ┆ price ┆ x │ │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ str ┆ str ┆ str ┆ ┆ f64 ┆ f64 ┆ i64 ┆ f64 │ ╞═══════╪═══════╪═══════╪═════════╪═════╪═══════╪═══════╪═══════╪══════╡ │ 0.23 ┆ Ideal ┆ E ┆ SI2 ┆ ... ┆ 61.5 ┆ 55 ┆ 326 ┆ 3.95 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.23 ┆ Ideal ┆ J ┆ VS1 ┆ ... ┆ 62.8 ┆ 56 ┆ 340 ┆ 3.93 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.31 ┆ Ideal ┆ J ┆ SI2 ┆ ... ┆ 62.2 ┆ 54 ┆ 344 ┆ 4.35 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.3 ┆ Ideal ┆ I ┆ SI2 ┆ ... ┆ 62 ┆ 54 ┆ 348 ┆ 4.31 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.79 ┆ Ideal ┆ I ┆ SI1 ┆ ... ┆ 61.6 ┆ 56 ┆ 2756 ┆ 5.95 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.71 ┆ Ideal ┆ E ┆ SI1 ┆ ... ┆ 61.9 ┆ 56 ┆ 2756 ┆ 5.71 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.71 ┆ Ideal ┆ G ┆ VS1 ┆ ... ┆ 61.4 ┆ 56 ┆ 2756 ┆ 5.76 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.72 ┆ Ideal ┆ D ┆ SI1 ┆ ... ┆ 60.8 ┆ 57 ┆ 2757 ┆ 5.75 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.75 ┆ Ideal ┆ D ┆ SI2 ┆ ... ┆ 62.2 ┆ 55 ┆ 2757 ┆ 5.83 │ ╰───────┴───────┴───────┴─────────┴─────┴───────┴───────┴───────┴──────╯
In [37]:
# Check For Price More than A Value (300) df[df['price'] > 300]
Out[37]:
shape: (53940, 10) ╭───────┬───────────┬───────┬─────────┬─────┬───────┬───────┬───────┬──────╮ │ carat ┆ cut ┆ color ┆ clarity ┆ ... ┆ depth ┆ table ┆ price ┆ x │ │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ str ┆ str ┆ str ┆ ┆ f64 ┆ f64 ┆ i64 ┆ f64 │ ╞═══════╪═══════════╪═══════╪═════════╪═════╪═══════╪═══════╪═══════╪══════╡ │ 0.23 ┆ Ideal ┆ E ┆ SI2 ┆ ... ┆ 61.5 ┆ 55 ┆ 326 ┆ 3.95 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.21 ┆ Premium ┆ E ┆ SI1 ┆ ... ┆ 59.8 ┆ 61 ┆ 326 ┆ 3.89 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.23 ┆ Good ┆ E ┆ VS1 ┆ ... ┆ 56.9 ┆ 65 ┆ 327 ┆ 4.05 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.29 ┆ Premium ┆ I ┆ VS2 ┆ ... ┆ 62.4 ┆ 58 ┆ 334 ┆ 4.2 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.72 ┆ Ideal ┆ D ┆ SI1 ┆ ... ┆ 60.8 ┆ 57 ┆ 2757 ┆ 5.75 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.72 ┆ Good ┆ D ┆ SI1 ┆ ... ┆ 63.1 ┆ 55 ┆ 2757 ┆ 5.69 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.7 ┆ Very Good ┆ D ┆ SI1 ┆ ... ┆ 62.8 ┆ 60 ┆ 2757 ┆ 5.66 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.86 ┆ Premium ┆ H ┆ SI2 ┆ ... ┆ 61 ┆ 58 ┆ 2757 ┆ 6.15 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.75 ┆ Ideal ┆ D ┆ SI2 ┆ ... ┆ 62.2 ┆ 55 ┆ 2757 ┆ 5.83 │ ╰───────┴───────────┴───────┴─────────┴─────┴───────┴───────┴───────┴──────╯
In [ ]:
#### Apply A Fxn on A DataFrame or Series + .apply() + .lazy() + filter + .to_pandas().
In [40]:
df.head()
Out[40]:
shape: (5, 10) ╭───────┬─────────┬───────┬─────────┬─────┬───────┬───────┬───────┬──────╮ │ carat ┆ cut ┆ color ┆ clarity ┆ ... ┆ depth ┆ table ┆ price ┆ x │ │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ str ┆ str ┆ str ┆ ┆ f64 ┆ f64 ┆ i64 ┆ f64 │ ╞═══════╪═════════╪═══════╪═════════╪═════╪═══════╪═══════╪═══════╪══════╡ │ 0.23 ┆ Ideal ┆ E ┆ SI2 ┆ ... ┆ 61.5 ┆ 55 ┆ 326 ┆ 3.95 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.21 ┆ Premium ┆ E ┆ SI1 ┆ ... ┆ 59.8 ┆ 61 ┆ 326 ┆ 3.89 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.23 ┆ Good ┆ E ┆ VS1 ┆ ... ┆ 56.9 ┆ 65 ┆ 327 ┆ 4.05 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.29 ┆ Premium ┆ I ┆ VS2 ┆ ... ┆ 62.4 ┆ 58 ┆ 334 ┆ 4.2 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.31 ┆ Good ┆ J ┆ SI2 ┆ ... ┆ 63.3 ┆ 58 ┆ 335 ┆ 4.34 │ ╰───────┴─────────┴───────┴─────────┴─────┴───────┴───────┴───────┴──────╯
In [41]:
# Method 1 df['x'].apply(round)
Out[41]:
Series: 'x' [i64] [ 4 4 4 4 4 4 4 4 4 4 ]
In [42]:
# Method 2 Using Lambda df['x'].apply(lambda x : round(x))
Out[42]:
Series: 'x' [i64] [ 4 4 4 4 4 4 4 4 4 4 ]
In [44]:
# Method 3 : Using Custom Fxn df['cut'].unique()
Out[44]:
Series: 'cut' [str] [ "Very ..." "Premi..." "Ideal..." "Good" "Fair" ]
In [45]:
# Convert to A Dictionary For Map mydict = {v:k for k,v in enumerate(df['cut'].unique())}
In [46]:
mydict
Out[46]:
{'Ideal': 0, 'Very Good': 1, 'Good': 2, 'Fair': 3, 'Premium': 4}
In [56]:
# WOnt work df['cut'].apply(mydict)
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-56-a91f1064400e> in <module> ----> 1 df['cut'].apply(mydict) /usr/local/lib/python3.7/dist-packages/pypolars/series.py in apply(self, func, dtype_out, sniff_dtype) 898 dtype_out = Boolean 899 --> 900 return wrap_s(self._s.apply_lambda(func, dtype_out)) 901 902 def shift(self, periods: int) -> "Series": TypeError: 'dict' object is not callable
In [57]:
def mapcut_values(x): for k,v in mydict.items(): if x == k: return v
In [58]:
mapcut_values("Ideal")
Out[58]:
0
In [59]:
df['cut'].apply(lambda x :mapcut_values(x))
Out[59]:
Series: 'cut' [i64] [ 0 4 2 4 2 1 1 1 3 1 ]
In [62]:
# Method 4 # lazy API import pypolars.lazy as plazy
In [63]:
# Methods/ dir(plazy)
Out[63]:
['Callable', 'DataFrame', 'Dict', 'Expr', 'LazyFrame', 'LazyGroupBy', 'List', 'Optional', 'PyExpr', 'PyLazyFrame', 'PyLazyGroupBy', 'Series', 'UDF', 'Union', 'When', 'WhenThen', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '_selection_to_pyexpr_list', 'avg', 'binary_expr', 'col', 'count', 'datatypes', 'expr_to_lit_or_expr', 'first', 'last', 'list', 'lit', 'max', 'mean', 'median', 'min', 'n_unique', 'os', 'pycol', 'pylit', 'pywhen', 'shutil', 'std', 'subprocess', 'sum', 'tempfile', 'udf', 'var', 'when', 'wrap_df', 'wrap_expr', 'wrap_ldf']
In [64]:
# Define your fxn def cust_mapcut_values(s: plazy.Series) -> plazy.Series: return s.apply(lambda x: mydict[x]) # Return
In [65]:
df.lazy().with_column(plazy.col('cut').map(cust_mapcut_values))
Out[65]:
<pypolars.lazy.LazyFrame at 0x7f93c05666d0>
In [66]:
output = df.lazy().with_column(plazy.col('cut').map(cust_mapcut_values))
In [67]:
output.collect()
Out[67]:
shape: (53940, 10) ╭───────┬─────┬───────┬─────────┬─────┬───────┬───────┬───────┬──────╮ │ carat ┆ cut ┆ color ┆ clarity ┆ ... ┆ depth ┆ table ┆ price ┆ x │ │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ i64 ┆ str ┆ str ┆ ┆ f64 ┆ f64 ┆ i64 ┆ f64 │ ╞═══════╪═════╪═══════╪═════════╪═════╪═══════╪═══════╪═══════╪══════╡ │ 0.23 ┆ 0 ┆ E ┆ SI2 ┆ ... ┆ 61.5 ┆ 55 ┆ 326 ┆ 3.95 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.21 ┆ 4 ┆ E ┆ SI1 ┆ ... ┆ 59.8 ┆ 61 ┆ 326 ┆ 3.89 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.23 ┆ 2 ┆ E ┆ VS1 ┆ ... ┆ 56.9 ┆ 65 ┆ 327 ┆ 4.05 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.29 ┆ 4 ┆ I ┆ VS2 ┆ ... ┆ 62.4 ┆ 58 ┆ 334 ┆ 4.2 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.72 ┆ 0 ┆ D ┆ SI1 ┆ ... ┆ 60.8 ┆ 57 ┆ 2757 ┆ 5.75 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.72 ┆ 2 ┆ D ┆ SI1 ┆ ... ┆ 63.1 ┆ 55 ┆ 2757 ┆ 5.69 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.7 ┆ 1 ┆ D ┆ SI1 ┆ ... ┆ 62.8 ┆ 60 ┆ 2757 ┆ 5.66 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.86 ┆ 4 ┆ H ┆ SI2 ┆ ... ┆ 61 ┆ 58 ┆ 2757 ┆ 6.15 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.75 ┆ 0 ┆ D ┆ SI2 ┆ ... ┆ 62.2 ┆ 55 ┆ 2757 ┆ 5.83 │ ╰───────┴─────┴───────┴─────────┴─────┴───────┴───────┴───────┴──────╯
In [68]:
type(output)
Out[68]:
pypolars.lazy.LazyFrame
In [69]:
# Method 5:using Pandas df2 = df.to_pandas()
In [70]:
type(df2)
Out[70]:
pandas.core.frame.DataFrame
In [71]:
df2['cut'].map(mydict)
Out[71]:
0 0 1 4 2 2 3 4 4 2 .. 53935 0 53936 2 53937 1 53938 4 53939 0 Name: cut, Length: 53940, dtype: int64
In [72]:
### Groupby Using Pypolars
In [78]:
# df.groupby('cut')['price'].sum() df.groupby('cut').select('price').sum()
Out[78]:
shape: (5, 2) ╭───────────┬───────────╮ │ cut ┆ price_sum │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═══════════╪═══════════╡ │ Ideal ┆ 74513487 │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ │ Premium ┆ 63221498 │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ │ Fair ┆ 7017600 │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ │ Good ┆ 19275009 │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ │ Very Good ┆ 48107623 │ ╰───────────┴───────────╯
In [80]:
df.groupby('cut').select('price').count()
Out[80]:
shape: (5, 2) ╭───────────┬─────────────╮ │ cut ┆ price_count │ │ --- ┆ --- │ │ str ┆ u32 │ ╞═══════════╪═════════════╡ │ Ideal ┆ 21551 │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ Premium ┆ 13791 │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ Fair ┆ 1610 │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ Very Good ┆ 12082 │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ Good ┆ 4906 │ ╰───────────┴─────────────╯
In [81]:
# Selecting Multiple COlumns From Groupby df.groupby('cut').select(['price','carat']).sum()
Out[81]:
shape: (5, 3) ╭───────────┬───────────┬────────────╮ │ cut ┆ price_sum ┆ carat_sum │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ f64 │ ╞═══════════╪═══════════╪════════════╡ │ Very Good ┆ 48107623 ┆ 9742.7 │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤ │ Fair ┆ 7017600 ┆ 1684.28 │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤ │ Premium ┆ 63221498 ┆ 1.230095e4 │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤ │ Ideal ┆ 74513487 ┆ 1.514684e4 │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤ │ Good ┆ 19275009 ┆ 4166.1 │ ╰───────────┴───────────┴────────────╯
In [82]:
# Multiple Groupby + Selecting Multiple Columns From Groupby + Sum df.groupby(['cut','color']).select(['price','carat']).sum()
Out[82]:
shape: (35, 4) ╭───────────┬───────┬───────────┬───────────╮ │ cut ┆ color ┆ price_sum ┆ carat_sum │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ i64 ┆ f64 │ ╞═══════════╪═══════╪═══════════╪═══════════╡ │ Ideal ┆ D ┆ 7450854 ┆ 1603.38 │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ │ Fair ┆ G ┆ 1331126 ┆ 321.48 │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ │ Fair ┆ E ┆ 824838 ┆ 191.88 │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ │ Fair ┆ F ┆ 1194025 ┆ 282.27 │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ │ ... ┆ ... ┆ ... ┆ ... │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ │ Premium ┆ G ┆ 13160170 ┆ 2460.51 │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ │ Very Good ┆ E ┆ 7715165 ┆ 1623.16 │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ │ Fair ┆ J ┆ 592103 ┆ 159.6 │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ │ Very Good ┆ G ┆ 8903461 ┆ 1762.87 │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ │ Ideal ┆ I ┆ 9317974 ┆ 1910.97 │ ╰───────────┴───────┴───────────┴───────────╯
In [89]:
# Select Every Column + Fxn df.groupby(['cut','color']).select_all().first()
Out[89]:
shape: (35, 10) ╭────────────┬───────┬────────────┬────────────┬─────┬───────────┬───────────┬───────────┬─────────╮ │ cut ┆ color ┆ carat_firs ┆ clarity_fi ┆ ... ┆ depth_fir ┆ table_fir ┆ price_fir ┆ x_first │ │ --- ┆ --- ┆ t ┆ rst ┆ ┆ st ┆ st ┆ st ┆ --- │ │ str ┆ str ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ f64 │ │ ┆ ┆ f64 ┆ str ┆ ┆ f64 ┆ f64 ┆ i64 ┆ │ ╞════════════╪═══════╪════════════╪════════════╪═════╪═══════════╪═══════════╪═══════════╪═════════╡ │ Very Good ┆ F ┆ 0.23 ┆ VS1 ┆ ... ┆ 60.9 ┆ 57 ┆ 357 ┆ 3.96 │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤ │ Very Good ┆ J ┆ 0.24 ┆ VVS2 ┆ ... ┆ 62.8 ┆ 57 ┆ 336 ┆ 3.94 │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤ │ Fair ┆ J ┆ 1.05 ┆ SI2 ┆ ... ┆ 65.8 ┆ 59 ┆ 2789 ┆ 6.41 │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤ │ Fair ┆ E ┆ 0.22 ┆ VS2 ┆ ... ┆ 65.1 ┆ 61 ┆ 337 ┆ 3.87 │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤ │ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤ │ Ideal ┆ F ┆ 0.81 ┆ SI2 ┆ ... ┆ 58.8 ┆ 57 ┆ 2761 ┆ 6.14 │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤ │ Ideal ┆ D ┆ 0.3 ┆ SI1 ┆ ... ┆ 62.5 ┆ 57 ┆ 552 ┆ 4.29 │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤ │ Very Good ┆ D ┆ 0.23 ┆ VS2 ┆ ... ┆ 60.5 ┆ 61 ┆ 357 ┆ 3.96 │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤ │ Premium ┆ E ┆ 0.21 ┆ SI1 ┆ ... ┆ 59.8 ┆ 61 ┆ 326 ┆ 3.89 │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤ │ Premium ┆ J ┆ 0.3 ┆ SI2 ┆ ... ┆ 59.3 ┆ 61 ┆ 405 ┆ 4.43 │ ╰────────────┴───────┴────────────┴────────────┴─────┴───────────┴───────────┴───────────┴─────────╯
In [90]:
# Select Every Column + Fxn df.groupby(['cut','color']).select_all().sum()
Out[90]:
shape: (35, 9) ╭────────────┬───────┬───────────┬───────────┬─────┬───────────┬───────────┬───────────┬───────────╮ │ cut ┆ color ┆ carat_sum ┆ depth_sum ┆ ... ┆ table_sum ┆ price_sum ┆ x_sum ┆ y_sum │ │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ f64 ┆ f64 ┆ ┆ f64 ┆ i64 ┆ f64 ┆ f64 │ ╞════════════╪═══════╪═══════════╪═══════════╪═════╪═══════════╪═══════════╪═══════════╪═══════════╡ │ Ideal ┆ H ┆ 2490.52 ┆ 1.922989e ┆ ... ┆ 1.743336e ┆ 12115278 ┆ 1.785324e ┆ 1.788149e │ │ ┆ ┆ ┆ 5 ┆ ┆ 5 ┆ ┆ 4 ┆ 4 │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ │ Very Good ┆ J ┆ 768.32 ┆ 4.19696e4 ┆ ... ┆ 3.95123e4 ┆ 3460182 ┆ 4380.41 ┆ 4403.66 │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ │ Ideal ┆ G ┆ 3422.29 ┆ 3.013436e ┆ ... ┆ 2.730272e ┆ 18171930 ┆ 2.691677e ┆ 2.697925e │ │ ┆ ┆ ┆ 5 ┆ ┆ 5 ┆ ┆ 4 ┆ 4 │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ │ Ideal ┆ D ┆ 1603.38 ┆ 1.747965e ┆ ... ┆ 1.586066e ┆ 7450854 ┆ 1.469912e ┆ 1.47261e4 │ │ ┆ ┆ ┆ 5 ┆ ┆ 5 ┆ ┆ 4 ┆ │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ │ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ │ Fair ┆ D ┆ 149.98 ┆ 1.04399e4 ┆ ... ┆ 9612 ┆ 699443 ┆ 980.99 ┆ 972 │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ │ Very Good ┆ E ┆ 1623.16 ┆ 1.481526e ┆ ... ┆ 1.392933e ┆ 7715165 ┆ 1.303792e ┆ 1.311171e │ │ ┆ ┆ ┆ 5 ┆ ┆ 5 ┆ ┆ 4 ┆ 4 │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ │ Ideal ┆ J ┆ 952.98 ┆ 5.53925e4 ┆ ... ┆ 5.01873e4 ┆ 4406695 ┆ 5662.76 ┆ 5673.56 │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ │ Fair ┆ E ┆ 191.88 ┆ 1.41836e4 ┆ ... ┆ 1.32977e4 ┆ 824838 ┆ 1323.63 ┆ 1312.24 │ ├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤ │ Premium ┆ F ┆ 1927.82 ┆ 1.42797e5 ┆ ... ┆ 1.367814e ┆ 10081319 ┆ 1.369857e ┆ 1.362389e │ │ ┆ ┆ ┆ ┆ ┆ 5 ┆ ┆ 4 ┆ 4 │ ╰────────────┴───────┴───────────┴───────────┴─────┴───────────┴───────────┴───────────┴───────────╯
Plotting with Pypolars
- No Direct Connection to Matplotlib
- df[].plot(kind=’bar’)
- Alternative is to use .to_pandas()
In [91]:
# Load Data Vis Pkgs import matplotlib.pyplot as plt import seaborn as sns
In [92]:
# Load EDA Pkgs # pip install py-polars import pypolars as pl
In [95]:
# Create A Series x = pl.Series('X',['A','B','C']) y = pl.Series('Y',[3,5,4])
In [96]:
dir(x)
Out[96]:
['__add__', '__array_ufunc__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__floordiv__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__invert__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__mul__', '__ne__', '__new__', '__radd__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmul__', '__rsub__', '__rtruediv__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__', '__weakref__', '_from_pyseries', '_s', 'append', 'apply', 'arg_true', 'arg_unique', 'argsort', 'as_duration', 'cast', 'chunk_lengths', 'clone', 'datetime_str_fmt', 'day', 'drop_nulls', 'dtype', 'explode', 'fill_none', 'filter', 'head', 'hour', 'inner', 'is_duplicated', 'is_float', 'is_not_null', 'is_null', 'is_numeric', 'is_unique', 'len', 'limit', 'max', 'mean', 'min', 'minute', 'month', 'n_chunks', 'name', 'nanosecond', 'null_count', 'ordinal_day', 'parse_date', 'rechunk', 'rename', 'rolling_max', 'rolling_mean', 'rolling_min', 'rolling_sum', 'sample', 'second', 'series_equal', 'set', 'set_at_idx', 'shift', 'slice', 'sort', 'std', 'str_contains', 'str_lengths', 'str_parse_date', 'str_replace', 'str_replace_all', 'str_to_lowercase', 'str_to_uppercase', 'sum', 'tail', 'take', 'take_every', 'to_dummies', 'to_list', 'to_numpy', 'unique', 'value_counts', 'var', 'view', 'year', 'zip_with']
In [97]:
# Bar Chart Using Matplotlib plt.bar(x,y)
Out[97]:
<BarContainer object of 3 artists>
In [98]:
# Load Dataset df = pl.read_csv("data/diamonds.csv")
In [99]:
df.head()
Out[99]:
shape: (5, 10) ╭───────┬─────────┬───────┬─────────┬─────┬───────┬───────┬───────┬──────╮ │ carat ┆ cut ┆ color ┆ clarity ┆ ... ┆ depth ┆ table ┆ price ┆ x │ │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ str ┆ str ┆ str ┆ ┆ f64 ┆ f64 ┆ i64 ┆ f64 │ ╞═══════╪═════════╪═══════╪═════════╪═════╪═══════╪═══════╪═══════╪══════╡ │ 0.23 ┆ Ideal ┆ E ┆ SI2 ┆ ... ┆ 61.5 ┆ 55 ┆ 326 ┆ 3.95 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.21 ┆ Premium ┆ E ┆ SI1 ┆ ... ┆ 59.8 ┆ 61 ┆ 326 ┆ 3.89 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.23 ┆ Good ┆ E ┆ VS1 ┆ ... ┆ 56.9 ┆ 65 ┆ 327 ┆ 4.05 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.29 ┆ Premium ┆ I ┆ VS2 ┆ ... ┆ 62.4 ┆ 58 ┆ 334 ┆ 4.2 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.31 ┆ Good ┆ J ┆ SI2 ┆ ... ┆ 63.3 ┆ 58 ┆ 335 ┆ 4.34 │ ╰───────┴─────────┴───────┴─────────┴─────┴───────┴───────┴───────┴──────╯
In [100]:
df['cut'].value_counts()
Out[100]:
shape: (5, 2) ╭───────────┬────────╮ │ cut ┆ counts │ │ --- ┆ --- │ │ str ┆ u32 │ ╞═══════════╪════════╡ │ Ideal ┆ 21551 │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ │ Premium ┆ 13791 │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ │ Very Good ┆ 12082 │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ │ Good ┆ 4906 │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ │ Fair ┆ 1610 │ ╰───────────┴────────╯
In [101]:
# type type(df['cut'].value_counts())
Out[101]:
pypolars.frame.DataFrame
In [102]:
# In Pandas df['cut'].value_counts().plot(kind='bar')
--------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) /usr/local/lib/python3.7/dist-packages/pypolars/frame.py in __getattr__(self, item) 260 try: --> 261 return wrap_s(self._df.column(item)) 262 except RuntimeError: RuntimeError: Any(NotFound("plot")) During handling of the above exception, another exception occurred: AttributeError Traceback (most recent call last) <ipython-input-102-6c9e7394c03c> in <module> 1 # In Pandas ----> 2 df['cut'].value_counts().plot(kind='bar') /usr/local/lib/python3.7/dist-packages/pypolars/frame.py in __getattr__(self, item) 261 return wrap_s(self._df.column(item)) 262 except RuntimeError: --> 263 raise AttributeError(f"{item} not found") 264 265 def __iter__(self): AttributeError: plot not found
In [103]:
# Method : Separately vc = df['cut'].value_counts()
In [104]:
vc.head()
Out[104]:
shape: (5, 2) ╭───────────┬────────╮ │ cut ┆ counts │ │ --- ┆ --- │ │ str ┆ u32 │ ╞═══════════╪════════╡ │ Ideal ┆ 21551 │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ │ Premium ┆ 13791 │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ │ Very Good ┆ 12082 │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ │ Good ┆ 4906 │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ │ Fair ┆ 1610 │ ╰───────────┴────────╯
In [105]:
plt.bar(vc['cut'],vc['counts'])
Out[105]:
<BarContainer object of 5 artists>
In [107]:
# Method 2: Indirect via to_pandas() df['cut'].value_counts().to_pandas().plot(kind='bar')
Out[107]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f93a76f7450>
In [108]:
# Using Seaborn df
Out[108]:
shape: (53940, 10) ╭───────┬───────────┬───────┬─────────┬─────┬───────┬───────┬───────┬──────╮ │ carat ┆ cut ┆ color ┆ clarity ┆ ... ┆ depth ┆ table ┆ price ┆ x │ │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ str ┆ str ┆ str ┆ ┆ f64 ┆ f64 ┆ i64 ┆ f64 │ ╞═══════╪═══════════╪═══════╪═════════╪═════╪═══════╪═══════╪═══════╪══════╡ │ 0.23 ┆ Ideal ┆ E ┆ SI2 ┆ ... ┆ 61.5 ┆ 55 ┆ 326 ┆ 3.95 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.21 ┆ Premium ┆ E ┆ SI1 ┆ ... ┆ 59.8 ┆ 61 ┆ 326 ┆ 3.89 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.23 ┆ Good ┆ E ┆ VS1 ┆ ... ┆ 56.9 ┆ 65 ┆ 327 ┆ 4.05 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.29 ┆ Premium ┆ I ┆ VS2 ┆ ... ┆ 62.4 ┆ 58 ┆ 334 ┆ 4.2 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.72 ┆ Ideal ┆ D ┆ SI1 ┆ ... ┆ 60.8 ┆ 57 ┆ 2757 ┆ 5.75 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.72 ┆ Good ┆ D ┆ SI1 ┆ ... ┆ 63.1 ┆ 55 ┆ 2757 ┆ 5.69 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.7 ┆ Very Good ┆ D ┆ SI1 ┆ ... ┆ 62.8 ┆ 60 ┆ 2757 ┆ 5.66 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.86 ┆ Premium ┆ H ┆ SI2 ┆ ... ┆ 61 ┆ 58 ┆ 2757 ┆ 6.15 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤ │ 0.75 ┆ Ideal ┆ D ┆ SI2 ┆ ... ┆ 62.2 ┆ 55 ┆ 2757 ┆ 5.83 │ ╰───────┴───────────┴───────┴─────────┴─────┴───────┴───────┴───────┴──────╯
In [109]:
import seaborn as sns
In [112]:
# Works in Pandas # Convert to pandas sns.displot(data=df.to_pandas(),x='price',hue='cut')
Out[112]:
<seaborn.axisgrid.FacetGrid at 0x7f93a764959
PyPolars has some cool benefits and features. Cool right?
You can check out the video tutorial below
Thanks For Your Time
Jesus Saves