Introduction
Purpose
This repository is designed to help those transitioning from Pandas
to Polars
become acquainted with Polars
' syntax. Most of the code examples are sourced from the excellent Polars user guide. Each example features both Polars
and Pandas
code, encouraging you to practice converting Polars
to Pandas
independently. If you encounter challenges with Polars
, you can refer to my solutions for guidance. I believe that with these hints, you'll develop even better solutions of your own. This approach will enable you to swiftly grasp Polars
through the familiar lens of Pandas
.
Why take this approach?
Converting code from Polars
to Pandas
involves a three-step process:
- Familiarizing with
Polars
: First, you must acquaint yourself withPolars
' syntax to understand its meaning. - Converting to
Pandas
: During the conversion process, you'll need to determine how to accomplish tasks usingPandas
. - Comparing the results: Finally, you'll compare the results and gain insights into the strengths and weaknesses of both libraries.
This approach ensures a comprehensive understanding of both Polars
and Pandas
, enabling you to make informed decisions when working with data manipulation libraries.
Embrace the new mindset
Contexts
Contexts in Polars
determine how to perform operations similar to df.loc[.., ..]
, df.iloc[.., ..]
and df[..]
.
In Polars
, You'll mainly work with these three contexts to manipulate rows and columns:
df.select([..])
: Select or create columns.df.with_columns([..])
: Create columns.df.filter(..)
: Filter rows.
It's worth noting that df.group_by(..).agg([..])
serves as a specialized context in Polars
for aggregation purposes.
Expressions
Expressions in Polars
are akin to the operations you wish to perform. They are present throughout the library. You'll find them used for various tasks, such as changing a column's data type, sorting a column, extracting the initial rows, and even computing the mean value for each group after performing a group by
operation.
No more index
Polars
excels at data manipulation through a column-based approach, unburdened by index-based constraints. In contrast, Pandas
primarily relies on index alignment as the key concept for connecting columns within each row. If you need to break the relationship for a single column in Pandas
, especially when the original index is multi-indexed, you'll likely find yourself doing a substantial amount of work to figure out how to realign it.
import pandas as pd
import polars as pl
data = {"nrs": [4, 3, 1, 5, 2], "names": ["foo", "ham", "spam", "egg", "baz"]}
df_pl = pl.DataFrame(data)
df_pd = pd.DataFrame(data)
Pseudo Index
If you really need the index to help you get used to Polars
, you can refer to df.with_row_count().
Parallel
Polars
is designed to operate in parallel. This means that you can't refer a column name you've assigned within the same context. This behavior may require some adjustment for Pandas
users who are accustomed to heavily using pd.assign()
.
import pandas as pd
import polars as pl
data = {"nrs": [4, 3, 1, 5, 2], "names": ["foo", "ham", "spam", "egg", "baz"]}
df_pl = pl.DataFrame(data)
df_pd = pd.DataFrame(data)
out_pl = df_pl.with_columns(add1=pl.col("nrs") + 1).with_columns(
add2=pl.col("add1") + 1
)
print(out_pl)
Namespaces
Namespaces in Polars
are akin to accessors in Pandas
. However, Polars
offers more robust namespaces compared to Pandas
, with features such as the list namespace, which can be incredibly useful.
Lazy
Lazy is at the core of Polars
and offers numerous advantages compared to the eager mode. For a more in-depth understanding, you should refer to the user guide.
Missing data
In Polars
, missing data is consistently represented as a null
value. Additionally, Polars
permits the use of Not a Number
or NaN
values for float columns. It's important to avoid conflating these two concepts.