Skip to content

Introduction

Purpose

This repository is designed to help those transitioning from Pandas to Polars become acquainted with Polars' syntax. Most of the code examples are sourced from the excellent Polars user guide. Each example features both Polars and Pandas code, encouraging you to practice converting Polars to Pandas independently. If you encounter challenges with Polars, you can refer to my solutions for guidance. I believe that with these hints, you'll develop even better solutions of your own. This approach will enable you to swiftly grasp Polars through the familiar lens of Pandas.

Why take this approach?

Converting code from Polars to Pandas involves a three-step process:

  1. Familiarizing with Polars: First, you must acquaint yourself with Polars' syntax to understand its meaning.
  2. Converting to Pandas: During the conversion process, you'll need to determine how to accomplish tasks using Pandas.
  3. Comparing the results: Finally, you'll compare the results and gain insights into the strengths and weaknesses of both libraries.

This approach ensures a comprehensive understanding of both Polars and Pandas, enabling you to make informed decisions when working with data manipulation libraries.

Embrace the new mindset

Contexts

Contexts in Polars determine how to perform operations similar to df.loc[.., ..], df.iloc[.., ..] and df[..].

In Polars, You'll mainly work with these three contexts to manipulate rows and columns:

  • df.select([..]): Select or create columns.
  • df.with_columns([..]): Create columns.
  • df.filter(..): Filter rows.

It's worth noting that df.group_by(..).agg([..]) serves as a specialized context in Polars for aggregation purposes.

Expressions

Expressions in Polars are akin to the operations you wish to perform. They are present throughout the library. You'll find them used for various tasks, such as changing a column's data type, sorting a column, extracting the initial rows, and even computing the mean value for each group after performing a group by operation.

No more index

Polars excels at data manipulation through a column-based approach, unburdened by index-based constraints. In contrast, Pandas primarily relies on index alignment as the key concept for connecting columns within each row. If you need to break the relationship for a single column in Pandas, especially when the original index is multi-indexed, you'll likely find yourself doing a substantial amount of work to figure out how to realign it.

import pandas as pd
import polars as pl

data = {"nrs": [4, 3, 1, 5, 2], "names": ["foo", "ham", "spam", "egg", "baz"]}
df_pl = pl.DataFrame(data)
df_pd = pd.DataFrame(data)

out_pl = df_pl.select(nrs=pl.col("nrs").sort(), names=pl.col("names").reverse())
print(out_pl)

shape: (5, 2)
┌─────┬───────┐
│ nrs ┆ names │
│ --- ┆ ---   │
│ i64 ┆ str   │
╞═════╪═══════╡
│ 1   ┆ baz   │
│ 2   ┆ egg   │
│ 3   ┆ spam  │
│ 4   ┆ ham   │
│ 5   ┆ foo   │
└─────┴───────┘

out_pd = df_pd.assign(
    nrs=lambda df_: df_.nrs.sort_values().reset_index(drop=True),
    names=lambda df_: df_.names[::-1].reset_index(drop=True),
)
print(out_pd)

   nrs names
0    1   baz
1    2   egg
2    3  spam
3    4   ham
4    5   foo

Pseudo Index

If you really need the index to help you get used to Polars, you can refer to df.with_row_count().

Parallel

Polars is designed to operate in parallel. This means that you can't refer a column name you've assigned within the same context. This behavior may require some adjustment for Pandas users who are accustomed to heavily using pd.assign().

import pandas as pd
import polars as pl

data = {"nrs": [4, 3, 1, 5, 2], "names": ["foo", "ham", "spam", "egg", "baz"]}
df_pl = pl.DataFrame(data)
df_pd = pd.DataFrame(data)

out_pl = df_pl.with_columns(add1=pl.col("nrs") + 1).with_columns(
    add2=pl.col("add1") + 1
)
print(out_pl)

shape: (5, 4)
┌─────┬───────┬──────┬──────┐
│ nrs ┆ names ┆ add1 ┆ add2 │
│ --- ┆ ---   ┆ ---  ┆ ---  │
│ i64 ┆ str   ┆ i64  ┆ i64  │
╞═════╪═══════╪══════╪══════╡
│ 4   ┆ foo   ┆ 5    ┆ 6    │
│ 3   ┆ ham   ┆ 4    ┆ 5    │
│ 1   ┆ spam  ┆ 2    ┆ 3    │
│ 5   ┆ egg   ┆ 6    ┆ 7    │
│ 2   ┆ baz   ┆ 3    ┆ 4    │
└─────┴───────┴──────┴──────┘

This code snippet will raise pl.exceptions.ColumnNotFoundError.

df_pl.with_columns(add1=pl.col("nrs") + 1,
                   add2=pl.col("add1") + 1)
out_pd = df_pd.assign(add1=lambda df_: df_.nrs + 1, add2=lambda df_: df_.add1 + 1)
print(out_pd)

   nrs names  add1  add2
0    4   foo     5     6
1    3   ham     4     5
2    1  spam     2     3
3    5   egg     6     7
4    2   baz     3     4

Namespaces

Namespaces in Polars are akin to accessors in Pandas. However, Polars offers more robust namespaces compared to Pandas, with features such as the list namespace, which can be incredibly useful.

Lazy

Lazy is at the core of Polars and offers numerous advantages compared to the eager mode. For a more in-depth understanding, you should refer to the user guide.

Missing data

In Polars, missing data is consistently represented as a null value. Additionally, Polars permits the use of Not a Number or NaN values for float columns. It's important to avoid conflating these two concepts.