Column selections
Setup
from datetime import date, datetime
import numpy as np
import pandas as pd
import polars as pl
import polars.selectors as cs
df_pl = pl.DataFrame(
{
"id": [9, 4, 2],
"place": ["Mars", "Earth", "Saturn"],
"date": pl.date_range(date(2022, 1, 1), date(2022, 1, 3), "1d", eager=True),
"sales": [33.4, 2142134.1, 44.7],
"has_people": [False, True, False],
"logged_at": pl.datetime_range(
datetime(2022, 12, 1), datetime(2022, 12, 1, 0, 0, 2), "1s", eager=True
),
}
).with_row_count("rn")
print(df_pl)
shape: (3, 7)
┌─────┬─────┬────────┬────────────┬───────────┬────────────┬─────────────────────┐
│ rn ┆ id ┆ place ┆ date ┆ sales ┆ has_people ┆ logged_at │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ str ┆ date ┆ f64 ┆ bool ┆ datetime[μs] │
╞═════╪═════╪════════╪════════════╪═══════════╪════════════╪═════════════════════╡
│ 0 ┆ 9 ┆ Mars ┆ 2022-01-01 ┆ 33.4 ┆ false ┆ 2022-12-01 00:00:00 │
│ 1 ┆ 4 ┆ Earth ┆ 2022-01-02 ┆ 2142134.1 ┆ true ┆ 2022-12-01 00:00:01 │
│ 2 ┆ 2 ┆ Saturn ┆ 2022-01-03 ┆ 44.7 ┆ false ┆ 2022-12-01 00:00:02 │
└─────┴─────┴────────┴────────────┴───────────┴────────────┴─────────────────────┘
df_pd = (
pd.DataFrame(
{
"id": [9, 4, 2],
"place": ["Mars", "Earth", "Saturn"],
"date": pd.date_range("2022-01-01", "2022-01-03"),
"sales": [33.4, 2142134.1, 44.7],
"has_people": [False, True, False],
"logged_at": pd.date_range("2022-12-01", "2022-12-01 00:00:02", freq="S"),
}
)
.rename_axis("rn")
.reset_index()
)
print(df_pd)
Expression expansion
Select all
shape: (3, 7)
┌─────┬─────┬────────┬────────────┬───────────┬────────────┬─────────────────────┐
│ rn ┆ id ┆ place ┆ date ┆ sales ┆ has_people ┆ logged_at │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ str ┆ date ┆ f64 ┆ bool ┆ datetime[μs] │
╞═════╪═════╪════════╪════════════╪═══════════╪════════════╪═════════════════════╡
│ 0 ┆ 9 ┆ Mars ┆ 2022-01-01 ┆ 33.4 ┆ false ┆ 2022-12-01 00:00:00 │
│ 1 ┆ 4 ┆ Earth ┆ 2022-01-02 ┆ 2142134.1 ┆ true ┆ 2022-12-01 00:00:01 │
│ 2 ┆ 2 ┆ Saturn ┆ 2022-01-03 ┆ 44.7 ┆ false ┆ 2022-12-01 00:00:02 │
└─────┴─────┴────────┴────────────┴───────────┴────────────┴─────────────────────┘
Exclude
shape: (3, 5)
┌─────┬────────┬────────────┬───────────┬────────────┐
│ id ┆ place ┆ date ┆ sales ┆ has_people │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ date ┆ f64 ┆ bool │
╞═════╪════════╪════════════╪═══════════╪════════════╡
│ 9 ┆ Mars ┆ 2022-01-01 ┆ 33.4 ┆ false │
│ 4 ┆ Earth ┆ 2022-01-02 ┆ 2142134.1 ┆ true │
│ 2 ┆ Saturn ┆ 2022-01-03 ┆ 44.7 ┆ false │
└─────┴────────┴────────────┴───────────┴────────────┘
By multiple strings
out_pd = df_pd.loc[:, ["date", "logged_at"]].assign(
date=lambda df_: df_.date.dt.strftime("%Y-%h-%d"),
logged_at=lambda df_: df_.logged_at.dt.strftime("%Y-%h-%d"),
)
print(out_pd)
If there are dozens of columns that need manipulation, I will use the following approach instead.
By regular expressions
By data type
Using selectors
selectors
is a unique feature of Polars
. It behaves similarly to a combination of df.select_dtypes()
and df.filter()
in Pandas
.
By dtype
Applying set operations
By patterns and substrings
shape: (3, 3)
┌─────┬────────────┬─────────────────────┐
│ rn ┆ has_people ┆ logged_at │
│ --- ┆ --- ┆ --- │
│ u32 ┆ bool ┆ datetime[μs] │
╞═════╪════════════╪═════════════════════╡
│ 0 ┆ false ┆ 2022-12-01 00:00:00 │
│ 1 ┆ true ┆ 2022-12-01 00:00:01 │
│ 2 ┆ false ┆ 2022-12-01 00:00:02 │
└─────┴────────────┴─────────────────────┘
Reference
The examples in this section have been adapted from the Polars
user guide.