df.group_by().agg()
df.group_by(..).agg([..])
groups specific columns and performs parallel aggregations.
Setup
import numpy as np
import pandas as pd
import polars as pl
np.random.seed(42)
data = {
"nrs": [1, 2, 3, 4, 5],
"names": ["foo", "ham", "spam", "egg", "baz"],
"random": np.random.rand(5),
"groups": ["A", "A", "B", "C", "B"],
}
shape: (5, 4)
┌─────┬───────┬──────────┬────────┐
│ nrs ┆ names ┆ random ┆ groups │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ str │
╞═════╪═══════╪══════════╪════════╡
│ 1 ┆ foo ┆ 0.37454 ┆ A │
│ 2 ┆ ham ┆ 0.950714 ┆ A │
│ 3 ┆ spam ┆ 0.731994 ┆ B │
│ 4 ┆ egg ┆ 0.598658 ┆ C │
│ 5 ┆ baz ┆ 0.156019 ┆ B │
└─────┴───────┴──────────┴────────┘
Example
df.group_by(..).agg(..)
behaves similarly to df.groupby(..).agg(..)
in Pandas
. In Polars
, aggregation is primarily achieved through expressions
, whereas Pandas
relies on the provided methods of the grouper object.
out_pl = (
df_pl.group_by("groups")
.agg(
pl.col("nrs").sum(),
pl.col("random").count(),
(
pl.col("random")
.filter(pl.col("names").str.contains("m"))
.sum()
.suffix("_sum")
),
pl.col("names").reverse().alias("reversed names"),
)
.sort(by="groups")
)
print(out_pl)
shape: (3, 5)
┌────────┬─────┬────────┬────────────┬─────────────────┐
│ groups ┆ nrs ┆ random ┆ random_sum ┆ reversed names │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ u32 ┆ f64 ┆ list[str] │
╞════════╪═════╪════════╪════════════╪═════════════════╡
│ A ┆ 3 ┆ 2 ┆ 0.950714 ┆ ["ham", "foo"] │
│ B ┆ 8 ┆ 2 ┆ 0.731994 ┆ ["baz", "spam"] │
│ C ┆ 4 ┆ 1 ┆ 0.0 ┆ ["egg"] │
└────────┴─────┴────────┴────────────┴─────────────────┘
Reference
The examples in this section have been adapted from the Polars
user guide.