Skip to content

df.group_by().agg()

df.group_by(..).agg([..]) groups specific columns and performs parallel aggregations.

Setup

import numpy as np
import pandas as pd
import polars as pl

np.random.seed(42)
data = {
    "nrs": [1, 2, 3, 4, 5],
    "names": ["foo", "ham", "spam", "egg", "baz"],
    "random": np.random.rand(5),
    "groups": ["A", "A", "B", "C", "B"],
}

df_pl = pl.DataFrame(data)
print(df_pl)

shape: (5, 4)
┌─────┬───────┬──────────┬────────┐
│ nrs ┆ names ┆ random   ┆ groups │
│ --- ┆ ---   ┆ ---      ┆ ---    │
│ i64 ┆ str   ┆ f64      ┆ str    │
╞═════╪═══════╪══════════╪════════╡
│ 1   ┆ foo   ┆ 0.37454  ┆ A      │
│ 2   ┆ ham   ┆ 0.950714 ┆ A      │
│ 3   ┆ spam  ┆ 0.731994 ┆ B      │
│ 4   ┆ egg   ┆ 0.598658 ┆ C      │
│ 5   ┆ baz   ┆ 0.156019 ┆ B      │
└─────┴───────┴──────────┴────────┘

df_pd = pd.DataFrame(data)
print(df_pd)

   nrs names    random groups
0    1   foo  0.374540      A
1    2   ham  0.950714      A
2    3  spam  0.731994      B
3    4   egg  0.598658      C
4    5   baz  0.156019      B

Example

df.group_by(..).agg(..) behaves similarly to df.groupby(..).agg(..) in Pandas. In Polars, aggregation is primarily achieved through expressions, whereas Pandas relies on the provided methods of the grouper object.

out_pl = (
    df_pl.group_by("groups")
    .agg(
        pl.col("nrs").sum(),
        pl.col("random").count(),
        (
            pl.col("random")
            .filter(pl.col("names").str.contains("m"))
            .sum()
            .suffix("_sum")
        ),
        pl.col("names").reverse().alias("reversed names"),
    )
    .sort(by="groups")
)
print(out_pl)

shape: (3, 5)
┌────────┬─────┬────────┬────────────┬─────────────────┐
│ groups ┆ nrs ┆ random ┆ random_sum ┆ reversed names  │
│ ---    ┆ --- ┆ ---    ┆ ---        ┆ ---             │
│ str    ┆ i64 ┆ u32    ┆ f64        ┆ list[str]       │
╞════════╪═════╪════════╪════════════╪═════════════════╡
│ A      ┆ 3   ┆ 2      ┆ 0.950714   ┆ ["ham", "foo"]  │
│ B      ┆ 8   ┆ 2      ┆ 0.731994   ┆ ["baz", "spam"] │
│ C      ┆ 4   ┆ 1      ┆ 0.0        ┆ ["egg"]         │
└────────┴─────┴────────┴────────────┴─────────────────┘

out_pd = (
    df_pd.assign(random_m=lambda df_: df_.random[df_.names.str.contains("m")])
    .groupby("groups")
    .agg(
        **{
            "nrs": ("nrs", "sum"),
            "random": ("random", "count"),
            "random_sum": ("random_m", "sum"),
            "reverse names": ("names", lambda s_: s_[::-1]),
        }
    )
    .reset_index()
)
print(out_pd)

  groups  nrs  random  random_sum reverse names
0      A    3       2    0.950714    [ham, foo]
1      B    8       2    0.731994   [baz, spam]
2      C    4       1    0.000000           egg

Reference

The examples in this section have been adapted from the Polars user guide.