bucketize

bucketize(*exprs, return_dtype=None)

Returns a Polars expression that assigns a label to each row based on its index, cycling through the provided expressions in a round-robin fashion.

bucketize() is the more general form of bucketize_lit(), allowing you to pass Polars expressions instead of just literal values. This enables advanced use cases such as referencing or transforming existing column values.

Be cautious when using pl.lit() as the first expression

Polars will automatically infer the data type of pl.lit(). For example, pl.lit(1) is inferred as pl.Int32.

To avoid unexpected type mismatches, it’s recommended to explicitly set the desired data type using return_dtype=.

Parameters

exprs : pl.Expr | Iterable[pl.Expr] = (): One or more pl.Expr objects, which can be passed as separate arguments or as a single iterable containing multiple expressions. All expressions must resolve to the same data type.
return_dtype : pl.DataType | pl.DataTypeExpr | None = None: An optional Polars data type to cast the resulting expression to.

Returns

: pl.Expr: A Polars expression that cycles through the input expressions based on the row index modulo.

Examples

DataFrame Context

Alternate between a column expression and a literal value:

import polars as pl
import turtle_island as ti

pl.Config.set_fmt_table_cell_list_len(10)
df = pl.DataFrame({"x": [1, 2, 3, 4, 5]})
df.with_columns(
    ti.bucketize(pl.col("x").add(10), pl.lit(100)).alias("bucketized")
)

shape: (5, 2)

x	bucketized
i64	i64
1	11
2	100
3	13
4	100
5	15

This alternates between the values of x + 10 and the literal 100. Make sure all expressions resolve to the same type—in this case, integers.

You can also cast the result to a specific type using return_dtype=:

df.with_columns(
    ti.bucketize(
        pl.col("x").add(10), pl.lit(100), return_dtype=pl.String
    ).alias("bucketized")
)

shape: (5, 2)

x	bucketized
i64	str
1	"11"
2	"100"
3	"13"
4	"100"
5	"15"

List Namespace Context

Working with Lists as Series

In the list namespace, it may be easier to think of each row as an element in a list. Conceptually, you’re working with a pl.Series, where each row corresponds to one item in the list.

Alternate between a column expression and a literal value for each element:

df2 = pl.DataFrame(
    {
        "x": [[1, 2, 3, 4], [5, 6, 7, 8]],
        "y": [[9, 10, 11, 12], [13, 14, 15, 16]],
    }
)
(
df2.with_columns(
        pl.all().list.eval(ti.bucketize(pl.element().add(10), pl.lit(100)))
    )
)

shape: (2, 2)

x	y
list[i64]	list[i64]
[11, 100, 13, 100]	[19, 100, 21, 100]
[15, 100, 17, 100]	[23, 100, 25, 100]