df.select()
df.select([..])
is a powerful selection tool that allows you to select one or more columns and perform operations on them in parallel. This includes the ability to create new columns as part of your selection process.
Setup
import numpy as np
import pandas as pd
import polars as pl
np.random.seed(42)
data = {
"nrs": [1, 2, 3, 4, 5],
"names": ["foo", "ham", "spam", "egg", "baz"],
"random": np.random.rand(5),
"groups": ["A", "A", "B", "C", "B"],
}
shape: (5, 4)
┌─────┬───────┬──────────┬────────┐
│ nrs ┆ names ┆ random ┆ groups │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ str │
╞═════╪═══════╪══════════╪════════╡
│ 1 ┆ foo ┆ 0.37454 ┆ A │
│ 2 ┆ ham ┆ 0.950714 ┆ A │
│ 3 ┆ spam ┆ 0.731994 ┆ B │
│ 4 ┆ egg ┆ 0.598658 ┆ C │
│ 5 ┆ baz ┆ 0.156019 ┆ B │
└─────┴───────┴──────────┴────────┘
Basic form
If selecting at least one column from the original dataframe, the behavior of df.select([..])
can be treated as df.loc(:, [..]).assign(..)
in Pandas
.
out_pl = df_pl.select(
pl.col("nrs"),
(pl.col("nrs") + 1).alias("add1"),
(pl.col("nrs") - 1).alias("sub1"),
(pl.col("nrs") * 2).alias("mul2"),
(pl.col("nrs") / 3).alias("div3"),
)
print(out_pl)
shape: (5, 5)
┌─────┬──────┬──────┬──────┬──────────┐
│ nrs ┆ add1 ┆ sub1 ┆ mul2 ┆ div3 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ f64 │
╞═════╪══════╪══════╪══════╪══════════╡
│ 1 ┆ 2 ┆ 0 ┆ 2 ┆ 0.333333 │
│ 2 ┆ 3 ┆ 1 ┆ 4 ┆ 0.666667 │
│ 3 ┆ 4 ┆ 2 ┆ 6 ┆ 1.0 │
│ 4 ┆ 5 ┆ 3 ┆ 8 ┆ 1.333333 │
│ 5 ┆ 6 ┆ 4 ┆ 10 ┆ 1.666667 │
└─────┴──────┴──────┴──────┴──────────┘
Generalized form
It's worth noting that df.select([..])
can also be used to create columns. Therefore, the behavior of df.select([..])
is akin to df.assign(..).drop(columns=..)
in Pandas
."
out_pl = df_pl.select(
(pl.col("nrs") + 1).alias("add1"),
(pl.col("nrs") - 1).alias("sub1"),
(pl.col("nrs") * 2).alias("mul2"),
(pl.col("nrs") / 3).alias("div3"),
)
print(out_pl)
shape: (5, 4)
┌──────┬──────┬──────┬──────────┐
│ add1 ┆ sub1 ┆ mul2 ┆ div3 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ f64 │
╞══════╪══════╪══════╪══════════╡
│ 2 ┆ 0 ┆ 2 ┆ 0.333333 │
│ 3 ┆ 1 ┆ 4 ┆ 0.666667 │
│ 4 ┆ 2 ┆ 6 ┆ 1.0 │
│ 5 ┆ 3 ┆ 8 ┆ 1.333333 │
│ 6 ┆ 4 ┆ 10 ┆ 1.666667 │
└──────┴──────┴──────┴──────────┘
Example
While this example appears straightforward in Polars
, there are some nuances in Pandas
. Let's explore them.
out_pl = df_pl.select(
pl.col("nrs").sum(),
pl.col("names").sort(),
pl.col("names").first().alias("first name"),
(pl.col("nrs").mean() * 10).alias("10xnrs"),
)
print(out_pl)
shape: (5, 4)
┌─────┬───────┬────────────┬────────┐
│ nrs ┆ names ┆ first name ┆ 10xnrs │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ f64 │
╞═════╪═══════╪════════════╪════════╡
│ 15 ┆ baz ┆ foo ┆ 30.0 │
│ 15 ┆ egg ┆ foo ┆ 30.0 │
│ 15 ┆ foo ┆ foo ┆ 30.0 │
│ 15 ┆ ham ┆ foo ┆ 30.0 │
│ 15 ┆ spam ┆ foo ┆ 30.0 │
└─────┴───────┴────────────┴────────┘
It's noted that, since one of the column names contains a space ('first name'), we need to use a dictionary to store each column name as the key and the corresponding operation as the value. Then, we can unpack this dictionary in pd.assign(..)
.
Here's a breakdown of the operations:
- "nrs": This operation calculates the sum of the "nrs" column and broadcasts it to all rows.
- "names": This operation sorts the values in the "names" column. To realign the index, we use
df.reset_index(drop=True)
. - "first name": This operation selects the first row value of the "names" column and broadcasts it to all rows. It's important to note that since we reassigned "names" in the previous line, we need to use the original
df_pd
to retrieve the "names" column. - "10xnrs": This operation calculates the mean of the "nrs" column and broadcasts it to all rows. Similarly, since we reassigned "nrs" at the beginning, we need to use the original
df_pd
to retrieve the "nrs" column.
Additionally, we use .drop(columns=df_pd.columns.drop(["nrs", "names"]))
to drop the columns that haven't been reassigned.
Tips
- Whenever you wish you could use
df.loc(..)
,df.iloc(..)
ordf[..]
inPolars
, you're likely looking fordf.select([..])
. - Notably,
df.select([..])
not only lets you select columns but also enables you to create new columns simultaneously.
Reference
The examples in this section have been adapted from the Polars
user guide.