Skip to content

df.select()

df.select([..]) is a powerful selection tool that allows you to select one or more columns and perform operations on them in parallel. This includes the ability to create new columns as part of your selection process.

Setup

import numpy as np
import pandas as pd
import polars as pl

np.random.seed(42)
data = {
    "nrs": [1, 2, 3, 4, 5],
    "names": ["foo", "ham", "spam", "egg", "baz"],
    "random": np.random.rand(5),
    "groups": ["A", "A", "B", "C", "B"],
}

df_pl = pl.DataFrame(data)
print(df_pl)

shape: (5, 4)
┌─────┬───────┬──────────┬────────┐
│ nrs ┆ names ┆ random   ┆ groups │
│ --- ┆ ---   ┆ ---      ┆ ---    │
│ i64 ┆ str   ┆ f64      ┆ str    │
╞═════╪═══════╪══════════╪════════╡
│ 1   ┆ foo   ┆ 0.37454  ┆ A      │
│ 2   ┆ ham   ┆ 0.950714 ┆ A      │
│ 3   ┆ spam  ┆ 0.731994 ┆ B      │
│ 4   ┆ egg   ┆ 0.598658 ┆ C      │
│ 5   ┆ baz   ┆ 0.156019 ┆ B      │
└─────┴───────┴──────────┴────────┘

df_pd = pd.DataFrame(data)
print(df_pd)

   nrs names    random groups
0    1   foo  0.374540      A
1    2   ham  0.950714      A
2    3  spam  0.731994      B
3    4   egg  0.598658      C
4    5   baz  0.156019      B

Basic form

If selecting at least one column from the original dataframe, the behavior of df.select([..]) can be treated as df.loc(:, [..]).assign(..) in Pandas.

out_pl = df_pl.select(
    pl.col("nrs"),
    (pl.col("nrs") + 1).alias("add1"),
    (pl.col("nrs") - 1).alias("sub1"),
    (pl.col("nrs") * 2).alias("mul2"),
    (pl.col("nrs") / 3).alias("div3"),
)
print(out_pl)

shape: (5, 5)
┌─────┬──────┬──────┬──────┬──────────┐
│ nrs ┆ add1 ┆ sub1 ┆ mul2 ┆ div3     │
│ --- ┆ ---  ┆ ---  ┆ ---  ┆ ---      │
│ i64 ┆ i64  ┆ i64  ┆ i64  ┆ f64      │
╞═════╪══════╪══════╪══════╪══════════╡
│ 1   ┆ 2    ┆ 0    ┆ 2    ┆ 0.333333 │
│ 2   ┆ 3    ┆ 1    ┆ 4    ┆ 0.666667 │
│ 3   ┆ 4    ┆ 2    ┆ 6    ┆ 1.0      │
│ 4   ┆ 5    ┆ 3    ┆ 8    ┆ 1.333333 │
│ 5   ┆ 6    ┆ 4    ┆ 10   ┆ 1.666667 │
└─────┴──────┴──────┴──────┴──────────┘

out_pd = df_pd.loc[:, ["nrs"]].assign(
    add1=lambda df_: df_.nrs + 1,
    sub1=lambda df_: df_.nrs - 1,
    mul2=lambda df_: df_.nrs * 2,
    div3=lambda df_: df_.nrs / 3,
)
print(out_pd)

   nrs  add1  sub1  mul2      div3
0    1     2     0     2  0.333333
1    2     3     1     4  0.666667
2    3     4     2     6  1.000000
3    4     5     3     8  1.333333
4    5     6     4    10  1.666667

Generalized form

It's worth noting that df.select([..]) can also be used to create columns. Therefore, the behavior of df.select([..]) is akin to df.assign(..).drop(columns=..) in Pandas."

out_pl = df_pl.select(
    (pl.col("nrs") + 1).alias("add1"),
    (pl.col("nrs") - 1).alias("sub1"),
    (pl.col("nrs") * 2).alias("mul2"),
    (pl.col("nrs") / 3).alias("div3"),
)
print(out_pl)

shape: (5, 4)
┌──────┬──────┬──────┬──────────┐
│ add1 ┆ sub1 ┆ mul2 ┆ div3     │
│ ---  ┆ ---  ┆ ---  ┆ ---      │
│ i64  ┆ i64  ┆ i64  ┆ f64      │
╞══════╪══════╪══════╪══════════╡
│ 2    ┆ 0    ┆ 2    ┆ 0.333333 │
│ 3    ┆ 1    ┆ 4    ┆ 0.666667 │
│ 4    ┆ 2    ┆ 6    ┆ 1.0      │
│ 5    ┆ 3    ┆ 8    ┆ 1.333333 │
│ 6    ┆ 4    ┆ 10   ┆ 1.666667 │
└──────┴──────┴──────┴──────────┘

out_pd = df_pd.assign(
    add1=lambda df_: df_.nrs + 1,
    sub1=lambda df_: df_.nrs - 1,
    mul2=lambda df_: df_.nrs * 2,
    div3=lambda df_: df_.nrs / 3,
).drop(columns=df_pd.columns)
print(out_pd)

   add1  sub1  mul2      div3
0     2     0     2  0.333333
1     3     1     4  0.666667
2     4     2     6  1.000000
3     5     3     8  1.333333
4     6     4    10  1.666667

Example

While this example appears straightforward in Polars, there are some nuances in Pandas. Let's explore them.

out_pl = df_pl.select(
    pl.col("nrs").sum(),
    pl.col("names").sort(),
    pl.col("names").first().alias("first name"),
    (pl.col("nrs").mean() * 10).alias("10xnrs"),
)
print(out_pl)

shape: (5, 4)
┌─────┬───────┬────────────┬────────┐
│ nrs ┆ names ┆ first name ┆ 10xnrs │
│ --- ┆ ---   ┆ ---        ┆ ---    │
│ i64 ┆ str   ┆ str        ┆ f64    │
╞═════╪═══════╪════════════╪════════╡
│ 15  ┆ baz   ┆ foo        ┆ 30.0   │
│ 15  ┆ egg   ┆ foo        ┆ 30.0   │
│ 15  ┆ foo   ┆ foo        ┆ 30.0   │
│ 15  ┆ ham   ┆ foo        ┆ 30.0   │
│ 15  ┆ spam  ┆ foo        ┆ 30.0   │
└─────┴───────┴────────────┴────────┘

out_pd = df_pd.assign(
    **{
        "nrs": lambda df_: df_.nrs.sum(),
        "names": lambda df_: df_.names.sort_values().reset_index(drop=True),
        "first name": df_pd.names.iloc[0],
        "10xnrs": df_pd.nrs.mean() * 10,
    }
).drop(columns=df_pd.columns.drop(["nrs", "names"]))
print(out_pd)

   nrs names first name  10xnrs
0   15   baz        foo    30.0
1   15   egg        foo    30.0
2   15   foo        foo    30.0
3   15   ham        foo    30.0
4   15  spam        foo    30.0

It's noted that, since one of the column names contains a space ('first name'), we need to use a dictionary to store each column name as the key and the corresponding operation as the value. Then, we can unpack this dictionary in pd.assign(..).

Here's a breakdown of the operations:

  • "nrs": This operation calculates the sum of the "nrs" column and broadcasts it to all rows.
  • "names": This operation sorts the values in the "names" column. To realign the index, we use df.reset_index(drop=True).
  • "first name": This operation selects the first row value of the "names" column and broadcasts it to all rows. It's important to note that since we reassigned "names" in the previous line, we need to use the original df_pd to retrieve the "names" column.
  • "10xnrs": This operation calculates the mean of the "nrs" column and broadcasts it to all rows. Similarly, since we reassigned "nrs" at the beginning, we need to use the original df_pd to retrieve the "nrs" column.

Additionally, we use .drop(columns=df_pd.columns.drop(["nrs", "names"])) to drop the columns that haven't been reassigned.

Tips

  • Whenever you wish you could use df.loc(..), df.iloc(..) or df[..] in Polars, you're likely looking for df.select([..]).
  • Notably, df.select([..]) not only lets you select columns but also enables you to create new columns simultaneously.

Reference

The examples in this section have been adapted from the Polars user guide.