Missing data
In Polars, missing data is consistently represented as a null value. Additionally, Polars permits the use of Not a Number or NaN values for float columns. It's important to avoid conflating these two concepts.
Setup
import numpy as np
import pandas as pd
import polars as pl
data = {"col1": [1, 2, 3], "col2": [1, None, 9]}
Missing data metadata
Is a missing value
Count the missing values
Filling missing data
Fill with specified literal value
Fill with a strategy
Fill with an expression
Fill with interpolation
NaN values
Similar to the null value, Polars has is_nan and fill_nan to work with the NaN value. However, it should be noted that there is no nan_count in Polars.
These NaN values can be created from Numpy's np.nan or the native python float('nan').
shape: (4, 1)
┌───────┐
│ value │
│ --- │
│ f64 │
╞═══════╡
│ 1.0 │
│ NaN │
│ NaN │
│ 3.0 │
└───────┘
Is a NaN value
shape: (4, 1)
┌───────┐
│ value │
│ --- │
│ bool │
╞═══════╡
│ false │
│ true │
│ true │
│ false │
└───────┘
Count the NaN values
Filling NaN
fill_literal_nan_df_pl = nan_df_pl.with_columns(pl.col("value").fill_nan(pl.lit(2)))
print(fill_literal_nan_df_pl)
shape: (4, 1)
┌───────┐
│ value │
│ --- │
│ f64 │
╞═══════╡
│ 1.0 │
│ 2.0 │
│ 2.0 │
│ 3.0 │
└───────┘
Calculating the mean and median values
When calculating the mean or median of a column with NaN values, the result will be NaN. To change this behavior, replace NaN values with null values. With this change, null values will be excluded when calculating the mean or median of a column.
pd.NaT
It's worth noting that Pandas has a special pd.NaT, which serves as the time equivalent of NaN.
df_pd_nat = pd.DataFrame([pd.Timestamp("2023"), np.nan], columns=["col"])
print(df_pd_nat.dtypes, end="\n" * 2)
print(df_pd_nat)
More about filling with interpolation
While Polars provides linear and nearest interpolation strategies, Pandas offers a broader range.
Several interpolation methods in df.interpolation of Pandas are adopted from the SciPy package.
out_pl = df_pl2.with_columns(
linear=pl.col("col1").interpolate(method="linear"),
nearest=pl.col("col1").interpolate(method="nearest"),
)
print(out_pl)
shape: (10, 3)
┌──────────┬──────────┬──────────┐
│ col1 ┆ linear ┆ nearest │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞══════════╪══════════╪══════════╡
│ 0.155995 ┆ 0.155995 ┆ 0.155995 │
│ 0.058084 ┆ 0.058084 ┆ 0.058084 │
│ 0.866176 ┆ 0.866176 ┆ 0.866176 │
│ 0.601115 ┆ 0.601115 ┆ 0.601115 │
│ … ┆ … ┆ … │
│ null ┆ 0.774611 ┆ 0.832443 │
│ 0.832443 ┆ 0.832443 ┆ 0.832443 │
│ 0.212339 ┆ 0.212339 ┆ 0.212339 │
│ 0.181825 ┆ 0.181825 ┆ 0.181825 │
└──────────┴──────────┴──────────┘
out_pd = df_pd2.assign(
linear=lambda df_: df_.col1.interpolate(method="linear"),
nearest=lambda df_: df_.col1.interpolate(method="nearest"),
quadratic=lambda df_: df_.col1.interpolate(method="quadratic"),
poly_order3=lambda df_: df_.col1.interpolate(method="polynomial", order=3),
spline_order5=lambda df_: df_.col1.interpolate(method="spline", order=5),
)
print(out_pd)
col1 linear nearest quadratic poly_order3 spline_order5
0 0.155995 0.155995 0.155995 0.155995 0.155995 0.155995
1 0.058084 0.058084 0.058084 0.058084 0.058084 0.058084
2 0.866176 0.866176 0.866176 0.866176 0.866176 0.866176
3 0.601115 0.601115 0.601115 0.601115 0.601115 0.601115
4 NaN 0.658947 0.601115 0.494938 0.517777 1.084757
5 NaN 0.716779 0.601115 0.815400 0.767544 1.214026
6 NaN 0.774611 0.832443 1.086551 0.991928 1.100660
7 0.832443 0.832443 0.832443 0.832443 0.832443 0.832443
8 0.212339 0.212339 0.212339 0.212339 0.212339 0.212339
9 0.181825 0.181825 0.181825 0.181825 0.181825 0.181825
Reference
The examples in this section have been adapted from the Polars user guide.