Missing data
In Polars
, missing data is consistently represented as a null
value. Additionally, Polars
permits the use of Not a Number
or NaN
values for float columns. It's important to avoid conflating these two concepts.
Setup
import numpy as np
import pandas as pd
import polars as pl
data = {"col1": [1, 2, 3], "col2": [1, None, 9]}
Missing data metadata
Is a missing value
Count the missing values
Filling missing data
Fill with specified literal value
Fill with a strategy
Fill with an expression
Fill with interpolation
NaN values
Similar to the null
value, Polars
has is_nan
and fill_nan
to work with the NaN
value. However, it should be noted that there is no nan_count
in Polars
.
These NaN
values can be created from Numpy's np.nan
or the native python float('nan')
.
shape: (4, 1)
┌───────┐
│ value │
│ --- │
│ f64 │
╞═══════╡
│ 1.0 │
│ NaN │
│ NaN │
│ 3.0 │
└───────┘
Is a NaN
value
shape: (4, 1)
┌───────┐
│ value │
│ --- │
│ bool │
╞═══════╡
│ false │
│ true │
│ true │
│ false │
└───────┘
Count the NaN
values
Filling NaN
fill_literal_nan_df_pl = nan_df_pl.with_columns(pl.col("value").fill_nan(pl.lit(2)))
print(fill_literal_nan_df_pl)
shape: (4, 1)
┌───────┐
│ value │
│ --- │
│ f64 │
╞═══════╡
│ 1.0 │
│ 2.0 │
│ 2.0 │
│ 3.0 │
└───────┘
Calculating the mean and median values
When calculating the mean or median of a column with NaN
values, the result will be NaN
. To change this behavior, replace NaN
values with null
values. With this change, null
values will be excluded when calculating the mean or median of a column.
pd.NaT
It's worth noting that Pandas
has a special pd.NaT
, which serves as the time equivalent of NaN
.
df_pd_nat = pd.DataFrame([pd.Timestamp("2023"), np.nan], columns=["col"])
print(df_pd_nat.dtypes, end="\n" * 2)
print(df_pd_nat)
More about filling with interpolation
While Polars
provides linear
and nearest
interpolation strategies, Pandas
offers a broader range.
Several interpolation methods in df.interpolation
of Pandas
are adopted from the SciPy
package.
out_pl = df_pl2.with_columns(
linear=pl.col("col1").interpolate(method="linear"),
nearest=pl.col("col1").interpolate(method="nearest"),
)
print(out_pl)
shape: (10, 3)
┌──────────┬──────────┬──────────┐
│ col1 ┆ linear ┆ nearest │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞══════════╪══════════╪══════════╡
│ 0.155995 ┆ 0.155995 ┆ 0.155995 │
│ 0.058084 ┆ 0.058084 ┆ 0.058084 │
│ 0.866176 ┆ 0.866176 ┆ 0.866176 │
│ 0.601115 ┆ 0.601115 ┆ 0.601115 │
│ … ┆ … ┆ … │
│ null ┆ 0.774611 ┆ 0.832443 │
│ 0.832443 ┆ 0.832443 ┆ 0.832443 │
│ 0.212339 ┆ 0.212339 ┆ 0.212339 │
│ 0.181825 ┆ 0.181825 ┆ 0.181825 │
└──────────┴──────────┴──────────┘
out_pd = df_pd2.assign(
linear=lambda df_: df_.col1.interpolate(method="linear"),
nearest=lambda df_: df_.col1.interpolate(method="nearest"),
quadratic=lambda df_: df_.col1.interpolate(method="quadratic"),
poly_order3=lambda df_: df_.col1.interpolate(method="polynomial", order=3),
spline_order5=lambda df_: df_.col1.interpolate(method="spline", order=5),
)
print(out_pd)
col1 linear nearest quadratic poly_order3 spline_order5
0 0.155995 0.155995 0.155995 0.155995 0.155995 0.155995
1 0.058084 0.058084 0.058084 0.058084 0.058084 0.058084
2 0.866176 0.866176 0.866176 0.866176 0.866176 0.866176
3 0.601115 0.601115 0.601115 0.601115 0.601115 0.601115
4 NaN 0.658947 0.601115 0.494938 0.517777 1.084757
5 NaN 0.716779 0.601115 0.815400 0.767544 1.214026
6 NaN 0.774611 0.832443 1.086551 0.991928 1.100660
7 0.832443 0.832443 0.832443 0.832443 0.832443 0.832443
8 0.212339 0.212339 0.212339 0.212339 0.212339 0.212339
9 0.181825 0.181825 0.181825 0.181825 0.181825 0.181825
Reference
The examples in this section have been adapted from the Polars
user guide.