Polars is a data frame in the Rust language using Apache Arrow Columnar Format. polars-ruby is the Ruby binding for Polars created by Andrew Kane.
Several members of the Ruby community have been deeply involved in the development of Apache Arrow.
Fast data processing with Ruby and Apache Arrow - rubykaigi2022
So while the Arrow C binding for the Ruby language is relatively well-developed, polars-df is not an Arrow C binding, but a binding to Polars implemented in Rust. magnus is used for the connection between Ruby and Rust. In fact, there is also a Ruby data frame that uses the Arrow binding, which is called RedAmber. But we are not talking about that now.
Please note that this post is incomplete and polars-df is still in the development phase, so the API is subject to change.
Documentation
Chapter 1 Getting started in Ruby
Installation
Ruby gem
gem install polars
From source code
git clone
https://github.com/ankane/polars-ruby
cd polars-ruby
bundle
bundle exec rake compile
bundle exec rake install
Quick start
Below we show a simple snippet that parses a CSV file, filters it, and finishes with a groupby operation. This example is presented in python only, as the "eager" API is not the preferred model in Rust.
require 'polars'
require 'uri'
df = Polars.read_csv(URI('https://j.mp/iriscsv'))
df.filter(Polars.col('sepal_length') > 5)
.groupby('species')
.agg(Polars.all.sum)
The snippet above will output:
shape: (3, 5)
┌────────────┬──────────────┬─────────────┬──────────────┬─────────────┐
│ species ┆ sepal_length ┆ sepal_width ┆ petal_length ┆ petal_width │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞════════════╪══════════════╪═════════════╪══════════════╪═════════════╡
│ versicolor ┆ 281.9 ┆ 131.8 ┆ 202.9 ┆ 63.3 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ setosa ┆ 116.9 ┆ 81.7 ┆ 33.2 ┆ 6.1 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ virginica ┆ 324.5 ┆ 146.2 ┆ 273.1 ┆ 99.6 │
└────────────┴──────────────┴─────────────┴──────────────┴─────────────┘
As we can see, Polars pretty-prints the output object, including the column name and datatype as headers.
Lazy quick start
If we want to run this query in lazy Polars we'd write:
require 'polars'
Polars.read_csv(URI('https://j.mp/iriscsv'))
.lazy
.filter(Polars.col('sepal_length') > 5)
.groupby('species')
.agg(Polars.all.sum)
.collect
shape: (3, 5)
┌────────────┬──────────────┬─────────────┬──────────────┬─────────────┐
│ species ┆ sepal_length ┆ sepal_width ┆ petal_length ┆ petal_width │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞════════════╪══════════════╪═════════════╪══════════════╪═════════════╡
│ virginica ┆ 324.5 ┆ 146.2 ┆ 273.1 ┆ 99.6 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ setosa ┆ 116.9 ┆ 81.7 ┆ 33.2 ┆ 6.1 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ versicolor ┆ 281.9 ┆ 131.8 ┆ 202.9 ┆ 63.3 │
└────────────┴──────────────┴─────────────┴──────────────┴─────────────┘
Chapter 2 Polars cheat sheet in Ruby
Creating / reading DataFrames
Create DataFrame
df = Polars::DataFrame.new({
nrs: [1, 2, 3, nil, 5],
names: ["foo", "ham", "spam", "egg", nil],
random: [0.3, 0.7, 0.1, 0.9, 0.6],
groups: %w[A A B C B],
})
shape: (5, 4)
┌──────┬───────┬────────┬────────┐
│ nrs ┆ names ┆ random ┆ groups │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ str │
╞══════╪═══════╪════════╪════════╡
│ 1 ┆ foo ┆ 0.3 ┆ A │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ ham ┆ 0.7 ┆ A │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3 ┆ spam ┆ 0.1 ┆ B │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ null ┆ egg ┆ 0.9 ┆ C │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 5 ┆ null ┆ 0.6 ┆ B │
└──────┴───────┴────────┴────────┘
Read CSV
iris = Polars.read_csv(URI('https://j.mp/iriscsv'),
has_header: true)
shape: (150, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬───────────┐
│ sepal_length ┆ sepal_width ┆ petal_length ┆ petal_width ┆ species │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ str │
╞══════════════╪═════════════╪══════════════╪═════════════╪═══════════╡
│ 5.1 ┆ 3.5 ┆ 1.4 ┆ 0.2 ┆ setosa │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.9 ┆ 3.0 ┆ 1.4 ┆ 0.2 ┆ setosa │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.7 ┆ 3.2 ┆ 1.3 ┆ 0.2 ┆ setosa │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.6 ┆ 3.1 ┆ 1.5 ┆ 0.2 ┆ setosa │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 6.3 ┆ 2.5 ┆ 5.0 ┆ 1.9 ┆ virginica │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 6.5 ┆ 3.0 ┆ 5.2 ┆ 2.0 ┆ virginica │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 6.2 ┆ 3.4 ┆ 5.4 ┆ 2.3 ┆ virginica │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 5.9 ┆ 3.0 ┆ 5.1 ┆ 1.8 ┆ virginica │
└──────────────┴─────────────┴──────────────┴─────────────┴───────────┘
Read parquet
Polars.read_parquet('file.parquet')
Expressions
df.filter(Polars.col('nrs') < 4) # symbols not work?
.groupby('groups')
.agg(Polars.all.sum)
shape: (2, 4)
┌────────┬─────┬───────┬────────┐
│ groups ┆ nrs ┆ names ┆ random │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ f64 │
╞════════╪═════╪═══════╪════════╡
│ A ┆ 3 ┆ null ┆ 1.0 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ B ┆ 3 ┆ null ┆ 0.1 │
└────────┴─────┴───────┴────────┘
Subset Observations - rows
Filter: Extract rows that meet logical criteria
df.filter(Polars.col('random') > 0.5)
df.filter(
(Polars.col('groups') == 'B') &
(Polars.col('random') > 0.5)
)
shape: (1, 4)
┌─────┬───────┬────────┬────────┐
│ nrs ┆ names ┆ random ┆ groups │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ str │
╞═════╪═══════╪════════╪════════╡
│ 5 ┆ null ┆ 0.6 ┆ B │
└─────┴───────┴────────┴────────┘
Randomly select fraction of rows.
df.sample(frac: 0.5)
# Results are random.
shape: (1, 4)
┌─────┬───────┬────────┬────────┐
│ nrs ┆ names ┆ random ┆ groups │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ str │
╞═════╪═══════╪════════╪════════╡
│ 2 ┆ ham ┆ 0.7 ┆ A │
└─────┴───────┴────────┴────────┘
Randomly select n rows.
df.sample(n: 2)
# Results are random.
shape: (2, 4)
┌──────┬───────┬────────┬────────┐
│ nrs ┆ names ┆ random ┆ groups │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ str │
╞══════╪═══════╪════════╪════════╡
│ 3 ┆ spam ┆ 0.1 ┆ B │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ null ┆ egg ┆ 0.9 ┆ C │
└──────┴───────┴────────┴────────┘
select first n rows.
df.head(2)
shape: (2, 4)
┌─────┬───────┬────────┬────────┐
│ nrs ┆ names ┆ random ┆ groups │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ str │
╞═════╪═══════╪════════╪════════╡
│ 1 ┆ foo ┆ 0.3 ┆ A │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ ham ┆ 0.7 ┆ A │
└─────┴───────┴────────┴────────┘
select last n rows.
df.tail(2)
shape: (2, 4)
┌──────┬───────┬────────┬────────┐
│ nrs ┆ names ┆ random ┆ groups │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ str │
╞══════╪═══════╪════════╪════════╡
│ null ┆ egg ┆ 0.9 ┆ C │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 5 ┆ null ┆ 0.6 ┆ B │
└──────┴───────┴────────┴────────┘
Subset Observations - columns
Select multiple columns with specific names
df.select(["nrs", "names"])
shape: (5, 2)
┌──────┬───────┐
│ nrs ┆ names │
│ --- ┆ --- │
│ i64 ┆ str │
╞══════╪═══════╡
│ 1 ┆ foo │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ ham │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ spam │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ null ┆ egg │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 5 ┆ null │
└──────┴───────┘
Select columns whose name matches regex
df.select(Polars.col("^n.*$"))
shape: (5, 2)
┌──────┬───────┐
│ nrs ┆ names │
│ --- ┆ --- │
│ i64 ┆ str │
╞══════╪═══════╡
│ 1 ┆ foo │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ ham │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ spam │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ null ┆ egg │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 5 ┆ null │
└──────┴───────┘
Subsets - rows and columns
Select rows 2-4
? # Yet Range support appears to be limited
Select columns in positions 1 and 3 (first column is 0)
?
Select rows meeting logical condition, and only the specific columns
?
Reshaping Data – Change layout, sorting, renaming
Append rows of DataFrames
Polars.concat([df, df2])
Append columns of DataFrames
Polars.concat([df, df3], how: "horizontal")
Gather columns into rows
df.melt(
id_vars: 'nrs',
value_vars: %w[names groups]
)
shape: (10, 3)
┌──────┬──────────┬───────┐
│ nrs ┆ variable ┆ value │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞══════╪══════════╪═══════╡
│ 1 ┆ names ┆ foo │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ names ┆ ham │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ names ┆ spam │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ null ┆ names ┆ egg │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ groups ┆ A │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ groups ┆ B │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ null ┆ groups ┆ C │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 5 ┆ groups ┆ B │
└──────┴──────────┴───────┘
Spread rows into columns
df.pivot(values: 'nrs', index: 'groups',
columns: 'names')
shape: (3, 6)
┌────────┬──────┬──────┬──────┬──────┬──────┐
│ groups ┆ foo ┆ ham ┆ spam ┆ egg ┆ null │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞════════╪══════╪══════╪══════╪══════╪══════╡
│ A ┆ 1 ┆ 2 ┆ null ┆ null ┆ null │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ B ┆ null ┆ null ┆ 3 ┆ null ┆ 5 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ C ┆ null ┆ null ┆ null ┆ null ┆ null │
└────────┴──────┴──────┴──────┴──────┴──────┘
Order rows by values of a column
# low to high
df.sort("random")
shape: (5, 4)
┌──────┬───────┬────────┬────────┐
│ nrs ┆ names ┆ random ┆ groups │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ str │
╞══════╪═══════╪════════╪════════╡
│ 3 ┆ spam ┆ 0.1 ┆ B │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1 ┆ foo ┆ 0.3 ┆ A │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 5 ┆ null ┆ 0.6 ┆ B │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ ham ┆ 0.7 ┆ A │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ null ┆ egg ┆ 0.9 ┆ C │
└──────┴───────┴────────┴────────┘
# high to low
df.sort("random", reverse: true)
shape: (5, 4)
┌──────┬───────┬────────┬────────┐
│ nrs ┆ names ┆ random ┆ groups │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ str │
╞══════╪═══════╪════════╪════════╡
│ null ┆ egg ┆ 0.9 ┆ C │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ ham ┆ 0.7 ┆ A │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 5 ┆ null ┆ 0.6 ┆ B │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1 ┆ foo ┆ 0.3 ┆ A │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3 ┆ spam ┆ 0.1 ┆ B │
└──────┴───────┴────────┴────────┘
Rename the columns of a DataFrame
df.rename({"nrs" => "idx"})
shape: (5, 4)
┌──────┬───────┬────────┬────────┐
│ idx ┆ names ┆ random ┆ groups │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ str │
╞══════╪═══════╪════════╪════════╡
│ 1 ┆ foo ┆ 0.3 ┆ A │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ ham ┆ 0.7 ┆ A │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3 ┆ spam ┆ 0.1 ┆ B │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ null ┆ egg ┆ 0.9 ┆ C │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 5 ┆ null ┆ 0.6 ┆ B │
└──────┴───────┴────────┴────────┘
Drop columns from DataFrame
df.drop(["names", "random"])
shape: (5, 2)
┌──────┬────────┐
│ nrs ┆ groups │
│ --- ┆ --- │
│ i64 ┆ str │
╞══════╪════════╡
│ 1 ┆ A │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ A │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3 ┆ B │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ null ┆ C │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 5 ┆ B │
└──────┴────────┘
Summarize Data
Count number of rows with each unique value of variable
df["groups"].value_counts
shape: (3, 2)
┌────────┬────────┐
│ groups ┆ counts │
│ --- ┆ --- │
│ str ┆ u32 │
╞════════╪════════╡
│ B ┆ 2 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ A ┆ 2 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ C ┆ 1 │
└────────┴────────┘
Number of rows in DataFrame
df.height
# => 5
Tuple of number of rows, number of columns in DataFrame
df.shape
# => [5, 4]
Number of distinct values in a column
df["groups"].n_unique
# => 3
Basic descriptive and statistics for each column
df.describe
shape: (7, 5)
┌────────────┬──────────┬───────┬──────────┬────────┐
│ describe ┆ nrs ┆ names ┆ random ┆ groups │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ str ┆ f64 ┆ str │
╞════════════╪══════════╪═══════╪══════════╪════════╡
│ count ┆ 5.0 ┆ 5 ┆ 5.0 ┆ 5 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ null_count ┆ 1.0 ┆ 1 ┆ 0.0 ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ mean ┆ 2.75 ┆ null ┆ 0.52 ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ std ┆ 1.707825 ┆ null ┆ 0.319374 ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ min ┆ 1.0 ┆ egg ┆ 0.1 ┆ A │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ max ┆ 5.0 ┆ spam ┆ 0.9 ┆ C │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ median ┆ 2.5 ┆ null ┆ 0.6 ┆ null │
└────────────┴──────────┴───────┴──────────┴────────┘
Aggregation functions
sum
min
max
std
median
mean
quantile
first
df.select(
[
# Sum values
Polars.sum('random').alias('sum'),
# Minimum value
Polars.min('random').alias('min'),
# Maximum value
Polars.max('random').alias('max'),
# or
Polars.col('random').max.alias('other_max'),
# Standard deviation
Polars.std('random').alias('std dev'),
# Variance
Polars.var('random').alias('variance'),
# Median
Polars.median('random').alias('median'),
# Mean
Polars.mean('random').alias('mean'),
# Quantile
Polars.quantile('random', 0.75).alias('quantile_0.75'),
# or
Polars.col('random').quantile(0.75).alias('other_quantile_0.75'),
# First value
Polars.first('random').alias('first')
]
)
┌─────┬─────┬─────┬───────────┬─────┬──────┬────────────┬──────────────┬───────┐
│ sum ┆ min ┆ max ┆ other_max ┆ ... ┆ mean ┆ quantile_0 ┆ other_quanti ┆ first │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ .75 ┆ le_0.75 ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ ┆ f64 ┆ --- ┆ --- ┆ f64 │
│ ┆ ┆ ┆ ┆ ┆ ┆ f64 ┆ f64 ┆ │
╞═════╪═════╪═════╪═══════════╪═════╪══════╪════════════╪══════════════╪═══════╡
│ 2.6 ┆ 0.1 ┆ 0.9 ┆ 0.9 ┆ ... ┆ 0.52 ┆ 0.7 ┆ 0.7 ┆ 0.3 │
└─────┴─────┴─────┴───────────┴─────┴──────┴────────────┴──────────────┴───────┘
Group Data
Group by values in column named "col", returning a GroupBy object
df.groupby("groups")
All of the aggregation functions from above can be applied to a group as well
df.groupby(by = 'groups').agg(
[
# Sum values
Polars.sum('random').alias('sum'),
# Minimum value
Polars.min('random').alias('min'),
# Maximum value
Polars.max('random').alias('max'),
# or
Polars.col('random').max.alias('other_max'),
# Standard deviation
Polars.std('random').alias('std_dev'),
# Variance
Polars.var('random').alias('variance'),
# Median
Polars.median('random').alias('median'),
# Mean
Polars.mean('random').alias('mean'),
# Quantile
Polars.quantile('random', 0.75).alias('quantile_0.75'),
# or
Polars.col('random').quantile(0.75).alias('other_quantile_0.75'),
# First value
Polars.first('random').alias('first')
]
)
shape: (3, 12)
┌────────┬─────┬─────┬─────┬─────┬──────┬───────────────┬──────────────┬───────┐
│ groups ┆ sum ┆ min ┆ max ┆ ... ┆ mean ┆ quantile_0.75 ┆ other_quanti ┆ first │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ le_0.75 ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ --- ┆ f64 │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ f64 ┆ │
╞════════╪═════╪═════╪═════╪═════╪══════╪═══════════════╪══════════════╪═══════╡
│ C ┆ 0.9 ┆ 0.9 ┆ 0.9 ┆ ... ┆ 0.9 ┆ 0.9 ┆ 0.9 ┆ 0.9 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ A ┆ 1.0 ┆ 0.3 ┆ 0.7 ┆ ... ┆ 0.5 ┆ 0.7 ┆ 0.7 ┆ 0.3 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ B ┆ 0.7 ┆ 0.1 ┆ 0.6 ┆ ... ┆ 0.35 ┆ 0.6 ┆ 0.6 ┆ 0.1 │
└────────┴─────┴─────┴─────┴─────┴──────┴───────────────┴──────────────┴───────┘
Additional GroupBy functions
??
Handling Missing Data
Drop rows with any column having a null value
df.drop_nulls
shape: (3, 4)
┌─────┬───────┬────────┬────────┐
│ nrs ┆ names ┆ random ┆ groups │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ str │
╞═════╪═══════╪════════╪════════╡
│ 1 ┆ foo ┆ 0.3 ┆ A │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ ham ┆ 0.7 ┆ A │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3 ┆ spam ┆ 0.1 ┆ B │
└─────┴───────┴────────┴────────┘
Replace null values with given value
df.fill_null(42)
shape: (5, 4)
┌─────┬───────┬────────┬────────┐
│ nrs ┆ names ┆ random ┆ groups │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ str │
╞═════╪═══════╪════════╪════════╡
│ 1 ┆ foo ┆ 0.3 ┆ A │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ ham ┆ 0.7 ┆ A │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3 ┆ spam ┆ 0.1 ┆ B │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 42 ┆ egg ┆ 0.9 ┆ C │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 5 ┆ 42 ┆ 0.6 ┆ B │
└─────┴───────┴────────┴────────┘
Replace null values using forward strategy
df.fill_null(strategy: "forward")
shape: (5, 4)
┌─────┬───────┬────────┬────────┐
│ nrs ┆ names ┆ random ┆ groups │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ str │
╞═════╪═══════╪════════╪════════╡
│ 1 ┆ foo ┆ 0.3 ┆ A │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ ham ┆ 0.7 ┆ A │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3 ┆ spam ┆ 0.1 ┆ B │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3 ┆ egg ┆ 0.9 ┆ C │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 5 ┆ egg ┆ 0.6 ┆ B │
└─────┴───────┴────────┴────────┘
Other fill strategies are "backward", "min", "max", "mean", "zero" and "one"
Replace floating point Nan values with given value
dfn = Polars::DataFrame.new(
{
"a" => [1.5, 2, Float::NAN, 4],
"b" => [0.5, 4, Float::NAN, 13]
}
)
dfn.fill_nan(99)
Make New Columns
Add a new columns to the DataFrame
df.with_column(
(Polars.col('random') * Polars.col('nrs'))
.alias('product')
)
shape: (5, 5)
┌──────┬───────┬────────┬────────┬─────────┐
│ nrs ┆ names ┆ random ┆ groups ┆ product │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ str ┆ f64 │
╞══════╪═══════╪════════╪════════╪═════════╡
│ 1 ┆ foo ┆ 0.3 ┆ A ┆ 0.3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ ham ┆ 0.7 ┆ A ┆ 1.4 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ spam ┆ 0.1 ┆ B ┆ 0.3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ null ┆ egg ┆ 0.9 ┆ C ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 5 ┆ null ┆ 0.6 ┆ B ┆ 3.0 │
└──────┴───────┴────────┴────────┴─────────┘
Add several new columns to the DataFrame
df.with_columns(
[
(Polars.col('random') * Polars.col('nrs'))
.alias('product'),
Polars.col('names').str.lengths
.alias('names_lengths')
]
)
shape: (5, 6)
┌──────┬───────┬────────┬────────┬─────────┬───────────────┐
│ nrs ┆ names ┆ random ┆ groups ┆ product ┆ names_lengths │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ str ┆ f64 ┆ u32 │
╞══════╪═══════╪════════╪════════╪═════════╪═══════════════╡
│ 1 ┆ foo ┆ 0.3 ┆ A ┆ 0.3 ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ ham ┆ 0.7 ┆ A ┆ 1.4 ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ spam ┆ 0.1 ┆ B ┆ 0.3 ┆ 4 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null ┆ egg ┆ 0.9 ┆ C ┆ null ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5 ┆ null ┆ 0.6 ┆ B ┆ 3.0 ┆ null │
└──────┴───────┴────────┴────────┴─────────┴───────────────┘
Add a column at index 0 that counts the rows
df.with_row_count
shape: (5, 5)
┌────────┬──────┬───────┬────────┬────────┐
│ row_nr ┆ nrs ┆ names ┆ random ┆ groups │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ str ┆ f64 ┆ str │
╞════════╪══════╪═══════╪════════╪════════╡
│ 0 ┆ 1 ┆ foo ┆ 0.3 ┆ A │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2 ┆ ham ┆ 0.7 ┆ A │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 3 ┆ spam ┆ 0.1 ┆ B │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3 ┆ null ┆ egg ┆ 0.9 ┆ C │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4 ┆ 5 ┆ null ┆ 0.6 ┆ B │
└────────┴──────┴───────┴────────┴────────┘
Rolling Functions
The following rolling functions are available
df.select(
[
# Rolling maximum value
Polars.col('random')
.rolling_max(2)
.alias('rolling_max'),
# Rolling mean value
Polars.col('random')
.rolling_mean(2)
.alias('rolling_mean'),
# Rolling median value
Polars.col('random')
.rolling_median(2, min_periods: 2)
.alias('rolling_median'),
# Rolling minimum value
Polars.col('random')
.rolling_min(2)
.alias('rolling_min'),
# Rolling standard deviation
Polars.col('random')
.rolling_std(2)
.alias('rolling_std'),
# Rolling sum values
Polars.col('random')
.rolling_sum(2)
.alias('rolling_sum'),
# Rolling variance
Polars.col('random')
.rolling_var(2)
.alias('rolling_var'),
# Rolling quantile
Polars.col('random')
.rolling_quantile(
0.75,
window_size: 2,
min_periods: 2
)
.alias('rolling_quantile'),
# Rolling skew
Polars.col('random')
.rolling_skew(2)
.alias('rolling_skew')
# Rolling custom function
# Polars.col('random')
# .rolling_apply(
# function = np.nanstd, window_size = 2
# )
# .alias('rolling_apply')
]
)
shape: (5, 9)
┌───────────┬────────────┬────────────┬───────────┬─────┬───────────┬───────────┬────────────┬────────────┐
│ rolling_m ┆ rolling_me ┆ rolling_me ┆ rolling_m ┆ ... ┆ rolling_s ┆ rolling_v ┆ rolling_qu ┆ rolling_sk │
│ ax ┆ an ┆ dian ┆ in ┆ ┆ um ┆ ar ┆ antile ┆ ew │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═══════════╪════════════╪════════════╪═══════════╪═════╪═══════════╪═══════════╪════════════╪════════════╡
│ null ┆ null ┆ null ┆ null ┆ ... ┆ null ┆ null ┆ null ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0.7 ┆ 0.5 ┆ 0.5 ┆ 0.3 ┆ ... ┆ 1.0 ┆ 0.08 ┆ 0.7 ┆ -4.3368e-1 │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ 6 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0.7 ┆ 0.4 ┆ 0.4 ┆ 0.1 ┆ ... ┆ 0.8 ┆ 0.18 ┆ 0.7 ┆ 3.8549e-16 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0.9 ┆ 0.5 ┆ 0.5 ┆ 0.1 ┆ ... ┆ 1.0 ┆ 0.32 ┆ 0.9 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0.9 ┆ 0.75 ┆ 0.75 ┆ 0.6 ┆ ... ┆ 1.5 ┆ 0.045 ┆ 0.9 ┆ 0.0 │
└───────────┴────────────┴────────────┴───────────┴─────┴───────────┴───────────┴────────────┴────────────┘
Window Functions
Window functions allow to group by several columns simultaneously
df.select(
[
'names',
'groups',
Polars.col('random').sum.over('names')
.alias('sum_by_names'),
Polars.col('random').sum.over('groups')
.alias('sum_by_groups')
]
)
Combine Data Sets
df_1 = Polars::DataFrame.new(
{
"foo" => [1, 2, 3],
"bar" => [6.0, 7.0, 8.0],
"ham" => ["a", "b", "c"]
}
)
df_2 = Polars::DataFrame.new(
{
"apple" => ["x", "y", "z"],
"ham" => ["a", "b", "d"]
}
)
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str │
╞═════╪═════╪═════╡
│ 1 ┆ 6.0 ┆ a │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 7.0 ┆ b │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 8.0 ┆ c │
└─────┴─────┴─────┘
shape: (3, 2)
┌───────┬─────┐
│ apple ┆ ham │
│ --- ┆ --- │
│ str ┆ str │
╞═══════╪═════╡
│ x ┆ a │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ y ┆ b │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ z ┆ d │
└───────┴─────┘
Inner Join
Retains only rows with a match in the other set.
df_1.join(df_2, on: "ham")
df_1.join(df_2, on: "ham", how: "inner")
shape: (2, 4)
┌─────┬─────┬─────┬───────┐
│ foo ┆ bar ┆ ham ┆ apple │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str ┆ str │
╞═════╪═════╪═════╪═══════╡
│ 1 ┆ 6.0 ┆ a ┆ x │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ 7.0 ┆ b ┆ y │
└─────┴─────┴─────┴───────┘
Left Join
Retains each row from "left" set (df).
df_1.join(df_2, on: "ham", how: "left")
shape: (3, 4)
┌─────┬─────┬─────┬───────┐
│ foo ┆ bar ┆ ham ┆ apple │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str ┆ str │
╞═════╪═════╪═════╪═══════╡
│ 1 ┆ 6.0 ┆ a ┆ x │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ 7.0 ┆ b ┆ y │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ 8.0 ┆ c ┆ null │
└─────┴─────┴─────┴───────┘
Outer Join
Retains each row, even if no other matching row exists.
df_1.join(df_2, on: "ham", how: "outer")
shape: (4, 4)
┌──────┬──────┬─────┬───────┐
│ foo ┆ bar ┆ ham ┆ apple │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str ┆ str │
╞══════╪══════╪═════╪═══════╡
│ 1 ┆ 6.0 ┆ a ┆ x │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ 7.0 ┆ b ┆ y │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ null ┆ null ┆ d ┆ z │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ 8.0 ┆ c ┆ null │
└──────┴──────┴─────┴───────┘
Anti Join
Contains all rows from df that do not have a match in other_df
df_1.join(df_2, on: "ham", how: "anti")
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str │
╞═════╪═════╪═════╡
│ 3 ┆ 8.0 ┆ c │
└─────┴─────┴─────┘
Top comments (0)