Groupby and Combine - Julia Data Science

4.8 Groupby and Combine

In the R programming language, Wickham (2011) has popularized the so-called split-apply-combine strategy for data transformations. In essence, this strategy splits a dataset into distinct groups, applies one or more functions to each group, and then combines the result. DataFrames.jl fully supports split-apply-combine. We will use the student grades example like before. Suppose that we want to know each student’s mean grade:

function all_grades()
    df1 = grades_2020()
    df1 = select(df1, :name, :grade_2020 => :grade)
    df2 = grades_2021()
    df2 = select(df2, :name, :grade_2021 => :grade)
    rename_bob2(data_col) = replace.(data_col, "Bob 2" => "Bob")
    df2 = transform(df2, :name => rename_bob2 => :name)
    return vcat(df1, df2)
end
all_grades()

name	grade
Sally	1.0
Bob	5.0
Alice	8.5
Hank	4.0
Bob	9.5
Sally	9.5
Hank	6.0

The strategy is to split the dataset into distinct students, apply the mean function to each student, and combine the result.

The split is called groupby and we give as second argument the column ID that we want to split the dataset into:

groupby(all_grades(), :name)

GroupedDataFrame with 4 groups based on key: name
Group 1 (2 rows): name = "Sally"
 Row │ name    grade
     │ String  Float64
─────┼─────────────────
   1 │ Sally       1.0
   2 │ Sally       9.5
Group 2 (2 rows): name = "Bob"
 Row │ name    grade
     │ String  Float64
─────┼─────────────────
   1 │ Bob         5.0
   2 │ Bob         9.5
Group 3 (1 row): name = "Alice"
 Row │ name    grade
     │ String  Float64
─────┼─────────────────
   1 │ Alice       8.5
Group 4 (2 rows): name = "Hank"
 Row │ name    grade
     │ String  Float64
─────┼─────────────────
   1 │ Hank        4.0
   2 │ Hank        6.0

We apply the mean function from Julia’s standard library Statistics module:

using Statistics

To apply this function, use the combine function:

gdf = groupby(all_grades(), :name)
combine(gdf, :grade => mean)

name	grade_mean
Sally	5.25
Bob	7.25
Alice	8.5
Hank	5.0

Imagine having to do this without the groupby and combine functions. We would need to loop over our data to split it up into groups, then loop over each split to apply a function, and finally loop over each group to gather the final result. Therefore, the split-apply-combine technique is a great one to know.

4.8.1 Multiple Source Columns

But what if we want to apply a function to multiple columns of our dataset?

group = [:A, :A, :B, :B]
X = 1:4
Y = 5:8
df = DataFrame(; group, X, Y)

group	X	Y
A	1	5
A	2	6
B	3	7
B	4	8

This is accomplished in a similar manner:

gdf = groupby(df, :group)
combine(gdf, [:X, :Y] .=> mean; renamecols=false)

group	X	Y
A	1.5	5.5
B	3.5	7.5

Note that we’ve used the dot . operator before the right arrow => to indicate that the mean has to be applied to multiple source columns [:X, :Y].

To use composable functions, a simple way is to create a function that does the intended composable transformations. For instance, for a series of values, let’s first take the mean followed by round to a whole number (also known as an integer Int):

gdf = groupby(df, :group)
rounded_mean(data_col) = round(Int, mean(data_col))
combine(gdf, [:X, :Y] .=> rounded_mean; renamecols=false)

group	X	Y
A	2	6
B	4	8

4.7 Variable Transformations ← → 4.9 Missing Data

Support this project
CC BY-NC-SA 4.0 Jose Storopoli, Rik Huijzer, Lazaro Alonso