4.8 Groupby and Combine

In the R programming language, Wickham (2011) has popularized the, so called, split-apply-combine strategy for data transformations. In essence, this strategy splits a dataset into distinct groups, applies one or more functions to each group, and then combines the result. DataFrames.jl fully supports split-apply-combine. We will use student grades example like before. Suppose that we want to know each student’s mean grade:

function all_grades()
    df1 = grades_2020()
    df1 = select(df1, :name, :grade_2020 => :grade)
    df2 = grades_2021()
    df2 = select(df2, :name, :grade_2021 => :grade)
    rename_bob2(data_col) = replace.(data_col, "Bob 2" => "Bob")
    df2 = transform(df2, :name => rename_bob2 => :name)
    return vcat(df1, df2)
end
all_grades()
name grade
Sally 1.0
Bob 5.0
Alice 8.5
Hank 4.0
Bob 9.5
Sally 9.5
Hank 6.0

The strategy is to split the dataset into distinct students, apply the mean function to each student, and combine the result.

The split is called groupby and we give as second argument the column ID that we want to split the dataset into:

groupby(all_grades(), :name)
GroupedDataFrame with 4 groups based on key: name
Group 1 (2 rows): name = "Sally"
 Row │ name    grade
     │ String  Float64
─────┼─────────────────
   1 │ Sally       1.0
   2 │ Sally       9.5
Group 2 (2 rows): name = "Bob"
 Row │ name    grade
     │ String  Float64
─────┼─────────────────
   1 │ Bob         5.0
   2 │ Bob         9.5
Group 3 (1 row): name = "Alice"
 Row │ name    grade
     │ String  Float64
─────┼─────────────────
   1 │ Alice       8.5
Group 4 (2 rows): name = "Hank"
 Row │ name    grade
     │ String  Float64
─────┼─────────────────
   1 │ Hank        4.0
   2 │ Hank        6.0

We apply the mean function from Julia’s standard library Statistics module:

using Statistics

Applying this function is done by using the combine function:

gdf = groupby(all_grades(), :name)
combine(gdf, :grade => mean)
name grade_mean
Sally 5.25
Bob 7.25
Alice 8.5
Hank 5.0

Imagine having to do this without the groupby and combine functions. We would need to loop over our data to split it up into groups, then loop over each split to apply a function, and finally loop over each group to gather the final result. Therefore, the split-apply-combine technique is a great one to know.

4.8.1 Multiple Source Columns

But what if we want to apply a function on multiple columns of our dataset?

group = [:A, :A, :B, :B]
X = 1:4
Y = 5:8
df = DataFrame(; group, X, Y)
group X Y
A 1 5
A 2 6
B 3 7
B 4 8

This is accomplished in a similar manner:

gdf = groupby(df, :group)
combine(gdf, [:X, :Y] .=> mean; renamecols=false)
group X Y
A 1.5 5.5
B 3.5 7.5

Note that we’ve used the dot . operator before the right arrow => to indicate that the mean has to be applied to multiple source columns [:X, :Y].

To use composable functions, a simple way is to create a function that does the intended composable transformations. For instance, for a series of values, let’s first take the mean followed by round to a whole number (also known as an integer Int):

gdf = groupby(df, :group)
rounded_mean(data_col) = round(Int, mean(data_col))
combine(gdf, [:X, :Y] .=> rounded_mean; renamecols=false)
group X Y
A 2 6
B 4 8


CC BY-NC-SA 4.0 Jose Storopoli, Rik Huijzer and Lazaro Alonso