At the start of this chapter, we showed multiple tables and raised questions also related to multiple tables. However, we haven’t talked about combining tables yet, which we will do in this section. In DataFrames.jl
, combining multiple tables is done via joins. Joins are extremely powerful, but it might take a while to wrap your head around them. It is not necessary to know the joins below by heart, because the DataFrames.jl
documentation, along with this book, will list them for you. But, it’s essential to know that joins exist. If you ever find yourself looping over rows in a DataFrame
and comparing it with other data, then you probably need one of the joins below.
In Section 4, we’ve introduced the grades for 2020 with grades_2020
:
grades_2020()
name | grade_2020 |
---|---|
Sally | 1.0 |
Bob | 5.0 |
Alice | 8.5 |
Hank | 4.0 |
Now, we’re going to combine grades_2020
with grades from 2021:
grades_2021()
name | grade_2021 |
---|---|
Bob 2 | 9.5 |
Sally | 9.5 |
Hank | 6.0 |
To do this, we are going to use joins. DataFrames.jl
lists no less than seven kinds of join. This might seem daunting at first, but hang on because they are all useful and we will showcase them all.
This first is innerjoin
. Suppose that we have two datasets A
and B
with respectively columns A_1, A_2, ..., A_n
and B_1, B_2, ..., B_m
and one of the columns has the same name, say A_1
and B_1
are both called :id
. Then, the inner join on :id
will go through all the elements in A_1
and compare it to the elements in B_1
. If the elements are the same, then it will add all the information from A_2, ..., A_n
and B_2, ..., B_m
after the :id
column.
Okay, so no worries if you didn’t get this description. The result on the grades datasets looks like this:
innerjoin(grades_2020(), grades_2021(); on=:name)
name | grade_2020 | grade_2021 |
---|---|---|
Sally | 1.0 | 9.5 |
Hank | 4.0 | 6.0 |
Note that only “Sally” and “Hank” are in both datasets. The name inner join makes sense since, in mathematics, the set intersection is defined by “all elements in \(A\), that are also in \(B\), or all elements in \(B\) that are also in \(A\)”.
Maybe you’re now thinking “aha, if we have an inner, then we probably also have an outer”. Yes, you’ve guessed right!
The outerjoin
is much less strict than the innerjoin
and just takes any row it can find which contains a name in at least one of the datasets:
outerjoin(grades_2020(), grades_2021(); on=:name)
name | grade_2020 | grade_2021 |
---|---|---|
Sally | 1.0 | 9.5 |
Hank | 4.0 | 6.0 |
Bob | 5.0 | missing |
Alice | 8.5 | missing |
Bob 2 | missing | 9.5 |
So, this method can create missing
data even though none of the original datasets had missing values.
We can get even more missing
data if we use the crossjoin
. This gives the Cartesian product of the rows, which is basically multiplication of rows, that is, for every row create a combination with any other row:
crossjoin(grades_2020(), grades_2021(); on=:id)
MethodError: no method matching crossjoin(::DataFrame, ::DataFrame; on::Symbol)
Closest candidates are:
crossjoin(::DataFrames.AbstractDataFrame, ::DataFrames.AbstractDataFrame; makeunique, renamecols) got unsupported keyword argument "on"
@ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/join/composer.jl:1567
crossjoin(::DataFrames.AbstractDataFrame, ::DataFrames.AbstractDataFrame, DataFrames.AbstractDataFrame...; makeunique) got unsupported keyword argument "on"
@ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/join/composer.jl:1591
...
Oops. Since crossjoin
doesn’t take the elements in the row into account, we don’t need to specify the on
argument for what we want to join:
crossjoin(grades_2020(), grades_2021())
ArgumentError: Duplicate variable names: :name. Pass makeunique=true to make them unique using a suffix automatically.
Stacktrace:
[1] make_unique!(names::Vector{Symbol}, src::Vector{Symbol}; makeunique::Bool)
@ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/other/utils.jl:99
[2] make_unique!
@ ~/.julia/packages/DataFrames/kcA9R/src/other/utils.jl:77 [inlined]
[3] #make_unique#4
@ ~/.julia/packages/DataFrames/kcA9R/src/other/utils.jl:121 [inlined]
[4] make_unique
@ ~/.julia/packages/DataFrames/kcA9R/src/other/utils.jl:120 [inlined]
[5] #Index#7
@ ~/.julia/packages/DataFrames/kcA9R/src/other/index.jl:27 [inlined]
[6] Index
...
Oops again. This is a very common error with DataFrame
s and join
s. The tables for the 2020 and 2021 grades have a duplicate column name, namely :name
. Like before, the error that DataFrames.jl
outputs shows a simple suggestion that might fix the issue. We can just pass makeunique=true
to solve this:
crossjoin(grades_2020(), grades_2021(); makeunique=true)
name | grade_2020 | name_1 | grade_2021 |
---|---|---|---|
Sally | 1.0 | Bob 2 | 9.5 |
Sally | 1.0 | Sally | 9.5 |
Sally | 1.0 | Hank | 6.0 |
Bob | 5.0 | Bob 2 | 9.5 |
Bob | 5.0 | Sally | 9.5 |
Bob | 5.0 | Hank | 6.0 |
Alice | 8.5 | Bob 2 | 9.5 |
Alice | 8.5 | Sally | 9.5 |
Alice | 8.5 | Hank | 6.0 |
Hank | 4.0 | Bob 2 | 9.5 |
Hank | 4.0 | Sally | 9.5 |
Hank | 4.0 | Hank | 6.0 |
So, now, we have one row for each grade from everyone in grades 2020 and grades 2021 datasets. For direct queries, such as “who has the highest grade?”, the Cartesian product is usually not so useful, but for “statistical” queries, it can be.
More useful for scientific data projects are the leftjoin
and rightjoin
. The left join gives all the elements in the left DataFrame
:
leftjoin(grades_2020(), grades_2021(); on=:name)
name | grade_2020 | grade_2021 |
---|---|---|
Sally | 1.0 | 9.5 |
Hank | 4.0 | 6.0 |
Bob | 5.0 | missing |
Alice | 8.5 | missing |
Here, grades for “Bob” and “Alice” were missing
in the grades 2021 table, so that’s why there are also missing
elements. The right join does sort of the opposite:
rightjoin(grades_2020(), grades_2021(); on=:name)
name | grade_2020 | grade_2021 |
---|---|---|
Sally | 1.0 | 9.5 |
Hank | 4.0 | 6.0 |
Bob 2 | missing | 9.5 |
Now, grades in 2020 are missing.
Note that leftjoin(A, B) != rightjoin(B, A)
, because the order of the columns will differ. For example, compare the output below to the previous output:
leftjoin(grades_2021(), grades_2020(); on=:name)
name | grade_2021 | grade_2020 |
---|---|---|
Sally | 9.5 | 1.0 |
Hank | 6.0 | 4.0 |
Bob 2 | 9.5 | missing |
Lastly, we have the semijoin
and antijoin
.
The semi join is even more restrictive than the inner join. It returns only the elements from the left DataFrame
which are in both DataFrame
s. This is like a combination of the left join with the inner join.
semijoin(grades_2020(), grades_2021(); on=:name)
name | grade_2020 |
---|---|
Sally | 1.0 |
Hank | 4.0 |
The opposite of the semi join is the anti join. It returns only the elements from the left DataFrame
which are not in the right DataFrame
:
antijoin(grades_2020(), grades_2021(); on=:name)
name | grade_2020 |
---|---|
Bob | 5.0 |
Alice | 8.5 |