Join - Julia Data Science

4.6 Join

At the start of this chapter, we showed multiple tables and raised questions also related to multiple tables. However, we haven’t talked about combining tables yet, which we will do in this section. In DataFrames.jl, combining multiple tables is done via joins. Joins are extremely powerful, but it might take a while to wrap your head around them. It is not necessary to know the joins below by heart, because the DataFrames.jl documentation, along with this book, will list them for you. But, it’s essential to know that joins exist. If you ever find yourself looping over rows in a DataFrame and comparing it with other data, then you probably need one of the joins below.

In Section 4, we’ve introduced the grades for 2020 with grades_2020:

grades_2020()

name	grade_2020
Sally	1.0
Bob	5.0
Alice	8.5
Hank	4.0

Now, we’re going to combine grades_2020 with grades from 2021:

grades_2021()

name	grade_2021
Bob 2	9.5
Sally	9.5
Hank	6.0

To do this, we are going to use joins. DataFrames.jl lists no less than seven kinds of join. This might seem daunting at first, but hang on because they are all useful and we will showcase them all.

4.6.1 innerjoin

This first is innerjoin. Suppose that we have two datasets A and B with respectively columns A_1, A_2, ..., A_n and B_1, B_2, ..., B_m and one of the columns has the same name, say A_1 and B_1 are both called :id. Then, the inner join on :id will go through all the elements in A_1 and compare it to the elements in B_1. If the elements are the same, then it will add all the information from A_2, ..., A_n and B_2, ..., B_m after the :id column.

Okay, so no worries if you didn’t get this description. The result on the grades datasets looks like this:

innerjoin(grades_2020(), grades_2021(); on=:name)

name	grade_2020	grade_2021
Sally	1.0	9.5
Hank	4.0	6.0

Note that only “Sally” and “Hank” are in both datasets. The name inner join makes sense since, in mathematics, the set intersection is defined by “all elements in \(A\), that are also in \(B\), or all elements in \(B\) that are also in \(A\)”.

4.6.2 outerjoin

Maybe you’re now thinking “aha, if we have an inner, then we probably also have an outer”. Yes, you’ve guessed right!

The outerjoin is much less strict than the innerjoin and just takes any row it can find which contains a name in at least one of the datasets:

outerjoin(grades_2020(), grades_2021(); on=:name)

name	grade_2020	grade_2021
Sally	1.0	9.5
Hank	4.0	6.0
Bob	5.0	missing
Alice	8.5	missing
Bob 2	missing	9.5

So, this method can create missing data even though none of the original datasets had missing values.

4.6.3 crossjoin

We can get even more missing data if we use the crossjoin. This gives the Cartesian product of the rows, which is basically multiplication of rows, that is, for every row create a combination with any other row:

crossjoin(grades_2020(), grades_2021(); on=:id)

MethodError: no method matching crossjoin(::DataFrame, ::DataFrame; on::Symbol)
This error has been manually thrown, explicitly, so the method may exist but be intentionally marked as unimplemented.

Closest candidates are:
  crossjoin(::DataFrames.AbstractDataFrame, ::DataFrames.AbstractDataFrame; makeunique, renamecols) got unsupported keyword argument "on"
   @ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/join/composer.jl:1567
  crossjoin(::DataFrames.AbstractDataFrame, ::DataFrames.AbstractDataFrame, DataFrames.AbstractDataFrame...; makeunique) got unsupported keyword argument "on"
   @ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/join/composer.jl:1591

  ...

Oops. Since crossjoin doesn’t take the elements in the row into account, we don’t need to specify the on argument for what we want to join:

crossjoin(grades_2020(), grades_2021())

ArgumentError: Duplicate variable names: :name. Pass makeunique=true to make them unique using a suffix automatically.
Stacktrace:
  [1] make_unique!(names::Vector{Symbol}, src::Vector{Symbol}; makeunique::Bool)
    @ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/other/utils.jl:99
  [2] make_unique!
    @ ~/.julia/packages/DataFrames/kcA9R/src/other/utils.jl:77 [inlined]
  [3] #make_unique#4
    @ ~/.julia/packages/DataFrames/kcA9R/src/other/utils.jl:121 [inlined]
  [4] make_unique
    @ ~/.julia/packages/DataFrames/kcA9R/src/other/utils.jl:120 [inlined]
  [5] #Index#7
    @ ~/.julia/packages/DataFrames/kcA9R/src/other/index.jl:27 [inlined]
  [6] Index
  ...

Oops again. This is a very common error with DataFrames and joins. The tables for the 2020 and 2021 grades have a duplicate column name, namely :name. Like before, the error that DataFrames.jl outputs shows a simple suggestion that might fix the issue. We can just pass makeunique=true to solve this:

crossjoin(grades_2020(), grades_2021(); makeunique=true)

name	grade_2020	name_1	grade_2021
Sally	1.0	Bob 2	9.5
Sally	1.0	Sally	9.5
Sally	1.0	Hank	6.0
Bob	5.0	Bob 2	9.5
Bob	5.0	Sally	9.5
Bob	5.0	Hank	6.0
Alice	8.5	Bob 2	9.5
Alice	8.5	Sally	9.5
Alice	8.5	Hank	6.0
Hank	4.0	Bob 2	9.5
Hank	4.0	Sally	9.5
Hank	4.0	Hank	6.0

So, now, we have one row for each grade from everyone in grades 2020 and grades 2021 datasets. For direct queries, such as “who has the highest grade?”, the Cartesian product is usually not so useful, but for “statistical” queries, it can be.

4.6.4 leftjoin and rightjoin

More useful for scientific data projects are the leftjoin and rightjoin. The left join gives all the elements in the left DataFrame:

leftjoin(grades_2020(), grades_2021(); on=:name)

name	grade_2020	grade_2021
Sally	1.0	9.5
Hank	4.0	6.0
Bob	5.0	missing
Alice	8.5	missing

Here, grades for “Bob” and “Alice” were missing in the grades 2021 table, so that’s why there are also missing elements. The right join does sort of the opposite:

rightjoin(grades_2020(), grades_2021(); on=:name)

name	grade_2020	grade_2021
Sally	1.0	9.5
Hank	4.0	6.0
Bob 2	missing	9.5

Now, grades in 2020 are missing.

Note that leftjoin(A, B) != rightjoin(B, A), because the order of the columns will differ. For example, compare the output below to the previous output:

leftjoin(grades_2021(), grades_2020(); on=:name)

name	grade_2021	grade_2020
Sally	9.5	1.0
Hank	6.0	4.0
Bob 2	9.5	missing

4.6.5 semijoin and antijoin

Lastly, we have the semijoin and antijoin.

The semi join is even more restrictive than the inner join. It returns only the elements from the left DataFrame which are in both DataFrames. This is like a combination of the left join with the inner join.

semijoin(grades_2020(), grades_2021(); on=:name)

name	grade_2020
Sally	1.0
Hank	4.0

The opposite of the semi join is the anti join. It returns only the elements from the left DataFrame which are not in the right DataFrame:

antijoin(grades_2020(), grades_2021(); on=:name)

name	grade_2020
Bob	5.0
Alice	8.5

4.5 Types and Categorical Da.. ← → 4.7 Variable Transformations

Support this project
CC BY-NC-SA 4.0 Jose Storopoli, Rik Huijzer, Lazaro Alonso