4.7 Variable Transformations

In Section 4.3.1, we saw that filter works by taking one or more source columns and filtering it by applying a “filtering” function. To recap, here’s an example of filter using the source => f::Function syntax: filter(:name => name -> name == "Alice", df).

In Section 4.4, we saw that select can take one or more source columns and put it into one or more target columns source => target. Also to recap here’s an example: select(df, :name => :people_names).

In this section, we discuss how to transform variables, that is, how to modify data. In DataFrames.jl, the syntax is source => transformation => target.

Like before, we use the grades_2020 dataset:

function grades_2020()
    name = ["Sally", "Bob", "Alice", "Hank"]
    grade_2020 = [1, 5, 8.5, 4]
    DataFrame(; name, grade_2020)
end
grades_2020()
name grade_2020
Sally 1.0
Bob 5.0
Alice 8.5
Hank 4.0

Suppose we want to increase all the grades in grades_2020 by 1. First, we define a function that takes as argument a vector of data and returns all of its elements increased by 1. Then we use the transform function from DataFrames.jl that, like all native DataFrames.jl’s functions, takes a DataFrame as first argument followed by the transformation syntax:

plus_one(grades) = grades .+ 1
transform(grades_2020(), :grade_2020 => plus_one)
name grade_2020 grade_2020_plus_one
Sally 1.0 2.0
Bob 5.0 6.0
Alice 8.5 9.5
Hank 4.0 5.0

Here, the plus_one function receives the whole :grade_2020 column. That is the reason why we’ve added the broadcasting “dot” . before the plus + operator. For a recap on broadcasting please see Section 3.3.1.

Like we said above, the DataFrames.jl minilanguage is always source => transformation => target. So, if we want to keep the naming of the target column in the output, we can do:

transform(grades_2020(), :grade_2020 => plus_one => :grade_2020)
name grade_2020
Sally 2.0
Bob 6.0
Alice 9.5
Hank 5.0

We can also use the keyword argument renamecols=false:

transform(grades_2020(), :grade_2020 => plus_one; renamecols=false)
name grade_2020
Sally 2.0
Bob 6.0
Alice 9.5
Hank 5.0

The same transformation can also be written with select as follows:

select(grades_2020(), :, :grade_2020 => plus_one => :grade_2020)
name grade_2020
Sally 2.0
Bob 6.0
Alice 9.5
Hank 5.0

where the : means “select all the columns” as described in Section 4.4. Alternatively, you can also use Julia’s broadcasting and modify the column grade_2020 by accessing it with df.grade_2020:

df = grades_2020()
df.grade_2020 = plus_one.(df.grade_2020)
df
name grade_2020
Sally 2.0
Bob 6.0
Alice 9.5
Hank 5.0

But, although the last example is easier since it builds on more native Julia operations, we strongly advise to use the functions provided by DataFrames.jl in most cases because they are more capable and easier to work with.

4.7.1 Multiple Transformations

To show how to transform two columns at the same time, we use the left joined data from Section 4.6:

leftjoined = leftjoin(grades_2020(), grades_2021(); on=:name)
name grade_2020 grade_2021
Sally 1.0 9.5
Hank 4.0 6.0
Bob 5.0 missing
Alice 8.5 missing

With this, we can add a column saying whether someone was approved by the criterion that one of their grades was above 5.5:

pass(A, B) = [5.5 < a || 5.5 < b for (a, b) in zip(A, B)]
transform(leftjoined, [:grade_2020, :grade_2021] => pass; renamecols=false)
name grade_2020 grade_2021 grade_2020_grade_2021
Sally 1.0 9.5 true
Hank 4.0 6.0 true
Bob 5.0 missing missing
Alice 8.5 missing true

We can clean up the outcome and put the logic in a function to get a list of all the approved students:

function only_pass()
    leftjoined = leftjoin(grades_2020(), grades_2021(); on=:name)
    pass(A, B) = [5.5 < a || 5.5 < b for (a, b) in zip(A, B)]
    leftjoined = transform(leftjoined, [:grade_2020, :grade_2021] => pass => :pass)
    passed = subset(leftjoined, :pass; skipmissing=true)
    return passed.name
end
only_pass()
["Sally", "Hank", "Alice"]


Support this project
CC BY-NC-SA 4.0 Jose Storopoli, Rik Huijzer, Lazaro Alonso