## 4.7 Variable Transformations

In Section 4.3.1, we saw that filter works by taking one or more source columns and filtering it by applying a “filtering” function. To recap, here’s an example of filter using the source => f::Function syntax: filter(:name => name -> name == "Alice", df).

In Section 4.4, we saw that select can take one or more source columns and put it into one or more target columns source => target. Also to recap here’s an example: select(df, :name => :people_names).

In this section, we discuss how to transform variables, that is, how to modify data. In DataFrames.jl, the syntax is source => transformation => target.

Like before, we use the grades_2020 dataset:

function grades_2020()
name = ["Sally", "Bob", "Alice", "Hank"]
grade_2020 = [1, 5, 8.5, 4]
end
grades_2020()
Sally 1.0
Bob 5.0
Alice 8.5
Hank 4.0

Suppose we want to increase all the grades in grades_2020 by 1. First, we define a function that takes as argument a vector of data and returns all of its elements increased by 1. Then we use the transform function from DataFrames.jl that, like all native DataFrames.jl’s functions, takes a DataFrame as first argument followed by the transformation syntax:

plus_one(grades) = grades .+ 1
transform(grades_2020(), :grade_2020 => plus_one)
Sally 1.0 2.0
Bob 5.0 6.0
Alice 8.5 9.5
Hank 4.0 5.0

Here, the plus_one function receives the whole :grade_2020 column. That is the reason why we’ve added the broadcasting “dot” . before the plus + operator. For a recap on broadcasting please see Section 3.3.1.

Like we said above, the DataFrames.jl minilanguage is always source => transformation => target. So, if we want to keep the naming of the target column in the output, we can do:

transform(grades_2020(), :grade_2020 => plus_one => :grade_2020)
Sally 2.0
Bob 6.0
Alice 9.5
Hank 5.0

We can also use the keyword argument renamecols=false:

transform(grades_2020(), :grade_2020 => plus_one; renamecols=false)
Sally 2.0
Bob 6.0
Alice 9.5
Hank 5.0

The same transformation can also be written with select as follows:

select(grades_2020(), :, :grade_2020 => plus_one => :grade_2020)
Sally 2.0
Bob 6.0
Alice 9.5
Hank 5.0

where the : means “select all the columns” as described in Section 4.4. Alternatively, you can also use Julia’s broadcasting and modify the column grade_2020 by accessing it with df.grade_2020:

df = grades_2020()
df
Sally 2.0
Bob 6.0
Alice 9.5
Hank 5.0

But, although the last example is easier since it builds on more native Julia operations, we strongly advise to use the functions provided by DataFrames.jl in most cases because they are more capable and easier to work with.

### 4.7.1 Multiple Transformations

To show how to transform two columns at the same time, we use the left joined data from Section 4.6:

leftjoined = leftjoin(grades_2020(), grades_2021(); on=:name)
Sally 1.0 9.5
Hank 4.0 6.0
Bob 5.0 missing
Alice 8.5 missing

With this, we can add a column saying whether someone was approved by the criterion that one of their grades was above 5.5:

pass(A, B) = [5.5 < a || 5.5 < b for (a, b) in zip(A, B)]
transform(leftjoined, [:grade_2020, :grade_2021] => pass; renamecols=false)
Sally 1.0 9.5 true
Hank 4.0 6.0 true
Bob 5.0 missing missing
Alice 8.5 missing true

We can clean up the outcome and put the logic in a function to get a list of all the approved students:

function only_pass()
only_pass()
["Sally", "Hank", "Alice"]