4.10 Performance

So far, we haven’t thought about making our DataFrames.jl code fast. Like everything in Julia, DataFrames.jl can be really fast. In this section, we will give some performance tips and tricks.

4.10.1 In-place operations

Like we explained in Section 3.3.2, functions that end with a bang ! are a common pattern to denote functions that modify one or more of their arguments. In the context of high performance Julia code, this means that functions with ! will just change in-place the objects that we have supplied as arguments.

Almost all the DataFrames.jl functions that we’ve seen have a "! twin". For example, filter has an in-place filter!, select has select!, subset has subset!, dropmissing has dropmissing!, and so on. Notice that these functions do not return a newDataFrame, but instead they update theDataFrame that they act upon. Additionally, DataFrames.jl (version 1.3 onwards) supports in-place leftjoin with the function leftjoin!. This function updates the left DataFrame with the joined columns from the rightDataFrame. There is a caveat that for each row of left table there must match at most one row in right table.

If you want the highest speed and performance in your code, you should definitely use the ! functions instead of regular DataFrames.jl functions.

Let’s go back to the example of the select function in the beginning of Section 4.4. Here is the responses DataFrame:

responses()
id q1 q2 q3 q4 q5
1 28 us F B A
2 61 fr B C E

Now, let’s perform the selection with the select function, like we did before:

select(responses(), :id, :q1)
id q1
1 28
2 61

And here is the in-place function:

select!(responses(), :id, :q1)
id q1
1 28
2 61

The @allocated macro tells us how much memory was allocated. In other words, how much new information the computer had to store in its memory while running the code. Let’s see how they will perform:

df = responses()
@allocated select(df, :id, :q1)
4512
df = responses()
@allocated select!(df, :id, :q1)
4096

As we can see, select! allocates less than select. So, it should be faster, while consuming less memory.

4.10.2 Copying vs Not Copying Columns

There are two ways to access a DataFrame column. They differ in how they are accessed: one creates a “view” to the column without copying and the other creates a whole new column by copying the original column.

The first way uses the regular dot . operator followed by the column name, like in df.col. This kind of access does not copy the column col. Instead df.col creates a “view” which is a link to the original column without performing any allocation. Additionally, the syntax df.col is the same as df[!, :col] with the bang ! as the row selector.

The second way to access a DataFrame column is the df[:, :col] with the colon : as the row selector. This kind of access does copy the column col, so beware that it may produce unwanted allocations.

As before, let’s try out these two ways to access a column in the responses DataFrame:

df = responses()
@allocated col = df[:, :id]
147152
df = responses()
@allocated col = df[!, :id]
0

When we access a column without copying it we are making zero allocations and our code should be faster. So, if you don’t need a copy, always access your DataFrames columns with df.col or df[!, :col] instead of df[:, :col].

4.10.3 CSV.read versus CSV.File

If you take a look at the help output for CSV.read, you will see that there is a convenience function identical to the function called CSV.File with the same keyword arguments. Both CSV.read and CSV.File will read the contents of a CSV file, but they differ in the default behavior. CSV.read, by default, will not make copies of the incoming data. Instead, CSV.read will pass all the data to the second argument (known as the “sink”).

So, something like this:

df = CSV.read("file.csv", DataFrame)

will pass all the incoming data from file.csv to the DataFrame sink, thus returning a DataFrame type that we store in the df variable.

For the case of CSV.File, the default behavior is the opposite: it will make copies of every column contained in the CSV file. Also, the syntax is slightly different. We need to wrap anything that CSV.File returns in a DataFrame constructor function:

df = DataFrame(CSV.File("file.csv"))

Or, with the pipe |> operator:

df = CSV.File("file.csv") |> DataFrame

Like we said, CSV.File will make copies of each column in the underlying CSV file. Ultimately, if you want the most performance, you would definitely use CSV.read instead of CSV.File. That’s why we only covered CSV.read in Section 4.1.1.

4.10.4 CSV.jl Multiple Files

Now let’s turn our attention to the CSV.jl. Specifically, the case when we have multiple CSV files to read into a single DataFrame. Since version 0.9 of CSV.jl we can provide a vector of strings representing filenames. Before, we needed to perform some sort of multiple file reading and then concatenate vertically the results into a single DataFrame. To exemplify, the code below reads from multiple CSV files and then concatenates them vertically using vcat into a single DataFrame with the reduce function:

files = filter(endswith(".csv"), readdir())
df = reduce(vcat, CSV.read(file, DataFrame) for file in files)

One additional trait is that reduce will not parallelize because it needs to keep the order of vcat which follows the same ordering of the files vector.

With this functionality in CSV.jl we simply pass the files vector into the CSV.read function:

files = filter(endswith(".csv"), readdir())
df = CSV.read(files, DataFrame)

CSV.jl will designate a file for each thread available in the computer while it lazily concatenates each thread-parsed output into a DataFrame. So we have the additional benefit of multithreading that we don’t have with the reduce option.

4.10.5 CategoricalArrays.jl compression

If you are handling data with a lot of categorical values, i.e. a lot of columns with textual data that represent somehow different qualitative data, you would probably benefit by using CategoricalArrays.jl compression.

By default, CategoricalArrays.jl will use an unsigned integer of size 32 bits UInt32 to represent the underlying categories:

typeof(categorical(["A", "B", "C"]))
CategoricalVector{String, UInt32, String, CategoricalValue{String, UInt32}, Union{}}

This means that CategoricalArrays.jl can represent up to \(2^{32}\) different categories in a given vector or column, which is a huge value (close to 4.3 billion). You probably would never need to have this sort of capacity in dealing with regular data17. That’s why categorical has a compress argument that accepts either true or false to determine whether or not the underlying categorical data is compressed. If you pass compress=true, CategoricalArrays.jl will try to compress the underlying categorical data to the smallest possible representation in UInt. For example, the previous categorical vector would be represented as an unsigned integer of size 8 bits UInt8 (mostly because this is the smallest unsigned integer available in Julia):

typeof(categorical(["A", "B", "C"]; compress=true))
CategoricalVector{String, UInt8, String, CategoricalValue{String, UInt8}, Union{}}

What does this all mean? Suppose you have a big vector. For example, a vector with one million entries, but only 4 underlying categories: A, B, C, or D. If you do not compress the resulting categorical vector, you will have one million entries stored as UInt32. On the other hand, if you do compress it, you will have one million entries stored instead as UInt8. By using Base.summarysize function we can get the underlying size, in bytes, of a given object. So let’s quantify how much more memory we would need to have if we did not compress our one million categorical vector:

using Random
one_mi_vec = rand(["A", "B", "C", "D"], 1_000_000)
Base.summarysize(categorical(one_mi_vec))
4000612

4 million bytes, which is approximately 3.8 MB. Don’t get us wrong, this is a good improvement over the raw string size:

Base.summarysize(one_mi_vec)
8000076

We reduced 50% of the raw data size by using the default CategoricalArrays.jl underlying representation as UInt32.

Now let’s see how we would fare with compression:

Base.summarysize(categorical(one_mi_vec; compress=true))
1000564

We reduced the size to 25% (one quarter) of the original uncompressed vector size without losing information. Our compressed categorical vector now has 1 million bytes which is approximately 1.0 MB.

So whenever possible, in the interest of performance, consider using compress=true in your categorical data.



Support this project
CC BY-NC-SA 4.0 Jose Storopoli, Rik Huijzer, Lazaro Alonso