4.9 Missing Data

Let’s dive into how to handle missing values in DataFrames.jl. We’ll cover three main approaches for dealing with missing data:

  1. filtering missing values with ismissing and dropmissing
  2. filling or replacing missing values with coalesce
  3. skipping over missing values with skipmissing

First, we need a DataFrame filled with missing values to showcase these approaches:

df_missing = DataFrame(;
    name=[missing, "Sally", "Alice", "Hank"],
    age=[17, missing, 20, 19],
    grade_2020=[5.0, 1.0, missing, 4.0],
)
name age grade_2020
missing 17 5.0
Sally missing 1.0
Alice 20 missing
Hank 19 4.0

This is the same DataFrame from Section 4 but with some missing values added.

4.9.1 Missing type

Some languages have several types to represent missing values. One such example is R which uses NA, NA_integer_, NA_real_, NA_character_, and NA_complex_. Julia, on the contrary, has only one: Missing.

typeof(missing)
Missing

As you can see missing is an instance of the type Missing.

NOTE: In the Julia Style Guide, there’s a guidance to use camel case for types and modules (see Section 8.2).

The first thing we need to cover for missing values is that they propagate through several operations. For example, addition, subtraction, multiplication, and division:

missing + 1
missing
missing - 1
missing
missing * 1
missing
missing / 1
missing

They also propagate through equality and comparison operators:

missing == 1
missing
missing == missing
missing
missing > 1
missing
missing > missing
missing

That’s why we need to be very cautious when comparing and testing equalities in the presence of missing values. For equality testing use the ismissing function instead.

4.9.2 Filtering Missing Values

Most of the time we want to remove missing values from our data.

Removing missings can be done in two ways:

  1. dropmissing function applied to a whole DataFrame or a subset of its columns.
  2. ismissing function applied to a filtering procedure (see Section 4.3).

The dropmissing function takes as first positional argument a DataFrame, and as an optional second argument either a single column or a vector of columns by which you’ll want to remove the missing data from.

By default, if you do not specify column(s), as the second positional argument, it will remove any observation (row) having missing values:

dropmissing(df_missing)
name age grade_2020
Hank 19 4.0

Since 3 out of 4 rows had at least one missing value, we get back a DataFrame with a single row as a result.

However, if we specify column(s) as the second positional argument to dropmissing, the resulting DataFrame will only drop rows that have missing values in the specified column(s).

Here’s a single column with a Symbol:

dropmissing(df_missing, :name)
name age grade_2020
Sally missing 1.0
Alice 20 missing
Hank 19 4.0

And now multiple columns with a vector of Symbols:

dropmissing(df_missing, [:name, :age])
name age grade_2020
Alice 20 missing
Hank 19 4.0

NOTE: You can use any of the column selectors described in Section 4.4 for the second positional argument of dropmissing.

The ismissing function tests if the underlying value is of the Missing type returning either true or false.

You can use negation ! to use it both ways:

filter(:name => ismissing, df_missing)
name age grade_2020
missing 17 5.0
filter(:name => !ismissing, df_missing)
name age grade_2020
Sally missing 1.0
Alice 20 missing
Hank 19 4.0

4.9.3 Filling or Replacing Missing Values

A common data wrangling pattern is to replace or fill missing values.

Like R (and SQL), Julia has the coalesce function. We often use it in a broadcasted way over an array to fill all missing values with a specific value.

Here’s an example of a vector containing two missing values:

coalesce.([missing, "some value", missing], "zero")
["zero", "some value", "zero"]

You can see that coalesce replaces missing values with "zero".

We can definitely use it in a transform procedure (Section 4.7):

transform(df_missing, :name => ByRow(x -> coalesce(x, "John Doe")); renamecols=false)
name age grade_2020
John Doe 17 5.0
Sally missing 1.0
Alice 20 missing
Hank 19 4.0

4.9.4 Skipping over Missing Values

As we saw on Section 4.8, we can use combine to apply summarizing functions to data. However, as explained, missing values propagate through most operations in Julia. Suppose you want to calculate the mean of :grade_2020 column in our df_missing:

combine(df_missing, :grade_2020 => mean)
grade_2020_mean
missing

You can skip missing values in any array or summarizing function by passing the skipmissing function:

combine(df_missing, :grade_2020 => mean ∘ skipmissing )
grade_2020_mean_skipmissing
3.3333333333333335

NOTE: We are using the function composition operator (which you can type with \circ<TAB>) to compose two functions into one. It is just like the mathematical operator:

$$f \circ g (x) = f(g(x))$$

Hence, (mean ∘ skipmissing)(x) becomes mean(skipmissing(x)).



Support this project
CC BY-NC-SA 4.0 Jose Storopoli, Rik Huijzer, Lazaro Alonso