Missing Data - Julia Data Science

4.9 Missing Data

Let’s dive into how to handle missing values in DataFrames.jl. We’ll cover three main approaches for dealing with missing data:

filtering missing values with ismissing and dropmissing
filling or replacing missing values with coalesce
skipping over missing values with skipmissing

First, we need a DataFrame filled with missing values to showcase these approaches:

df_missing = DataFrame(;
    name=[missing, "Sally", "Alice", "Hank"],
    age=[17, missing, 20, 19],
    grade_2020=[5.0, 1.0, missing, 4.0],
)

name	age	grade_2020
missing	17	5.0
Sally	missing	1.0
Alice	20	missing
Hank	19	4.0

This is the same DataFrame from Section 4 but with some missing values added.

4.9.1 `Missing` type

Some languages have several types to represent missing values. One such example is R which uses NA, NA_integer_, NA_real_, NA_character_, and NA_complex_. Julia, on the contrary, has only one: Missing.

typeof(missing)

Missing

As you can see missing is an instance of the type Missing.

NOTE: In the Julia Style Guide, there’s a guidance to use camel case for types and modules (see Section 8.2).

The first thing we need to cover for missing values is that they propagate through several operations. For example, addition, subtraction, multiplication, and division:

missing + 1

missing

missing - 1

missing

missing * 1

missing

missing / 1

missing

They also propagate through equality and comparison operators:

missing == 1

missing

missing == missing

missing

missing > 1

missing

missing > missing

missing

That’s why we need to be very cautious when comparing and testing equalities in the presence of missing values. For equality testing use the ismissing function instead.

4.9.2 Filtering Missing Values

Most of the time we want to remove missing values from our data.

Removing missings can be done in two ways:

dropmissing function applied to a whole DataFrame or a subset of its columns.
ismissing function applied to a filtering procedure (see Section 4.3).

The dropmissing function takes as first positional argument a DataFrame, and as an optional second argument either a single column or a vector of columns by which you’ll want to remove the missing data from.

By default, if you do not specify column(s), as the second positional argument, it will remove any observation (row) having missing values:

dropmissing(df_missing)

name	age	grade_2020
Hank	19	4.0

Since 3 out of 4 rows had at least one missing value, we get back a DataFrame with a single row as a result.

However, if we specify column(s) as the second positional argument to dropmissing, the resulting DataFrame will only drop rows that have missing values in the specified column(s).

Here’s a single column with a Symbol:

dropmissing(df_missing, :name)

name	age	grade_2020
Sally	missing	1.0
Alice	20	missing
Hank	19	4.0

And now multiple columns with a vector of Symbols:

dropmissing(df_missing, [:name, :age])

name	age	grade_2020
Alice	20	missing
Hank	19	4.0

NOTE: You can use any of the column selectors described in Section 4.4 for the second positional argument of dropmissing.

The ismissing function tests if the underlying value is of the Missing type returning either true or false.

You can use negation ! to use it both ways:

ismissing to just keep the missing values.
!ismissing to keep anything but the missing values

filter(:name => ismissing, df_missing)

name	age	grade_2020
missing	17	5.0

filter(:name => !ismissing, df_missing)

name	age	grade_2020
Sally	missing	1.0
Alice	20	missing
Hank	19	4.0

4.9.3 Filling or Replacing Missing Values

A common data wrangling pattern is to replace or fill missing values.

Like R (and SQL), Julia has the coalesce function. We often use it in a broadcasted way over an array to fill all missing values with a specific value.

Here’s an example of a vector containing two missing values:

coalesce.([missing, "some value", missing], "zero")

["zero", "some value", "zero"]

You can see that coalesce replaces missing values with "zero".

We can definitely use it in a transform procedure (Section 4.7):

transform(df_missing, :name => ByRow(x -> coalesce(x, "John Doe")); renamecols=false)

name	age	grade_2020
John Doe	17	5.0
Sally	missing	1.0
Alice	20	missing
Hank	19	4.0

4.9.4 Skipping over Missing Values

As we saw on Section 4.8, we can use combine to apply summarizing functions to data. However, as explained, missing values propagate through most operations in Julia. Suppose you want to calculate the mean of :grade_2020 column in our df_missing:

combine(df_missing, :grade_2020 => mean)

grade_2020_mean
missing

You can skip missing values in any array or summarizing function by passing the skipmissing function:

combine(df_missing, :grade_2020 => mean ∘ skipmissing )

grade_2020_mean_skipmissing
3.3333333333333335

NOTE: We are using the function composition operator ∘ (which you can type with \circ<TAB>) to compose two functions into one. It is just like the mathematical operator:

$$f \circ g (x) = f(g(x))$$

Hence, (mean ∘ skipmissing)(x) becomes mean(skipmissing(x)).

4.8 Groupby and Combine ← → 4.10 Performance

Support this project
CC BY-NC-SA 4.0 Jose Storopoli, Rik Huijzer, Lazaro Alonso

4.9 Missing Data

4.9.1 Missing type

4.9.2 Filtering Missing Values

4.9.3 Filling or Replacing Missing Values

4.9.4 Skipping over Missing Values

4.9.1 `Missing` type