## 4.9 Missing Data

Let’s dive into how to handle missing values in DataFrames.jl. We’ll cover three main approaches for dealing with missing data:

1. filtering missing values with ismissing and dropmissing
2. filling or replacing missing values with coalesce
3. skipping over missing values with skipmissing

First, we need a DataFrame filled with missing values to showcase these approaches:

df_missing = DataFrame(;
name=[missing, "Sally", "Alice", "Hank"],
age=[17, missing, 20, 19],
)
missing 17 5.0
Sally missing 1.0
Alice 20 missing
Hank 19 4.0

This is the same DataFrame from Section 4 but with some missing values added.

### 4.9.1Missing type

Some languages have several types to represent missing values. One such example is R which uses NA, NA_integer_, NA_real_, NA_character_, and NA_complex_. Julia, on the contrary, has only one: Missing.

typeof(missing)
Missing

As you can see missing is an instance of the type Missing.

NOTE: In the Julia Style Guide, there’s a guidance to use camel case for types and modules (see Section 8.2).

The first thing we need to cover for missing values is that they propagate through several operations. For example, addition, subtraction, multiplication, and division:

missing + 1
missing
missing - 1
missing
missing * 1
missing
missing / 1
missing

They also propagate through equality and comparison operators:

missing == 1
missing
missing == missing
missing
missing > 1
missing
missing > missing
missing

That’s why we need to be very cautious when comparing and testing equalities in the presence of missing values. For equality testing use the ismissing function instead.

### 4.9.2 Filtering Missing Values

Most of the time we want to remove missing values from our data.

Removing missings can be done in two ways:

1. dropmissing function applied to a whole DataFrame or a subset of its columns.
2. ismissing function applied to a filtering procedure (see Section 4.3).

The dropmissing function takes as first positional argument a DataFrame, and as an optional second argument either a single column or a vector of columns by which you’ll want to remove the missing data from.

By default, if you do not specify column(s), as the second positional argument, it will remove any observation (row) having missing values:

dropmissing(df_missing)
Hank 19 4.0

Since 3 out of 4 rows had at least one missing value, we get back a DataFrame with a single row as a result.

However, if we specify column(s) as the second positional argument to dropmissing, the resulting DataFrame will only drop rows that have missing values in the specified column(s).

Here’s a single column with a Symbol:

dropmissing(df_missing, :name)
Sally missing 1.0
Alice 20 missing
Hank 19 4.0

And now multiple columns with a vector of Symbols:

dropmissing(df_missing, [:name, :age])
Alice 20 missing
Hank 19 4.0

NOTE: You can use any of the column selectors described in Section 4.4 for the second positional argument of dropmissing.

The ismissing function tests if the underlying value is of the Missing type returning either true or false.

You can use negation ! to use it both ways:

• ismissing to just keep the missing values.
• !ismissing to keep anything but the missing values
filter(:name => ismissing, df_missing)
missing 17 5.0
filter(:name => !ismissing, df_missing)
Sally missing 1.0
Alice 20 missing
Hank 19 4.0

### 4.9.3 Filling or Replacing Missing Values

A common data wrangling pattern is to replace or fill missing values.

Like R (and SQL), Julia has the coalesce function. We often use it in a broadcasted way over an array to fill all missing values with a specific value.

Here’s an example of a vector containing two missing values:

coalesce.([missing, "some value", missing], "zero")
["zero", "some value", "zero"]

You can see that coalesce replaces missing values with "zero".

We can definitely use it in a transform procedure (Section 4.7):

transform(df_missing, :name => ByRow(x -> coalesce(x, "John Doe")); renamecols=false)
John Doe 17 5.0
Sally missing 1.0
Alice 20 missing
Hank 19 4.0

### 4.9.4 Skipping over Missing Values

As we saw on Section 4.8, we can use combine to apply summarizing functions to data. However, as explained, missing values propagate through most operations in Julia. Suppose you want to calculate the mean of :grade_2020 column in our df_missing:

missing

You can skip missing values in any array or summarizing function by passing the skipmissing function:

combine(df_missing, :grade_2020 => mean ∘ skipmissing )
$$f \circ g (x) = f(g(x))$$