Let’s dive into how to handle missing values in DataFrames.jl. We’ll cover three main approaches for dealing with missing data:
ismissing and dropmissingcoalesceskipmissingFirst, we need a DataFrame filled with missing values to showcase these approaches:
df_missing = DataFrame(;
name=[missing, "Sally", "Alice", "Hank"],
age=[17, missing, 20, 19],
grade_2020=[5.0, 1.0, missing, 4.0],
)
| name | age | grade_2020 |
|---|---|---|
| missing | 17 | 5.0 |
| Sally | missing | 1.0 |
| Alice | 20 | missing |
| Hank | 19 | 4.0 |
This is the same DataFrame from Section 4 but with some missing values added.
Missing typeSome languages have several types to represent missing values. One such example is R which uses NA, NA_integer_, NA_real_, NA_character_, and NA_complex_. Julia, on the contrary, has only one: Missing.
typeof(missing)
Missing
As you can see missing is an instance of the type Missing.
NOTE: In the Julia Style Guide, there’s a guidance to use camel case for types and modules (see Section 8.2).
The first thing we need to cover for missing values is that they propagate through several operations. For example, addition, subtraction, multiplication, and division:
missing + 1
missing
missing - 1
missing
missing * 1
missing
missing / 1
missing
They also propagate through equality and comparison operators:
missing == 1
missing
missing == missing
missing
missing > 1
missing
missing > missing
missing
That’s why we need to be very cautious when comparing and testing equalities in the presence of missing values. For equality testing use the ismissing function instead.
Most of the time we want to remove missing values from our data.
Removing missings can be done in two ways:
dropmissing function applied to a whole DataFrame or a subset of its columns.ismissing function applied to a filtering procedure (see Section 4.3).The dropmissing function takes as first positional argument a DataFrame, and as an optional second argument either a single column or a vector of columns by which you’ll want to remove the missing data from.
By default, if you do not specify column(s), as the second positional argument, it will remove any observation (row) having missing values:
dropmissing(df_missing)
| name | age | grade_2020 |
|---|---|---|
| Hank | 19 | 4.0 |
Since 3 out of 4 rows had at least one missing value, we get back a DataFrame with a single row as a result.
However, if we specify column(s) as the second positional argument to dropmissing, the resulting DataFrame will only drop rows that have missing values in the specified column(s).
Here’s a single column with a Symbol:
dropmissing(df_missing, :name)
| name | age | grade_2020 |
|---|---|---|
| Sally | missing | 1.0 |
| Alice | 20 | missing |
| Hank | 19 | 4.0 |
And now multiple columns with a vector of Symbols:
dropmissing(df_missing, [:name, :age])
| name | age | grade_2020 |
|---|---|---|
| Alice | 20 | missing |
| Hank | 19 | 4.0 |
NOTE: You can use any of the column selectors described in Section 4.4 for the second positional argument of
dropmissing.
The ismissing function tests if the underlying value is of the Missing type returning either true or false.
You can use negation ! to use it both ways:
ismissing to just keep the missing values.!ismissing to keep anything but the missing valuesfilter(:name => ismissing, df_missing)
| name | age | grade_2020 |
|---|---|---|
| missing | 17 | 5.0 |
filter(:name => !ismissing, df_missing)
| name | age | grade_2020 |
|---|---|---|
| Sally | missing | 1.0 |
| Alice | 20 | missing |
| Hank | 19 | 4.0 |
A common data wrangling pattern is to replace or fill missing values.
Like R (and SQL), Julia has the coalesce function. We often use it in a broadcasted way over an array to fill all missing values with a specific value.
Here’s an example of a vector containing two missing values:
coalesce.([missing, "some value", missing], "zero")
["zero", "some value", "zero"]
You can see that coalesce replaces missing values with "zero".
We can definitely use it in a transform procedure (Section 4.7):
transform(df_missing, :name => ByRow(x -> coalesce(x, "John Doe")); renamecols=false)
| name | age | grade_2020 |
|---|---|---|
| John Doe | 17 | 5.0 |
| Sally | missing | 1.0 |
| Alice | 20 | missing |
| Hank | 19 | 4.0 |
As we saw on Section 4.8, we can use combine to apply summarizing functions to data. However, as explained, missing values propagate through most operations in Julia. Suppose you want to calculate the mean of :grade_2020 column in our df_missing:
combine(df_missing, :grade_2020 => mean)
| grade_2020_mean |
|---|
| missing |
You can skip missing values in any array or summarizing function by passing the skipmissing function:
combine(df_missing, :grade_2020 => mean ∘ skipmissing )
| grade_2020_mean_skipmissing |
|---|
| 3.3333333333333335 |
NOTE: We are using the function composition operator
∘(which you can type with\circ<TAB>) to compose two functions into one. It is just like the mathematical operator:$$f \circ g (x) = f(g(x))$$
Hence,
(mean ∘ skipmissing)(x)becomesmean(skipmissing(x)).