Filter and Subset - Julia Data Science

4.3 Filter and Subset

There are two ways to remove rows from a DataFrame, one is filter (Section 4.3.1) and the other is subset (Section 4.3.2). filter was added earlier to DataFrames.jl, is more powerful and more consistent with syntax from Julia base, so that is why we start discussing filter first. subset is newer and often more convenient.

4.3.1 Filter

From this point on, we start to get into the more powerful features of DataFrames.jl. To do this, we need to learn some functions, such as select and filter. But don’t worry! It might be a relief to know that the general design goal of DataFrames.jl is to keep the number of functions that a user has to learn to a minimum¹⁵.

Like before, we resume from the grades_2020:

grades_2020()

name	grade_2020
Sally	1.0
Bob	5.0
Alice	8.5
Hank	4.0

We can filter rows by using filter(source => f::Function, df). Note how this function is very similar to the function filter(f::Function, V::Vector) from Julia Base module. This is because DataFrames.jl uses multiple dispatch (see Section 2.3.3) to define a new method of filter that accepts a DataFrame as argument.

At first sight, defining and working with a function f for filtering can be a bit difficult to use in practice. Hold tight, that effort is well-paid, since it is a very powerful way of filtering data. As a simple example, we can create a function equals_alice that checks whether its input equals “Alice”:

equals_alice(name::String) = name == "Alice"
JDS.equals_alice("Bob")


false

equals_alice("Alice")


true

Equipped with such a function, we can use it as our function f to filter out all the rows for which name equals “Alice”:

filter(:name => equals_alice, grades_2020())

name	grade_2020
Alice	8.5

Note that this doesn’t only work for DataFrames, but also for vectors:

filter(equals_alice, ["Alice", "Bob", "Dave"])

["Alice"]

We can make it a bit less verbose by using an anonymous function (see Section 3.2.4.4):

filter(n -> n == "Alice", ["Alice", "Bob", "Dave"])

["Alice"]

which we can also use on grades_2020:

filter(:name => n -> n == "Alice", grades_2020())

name	grade_2020
Alice	8.5

To recap, this function call can be read as “for each element in the column :name, let’s call the element n, check whether n equals Alice”. For some people, this is still too verbose. Luckily, Julia has added a partial function application of ==. The details are not important – just know that you can use it just like any other function:

filter(:name => ==("Alice"), grades_2020())

name	grade_2020
Alice	8.5

To get all the rows which are not Alice, == (equality) can be replaced by != (inequality) in all previous examples:

filter(:name => !=("Alice"), grades_2020())

name	grade_2020
Sally	1.0
Bob	5.0
Hank	4.0

Now, to show why functions are so powerful, we can come up with a slightly more complex filter. In this filter, we want to have the people whose names start with A or B and have a grade above 6:

function complex_filter(name, grade)::Bool
    interesting_name = startswith(name, 'A') || startswith(name, 'B')
    interesting_grade = 6 < grade
    interesting_name && interesting_grade
end

filter([:name, :grade_2020] => complex_filter, grades_2020())

name	grade_2020
Alice	8.5

4.3.2 Subset

The subset function was added to make it easier to work with missing values (Section 4.9). In contrast to filter, subset works on complete columns instead of rows or single values. If we want to use our earlier defined functions, we should wrap it inside ByRow:

subset(grades_2020(), :name => ByRow(equals_alice))

name	grade_2020
Alice	8.5

Also note that the DataFrame is now the first argument subset(df, args...), whereas in filter it was the second one filter(f, df). The reason for this is that Julia defines filter as filter(f, V::Vector) and DataFrames.jl chose to maintain consistency with existing Julia functions that were extended to DataFrames types by multiple dispatch.

NOTE: Most of native DataFrames.jl functions, which subset belongs to, have a consistent function signature that always takes a DataFrame as first argument.

Just like with filter, we can also use anonymous functions inside subset:

subset(grades_2020(), :name => ByRow(name -> name == "Alice"))

name	grade_2020
Alice	8.5

Or, the partial function application for ==:

subset(grades_2020(), :name => ByRow(==("Alice")))

name	grade_2020
Alice	8.5

Ultimately, let’s show the real power of subset. First, we create a dataset with some missing values:

function salaries()
    names = ["John", "Hank", "Karen", "Zed"]
    salary = [1_900, 2_800, 2_800, missing]
    DataFrame(; names, salary)
end
salaries()

Table 6: Salaries.
names	salary
John	1900
Hank	2800
Karen	2800
Zed	missing

This data is about a plausible situation where you want to figure out your colleagues’ salaries, and haven’t figured it out for Zed yet. Even though we don’t want to encourage these practices, we suspect it is an interesting example. Suppose we want to know who earns more than 2000. If we use filter, without taking the missing values into account, it will fail:

filter(:salary => >(2_000), salaries())

TypeError: non-boolean (Missing) used in boolean context
Stacktrace:
  [1] (::DataFrames.var"#105#106"{Base.Fix2{typeof(>), Int64}})(x::Missing)
    @ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/abstractdataframe/abstractdataframe.jl:1248
  ...

subset will also fail, but it will fortunately point us towards an easy solution:

subset(salaries(), :salary => ByRow(>(2_000)))

ArgumentError: missing was returned in condition number 1 but only true or false are allowed; pass skipmissing=true to skip missing values
Stacktrace:
  [1] _and(x::Missing)
    @ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/abstractdataframe/subset.jl:11
  ...

So, we just need to pass the keyword argument skipmissing=true:

subset(salaries(), :salary => ByRow(>(2_000)); skipmissing=true)

names	salary
Hank	2800
Karen	2800

4.2 Index and Summarize ← → 4.4 Select

Support this project
CC BY-NC-SA 4.0 Jose Storopoli, Rik Huijzer, Lazaro Alonso