There are two ways to remove rows from a `DataFrame`

, one is `filter`

(Section 4.3.1) and the other is `subset`

(Section 4.3.2). `filter`

was added earlier to `DataFrames.jl`

, is more powerful and more consistent with syntax from Julia base, so that is why we start discussing `filter`

first. `subset`

is newer and often more convenient.

From this point on, we start to get into the more powerful features of `DataFrames.jl`

. To do this, we need to learn some functions, such as `select`

and `filter`

. But don’t worry! It might be a relief to know that the **general design goal of DataFrames.jl is to keep the number of functions that a user has to learn to a minimum ^{15}**.

Like before, we resume from the `grades_2020`

:

`grades_2020()`

name | grade_2020 |
---|---|

Sally | 1.0 |

Bob | 5.0 |

Alice | 8.5 |

Hank | 4.0 |

We can filter rows by using `filter(source => f::Function, df)`

. Note how this function is very similar to the function `filter(f::Function, V::Vector)`

from Julia `Base`

module. This is because `DataFrames.jl`

uses **multiple dispatch** (see Section 2.3.3) to define a new method of `filter`

that accepts a `DataFrame`

as argument.

At first sight, defining and working with a function `f`

for filtering can be a bit difficult to use in practice. Hold tight, that effort is well-paid, since **it is a very powerful way of filtering data**. As a simple example, we can create a function `equals_alice`

that checks whether its input equals “Alice”:

```
equals_alice(name::String) = name == "Alice"
equals_alice("Bob")
```

`false`

`equals_alice("Alice")`

`true`

Equipped with such a function, we can use it as our function `f`

to filter out all the rows for which `name`

equals “Alice”:

`filter(:name => equals_alice, grades_2020())`

name | grade_2020 |
---|---|

Alice | 8.5 |

Note that this doesn’t only work for `DataFrame`

s, but also for vectors:

`filter(equals_alice, ["Alice", "Bob", "Dave"])`

`["Alice"]`

We can make it a bit less verbose by using an **anonymous function** (see Section 3.2.4.4):

`filter(n -> n == "Alice", ["Alice", "Bob", "Dave"])`

`["Alice"]`

which we can also use on `grades_2020`

:

`filter(:name => n -> n == "Alice", grades_2020())`

name | grade_2020 |
---|---|

Alice | 8.5 |

To recap, this function call can be read as “for each element in the column `:name`

, let’s call the element `n`

, check whether `n`

equals Alice.” For some people, this is still too verbose. Luckily, Julia has added a *partial function application* of `==`

. The details are not important – just know that you can use it just like any other function:

`filter(:name => ==("Alice"), grades_2020())`

name | grade_2020 |
---|---|

Alice | 8.5 |

To get all the rows which are *not* Alice, `==`

(equality) can be replaced by `!=`

(inequality) in all previous examples:

`filter(:name => !=("Alice"), grades_2020())`

name | grade_2020 |
---|---|

Sally | 1.0 |

Bob | 5.0 |

Hank | 4.0 |

Now, to show **why functions are so powerful**, we can come up with a slightly more complex filter. In this filter, we want to have the people whose names start with A or B **and** have a grade above 6:

```
function complex_filter(name, grade)::Bool
interesting_name = startswith(name, 'A') || startswith(name, 'B')
interesting_grade = 6 < grade
interesting_name && interesting_grade
end
```

`filter([:name, :grade_2020] => complex_filter, grades_2020())`

name | grade_2020 |
---|---|

Alice | 8.5 |

The `subset`

function was added to make it easier to work with missing values (Section 4.5). In contrast to `filter`

, `subset`

works on complete columns instead of rows or single values. If we want to use our earlier defined functions, we should wrap it inside `ByRow`

:

`subset(grades_2020(), :name => ByRow(equals_alice))`

name | grade_2020 |
---|---|

Alice | 8.5 |

Also note that the `DataFrame`

is now the first argument `subset(df, args...)`

, whereas in `filter`

it was the second one `filter(f, df)`

. The reason for this is that Julia defines filter as `filter(f, V::Vector)`

and `DataFrames.jl`

chose to maintain consistency with existing Julia functions that were extended to `DataFrame`

s types by multiple dispatch.

Most of nativeNOTE:`DataFrames.jl`

functions, which`subset`

belongs to, have aconsistent function signature that always takes a.`DataFrame`

as first argument

Just like with `filter`

, we can also use anonymous functions inside `subset`

:

`subset(grades_2020(), :name => ByRow(name -> name == "Alice"))`

name | grade_2020 |
---|---|

Alice | 8.5 |

Or, the partial function application for `==`

:

`subset(grades_2020(), :name => ByRow(==("Alice")))`

name | grade_2020 |
---|---|

Alice | 8.5 |

Ultimately, let’s show the real power of `subset`

. First, we create a dataset with some missing values:

```
function salaries()
names = ["John", "Hank", "Karen", "Zed"]
salary = [1_900, 2_800, 2_800, missing]
DataFrame(; names, salary)
end
salaries()
```

names | salary |
---|---|

John | 1900 |

Hank | 2800 |

Karen | 2800 |

Zed | missing |

This data is about a plausible situation where you want to figure out your colleagues’ salaries, and haven’t figured it out for Zed yet. Even though we don’t want to encourage these practices, we suspect it is an interesting example. Suppose we want to know who earns more than 2000. If we use `filter`

, without taking the `missing`

values into account, it will fail:

`filter(:salary => >(2_000), salaries())`

```
TypeError: non-boolean (Missing) used in boolean context
Stacktrace:
[1] (::DataFrames.var"#97#98"{Base.Fix2{typeof(>), Int64}})(x::Missing)
@ DataFrames ~/.julia/packages/DataFrames/zqFGs/src/abstractdataframe/abstractdataframe.jl:1110
...
```

`subset`

will also fail, but it will fortunately point us towards an easy solution:

`subset(salaries(), :salary => ByRow(>(2_000)))`

```
ArgumentError: missing was returned in condition number 1 but only true or false are allowed; pass skipmissing=true to skip missing values
Stacktrace:
[1] _and(x::Missing)
@ DataFrames ~/.julia/packages/DataFrames/zqFGs/src/abstractdataframe/subset.jl:11
...
```

So, we just need to pass the keyword argument `skipmissing=true`

:

`subset(salaries(), :salary => ByRow(>(2_000)); skipmissing=true)`

names | salary |
---|---|

Hank | 2800 |

Karen | 2800 |

15. According to Bogumił Kamiński (lead developer and maintainer of

`DataFrames.jl`

) on Discourse (https://discourse.julialang.org/t/pull-dataframes-columns-to-the-front/60327/5).↩︎