4.4 Select

Whereas filter removes rows, select removes columns. However, select is much more versatile than just removing columns, as we will discuss in this section. First, let’s create a dataset with multiple columns:

function responses()
    id = [1, 2]
    q1 = [28, 61]
    q2 = [:us, :fr]
    q3 = ["F", "B"]
    q4 = ["B", "C"]
    q5 = ["A", "E"]
    DataFrame(; id, q1, q2, q3, q4, q5)
Table 7: Responses.
id q1 q2 q3 q4 q5
1 28 us F B A
2 61 fr B C E

Here, the data represents answers for five questions (q1, q2, …, q5) in a given questionnaire. We will start by “selecting” a few columns from this dataset. As usual, we use symbols to specify columns:

select(responses(), :id, :q1)
id q1
1 28
2 61

We can also use strings if we want:

select(responses(), "id", "q1", "q2")
id q1 q2
1 28 us
2 61 fr

To select everything except one or more columns, use Not with either a single column:

select(responses(), Not(:q5))
id q1 q2 q3 q4
1 28 us F B
2 61 fr B C

Or, with multiple columns:

select(responses(), Not([:q4, :q5]))
id q1 q2 q3
1 28 us F
2 61 fr B

It’s also fine to mix and match columns that we want to preserve with columns that we do Not want to select:

select(responses(), :q5, Not(:q5))
q5 id q1 q2 q3 q4
A 1 28 us F B
E 2 61 fr B C

Note how q5 is now the first column in the DataFrame returned by select. There is a more clever way to achieve the same using :. The colon : can be thought of as “all the columns that we didn’t include yet.” For example:

select(responses(), :q5, :)
q5 id q1 q2 q3 q4
A 1 28 us F B
E 2 61 fr B C

Or, to put q5 at the second position16:

select(responses(), 1, :q5, :)
id q5 q1 q2 q3 q4
1 A 28 us F B
2 E 61 fr B C

NOTE: As you might have observed there are several ways to select a column. These are known as column selectors.

We can use:

  • Symbol: select(df, :col)

  • String: select(df, "col")

  • Integer: select(df, 1)

Even renaming columns is possible via select using the source => target pair syntax:

select(responses(), 1 => "participant", :q1 => "age", :q2 => "nationality")
participant age nationality
1 28 us
2 61 fr

Additionally, thanks to the “splat” operator ... (see Section 3.3.11), we can also write:

renames = (1 => "participant", :q1 => "age", :q2 => "nationality")
select(responses(), renames...)
participant age nationality
1 28 us
2 61 fr

  1. 16. thanks to Sudete on Discourse (https://discourse.julialang.org/t/pull-dataframes-columns-to-the-front/60327/4) for this suggestion.↩︎

CC BY-NC-SA 4.0 Jose Storopoli, Rik Huijzer, Lazaro Alonso