Select - Julia Data Science

4.4 Select

Whereas filter removes rows, select removes columns. However, select is much more versatile than just removing columns, as we will discuss in this section. First, let’s create a dataset with multiple columns:

function responses()
    id = [1, 2]
    q1 = [28, 61]
    q2 = [:us, :fr]
    q3 = ["F", "B"]
    q4 = ["B", "C"]
    q5 = ["A", "E"]
    DataFrame(; id, q1, q2, q3, q4, q5)
end
responses()

Table 7: Responses.
id	q1	q2	q3	q4	q5
1	28	us	F	B	A
2	61	fr	B	C	E

Here, the data represents answers for five questions (q1, q2, …, q5) in a given questionnaire. We will start by “selecting” a few columns from this dataset. As usual, we use symbols to specify columns:

select(responses(), :id, :q1)

id	q1
1	28
2	61

We can also use strings if we want:

select(responses(), "id", "q1", "q2")

id	q1	q2
1	28	us
2	61	fr

Additionally, we can use Regular Expressions with Julia’s regex string literal. A string literal in Julia is a prefix that you use while constructing a String. For example, the regex string literal can be created with r"..." where ... is the Regular Expression. For example, suppose you only want to select the columns that start with q:

select(responses(), r"^q")

q1	q2	q3	q4	q5
28	us	F	B	A
61	fr	B	C	E

NOTE: We won’t cover regular expressions in this book, but you are encouraged to learn about them. To build and test regular expressions interactively, we advice to use online tools for them such as https://regex101.com/.

To select everything except one or more columns, use Not with either a single column:

select(responses(), Not(:q5))

id	q1	q2	q3	q4
1	28	us	F	B
2	61	fr	B	C

Or, with multiple columns:

select(responses(), Not([:q4, :q5]))

id	q1	q2	q3
1	28	us	F
2	61	fr	B

It’s also fine to mix and match columns that we want to preserve with columns that we do Not want to select:

select(responses(), :q5, Not(:q5))

q5	id	q1	q2	q3	q4
A	1	28	us	F	B
E	2	61	fr	B	C

Note how q5 is now the first column in the DataFrame returned by select. There is a more clever way to achieve the same using :. The colon : can be thought of as “all the columns that we didn’t include yet”. For example:

select(responses(), :q5, :)

q5	id	q1	q2	q3	q4
A	1	28	us	F	B
E	2	61	fr	B	C

Or, to put q5 at the second position¹⁶:

select(responses(), 1, :q5, :)

id	q5	q1	q2	q3	q4
1	A	28	us	F	B
2	E	61	fr	B	C

NOTE: As you might have observed there are several ways to select a column. These are known as column selectors.

We can use:

Symbol: select(df, :col)

String: select(df, "col")

Integer: select(df, 1)

RegEx: select(df, r"RegEx")

Even renaming columns is possible via select using the source => target pair syntax:

select(responses(), 1 => "participant", :q1 => "age", :q2 => "nationality")

participant	age	nationality
1	28	us
2	61	fr

Additionally, thanks to the “splat” operator ... (see Section 3.3.11), we can also write:

renames = (1 => "participant", :q1 => "age", :q2 => "nationality")
select(responses(), renames...)

participant	age	nationality
1	28	us
2	61	fr

4.3 Filter and Subset ← → 4.5 Types and Categorical Da..

Support this project
CC BY-NC-SA 4.0 Jose Storopoli, Rik Huijzer, Lazaro Alonso