Data comes mostly in a tabular format. By tabular, we mean that the data consists of a table containing rows and columns. Entries within any one column are usually of the same data type, whereas entries within a given row typically have different types. The rows, in practice, denote observations while columns denote variables. For example, we can have a table of TV shows containing the country in which each was produced and our personal rating, see Table 1.
name | country | rating |
---|---|---|
Game of Thrones | United States | 8.2 |
The Crown | England | 7.3 |
Friends | United States | 7.8 |
… | … | … |
Here, the dots mean that this could be a very long table and we only show a few rows. While analyzing data, often we come up with interesting questions about the data, also known as data queries. For large tables, computers would be able to answer these kinds of questions much quicker than you could do it by hand. Some examples of these so-called queries for this data could be:
But, as a researcher, real science often starts with having multiple tables or data sources. For example, if we also have data from someone else’s ratings for the TV shows (Table 2):
name | rating |
---|---|
Game of Thrones | 7 |
Friends | 6.4 |
… | … |
Now, questions that we could ask ourselves could be:
In the rest of this chapter, we will show you how you can easily answer these questions in Julia. To do so, we first show why we need a Julia package called DataFrames.jl
. In the next sections, we show how you can use this package and, finally, we show how to write fast data transformations (Section 4.10).
NOTE:
DataFrames.jl
has some guiding principles. Notably, we would like to highlight two of them:
- Stay consistent with Julia’s
Base
module functions.- Minimize the number of functions
DataFrames.jl
provides.Those two principles are really powerful because if you have a good grasp of Julia’s basic functions, such as
filter
, then you can do powerful operations on tabular data.This is a benefit over Python’s
pandas
or R’sdplyr
which differ more from the core languages.
Let’s look at a table of grades like the one in Table 3:
name | age | grade_2020 |
---|---|---|
Bob | 17 | 5.0 |
Sally | 18 | 1.0 |
Alice | 20 | 8.5 |
Hank | 19 | 4.0 |
Here, the column name has type string
, age has type integer
, and grade has type float
.
So far, this book has only handled Julia’s basics. These basics are great for many things, but not for tables. To show that we need more, lets try to store the tabular data in arrays:
function grades_array()
name = ["Bob", "Sally", "Alice", "Hank"]
age = [17, 18, 20, 19]
grade_2020 = [5.0, 1.0, 8.5, 4.0]
(; name, age, grade_2020)
end
Now, the data is stored in so-called column-major form, which is cumbersome when we want to get data from a row:
function second_row()
name, age, grade_2020 = grades_array()
i = 2
row = (name[i], age[i], grade_2020[i])
end
JDS.second_row()
("Sally", 18, 1.0)
Or, if you want to have the grade for Alice, you first need to figure out in what row Alice is:
function row_alice()
names = grades_array().name
i = findfirst(names .== "Alice")
end
row_alice()
3
and then we can get the value:
function value_alice()
grades = grades_array().grade_2020
i = row_alice()
grades[i]
end
value_alice()
8.5
DataFrames.jl
can easily solve these kinds of issues. You can start by loading DataFrames.jl
with using
:
using DataFrames
With DataFrames.jl
, we can define a DataFrame
to hold our tabular data:
names = ["Sally", "Bob", "Alice", "Hank"]
grades = [1, 5, 8.5, 4]
df = DataFrame(; name=names, grade_2020=grades)
name | grade_2020 |
---|---|
Sally | 1.0 |
Bob | 5.0 |
Alice | 8.5 |
Hank | 4.0 |
which gives us a variable df
containing our data in table format.
NOTE: This works, but there is one thing that we need to change straight away. In this example, we defined the variables
name
,grade_2020
anddf
in global scope. This means that these variables can be accessed and edited from anywhere. If we would continue writing the book like this, we would have a few hundred variables at the end of the book even though the data that we put into the variablename
should only be accessed viaDataFrame
! The variablesname
andgrade_2020
were never meant to be kept for long! Now, imagine that we would change the contents ofgrade_2020
a few times in this book. Given only the book as PDF, it would be near impossible to figure out the contents of the variable by the end.We can solve this very easily by using functions.
Let’s do the same thing as before but now in a function:
function grades_2020()
name = ["Sally", "Bob", "Alice", "Hank"]
grade_2020 = [1, 5, 8.5, 4]
DataFrame(; name, grade_2020)
end
grades_2020()
name | grade_2020 |
---|---|
Sally | 1.0 |
Bob | 5.0 |
Alice | 8.5 |
Hank | 4.0 |
Note that name
and grade_2020
are destroyed after the function returns, that is, they are only available in the function. There are two other benefits of doing this. First, it is now clear to the reader where name
and grade_2020
belong to: they belong to the grades of 2020. Second, it is easy to determine what the output of grades_2020()
would be at any point in the book. For example, we can now assign the data to a variable df
:
df = grades_2020()
name | grade_2020 |
---|---|
Sally | 1.0 |
Bob | 5.0 |
Alice | 8.5 |
Hank | 4.0 |
Change the contents of df
:
df = DataFrame(name = ["Malice"], grade_2020 = ["10"])
name | grade_2020 |
---|---|
Malice | 10 |
And still recover the original data back without any problem:
df = grades_2020()
name | grade_2020 |
---|---|
Sally | 1.0 |
Bob | 5.0 |
Alice | 8.5 |
Hank | 4.0 |
Of course, this assumes that the function is not re-defined. We promise to not do that in this book, because it is a bad idea exactly for this reason. Instead of “changing” a function, we will make a new one and give it a clear name.
So, back to the DataFrames
constructor. As you might have seen, the way to create one is simply to pass vectors as arguments into the DataFrame
constructor. You can come up with any valid Julia vector and it will work as long as the vectors have the same length. Duplicates, Unicode symbols and any sort of numbers are fine:
DataFrame(σ = ["a", "a", "a"], δ = [π, π/2, π/3])
σ | δ |
---|---|
a | 3.141592653589793 |
a | 1.5707963267948966 |
a | 1.0471975511965976 |
Typically, in your code, you would create a function which wraps around one or more DataFrame
s’ functions. For example, we can make a function to get the grades for one or more names
:
function grades_2020(names::Vector{Int})
df = grades_2020()
df[names, :]
end
JDS.grades_2020([3, 4])
name | grade_2020 |
---|---|
Alice | 8.5 |
Hank | 4.0 |
This way of using functions to wrap around basic functionality in programming languages and packages is quite common. Basically, you can think of Julia and DataFrames.jl
as providers of building blocks. They provide very generic building blocks which allow you to build things for your specific use case like this grades example. By using the blocks, you can make a data analysis script, control a robot or whatever you like to build.
So far, the examples were quite cumbersome, because we had to use indexes. In the next sections, we will show how to load and save data, and many powerful building blocks provided by DataFrames.jl
.