3.2 Native Data Structures

Julia has several native data structures. They are abstractions of data that represent somehow structured data. We will cover the most used ones. They hold homogeneous or heterogeneous data. Since they are collections, they can be looped over with the for loops.

We will cover String, Tuple, NamedTuple, UnitRange, Arrays, Pair, Dict, Symbol.

When you stumble into a data structure in Julia, you can find methods that accept it as an argument with the methodswith function. In Julia, the distinction between methods and functions is as follows: Every function can have multiple methods like we have shown earlier. The methodswith function is a nice trick to have in your bag of tricks. Let’s see what we can do with a String for example:

first(methodswith(String), 5)
[1] crc32c(s::String) in CRC32c at /opt/hostedtoolcache/julia/1.6.2/x64/share/julia/stdlib/v1.6/CRC32c/src/CRC32c.jl:39
[2] crc32c(s::String, crc::UInt32) in CRC32c at /opt/hostedtoolcache/julia/1.6.2/x64/share/julia/stdlib/v1.6/CRC32c/src/CRC32c.jl:39
[3] getindex(p::Base.BinaryPlatforms.AbstractPlatform, k::String) in Base.BinaryPlatforms at binaryplatforms.jl:137
[4] getindex(f::CSV.File, col::String) in CSV at /home/runner/.julia/packages/CSV/b4GfC/src/file.jl:152
[5] getindex(d::MathTeXEngine.CanonicalDict{Char}, key::String) in MathTeXEngine at /home/runner/.julia/packages/MathTeXEngine/ZP0gS/src/parser/commands_registration.jl:20

3.2.1 Broadcasting Operators and Functions

Before we dive into data structures, we need to talk about broadcasting (also known as vectorization) and the “dot” operator ..

For mathematical operations, like * (multiplication) or + (addition), we can broadcast it using the dot operator. For example, broadcasted addition would imply in changing the + to .+:

[1, 2, 3] .+ 1
[2, 3, 4]

It also works with functions automatically. (Technically, the mathematical operations, or infix operators, are also functions, but that is not so important to know.) Remember our logarithm function?

logarithm.([1, 2, 3])
[0.0, 0.6931471805599569, 1.0986122886681282] Functions with a bang !

It is a Julia convention to append a bang ! to names of functions that modify one or more of their arguments. This convention warns the user that the function is not pure, i.e., that it has side effects. A function with side effects is useful when you want to update a large data structure or variable container without having all the overhead from creating a new instance.

For example, we can create a function that adds 1 to each element in a vector V:

function add_one!(V)
    for i in 1:length(V)
        V[i] += 1
    return nothing
my_data = [1, 2, 3]


[2, 3, 4]

3.2.2 String

Strings are represented delimited by double quotes:

typeof("This is a string")

We can also write a multiline string:

text = "
This is a big multiline string.
As you can see.
It is still a String to Julia.

This is a big multiline string.
As you can see.
It is still a String to Julia.

But, it is, typically, more clear to use triple quotation marks:

s = """
    This is a big multiline string with a nested "quotation".
    As you can see.
    It is still a String to Julia.
This is a big multiline string with a nested "quotation".
As you can see.
It is still a String to Julia.

When using triple-backticks, the indentation and newline at the start is ignored by Julia. This improves code readability because you can indent the block in your source code without those spaces ending up in your string. String Concatenation

A common string operation is string concatenation. Suppose that you want to construct a new string that is the concatenation of two or more strings. This is accomplish in julia either with the * operator or the join function. This symbol might sound like a weird choice and it actually is. For now, many Julia codebases are using this symbol, so it will stay in the language. If you’re interested, you can read a discussion from 2015 about it at https://github.com/JuliaLang/julia/issues/11030.

hello = "Hello"
goodbye = "Goodbye"

hello * goodbye

As you can see, we are missing a space between hello and goodbye. We could concatenate an additional " " string with the *, but that would be cumbersome for more than two strings. That’s when the join function comes up. We just pass as arguments the strings inside the brackets [] and the separator:

join([hello, goodbye], " ")
Hello Goodbye String Interpolation

Concatenating strings can be convoluted. We can be much more expressive with string interpolation. It works like this: you specify whatever you want to be included in you string with the dollar sign $. Here’s the example before but now using interpolation:

"$hello $goodbye"
Hello Goodbye

It works even inside functions. Let’s revisit our test function from Section 3.1.5:

function test_interpolated(a, b)
    if a < b
        "$a is less than $b"
    elseif a > b
        "$a is greater than $b"
        "$a is equal to $b"

test_interpolated(3.14, 3.14)
3.14 is equal to 3.14 String Manipulations

There are several functions to manipulate strings in Julia. We will demonstrate the most common ones. Also, note that most of these functions accepts a Regular Expression (RegEx) as arguments. We won’t cover RegEx in this book, but you are encouraged to learn about them, especially if most of your work uses textual data.

First, let us define a string for us to play around with:

julia_string = "Julia is an amazing opensource programming language"
Julia is an amazing opensource programming language
  1. occursin, startswith and endswith: A conditional (returns either true or false) if the first argument is a:

    • substring of the second argument

      occursin("Julia", julia_string)
    • prefix of the second argument

      startswith("Julia", julia_string)
    • suffix of the second argument

      endswith("Julia", julia_string)
  2. lowercase, uppercase, titlecase and lowercasefirst:

    julia is an amazing opensource programming language
    Julia Is An Amazing Opensource Programming Language
    julia is an amazing opensource programming language
  3. replace: introduces a new syntax, called the Pair

    replace(julia_string, "amazing" => "awesome")
    Julia is an awesome opensource programming language
  4. split: breaks up a string by a delimiter:

    split(julia_string, " ")
    SubString{String}["Julia", "is", "an", "amazing", "opensource", "programming", "language"] String Conversions

Often, we need to convert between types in Julia. We can use the string function:

my_number = 123

Sometimes, we want the opposite: convert a string to a number. Julia has a handy function for that: parse

typeof(parse(Int64, "123"))

Sometimes, we want to play safe with these convertions. That’s when tryparse function steps in. It has the same functionality as parse but returns either a value of the requested type, or nothing. That makes tryparse handy when we want to avoid errors. Of course, you would need to deal with all those nothing values afterwards.

tryparse(Int64, "A very non-numeric string")

3.2.3 Tuple

Julia has a data structure called tuple. They are really special in Julia because they are often used in relation to functions. Since functions are a important feature in Julia, every Julia user should know the basics of tuples.

A tuple is a fixed-length container that can hold multiple different types. A tuple is an imutable object, meaning that it cannot be modified after instantiation. To construct a tuple, use parentheses () to delimitate the beginning and end, along with commas , as value’s delimiters:

my_tuple = (1, 3.14, "Julia")
(1, 3.14, "Julia")

Here, we are creating a tuple with three values. Each one of the values is a different type. We can access them via indexing. Like this:


We can also loop over tuples with the for keyword. And even apply functions to tuples. But we can never change any value of a tuple since they are immutable.

Remember functions that return multiple values back in Section Let’s inspect what our add_multiply function returns:

return_multiple = add_multiply(1, 2)
Tuple{Int64, Int64}

This is because return a, b is the same as return (a, b):

1, 2
(1, 2)

So, now you can see why they are often related.

One more thing about tuples. When you want to pass more than one variable to an anonymous function, guess what you would need to use? Once again: tuples!

map((x, y) -> x^y, 2, 3)

Or, even more than two arguments:

map((x, y, z) -> x^y + z, 2, 3, 1)

3.2.4 Named Tuple

Sometimes, you want to name the values in tuples. That’s when named tuples comes in. Their functionality is pretty much same the same as tuples: they are immutable and can hold any type of value.

Named tuple’s construction are slightly different from tuples. You have the familiar parentheses () and comma , value separator. But now you name the values:

my_namedtuple = (i=1, f=3.14, s="Julia")
(i = 1, f = 3.14, s = "Julia")

We can access a named tuple’s values via indexing like regular tuples or, alternatively, access by their names with the .:


To finish named tuples, there is one important quick syntax that you’ll see a lot in Julia code. Often Julia users create a named tuple by using the familiar parenthesis () and commas ,, but without naming the values. To do so you begin the named tuple construction by specifying first a semicolon ; before the values. This is especially useful when the values that would compose the named tuple are already defined in variables or when you want to avoid long lines:

i = 1
f = 3.14
s = "Julia"

my_quick_namedtuple = (; i, f, s)
(i = 1, f = 3.14, s = "Julia")

3.2.5 Ranges

A range in Julia represents an interval between a start and stop boundaries. The syntax is start:stop:


As you can see, our instantiated range is of type UnitRange{T} where T is the type inside the UnitRange:


And, if we gather all the values, we get:

[x for x in 1:10]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

We can construct ranges also for other types:

StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}}

Sometimes, we want to change the default interval stepsize behavior. We can do that by adding a stepsize in the range syntax start:step:stop. For example, suppose we want a range of Float64 from 0 to 1 with steps of size 0.2:


If you want to “materialize” a UnitRange into a collection, you can use the function collect:

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

We have an array of the type specified in the UnitRange between the boundaries that we’ve set. Speaking in arrays, let’s talk about them.

3.2.6 Array

Arrays are a systematic arrangement of similar objects, usually in rows and columns. Most of the time you would want arrays of a single type for performance issues, but note that they can also hold objects of different types. They are the “bread and butter” of data scientist, because arrays are what constitutes most of data manipulation and data visualization workflows.

Arrays are a powerful data structure. They are one of the main features that makes Julia blazing fast. Array Types

Let’s start with arrays types. There are several, but we will focus on two the most used in data science:

Note here that T is the type of the underlying array. So, for example, Vector{Int64} is a Vector which all elements are Int64s and Matrix{AbstractFloat} is a Matrix which all elements are subtypes of AbstractFloat.

Most of the time, especially when dealing with tabular data, we are using either one- or two-dimensional arrays. They are both Array types for Julia. But we can use the handy aliases Vector and Matrix for clear and concise syntax. Array Construction

How do we construct an array? The simplest answer is to use the default constructor. It accepts the element type as the type parameter inside the {} brackets and inside the constructor you’ll pass the element type followed by the desired dimensions. It is common to initialize vector and matrices with undefined elements by using the undef argument for type. A vector of 10 undef Float64 elements can be constructed as:

my_vector = Vector{Float64}(undef, 10)
[6.9127222211763e-310, 6.9127222211779e-310, 6.91272222117946e-310, 6.91272222118104e-310, 6.9127222211826e-310, 6.9127222211842e-310, 6.9127222211858e-310, 6.91272222118737e-310, 6.91272222118895e-310, 0.0]

For matrices, since we are dealing with two-dimensional objects, we need to pass two dimensions arguments inside the constructor: one for rows and another for columns. For example, a matrix with 10 rows, 2 columns and undef elements can be instantiate as:

my_matrix = Matrix{Float64}(undef, 10, 2)
10×2 Matrix{Float64}:
 0.0  0.0
 0.0  0.0
 0.0  0.0
 0.0  0.0
 0.0  0.0
 0.0  0.0
 0.0  0.0
 0.0  0.0
 0.0  0.0
 0.0  0.0

We also have some syntax aliases for the most common elements in array construction:

For other elements we can first intantiate an array with undef elements and use the fill! function to fill all elements of an array with the desired element. Here’s an example with 3.14 (\(\pi\)):

my_matrix_π = Matrix{Float64}(undef, 2, 2)
fill!(my_matrix_π, 3.14)
2×2 Matrix{Float64}:
 3.14  3.14
 3.14  3.14

We can also create arrays with arrays literals. For example a 2x2 matrix of integers:

[[1 2]
 [3 4]]
2×2 Matrix{Int64}:
 1  2
 3  4

Array literals also accept a type specification before the [] brackets. So, if we want the same 2x2 array as before but now as floats, we can do so:

Float64[[1 2]
        [3 4]]
2×2 Matrix{Float64}:
 1.0  2.0
 3.0  4.0

It also works for vectors:

Bool[0, 1, 0, 1]
Bool[0, 1, 0, 1]

You can even mix and match array literals with the constructors:

[ones(Int, 2, 2) zeros(Int, 2, 2)]
2×4 Matrix{Int64}:
 1  1  0  0
 1  1  0  0
[zeros(Int, 2, 2)
 ones(Int, 2, 2)]
4×2 Matrix{Int64}:
 0  0
 0  0
 1  1
 1  1
[ones(Int, 2, 2) [1; 2]
 [3 4]            5]
3×3 Matrix{Int64}:
 1  1  1
 1  1  2
 3  4  5

Another powerful way to create arrays are array comprehensions. This way of creating arrays is our preferred way, it avoids loops, indexing and other error-prone operations. You specify what you want to do inside the [] brackets. For example, say we want to create a vector of squares from 1 to 100:

[x^2 for x in 1:10]
[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

They also support multiple inputs:

[x*y for x in 1:10 for y in 1:2]
[1, 2, 2, 4, 3, 6, 4, 8, 5, 10, 6, 12, 7, 14, 8, 16, 9, 18, 10, 20]

And conditionals:

[x^2 for x in 1:10 if isodd(x)]
[1, 9, 25, 49, 81]

As with array literals you can specify your desired type before the [] brackets:

Float64[x^2 for x in 1:10 if isodd(x)]
[1.0, 9.0, 25.0, 49.0, 81.0]

Finally, we can also create arrays with concatenation functions: Array Inspection

Once we have arrays, the next logical step is to inspect them. There are a lot of handy functions that allows the user to have an inner insight into any array.

It is most useful to know what type of elements are inside an array. We can do this with eltype:


After knowing its types, one might be interested in array dimensions. Julia has several functions to inspect array dimensions: Array Indexing and Slicing

Sometimes we want to only inspect certain parts of an array. This is called indexing and slicing. If you want a particular observation of a vector, or a row or column of a matrix; you’ll probably need to index an array.

First, I will create an example vector and matrix to play around:

my_example_vector = [1, 2, 3, 4, 5]

my_example_matrix = [[1 2 3]
                     [4 5 6]
                     [7 8 9]]

Let’s see first an example with vectors. Suppose you want the second element of a vector. You append [] brackets with the desired index inside:


The same syntax follows with matrices. But, since matrices are 2-dimensional arrays, we have to specify both rows and columns. Let’s retrieve the element from the second row (first dimension) and first column (second dimension):

my_example_matrix[2, 1]

Julia also have conventional keywords for the first and last elements of an array: begin and end. For example, the second to last element of a vector can be retrieved as:


It also work for matrices. Let’s retrieve the element of the last row and second column:

my_example_matrix[end, begin+1]

Often, we are not only interested in just one array element, but in a whole subset of array elements. We can accomplish this by slicing an array. It uses the same index syntax, but with the added colon : to denote the boundaries that we are slicing through the array. For example, suppose we want to get the 2nd to 4th element of a vector:

[2, 3, 4]

We could do the same with matrices. Particularly with matrices if we want to select all elements in a following dimension we can do so with just a colon :. For example, all elements in the second row:

my_example_matrix[2, :]
[4, 5, 6]

You can interpret this with something like “take 2nd row and all columns.”

It also supports begin and end:

my_example_matrix[begin+1:end, end]
[6, 9] Array Manipulations

There are several ways we could manipulate an array. The first would be to manipulate a singular element of the array. We just index the array by the desired element and proceed with an assignment =:

my_example_matrix[2, 2] = 42
3×3 Matrix{Int64}:
 1   2  3
 4  42  6
 7   8  9

Or you can manipulate a certain subset of elements of the array. In this case, we need to slice the array and then assign with =:

my_example_matrix[3, :] = [17, 16, 15]
3×3 Matrix{Int64}:
  1   2   3
  4  42   6
 17  16  15

Note that we had to assign a vector because we our sliced array is of type Vector:

typeof(my_example_matrix[3, :])
Vector{Int64} (alias for Array{Int64, 1})

The second way we could manipulate an array is to alter its shape. Suppose you have a 6-element vector and you want to make it a 3x2 matrix. You can do so with reshape, by using the array as first argument and a tuple of dimensions as second argument:

six_vector = [1, 2, 3, 4, 5, 6]
tree_two_matrix = reshape(six_vector, (3, 2))
3×2 Matrix{Int64}:
 1  4
 2  5
 3  6

You can do the reverse, convert it back to a vector, by specifying a tuple with only one dimension as second argument:

reshape(tree_two_matrix, (6, ))
[1, 2, 3, 4, 5, 6]

The third way we could manipulate an array is to apply a function over every array element. This is where the familiar broadcasting “dot” operator . comes in.

3×3 Matrix{Float64}:
 0.0      0.693147  1.09861
 1.38629  3.73767   1.79176
 2.83321  2.77259   2.70805

We also broadcast operators:

my_example_matrix .+ 100
3×3 Matrix{Int64}:
 101  102  103
 104  142  106
 117  116  115

We can use also map to apply a function to every element of an array:

map(logarithm, my_example_matrix)
3×3 Matrix{Float64}:
 0.0      0.693147  1.09861
 1.38629  3.73767   1.79176
 2.83321  2.77259   2.70805

It also accepts an anonymous function:

map(x -> x*3, my_example_matrix)
3×3 Matrix{Int64}:
  3    6   9
 12  126  18
 51   48  45

It also works with slicing:

map(x -> x + 100, my_example_matrix[:, 3])
[103, 106, 115]

Finally, sometimes, and specially when dealing with tabular data, we want to apply a function over all elements in a specific array dimension. This can be done with the mapslices function. Similar to map, the first argument is the function and the second argument is the array. The only change is that we need to specify the dims argument to flag what dimension we want to transform the elements.

For example let’s use mapslice with the sum function on both rows (dims=1) and columns (dims=2):

# rows
mapslices(sum, my_example_matrix; dims=1)
1×3 Matrix{Int64}:
 22  60  24
# columns
mapslices(sum, my_example_matrix; dims=2)
3×1 Matrix{Int64}:
 48 Array Iteration

One common operation is to iterate over an array with a for loop. The regular for loop over an array returns each element.

The simplest example is with a vector.

simple_vector = [1, 2, 3]

empty_vector = Int64[]

for i in simple_vector
    push!(empty_vector, i + 1)

[2, 3, 4]

Sometimes you don’t want to loop over each element, but actually over each array index. We can eachindex function combined with a for loop to iterate over each array index.

Again, let’s show an example with a vector:

forty_two_vector = [42, 42, 42]

empty_vector = Int64[]

for i in eachindex(forty_two_vector)
    push!(empty_vector, i)

[1, 2, 3]

In this example the eachindex(forty_two_vector) iterator inside the for loop returns not forty_two_vector’s values but its indices: [1, 2, 3].

Iterating over matrices involves more details. The standard for loop goes first over columns then over rows. It will first traverse all elements in column 1, from the first row to the last row, then it will move to column 2 in a similar fashion until it has covered all columns.

Those familiar with other programming languages, Julia, like most scientific programming languages, is “column-major.” This means that arrays are stored contiguously using a column orientation. If any time you are seeing problems of performance and there is an array for loop involved, chances are that you are mismatching Julia’s native column-major storage orientation.

Ok, let’s show this in an example:

column_major = [[1 2]
                [3 4]]

row_major = [[1 3]
             [2 4]]
empty_vector = Int64[]

for i in column_major
    push!(empty_vector, i)

[1, 3, 2, 4]
empty_vector = Int64[]

for i in row_major
    push!(empty_vector, i)

[1, 2, 3, 4]

There are some handy functions to iterate over matrices.

3.2.7 Pair

Compared to the huge section on arrays, this section on pairs will be brief. Pair is a data structure that holds two types. How we construct a pair in Julia is using the following syntax:

my_pair = Pair("Julia", 42)
"Julia" => 42

Alternatively, we can create a pair by specifying both values and in between we use the pair => operator:

my_pair = "Julia" => 42
"Julia" => 42

The elements are stored in the fields first and second.


Pairs will be used a lot in data manipulation and data visualization since both DataFrames.jl (Section 4) or Plots.jl (Section 5) main functions depends on Pair as type arguments.

3.2.8 Dict

If you understood what a Pair is, then Dict won’t be a problem. Dict in Julia is just a “hash table” with pairs of key and value. keys and values can be of any type, but generally you’ll see keys as strings.

There are two ways to construct Dicts in Julia. The first is using the default constructor Dict and passing a vector of tuples composed of (key, value):

my_dict = Dict("one" => 1, "two" => 2)
Dict{String, Int64} with 2 entries:
  "two" => 2
  "one" => 1

We prefer the second way of constructing Dicts. It offers a much elegant and expressive syntax. You use the same default constructor Dict, but now you pass pairs of key and value:

my_dict = Dict("one" => 1, "two" => 2)
Dict{String, Int64} with 2 entries:
  "two" => 2
  "one" => 1

You can retrieve a Dicts value by indexing it by the corresponding key:


Similarly, to add a new entry you index the Dict by the desired key and assign a value with the assignment = operator:

my_dict["three"] = 3

If you want to check if a Dict has a certain key you can use the haskey function:

haskey(my_dict, "two")

To delete a key you can use either the delete! function:

delete!(my_dict, "three")
Dict{String, Int64} with 2 entries:
  "two" => 2
  "one" => 1

Or to delete a key while retuning its value you can use the pop! function:

popped_value = pop!(my_dict, "two")

Now our my_dict has only one key:

Dict{String, Int64} with 1 entry:
  "one" => 1

Dicts are also used in data manipulations by DataFrames.jl (Section 4) and data visualization by Plots.jl (Section 5). So it is important to know their basic functionality.

There is one useful Dict constructor that we use a lot. Suppose you have two vectors and you want to construct a Dict with one of them as keys and the other as values. You can do that with the zip function which “glues” together two objects just like a zipper:

A = ["one", "two", "three"]
B = [1, 2, 3]

dic = Dict(zip(A, B))
Dict{String, Int64} with 3 entries:
  "two" => 2
  "one" => 1
  "three" => 3

For instance, we can now get the number 3 via:


3.2.9 Symbol

Symbol is actually not a data structure. It is a type and behaves at lot like a string. Instead of surrounding the text by quotation marks, a symbol starts with a colon (:) and can contain underscores:

sym = :some_text

Since symbols and strings are so similar, we can easily convert a symbol to string and vice versa:

s = string(sym)
sym = Symbol(s)

One simple benefit of symbols is that you have to type one character less, that is, :some_text versus "some text". We use Symbols a lot in data manipulations with the DataFrames.jl package (Section 4) and data visualizations (Section 5 and 6).

3.2.10 Splat Operator

In Julia we have the “splat” operator ... which is mainly used in function calls as a sequence of arguments. We will occasionally use splatting in some function calls in data manipulation and data visualization chapters.

The most intuitive way to learn about splatting is with an example. The add_elements function below takes three arguments to be adeed together:

add_elements(a, b, c) = a + b + c
add_elements (generic function with 1 method)

Now suppose that I have a collection with three elements. The naïve way to this would be to supply the function with all three elements as function arguments like this:

my_collection = [1, 2, 3]

add_elements(my_collection[1], my_collection[2], my_collection[3])

Here is where we use the “splat” operator ... which takes a collection (often an array, vector, tuple or range) and converts into a sequence of arguments:

add_elements(my_collection...) # and splat!

The ... is included after the collection that we want to “splat” into a sequence of arguments. In the example above, syntactically speaking, the following are the same:

collection = [x, y, z]

function(collection...) = function(x, y, z)

Anytime Julia sees a splatting operator inside a function call, it will be converted on a sequence of arguments for all elements of the collection separated by commas.

It also works for ranges:

add_elements(1:3...) # and splat!

CC BY-NC-SA 4.0 Jose Storopoli, Rik Huijzer and Lazaro Alonso