7.5 Statistical Visualizations

There are several statistical visualization techniques. For now, we will focus on only three: histograms, box plots, and density plots, since they are commonly used to analyze univariate data.

We will also use the more_grades dataset from Section 7.1.

7.5.1 Histograms

As already briefly shown in Section 7.4, histograms approximate the distribution for given data. We construct them by “binning,” i.e., inserting into discrete bins the range of values into a series of intervals and then counting up how many values fall in each given interval. The bins are represented as a bar in which the height describes the frequency of values belonging to that bin.

We can draw histograms using Makie.jl:

df = more_grades()
fig = Figure(; resolution=(600, 400))
ax = Axis(fig[1, 1], xticks=1:10)
hist!(ax, df.grade; color=(:dodgerblue, 0.5))

Note that by default hist! uses 15 bins. We can change that with the bins keyword:

df = more_grades()
fig = Figure(; resolution=(600, 400))
ax = Axis(fig[1, 1], xticks=1:10)
hist!(ax, df.grade; color=(:dodgerblue, 0.5), bins=10)

We can see clearly that most of the grades are between 4 and 9.

7.5.2 Box Plots

Box plots are a method for graphically depicting numerical data through their quartiles (see Figure 67). The “box” is typically represented by the quartiles 1 to 3 (see Section 7.2.3). The median, second quartile – Q2, or percentile 0.5, is the line inside the box. The first and third quartile, Q1 and Q3, or percentiles 0.25 and 0.75, respectively, are the box’s lower and upper bounds. Finally, we have the “whisker” which, traditionally (and the default in most data visualization techniques), is the range composed by extending the interquartile range (IQR) by 1.5.

The basic box plot can be drawn using Makie.jl (see Chapter 6). It accepts x and y vectors which represents the positions of the categories and the variables within the boxes, respectively. Since the elements in our vector x are of type String, we need to convert it to categorical using CategoricalArrays.jl (Section 4.5) and then pass the Axis keyword argument xticks (see Section 6.2) as a tuple of values and labels. For the xticks’ labels we used the levels function from CategoricalArrays.jl that returns the categorical levels from our name variable in the same order as the integer codes. Finally, for the x vector inside Makie’s boxplot function, we wrap the name variable with the levelcode function, also from CategoricalArrays.jl, which returns the underlying integer codes from our categorical variable name. We do this because Makie’s boxplot only accepts a vector of Ints as inputs for the x argument. Here is the code:

df = more_grades()
transform!(df, :name => categorical; renamecols=false)
fig = Figure(; resolution=(600, 400))
ax = Axis(fig[1, 1]; xticks = (1:4, levels(df.name)))
boxplot!(ax, levelcode.(df.name), df.grade)

The default IQR range for the whiskers in Makie.jl is 1.5. However, sometimes we see the whiskers either with a different IQR range or with a small vertical bar to better visualize the whiskers’ tips. We can control both of those with the range (default 1.5) and whiskerwidth (default 0.0) arguments:

df = more_grades()
transform!(df, :name => categorical; renamecols=false)
fig = Figure(; resolution=(600, 400))
ax = Axis(fig[1, 1]; xticks = (1:4, levels(df.name)))
boxplot!(ax, levelcode.(df.name), df.grade; range=2.0, whiskerwidth=0.5)

Box plots can also flag anything outside the whiskers as outliers. By default, these observations are not shown in Makie.jl but you can control this with the show_outliers argument:

df = more_grades()
transform!(df, :name => categorical; renamecols=false)
fig = Figure(; resolution=(600, 400))
ax = Axis(fig[1, 1]; xticks = (1:4, levels(df.name)))
boxplot!(ax, levelcode.(df.name), df.grade; range=0.5, show_outliers=true)

As you can see, box plots are a useful way to visualize data with robust central tendencies and dispersion measures to outliers.

7.5.3 Density Plots

Box plots limit us just to summarize statistics like median, quartiles, and IQRs. Often we want to see the underlying distribution of the data. Histograms are discrete approximations. If we would like to have continuous approximations we need something else: density plots. Density plots are graphical density estimations of numerical data. It shows us the approximate distribution of a given variable by depicting it as a density, where the higher the curve at a given point is, the more likely is the variable to take a certain value.

A density plot can also be drawn using Makie.jl. However, it is more convoluted than the box plot. First, we want to pass for each density! function only the values with respect to one observation. Thus, we define a values function that will accept a code argument to filter the dataset’s variable name wrapped with the levelcode function. Then, we plot a density pltobj for each one of the variable name’s levels. Finally, we make sure that the density plotobjs have their own ytick with the offset keyword paired with a custom yticks in the Axis constructor by specifying, same as before, a tuple of values and labels. The effect of the offset in the for loop is the increment from 1 to 4, by 1, of both the offset argument for density! and the code argument for values:

df = more_grades()
transform!(df, :name => categorical; renamecols=false)
categories = levels(df.name)
values(code) = filter(row -> levelcode.(row.name) == code, df).grade
fig = Figure(; resolution=(600, 400))
ax = Axis(fig[1, 1]; yticks = (1:4, categories), limits=((-1, 11), nothing))
for i in 1:length(categories)
density!(ax, values(i); offset=i)
end

As explained in Section 6.5, we can change Makie’s colors by either specifying a color or colormap. This can also be applied to density:

df = more_grades()
transform!(df, :name => categorical; renamecols=false)
categories = levels(df.name)
values(code) = filter(row -> levelcode.(row.name) == code, df).grade
fig = Figure(; resolution=(600, 400))
ax1 = Axis(fig[1, 1]; yticks = (1:4, categories), limits=((-1, 11), nothing))
ax2 = Axis(fig[1, 2]; yticks = (1:4, categories), limits=((-1, 11), nothing))
for i in 1:length(categories)
density!(ax1, values(i); offset=i, color=(:dodgerblue, 0.5))
end
for i in 1:length(categories)
density!(ax2, values(i); offset=i, color=:x, colormap=:viridis)
end

Here, in the first figure (left) we are using a specific color for all density!’s plotobjs. And in the second figure (right) we pass the :x argument to color to tell Makie to apply the colormap gradient along the x-axis (from left to right) while also specifying which colormap palette as :viridis. The color code gradient in the y direction is most common and is a visual aid to easily identify trends; in the x direction is useful when you want to know how things go in some time-dependent variable, but is not widely used.

7.5.4 Anscombe Quartet

We conclude this Statistics chapter with a demonstration of the importance of data visualization in statistical analysis. For this, we present the Anscombe Quartet , which comprises four datasets that have nearly identical simple descriptive statistics, yet have very different distributions and appear very different when plotted. Each dataset has 11 observations with x and y variables. They were created in 1973 by the statistician Francis Anscombe to show the importance of plotting data before conducting statistical analysis. Here is the table with the four datasets:

Table 8: Anscome Quartet
x_1 y_1 x_2 y_2 x_3 y_3 x_4 y_4
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.1 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.1 4.0 5.39 19.0 12.5
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89

Now, if you look at the descriptive statistics for both x and y variables in all 4 datasets they are pretty much the same along with their correlation (both up to 2 decimal places):

df = anscombe_quartet()
round_up = x -> round(x; digits=2)
combine(groupby(df, :dataset),
[:x, :y] .=> round_up .∘ [mean std],
[:x, :y]  => round_up ∘ cor)
dataset x_function_mean y_function_mean x_function_std y_function_std x_y_function_cor
1.0 9.0 7.5 3.32 2.03 0.82
2.0 9.0 7.5 3.32 2.03 0.82
3.0 9.0 7.5 3.32 2.03 0.82
4.0 9.0 7.5 3.32 2.03 0.82

Now, if we take a look at a simple scatter plot of all 4 datasets, we clearly see that something else is going on:

Here, the first dataset (upper left) is a frequent situation that we encounter in data science: x and y are correlated with added random noise. In the second dataset (upper right), we see a perfect correlation except for an outlier in the second to last observation. For the third dataset (lower left), the relationship is non-linear. Finally, for the fourth dataset (lower right) there isn’t any relationship except by an outlier observation.

The Anscombe Quartet tells us that sometimes descriptive statistics can fool us and we should rely also on visualizations to analyze our data.

DRAFT - CC BY-NC-SA 4.0 Jose Storopoli, Rik Huijzer, Lazaro Alonso