7.2 Dispersion Measures

In statistics, dispersion is a measure of how much spread certain observations have from a central tendency. It is also called variability or spread. One interesting property is that, contrary to central tendencies, dispersion measures can only be positive. In other words, we do not have negative measures of dispersion.

7.2.1 Variance and Standard Deviation

The first dispersion measure that we will cover are variance and standard deviation. We will be covering both measures together because they are related.

Let’s start with the variance. Variance is the mean of the squared difference between measurements and their average value:

\[ \sigma^2(x) = \frac{1}{n-1} \sum^n_{i=1} (x_i - \bar{x})^2. \qquad(3)\]

We use the squared Greek letter \(\sigma^2\) to denote variance, but you also might find variance being denoted as operator \(\operatorname{var}\). Note that we are using \(n-1\) in Equation 3. This is because we need a bias correction since we are using one degree of freedom from our estimate mean \(\bar{x}\). Degrees of freedom are not in the scope of our book, so we won’t cover in details, but feel free to check the Wikipedia for a in-depth explanation.

Since we are squaring the differences in Equation 3, the variance has a property that all dispersion measures have: the variance cannot be negative.

The variance can be found with the var function from Julia’s standard library Statistics module:

using Statistics: var

Like before, we can apply the variance to different groups in our more_grades DataFrame:

gdf = groupby(more_grades(), :name)
combine(gdf, :grade => var)
name grade_var
Sally 19.083333333333336
Bob 5.25
Alice 4.5
Hank 1.3333333333333335

We can see that Sally has the highest dispersion in her grades measured by its variance.

The standard deviation is the square root of the mean of the squared difference between measurements and their average value. Or in more simple words: it is the square root of the variance:

\[ \sigma(x) = \sqrt{\sigma(x)^2} = \sqrt{\frac{1}{n-1} \sum^n_{i=1} (x_i - \bar{x})^2}. \qquad(4)\]

In a similar fashion, for the standard deviation, we can use the std function from Julia’s standard library Statistics module:

using Statistics: std
gdf = groupby(more_grades(), :name)
combine(gdf, :grade => std)
name grade_std
Sally 4.368447474027053
Bob 2.29128784747792
Alice 2.1213203435596424
Hank 1.1547005383792517

Since the standard deviation is the square root of the variance, our measures of dispersion have only been rescaled. Sally still has the highest dispersion in her grades measured either by variance or standard deviation.

As we did before in Figure 52, we have two data distributions:

Figure 52: Normal and Non-Normal Distributed Data – Differences Between Standard Deviations.

We can see that the mean \(\mu\) is slightly shifted towards to the right by the few influential observations and that the dispersion measured by the \(\pm 1\sigma\) (one standard deviation) away from the mean inherits the bias from the mean.

7.2.2 Mean Absolute Deviation

Since variance and standard deviation use the mean in their mathematical formulation, they are also sensitive to outliers. This is where a dispersion measure that uses the median instead of the mean would be useful. This is exactly the case of the median absolute deviation (mad) which is defined as the median of the absolute difference between measurements and their median value. mad is an extremely robust dispersion measure since it uses twice the median to calculate first the central tendency followed by the difference between observations and their central distance:

\[ \operatorname{mad}(x) = \operatorname{median}(|x_i - \operatorname{median}(x)|), \qquad(5)\]

where \(|x|\) denotes the absolute value of \(x\).

The median absolute deviation is available as the function mad in the StatsBase.jl:

using StatsBase: mad

Let’s see how our more_grades DataFrame’s dispersion measures are using mad:

gdf = groupby(more_grades(), :name)
combine(gdf, :grade => mad)
name grade_mad
Sally 3.7065055462640046
Bob 2.2239033277584026
Alice 2.2239033277584026
Hank 0.0

We can see that Sally has still the highest grades dispersion, but now Bob’s and Alice’s dispersion are the same. Also note that, by using \(\operatorname{mad}\), Hank’s dispersion is zero. This happens because two of Hank’s three grades are the same value:

df = more_grades()
filter!(row -> row.name == "Hank", df)
df
name grade
Hank 4.0
Hank 6.0
Hank 6.0

If we plug Hank’s grades into Equation 5, we have to calculate \(\operatorname{median}([2, 0, 0])\), so we end up with the middle value in an ordered list which is \(0\).

Once more, in Figure 53, we have two data distributions:

Figure 53: Normal and Non-Normal Distributed Data – Differences Between Mean Absolute Deviations.

Note that the mad is extremely robust against influential observations and, contrary to the standard deviation, does not inherit any bias from the underlying central tendency measure (median). mad can be an effective dispersion measure to non-normal data which few influential observations shift a non-robust central tendency (such as the mean).

7.2.3 Percentiles and Quantiles

In statistics, we have the notion of a percentile that is a score below which a given percentage of scores from observations falls. For example, the median is the 0.5 percentile (50%). Or, if we want the top-5% highest values of our observations, we would select observations from the 0.95 percentile onwards.

Some percentiles have special names:

These, including percentiles, are in a broad manner called quantiles. Quantiles are cut points that divide equally the range of observations values’ into equal spaced intervals.

The most important and commonly used quantile is the quartile or 4-quantile, which we denote with the letter Q followed by a number to identify which one of the quantiles we are referring to. Since we have only three quantiles, we have Q1, Q2, and Q3 corresponding to the first, second, and third quantile, respectively. The Q2 (the 0.5 percentile) is also the median and the Q1 and Q3 are the 0.25 and 0.75 percentile. The quartiles are important because we often use them to denote a measure of dispersion. This measure is known as interquartile range (IQR) and is the difference between the third and first quartile:

\[ \operatorname{IQR}(x) = \operatorname{Q3}(x) - \operatorname{Q1}(x), \qquad(6)\]

Like the median absolute deviation, IQR, since it uses the median and percentiles, it is also robust to outliers.

As before, in Figure 54, we have two data distributions:

Figure 54: Normal and Non-Normal Distributed Data – Differences Between IQR Measures.

Here we can see that the median is not influenced by the few influential observations and that the IQR measured by the 1st and 3rd quantiles represents the 50% most probable values of our observations.

7.2.4 Advice on Dispersion Measures

You might be wondering: “which dispersion measure shall I use? Variance? Standard Deviation? Mean Absolute Deviation? IQR?” Like before, we provide the following advice:


  1. 17. we would suggest using the StatsBase.countmap function which returns a dictionary that maps each unique value in a given vector to its number of occurrences.↩︎



DRAFT - CC BY-NC-SA 4.0 Jose Storopoli, Rik Huijzer, Lazaro Alonso