## 7.1 Central Tendencies Measures

The most basic way of using descriptive statistics is to summarize data by a measure of central tendency.

### 7.1.1 Mean

The most common central tendency measure is the mean. Let $$\mathbf{x}$$ denote the vector $$x_1, x_2, \dots, x_n$$. The mean for $$\mathbf{x}$$, also known as the average, is the sum of all measurements divided by the number of observations:

$\bar{x} = \frac{1}{n} \sum^n_{i=1} x_i = \frac{x_1 + x_2 + \cdots + x_n}{n}, \qquad(1)$

where $$\bar{x}$$ is pronounced “x bar.” It is common in statistics to denote sample statistics with a Roman letter, such as $$x$$ or $$\bar{x}$$, and population statistics with a Greek letter, such as $$\mu$$. Additionally, the mean can be used to calculate the expectation in distrete settings. The expectation is typically represented by the operator $$\operatorname{E}$$, thus the mean of $$\mathbf{x}$$ is $$\text{mean}(\mathbf{x}) = \operatorname{E}(\mathbf{x})$$.

The mean can be found using the mean function from the Statistics module, part of Julia’s standard library:

using Statistics: mean

And we can apply the mean to different groups in our data like we did in Section 4.8. For example, we have the all_grades DataFrame:

all_grades()
Sally 1.0
Bob 5.0
Alice 8.5
Hank 4.0
Bob 9.5
Sally 9.5
Hank 6.0

Let’s add more grades to our students so that we have more numbers to calculate central tendencies:

more_grades()
Sally 1.0
Bob 5.0
Alice 8.5
Hank 4.0
Bob 9.5
Sally 9.5
Hank 6.0
Bob 6.5
Sally 7.0
Hank 6.0
Alice 5.5
gdf = groupby(more_grades(), :name)
combine(gdf, :grade => mean)
Sally 5.833333333333333
Bob 7.0
Alice 7.0
Hank 5.333333333333333

### 7.1.2 Median

We will see that the mean is highly sensitive to outliers and can sometimes be misleading. This is especially true when we are dealing with long-tailed (non-normal distributions)16. That is why we are sometimes interested in the median which is the middle value that separates the higher half from the lower half of the data. Intuitively, the median tells us the value of the data’s 50% percentile. One example is when we are analyzing income. The median is the upper limit that we expect that half of the observations earn. So, if the median income is $80,000, we expect that half of our observations earn between the minimum value and$80,000. The mathematical formula for the median is:

$\operatorname{median}(\mathbf{x}) = \frac{x_{\lfloor (\# \mathbf{x}+1) \div 2 \rfloor} + x_{\lceil (\# \mathbf{x}+1) \div 2 \rceil}}{2}, \qquad(2)$

where:

• $$\mathbf{x} = x_1, \cdots, x_n$$ is an ordered vector of numbers
• $$\mathbf{x}_i$$ is the element in vector $$\mathbf{x}$$ at position $$i$$
• $$\# \textbf{x}$$ is the length of $$\mathbf{x}$$
• $$\lfloor u \rfloor$$ is the rounded-down value for $$u$$ to the nearest integer
• $$\lceil u \rceil$$ is the rounded-up value for $$u$$ to the nearest integer

Similarly, we can use the median from the Statistics module to apply the median to our data:

using Statistics: median
gdf = groupby(more_grades(), :name)
combine(gdf, :grade => median)
Sally 7.0
Bob 6.5
Alice 7.0
Hank 6.0

As we can see, the outcome of median differs substantially from the mean.

### 7.1.3 Mode

The mean and median can be useful for numerical and ordinal data. However, they are ineffective for nominal data, in which our data consists of qualitative data (also known as categorical data). This is where we use the mode, defined as the most frequent value in our data.

For the mode, we do not have a mode function inside Julia’s standard library Statistics module. Instead, we need to use the StatsBase.jl package for less common statistical functions:

using StatsBase: mode

In Section 4.5, we have the correct_types DataFrame, which is mainly categorical with Dates and Strings:

correct_types()
id date age
3 2018-08-01 infant

We can compute the mode with the combine function from DataFrames.jl:

combine(correct_types(), :age => mode)
age_mode

### 7.1.4 Visualization of Central Tendencies

In order to have a better intuition behind the difference between mean, median, and mode, visualizations are useful. We will cover statistical visualizations in depth in Section 7.5. Below, in Figure 51, we have two data distributions:

• Upper row: normal distributed data
• Lower row: non-normal distributed data Figure 51: Normal and Non-Normal Distributed Data – Differences Between Central Tendencies.

You can see that the mean, median, and mode are almost the same in the normal distributed data, but they differ a lot in the non-normal distributed data. In the non-normal distributed data, the mean is highly skewed towards to the right, biasing our central tendency. This is an example of an outlier scenario where the mean can be highly sensitive to influential observations. Nevertheless, the median is unaffected by the outliers and can be used as a reliable central tendency. This demonstrates that the median is robust to outliers. In both cases, the mode is used only for comparison, since it is not advised for use with numerical data.

### 7.1.5 Advice on Central Tendencies

You might be wondering: “which central tendency shall I use? Mean? Median? Mode?” Here is our advice:

• For data that do not have outliers, use the mean
• For data that do have outliers, use the median
• For categorical/nominal data, use the mode

1. 16. more in Section 7.4.1.↩︎

DRAFT - CC BY-NC-SA 4.0 Jose Storopoli, Rik Huijzer, Lazaro Alonso