7 Statistics

Statistics is an essential part of data science. In this chapter, we aim to present the concepts in a way that we would have liked to learn it. Our approach is to simplify the concepts as much as possible where possible. In this chapter, we will cover:

Statistics is important because it is a tool to make sense of data. With the abundant availability of data, we are often overwhelmed by numbers. Statistics offers a way to comprehend, summarize, and infer information from data. We believe that every data scientist needs to have a basic understanding of statistics and how to perform simple statistical operations.

We can divide statistics into two broad categories: descriptive and inferential. Descriptive statistics summarizes and quantifies the characteristics of given observed data. Common metrics are: mean, median, mode, standard deviation, variance, correlation, percentiles.

Inferential statistics allows generating inferences (statements) from observed data about the data generation process. In Figure 50, we summarize the relationship between the data generating process and observed data. All phenomena have an underlying data generating process that describes how the data is being generated. For example, in a soccer game, a scored goal can be explained by an underlying process: a tactic, error, stroke of luck; or a mix of those. If we know a phenomenon’s data generating process, we can use probability to simulate possible scenarios given certain aspects. Most of the time, especially in applied sciences, we do not have full knowledge of the data generating process. Given the observed data, we attempt to recover the data generating process. This process is known as statistical inference. Given some data, we can infer the aspects of the underlying data generating process. This is the realm of inferential statistics.

With knowledge of the data generating process, we can apply probability to generate and simulate plausible data. And by using the observed data, we can use inference to gain knowledge about the underlying data generating process.

Figure 50: Statistics vs Probability

In this chapter, we will cover only descriptive statistics. Inferential statistics is an important and fundamental component of applied sciences, but its scope is too broad. So, let’s begin with some simple ways to summarize our data by using central tendencies.

DRAFT - CC BY-NC-SA 4.0 Jose Storopoli, Rik Huijzer, Lazaro Alonso