# 7 Statistics

Statistics is an essential part of data science. In this chapter, we aim to present the concepts in a way that we would have liked to learn it. Our approach is to simplify the concepts as much as possible where possible. In this chapter, we will cover:

• what is statistics
• the difference between descriptive statistics and inferential statistics
• measures of central tendencies
• measures of dispersion
• measures of dependence
• probability distributions
• statistical visualization

Statistics is important because it is a tool to make sense of data. With the abundant availability of data, we are often overwhelmed by numbers. Statistics offers a way to comprehend, summarize, and infer information from data. We believe that every data scientist needs to have a basic understanding of statistics and how to perform simple statistical operations.

We can divide statistics into two broad categories: descriptive and inferential. Descriptive statistics summarizes and quantifies the characteristics of given observed data. Common metrics are: mean, median, mode, standard deviation, variance, correlation, percentiles.

Inferential statistics allows generating inferences (statements) from observed data about the data generation process. In Figure 50, we summarize the relationship between the data generating process and observed data. All phenomena have an underlying data generating process that describes how the data is being generated. For example, in a soccer game, a scored goal can be explained by an underlying process: a tactic, error, stroke of luck; or a mix of those. If we know a phenomenon’s data generating process, we can use probability to simulate possible scenarios given certain aspects. Most of the time, especially in applied sciences, we do not have full knowledge of the data generating process. Given the observed data, we attempt to recover the data generating process. This process is known as statistical inference. Given some data, we can infer the aspects of the underlying data generating process. This is the realm of inferential statistics.

With knowledge of the data generating process, we can apply probability to generate and simulate plausible data. And by using the observed data, we can use inference to gain knowledge about the underlying data generating process.