Глосарій

Виберіть одне з ключових слів ліворуч ...

ProbabilityIntroduction

Час читання: ~10 min

When we do data science, we begin with a data set and work to gain insights about the process that generated the data. Crucial to this endeavor is a robust vocabulary for discussing the behavior of data-generating processes.

It is helpful to initially consider data-generating processes whose randomness properties are specified completely and precisely. The study of such processes is called probability. For example, "What's the probability that I get at least 7 heads in 10 independent flips of a fair coin?" is a probability question, because the setup is fully specified: the coins have exactly 50% probability of heads, and the different flips do not affect one another.

The question of whether the coins are really fair or whether the flips are really independent will be deferred to our study of statistics. In statistics, we will have the outcome of a random experiment in hand and will be looking to draw inferences about the unknown setup. Once we are able to answer questions in the "setup \rightarrow outcome" direction, we will be well positioned to approach the "outcome \rightarrow setup" direction.

Exercise

Each of the questions below is a probability question or a statistics question. Select ones which are probability questions.

On days when the weather forecast says that the chance of rain is 10%, it actually rains only about 5% of the time. What is the probability of rain on a day when the weather forecast says "10% chance of rain"?
If it will rain today with probability 40%, what is the probability that it will not rain today?
If you roll two fair dice, what is the average total number of pips showing on the top faces?
Your friend rolled 12 on each of the first three rolls of the board game they're playing with you. What is the probability that the dice they're using are weighted in favor of the 6's?

Solution. The first question is statistics. We don't know the probability of rain, and we are trying to draw an inference about it based on observed samples.

The second question is a probability question. We are given the setup and asked a question which assumes its validity.

The third question is also a probability question. We're told the dice are fair, and we're asked a question about the outcome of the rolls.

The third question is a statistics question, since the outcome of the rolls is known, and the probabilities are in question.

Bruno
Bruno Bruno