Back

Discrete probability distributions:

Binomial Distribution:

  1. Nature of Experiment: It models the number of successes in a fixed number of independent trials, with each trial having two possible outcomes (success or failure).

  2. Parameters: The two parameters are n (the number of trials) and p (the probability of success in each trial).

  3. Assumptions:

  1. Applications: Used in scenarios like quality control (defect vs. no defect), survey responses (agree vs. disagree), and any situation with a clear dichotomy in outcomes.

The Binomial Distribution in a novel way:

Imagine you're in a mystical garden, filled with a peculiar kind of flower called the "Fatebloom." Each Fatebloom has exactly two types of petals: silver and gold. A legend says that if you pluck a petal at random, your day will be filled with luck if it's gold, and ordinary if it's silver. Curious, you decide to conduct an experiment with these magical flowers to see how much luck you might gather in a day.

You decide on a simple ritual: every morning, you will visit the garden, choose a Fatebloom at random, and pluck exactly ten petals, one after the other, recording whether each is silver or gold. You wonder, "What are the chances I'll pluck exactly six gold petals out of ten?" This question, as it turns out, can be answered by the Binomial Distribution.

Here's how the Binomial Distribution comes into play in this enchanted scenario:

Each petal plucked represents a trial. In the context of the Binomial Distribution, a trial is an event with two possible outcomes (in this case, gold for luck, silver for an ordinary day).

The probability of plucking a gold petal (success) is consistent every time you pluck a petal. Let's say, based on the legend, that the chance of getting a gold petal is 50% (0.5) for simplicity.

The number of trials (petal plucks) is fixed in advance: You decide to pluck 10 petals each morning, no more, no less.

Each petal pluck is independent of the others. The outcome of plucking one petal doesn't affect the outcomes of plucking the others.

With these magical elements, the Binomial Distribution tells you the likelihood of various outcomes, such as the probability of plucking exactly six gold petals out of ten, or any number of gold petals, for that matter. It encapsulates the essence of probability in a scenario of multiple, independent trials each with two possible outcomes.

from scipy.stats import binom
 
# Given values for the scenario
 
n = 10 # number of petals plucked
p = 0.5 # probability of plucking a gold petal (success)
k = 6 # number of gold petals we're interested in
 
# Calculate the probability of plucking exactly 6 gold petals out of 10
 
probability = binom.pmf(k, n, p)
 
print(f"The probability of plucking exactly 6 gold petals out of 10 is: {probability:.4f}")

Poisson Distribution

  1. Nature of Experiment: It models the number of events occurring in a fixed interval of time or space, with the events occurring independently of the time since the last event.

  2. Parameters: The key parameter is λ (lambda), which is the average rate at which events occur in a given time interval or space region.

  3. Assumptions:

  1. Applications: Used for modeling counts of events like emails received per day, calls arriving at a call center, decay events per second from a radioactive source, and distribution of organisms in a field.

Continuous probability distributions:

The Normal (Gaussian) Distribution

The normal or Gaussian distribution is a continuous probability distribution that is symmetric around its mean, showing that data near the mean are more frequent in occurrence than data far from the mean. In graph form, this distribution will appear as a bell curve.

Key characteristics of the normal distribution include:

  1. Symmetry: The distribution is perfectly symmetric around its mean, μ (mu), meaning the bulk of the observations cluster around the central peak and probabilities for values further away from the mean taper off equally in both directions.

  2. Mean, Median, and Mode: In a normal distribution, the mean, median, and mode are all located at the center of the distribution.

  3. Standard Deviation, σ (sigma): This measures the dispersion or variability around the mean; about 68% of the data falls within one standard deviation of the mean, 95% within two standard deviations, and 99.7% within three standard deviations.

  4. Inflection Points: These occur at one standard deviation from the mean on either side and mark the "change of curvature" on the bell curve.

Likelihood vs Probability

Probability pertains to predicting observations before they occur, based on a known model. Likelihood pertains to interpreting or estimating the parameters of the model after observations are made.

Probability of a single point

One particularly confusing aspect of continuous distributions, such as the normal distribution, is the concept of the probability of a single point. In continuous distributions, the probability that a random variable equals any specific exact value is zero. This concept can be counterintuitive because it seems to contradict our usual understanding of probability, where we often calculate the probability of specific outcomes.

tl dr; In continuous distributions, we should think in terms of probabilities over intervals (areas), not exact values.

Models

To evalueate the quality of these models we can use the following strategies:

Explaining R Squared (R²) with an analogy

Imagine you are an archery coach, and your goal is to train an archer (the model) to hit a target (predict outcomes). Every arrow the archer shoots represents a prediction made by your model.

R-squared measures the closeness of the arrows to the bulls-eye:

Thus, R-squared gives you a score that helps you understand how well your predictive model performs in terms of explaining the data it's supposed to predict. This helps you gauge the effectiveness of your model and whether there might be other variables or models that could do a better job.

R Squared (R²) exercise

Link: R Squared (R²) exercise

Write a Python script to calculate R-squared for a simple linear regression model where you predict students exam scores based on their study hours.

Dataset:

Tasks:

  1. Calculate the mean of X and Y.
  2. Calculate the Total Sum of Squares (TSS).
  3. Use the formula for the slope (m) and intercept (b) of the regression line.