Introduction to Machine Learning - Fundamental Concepts in Statistics (3) (in progress)

Discrete probability distributions:

Binomial distribution
Poisson distribution

Binomial Distribution:

Nature of Experiment: It models the number of successes in a fixed number of independent trials, with each trial having two possible outcomes (success or failure).
Parameters: The two parameters are n (the number of trials) and p (the probability of success in each trial).
Assumptions:

There is a fixed number of trials (n).
Each trial is independent of others.
The probability of success (p) is the same for each trial.
The outcome of each trial is binary (success/failure).

Applications: Used in scenarios like quality control (defect vs. no defect), survey responses (agree vs. disagree), and any situation with a clear dichotomy in outcomes.

The Binomial Distribution in a novel way:

Imagine you're in a mystical garden, filled with a peculiar kind of flower called the "Fatebloom." Each Fatebloom has exactly two types of petals: silver and gold. A legend says that if you pluck a petal at random, your day will be filled with luck if it's gold, and ordinary if it's silver. Curious, you decide to conduct an experiment with these magical flowers to see how much luck you might gather in a day.

You decide on a simple ritual: every morning, you will visit the garden, choose a Fatebloom at random, and pluck exactly ten petals, one after the other, recording whether each is silver or gold. You wonder, "What are the chances I'll pluck exactly six gold petals out of ten?" This question, as it turns out, can be answered by the Binomial Distribution.

Here's how the Binomial Distribution comes into play in this enchanted scenario:

Each petal plucked represents a trial. In the context of the Binomial Distribution, a trial is an event with two possible outcomes (in this case, gold for luck, silver for an ordinary day).

The probability of plucking a gold petal (success) is consistent every time you pluck a petal. Let's say, based on the legend, that the chance of getting a gold petal is 50% (0.5) for simplicity.

The number of trials (petal plucks) is fixed in advance: You decide to pluck 10 petals each morning, no more, no less.

Each petal pluck is independent of the others. The outcome of plucking one petal doesn't affect the outcomes of plucking the others.

With these magical elements, the Binomial Distribution tells you the likelihood of various outcomes, such as the probability of plucking exactly six gold petals out of ten, or any number of gold petals, for that matter. It encapsulates the essence of probability in a scenario of multiple, independent trials each with two possible outcomes.

from scipy.stats import binom
 
# Given values for the scenario
 
n = 10 # number of petals plucked
p = 0.5 # probability of plucking a gold petal (success)
k = 6 # number of gold petals we're interested in
 
# Calculate the probability of plucking exactly 6 gold petals out of 10
 
probability = binom.pmf(k, n, p)
 
print(f"The probability of plucking exactly 6 gold petals out of 10 is: {probability:.4f}")

Poisson Distribution

Nature of Experiment: It models the number of events occurring in a fixed interval of time or space, with the events occurring independently of the time since the last event.
Parameters: The key parameter is λ (lambda), which is the average rate at which events occur in a given time interval or space region.
Assumptions:

Events occur independently.
The average rate (λ) is constant throughout the experiment.
Two events cannot occur at exactly the same instant.
The probability of more than one occurrence in an infinitesimally small time interval is negligible.

Applications: Used for modeling counts of events like emails received per day, calls arriving at a call center, decay events per second from a radioactive source, and distribution of organisms in a field.

Continuous probability distributions:

Normal (GAussian) distribution
Exponential distribution

The Normal (Gaussian) Distribution

The normal or Gaussian distribution is a continuous probability distribution that is symmetric around its mean, showing that data near the mean are more frequent in occurrence than data far from the mean. In graph form, this distribution will appear as a bell curve.

Key characteristics of the normal distribution include:

Symmetry: The distribution is perfectly symmetric around its mean, μ (mu), meaning the bulk of the observations cluster around the central peak and probabilities for values further away from the mean taper off equally in both directions.
Mean, Median, and Mode: In a normal distribution, the mean, median, and mode are all located at the center of the distribution.
Standard Deviation, σ (sigma): This measures the dispersion or variability around the mean; about 68% of the data falls within one standard deviation of the mean, 95% within two standard deviations, and 99.7% within three standard deviations.
Inflection Points: These occur at one standard deviation from the mean on either side and mark the "change of curvature" on the bell curve.

Likelihood vs Probability

Probability measures the chance of an event occurring in the future. It is used to quantify the likelihood of specific outcomes in relation to the total number of possible outcomes.
Likelihood measures how probable a particular set of observations is given a model parameter. It is a function of the parameters of a model; not the outcomes.

Probability pertains to predicting observations before they occur, based on a known model. Likelihood pertains to interpreting or estimating the parameters of the model after observations are made.

Probability of a single point

One particularly confusing aspect of continuous distributions, such as the normal distribution, is the concept of the probability of a single point. In continuous distributions, the probability that a random variable equals any specific exact value is zero. This concept can be counterintuitive because it seems to contradict our usual understanding of probability, where we often calculate the probability of specific outcomes.

tl dr; In continuous distributions, we should think in terms of probabilities over intervals (areas), not exact values.

Models

Models approximate reality to let us explore relationships and make predictions.
In machine learning, we build models by training machine learning algorithms with Training Data.
The model learns patterns from the training data to make predictions on new, unseen data (Test Data) and we use statistics to evaluate how well the model performs.

To evalueate the quality of these models we can use the following strategies:

Sum of the Squared Residuals (SSR)
Mean Squared Error (MSE): It's the average of the squared residuals (SSR divided by the number of data points).
R Squared (R²)
p-value

Explaining R Squared (R²) with an analogy

Imagine you are an archery coach, and your goal is to train an archer (the model) to hit a target (predict outcomes). Every arrow the archer shoots represents a prediction made by your model.

Bulls-eye (Center of the target): Represents the actual values from your data.
Where the arrows land: Represents the predicted values from your model.

R-squared measures the closeness of the arrows to the bulls-eye:

R-squared = 1 (100%): All arrows hit the bulls-eye perfectly. Your model's predictions are exactly the same as the actual outcomes. This is an ideal scenario where your model explains all the variability in the data perfectly.
R-squared = 0%: The arrows land all around the target but never close to the bulls-eye. This means your model does no better at predicting than simply taking the average of the actual values. The variables you used in the model provide no information about the outcomes according to the data you have.
0 < R-squared < 1: Arrows are closer to the bulls-eye but not always hitting it. The higher the R-squared, the closer the arrows land to the bulls-eye on average, indicating your model explains a larger proportion of the variability in the data.

Thus, R-squared gives you a score that helps you understand how well your predictive model performs in terms of explaining the data it's supposed to predict. This helps you gauge the effectiveness of your model and whether there might be other variables or models that could do a better job.

R Squared (R²) exercise

Link: R Squared (R²) exercise

Write a Python script to calculate R-squared for a simple linear regression model where you predict students exam scores based on their study hours.

Dataset:

Study Hours (X): [5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
Exam Scores (Y): [55, 60, 65, 70, 75, 80, 85, 90, 95, 100]

Tasks:

Calculate the mean of X and Y.
Calculate the Total Sum of Squares (TSS).
Use the formula for the slope (m) and intercept (b) of the regression line.