Continuous Frequency Distributions – Understanding Continuous Frequency Distributions and the Probability Density Function (PDF)

Lets deep dive into the world of statistics to understand the mysteries of continuous frequency distributions and the probability density function (PDF). By the end of this post, you’ll have a clearer understanding of these foundational concepts and their significance in the statistics.

In this Blog post we will learn:

Continuous Frequency Distributions:
Constructing the Frequency Distribution
Probability Density Function (PDF)
3.1. Estimating Density
Probability Density Function (PDF) Estimation:
Visualizing Probability Distributions using Python
Types of Continuous Frequency Distributions:
6.1. Normal (Gaussian) Distribution
6.2. Uniform Distribution
6.3. Exponential Distribution
6.4. Gamma Distribution
6.5. Beta Distribution
6.6. Chi-Squared Distribution
6.7. Weibull Distribution
6.8. Log-Normal Distribution
Conclusion:

1. Continuous Frequency Distributions:

The term “distribution” refers to the way in which a set of data points are spread or scattered across a range of values. When we talk about continuous frequency distributions, we’re looking at data that can take on any value within a specified range. This is in contrast to discrete data, which can only take on specific, separate values (like the number of pets someone has, which can only be whole numbers).

Imagine measuring the height of every person in a city. Since height can vary by very small amounts, you have a continuous data set. In a continuous frequency distribution, instead of listing each individual height, you’d group the heights into intervals (like 5’0″ to 5’1″, 5’1″ to 5’2″, and so on)..

Continuous variables, such as height, weight, or expenditure, can take any value within a given range. When we collect data on such variables, one of the ways to summarize and visualize this data is through a continuous frequency distribution.

A continuous frequency distribution groups data into intervals and tells us how many data points fall into each interval.

Let’s consider sample data on monthly expenditure on groceries of 50 families in a neighborhood (in USD):

# Sample data
expenditure = [105, 120, 130, 125, 140, 110, 135, 145, 125, 115, 140, 150, 
               120, 155, 135, 130, 125, 155, 145, 135, 110, 145, 160, 165, 
               130, 125, 150, 155, 165, 160, 135, 140, 120, 130, 145, 150, 
               165, 155, 160, 135, 120, 110, 155, 160, 165, 130, 145, 140, 135, 125] # truncated for brevity

2. Constructing the Frequency Distribution

Determine the Range:
- Lowest value = 105
- Highest value = 165
- Range = 60
Choose Intervals:
- We chose 6 intervals with a width of 10.
Count Observations:

Interval	Frequency
105-115	4
115-125	3
125-135	8
135-145	10
145-155	9
155-165	6

3. Probability Density Function (PDF)

The Probability Density Function (PDF) provides a description of the probability of a continuous random variable taking on a particular value. For a continuous random variable $X$, the probability that $X$ takes on any specific value $x$ is technically 0. Instead, we often look at the probability that $X$ falls within a specific interval.

3.1. Estimating Density

The density of a histogram bar is a measure that allows the area of the bar to represent the proportion of data points in that interval. The formula for density is:

$ \text{Density} = \frac{f}{N \times w} $

Where:

$ f $ is the frequency of the interval
$ N $ is the total number of observations
$ w $ is the width of the interval

4. Probability Density Function (PDF) Estimation:

Total number of observations = 50

Interval width = 10

For 105-115 : $ \text{Density} = \frac{4}{50 \times 10} $ = 0.008
For 115-125 : $ \text{Density} = \frac{3}{50 \times 10} $ = 0.006
For 125-135 : $ \text{Density} = \frac{8}{50 \times 10} $ = 0.016
For 135-145 : $ \text{Density} = \frac{10}{50 \times 10} $ = 0.020
For 145-155 : $ \text{Density} = \frac{9}{50 \times 10} $ = 0.018
For 155-165 : $ \text{Density} = \frac{6}{50 \times 10} $ = 0.012

Using this formula, we calculate the densities for our intervals:

Interval	Frequency	Density
105-115	4	0.008
115-125	3	0.006
125-135	8	0.016
135-145	10	0.020
145-155	9	0.018
155-165	6	0.012

5. Visualizing Probability Distributions using Python

Visualizing these distributions can be insightful. Let’s see how we can visualize each of the mentioned distributions using Python’s matplotlib and numpy libraries. If not already installed, you can install these libraries using pip

To visualize the data, we’ll plot both the frequency and density using Python’s popular matplotlib library:

pip install matplotlib numpy

# Additionally, we'll use the scipy.stats module for some distributions

pip install scipy

import matplotlib.pyplot as plt

# Sample data
expenditure = [105, 120, 130, 125, 140, 110, 135, 145, 125, 115, 140, 150, 
               120, 155, 135, 130, 125, 155, 145, 135, 110, 145, 160, 165, 
               130, 125, 150, 155, 165, 160, 135, 140, 120, 130, 145, 150, 
               165, 155, 160, 135, 120, 110, 155, 160, 165, 130, 145, 140, 135, 125] # truncated for brevity


# Frequency and density data
intervals = ["105-115", "115-125", "125-135", "135-145", "145-155", "155-165"]
frequencies = [4, 3, 8, 10, 9, 6]
densities = [0.008, 0.006, 0.016, 0.020, 0.018, 0.012]

fig, ax1 = plt.subplots(figsize=(6, 4))

# Plotting frequency on the primary y-axis
ax1.bar(intervals, frequencies, color='skyblue', alpha=0.6, label='Frequency')
ax1.set_xlabel('Expenditure ($)')
ax1.set_ylabel('Frequency', color='blue')
ax1.tick_params(axis='y', labelcolor='blue')
ax1.set_title('Frequency and Density of Monthly Expenditure on Groceries')

# Creating the secondary y-axis for the density
ax2 = ax1.twinx()
ax2.plot(intervals, densities, color='green', marker='o', label='Density', linewidth=2)
ax2.set_ylabel('Density', color='green')
ax2.tick_params(axis='y', labelcolor='green')

# Showing the plot
plt.tight_layout()
plt.show()

6. Types of Continuous Frequency Distributions:

6.1. Normal (Gaussian) Distribution

Normal (Gaussian) Distribution: It’s bell-shaped and symmetric about the mean. Most of the observed values cluster around the mean, and the probabilities for values farther away from the mean taper off equally in both directions.

Application: Heights, IQ scores, and residuals in regression analysis usually follow this distribution.

import matplotlib.pyplot as plt
import numpy as np

mean, std_dev = 0, 0.1
s = np.random.normal(mean, std_dev, 1000)
plt.hist(s, 30, density=True)
plt.title('Normal Distribution')
plt.show()

6.2. Uniform Distribution

Uniform Distribution: All values in a specified range are equally probable. It lacks a mode, and every interval of equal length on the distribution’s support is equally probable.

Example: The probability of any given number appearing when rolling a fair dice.

s = np.random.uniform(-1, 1, 1000)
plt.hist(s, 30, density=True)
plt.title('Uniform Distribution')
plt.show()

6.3. Exponential Distribution

Exponential Distribution: It describes the time intervals in a homogeneous Poisson process. Often used to model the time we need to wait before a given event occurs.

Application: Modeling the lifetime of an electronic component.

scale = 1.0
s = np.random.exponential(scale, 1000)
plt.hist(s, 30, density=True)
plt.title('Exponential Distribution')
plt.show()

6.4. Gamma Distribution

Gamma Distribution: A two-parameter family of continuous probability distributions. It’s used for modeling the time until the nth event in a Poisson process.

Application: Modeling the sum of exponentially-distributed random variables.

import scipy.stats as stats

a = 2
s = np.linspace(0, 10, 1000)
y = stats.gamma.pdf(s, a)
plt.plot(s, y)
plt.title('Gamma Distribution')
plt.show()

6.5. Beta Distribution

Beta Distribution: It’s defined on the interval [0,1] and is used to model random variables that have constraints on their range.

Application: Used in Bayesian analysis and to model random behavior in game theory.

a, b = 2.5, 2.5
s = np.linspace(0, 1, 1000)
y = stats.beta.pdf(s, a, b)
plt.plot(s, y)
plt.title('Beta Distribution')
plt.show()

6.6. Chi-Squared Distribution

Chi-Squared Distribution: It’s related to the standard normal distribution. If you square a standard normal random variable, the resulting distribution is a chi-squared distribution with 1 degree of freedom.

Application: Used in hypothesis testing and confidence interval estimation for variance.

df = 4
s = np.linspace(0, 20, 1000)
y = stats.chi2.pdf(s, df)
plt.plot(s, y)
plt.title('Chi-Squared Distribution')
plt.show()

6.7. Weibull Distribution

Weibull Distribution: Used to analyze life data, model reliability data, and describe the time to failure of a process or system.

Application: Analyzing the time to first failure of a process.

c = 1.75
s = np.linspace(0, 3, 1000)
y = stats.weibull_min.pdf(s, c)
plt.plot(s, y)
plt.title('Weibull Distribution')
plt.show()

6.8. Log-Normal Distribution

Log-Normal Distribution: If the logarithm of a variable has a normal distribution, then the original variable has a log-normal distribution.

Application: Used to model stock prices and other financial data.

s = 0.954
x = np.linspace(0, 5, 1000)
pdf = stats.lognorm.pdf(x, s)
plt.plot(x, pdf)
plt.title('Log-Normal Distribution')
plt.show()

7. Conclusion:

Understanding and visualizing probability frequency distributions is essential for various applications from predicting future events, estimating risks, to making decisions under uncertainty.

Python, with its rich libraries like NumPy and Matplotlib, provides a simple yet powerful platform for anyone keen on diving into the world of probability and statistics. So, harness this knowledge, and make your data speak the language of probability.

Probability