Menu

Sampling and Sampling Distributions – A Comprehensive Guide on Sampling and Sampling Distributions

Explore the fundamentals of sampling and sampling distributions in statistics. Dive deep into various sampling methods, from simple random to stratified, and uncover the significance of sampling distributions in detail.

In this blog post we will learn

  1. What is Sampling?
  2. Why Sample?
  3. Types of Sampling Methods
    3.1. Simple Random Sampling (SRS)
    3.2. Stratified Sampling
    3.3. Cluster Sampling
    3.4. Systematic Sampling
    3.5. Convenience Sampling
    3.6. Quota Sampling
  4. Simple demonstration of different sampling methods using Python
  5. What is a Sampling Distribution?
    5.1. Simulate and visualize the sampling distribution of the sample mean using Python
    5.2. Key Concepts in Sampling Distributions
    5.3. Importance of Sampling Distributions
  6. Conclusion

1. What is Sampling?

Sampling refers to the process of selecting a subset (or a sample) from a larger set (often called a population). Instead of collecting data from every individual in the population (which can be time-consuming and costly), researchers typically collect data from a sample and then use that sample to make inferences about the larger population.

For example, if we wanted to know the average height of all adult men in a country, instead of measuring every single man, we could measure a sample of them and then estimate the average height for the entire group.

2. Why Sample?

Sampling has a host of benefits:

  1. Cost-effective: It’s often cheaper to collect data from a sample than from an entire population.
  2. Time-saving: Sampling can save a considerable amount of time.
  3. Feasibility: In some cases, it’s virtually impossible to survey an entire population.
  4. Accuracy: If done correctly, sampling can provide accurate estimates of population parameters.

3. Types of Sampling Methods

3.1. Simple Random Sampling (SRS)

Definition: Every individual in the population has an equal chance of being selected.

Example: Imagine a bowl containing 100 unique lottery tickets. If you were to close your eyes and pick out 10 tickets one at a time, you’re engaging in simple random sampling.

3.2. Stratified Sampling

Definition: The population is divided into non-overlapping groups (or strata) based on a particular characteristic, and then a random sample is taken from each group.

Example: Let’s say you’re researching study habits among high school students across freshmen, sophomores, juniors, and seniors. Instead of picking randomly from the whole school, you first divide students by grade level and then randomly pick an equal number from each grade. This ensures representation from all grades.

3.3. Cluster Sampling

Definition: The population is divided into clusters (often geographically), and then a random sample of clusters is chosen. All or a random sample of members from those selected clusters will be surveyed.

Example: Imagine you want to survey households in a large city. The city is divided into different neighborhoods (clusters). Instead of sampling households from the entire city, you randomly select a few neighborhoods and then survey all households (or a random sample of them) within those selected neighborhoods.

3.4. Systematic Sampling

Definition: Every $k$ th individual is selected from a list or sequence.

Example: You have a list of 1,000 customers and want to select 50 for a survey. To do this, you might select every 20th customer from the list (1,000 divided by 50 equals 20). So you’d survey the 20th, 40th, 60th customer, and so on.

3.5. Convenience Sampling

Definition: The sample is chosen based on what is easy or convenient, rather than any systematic or random method.

Example: A street interviewer stops passers-by at a mall entrance to ask about their shopping preferences. Here, the sample consists of whoever happens to be at that particular entrance at that time – it’s convenient, but not necessarily representative of all shoppers.

3.6. Quota Sampling

Definition: The researcher ensures equal or proportionate representation of subjects depending on certain characteristics, but the selection within those categories might be non-random.

Example: If you’re surveying voters’ intentions before an election and you know the gender distribution is 50% male and 50% female, you might ensure that out of 100 surveyed individuals, 50 are male and 50 are female. However, how you select those 50 males and females might not be random.

4. Simple demonstration of different sampling methods using Python

import pandas as pd
import numpy as np

# Create a sample DataFrame for demonstration
data = {
    'ID': range(1, 101),  # IDs for 100 individuals
    'Age': np.random.randint(15, 65, 100),  # Random ages between 15 and 65
    'Grade': np.random.choice(['Freshman', 'Sophomore', 'Junior', 'Senior'], 100)  # Random school grades
}
df = pd.DataFrame(data)

df.head()
ID Age Grade
0 1 48 Junior
1 2 23 Sophomore
2 3 62 Sophomore
3 4 24 Freshman
4 5 44 Junior
# 1. Simple Random Sampling (SRS)
srs_sample = df.sample(n=10)  # Get 10 random rows from the DataFrame

print("Simple Random Sampling (SRS) Sample:")
srs_sample
Simple Random Sampling (SRS) Sample:
ID Age Grade
89 90 38 Freshman
98 99 21 Junior
76 77 37 Junior
97 98 37 Junior
28 29 18 Junior
6 7 16 Junior
32 33 44 Freshman
24 25 56 Junior
94 95 33 Senior
8 9 24 Senior
# 2. Stratified Sampling
strat_sample = df.groupby('Grade').apply(lambda x: x.sample(n=2)).reset_index(drop=True)  # Get 2 samples from each grade

print("\nStratified Sampling Sample:")
strat_sample
Stratified Sampling Sample:
ID Age Grade
0 10 59 Freshman
1 97 52 Freshman
2 12 17 Junior
3 98 37 Junior
4 88 51 Senior
5 30 34 Senior
6 72 33 Sophomore
7 35 30 Sophomore
# 3. Cluster Sampling
clusters = df.groupby(df.index // 10)  # Create 10 clusters
selected_clusters = clusters.apply(lambda x: x if np.random.rand() < 0.2 else None).dropna()  # Select 20% of clusters

print("\nCluster Sampling Sample:")
selected_clusters
Cluster Sampling Sample:
ID Age Grade
5 50 51 22 Senior
51 52 50 Sophomore
52 53 33 Sophomore
53 54 25 Freshman
54 55 30 Senior
55 56 46 Senior
56 57 28 Freshman
57 58 48 Senior
58 59 26 Junior
59 60 25 Junior
# 4. Systematic Sampling
k = len(df) // 10
sys_sample = df.iloc[::k].head(10)

print("\nSystematic Sampling Sample:")
sys_sample
Systematic Sampling Sample:
ID Age Grade
0 1 48 Junior
10 11 46 Junior
20 21 41 Freshman
30 31 34 Junior
40 41 24 Sophomore
50 51 22 Senior
60 61 52 Freshman
70 71 28 Senior
80 81 60 Freshman
90 91 18 Freshman
# 5. Convenience Sampling
# Here, we'll just take the first 10 rows. In real-world scenarios, this would be akin to surveying whoever comes first.
conv_sample = df.head(10)

print("\nConvenience Sampling Sample:")
conv_sample
Convenience Sampling Sample:
ID Age Grade
0 1 48 Junior
1 2 23 Sophomore
2 3 62 Sophomore
3 4 24 Freshman
4 5 44 Junior
5 6 52 Sophomore
6 7 16 Junior
7 8 50 Sophomore
8 9 24 Senior
9 10 59 Freshman
# 6. Quota Sampling
# Let's say we have a quota to sample 3 individuals from each grade.
quota_sample = df.groupby('Grade').apply(lambda x: x.sample(n=3)).reset_index(drop=True)

print("\nQuota Sampling Sample:")
quota_sample
Quota Sampling Sample:
ID Age Grade
0 47 36 Freshman
1 33 44 Freshman
2 10 59 Freshman
3 96 39 Junior
4 5 44 Junior
5 59 26 Junior
6 30 34 Senior
7 88 51 Senior
8 78 56 Senior
9 53 33 Sophomore
10 8 50 Sophomore
11 64 16 Sophomore

5. What is a Sampling Distribution?

A sampling distribution is the distribution of a statistic (like the mean or proportion) based on all possible samples of a given size from a population. It tells us how much we would expect our sample statistic to vary from one sample to another.

For instance, if we were to repeatedly draw different samples of 100 men from our earlier example and calculate the average height for each sample, the distribution of those sample means would be the sampling distribution of the mean.

5.1. Simulate and visualize the sampling distribution of the sample mean using Python

In this example:

  1. We’ve created a population with a mean of 75 and a standard deviation of 15.
  2. We then repeatedly (1,000 times) drew random samples (each of size 100) from this population.
  3. For each sample, we computed its mean and stored it.
  4. Finally, we visualized the distribution of these sample means.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Generating the population data
np.random.seed(0)
population_data = np.random.randn(10000) * 15 + 75  # Let's say the population data is normally distributed with mean 75 and standard deviation 15.

# Simulate the sampling distribution of the sample mean
num_samples = 1000
sample_size = 100
sample_means = []

for _ in range(num_samples):
    sample = np.random.choice(population_data, size=sample_size, replace=False)
    sample_means.append(np.mean(sample))

# Plotting
plt.hist(sample_means, bins=30, edgecolor='k', alpha=0.7)
plt.title("Sampling Distribution of the Sample Mean")
plt.xlabel("Sample Mean")
plt.ylabel("Frequency")
plt.axvline(x=np.mean(sample_means), color='r', linestyle='dashed', linewidth=1)
plt.show()

5.2. Key Concepts in Sampling Distributions

Central Limit Theorem (CLT): For a sufficiently large sample size, the sampling distribution of the sample mean will be approximately normal, regardless of the population’s distribution. This is a powerful property that allows us to make statistical inferences.

To learn more about Central Limit Theorem refer to this blog post Central Limit Theorem

Standard Error (SE): It measures the dispersion or variability of sample statistics from one sample to the next. A smaller SE indicates that our sample statistic (like the mean) is more consistent across different samples.

5.3. Importance of Sampling Distributions

Sampling distributions are crucial for hypothesis testing and confidence interval estimation. Knowing how our sample statistic behaves (its distribution) under repeated sampling allows us to:

  1. Assess the likelihood of observing our sample results if some null hypothesis were true.
  2. Gauge the precision of our sample estimates.

6. Conclusion

Sampling and its associated distribution provide the foundation for much of inferential statistics. By understanding these concepts, we are better equipped to make informed decisions based on sample data. As always, the key lies in choosing the right sampling method and ensuring that our sample is representative of the larger population.

Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science