Sampling and Inference
Sampling and Inference
In statistics, it is often impossible or too difficult to collect data from every single person in a group. Instead, we use a sample to make guesses or draw conclusions—called inferences—about the entire group, which is known as the population.
What is a Random Sample?
For an inference to be accurate, the sample must fairly represent the whole population. The best way to achieve this is through a random sample, where every member of the population has an equal chance of being selected.
For example, if a school wants to estimate the average height of all 7th graders, measuring every student might take too long. Instead, they can randomly select 20 students. If chosen truly randomly (like drawing names from a hat), the average height of these 20 students will be a good estimate for the whole grade.
Making Inferences
An inference is a logical conclusion based on your sample data. If you survey a random sample of 50 students and 30 of them prefer pizza over burgers, you can infer that roughly 5030 (or 60%) of the entire school population also prefers pizza.
Keep in mind:
- Sample Size: Larger random samples generally give more accurate predictions.
- Bias: If a sample is not random (e.g., only surveying players on the basketball team about their height), it is biased and cannot be used to make a valid inference about the whole school.
Comparing Two Populations
You can also use samples to compare two different populations. To do this, you look at two main features of the sample data:
- Center: Where is the middle of the data? We usually measure this using the mean (average) or the median.
- Spread: How spread out is the data? We measure this using the range or the Mean Absolute Deviation (MAD).
Example: Imagine you are comparing the test score distributions from two different classes.
- Class A's sample has a mean score of 85 and a range of 10.
- Class B's sample has a mean score of 75 and a range of 25.
By comparing the centers, you can infer that Class A generally scored higher. By comparing the spread, you can infer that Class A's scores were much more consistent (closer together), while Class B's scores were more widely varied.