Facebook Pixel
Mathos AI logo

Sampling and Inference

Sampling and Inference

In statistics, it is often impossible or too difficult to collect data from every single person in a group. Instead, we use a sample to make guesses or draw conclusions—called inferences—about the entire group, which is known as the population.

What is a Random Sample?

For an inference to be accurate, the sample must fairly represent the whole population. The best way to achieve this is through a random sample, where every member of the population has an equal chance of being selected.

For example, if a school wants to estimate the average height of all 7th graders, measuring every student might take too long. Instead, they can randomly select 2020 students. If chosen truly randomly (like drawing names from a hat), the average height of these 2020 students will be a good estimate for the whole grade.

Making Inferences

An inference is a logical conclusion based on your sample data. If you survey a random sample of 5050 students and 3030 of them prefer pizza over burgers, you can infer that roughly 3050\frac{30}{50} (or 60%60\%) of the entire school population also prefers pizza.

Keep in mind:

  • Sample Size: Larger random samples generally give more accurate predictions.
  • Bias: If a sample is not random (e.g., only surveying players on the basketball team about their height), it is biased and cannot be used to make a valid inference about the whole school.

Comparing Two Populations

You can also use samples to compare two different populations. To do this, you look at two main features of the sample data:

  1. Center: Where is the middle of the data? We usually measure this using the mean (average) or the median.
  2. Spread: How spread out is the data? We measure this using the range or the Mean Absolute Deviation (MAD).

Example: Imagine you are comparing the test score distributions from two different classes.

  • Class A's sample has a mean score of 8585 and a range of 1010.
  • Class B's sample has a mean score of 7575 and a range of 2525.

By comparing the centers, you can infer that Class A generally scored higher. By comparing the spread, you can infer that Class A's scores were much more consistent (closer together), while Class B's scores were more widely varied.