Sampling Techniques in Statistics for Machine Learning

By | September 14, 2020

Data is like a fuel to a Data Scientist. Any study or research work requires a good amount of quality data. The term ‘good amount of quality data’ changes with the kind of study one wants to do. Various sampling techniques are there to get you just that.

As a researcher, you may want to study-different animals, changing weather, human behavior, car-sales for a particular automotive industry, frauds for an insurance company, blah blah blah…and a million such topics of interest.

One common and most important exercise in all these studies involves collecting quality data. Collecting data is challenging sometimes while in other cases where even if the data is present, it’s hard to work with the full data due to some known challenges. In those scenarios, one needs a way to select a ‘good amount of quality data’ to work with.

In this article, we will focus on selecting data (Sampling). Why it is necessary and how to do it effectively. The rest of the article is divided into the following small parts-

  1. Sampling
  2. Sampling Techniques
  3. Qualities of Good Sampling
  4. Conclusion

1. Sampling

Suppose you want to study the heights of all the humans on planet Earth. In no way, you can collect such data (heights) for every living person. But what you can do is-select a part of people with available heights and consider it as a representative of the heights of the total population.

But wait! Two immediate questions:- 1. How to decide which person remains in the data and which one doesn’t? and 2. How much data to select so that it represents the original population accurately.

Sampling | dropsofai.com
Sampling | Image Source

Sampling is the answer to those questions. There are various sampling techniques just to do that for you. In order to decide what technique will work best for you depends upon the kind of problem, you are trying to solve.

But if you understand these techniques well, deciding on ‘what sampling technique to use’ in the research/study, shouldn’t be a big problem for you.


2. Sampling Techniques

Sampling Techniques can be categorised into two broad categories. Each category has its own different ways of operation-

  1. Probability Sampling
  2. Non-Probability Sampling

1. Probability Sampling

In probability-based sampling techniques, each element of the population is associated with a probability value of getting selected or not.

There are three common probability-based sampling techniques that are commonly used in statistics and machine learning studies-

You might want to read next:

How to deal with Imbalanced data in classification?

  1. Simple Random Sampling
  2. Stratified Random Sampling
  3. Cluster Sampling

Simple Random Sampling

When each element of the population has an equal chance of being selected into the desired dataset, the sampling technique is said to be simple random sampling.

In this technique, the selection of any element does not place any limits on the other elements to be selected further, thus keeping each element equally likely to be selected at any point in time of the sampling procedure.

Simple Random Sampling is an unbiased surveying technique.


Stratified Random Sampling

In the stratified random sampling technique, the population is first divided into multiple sub-groups. These sub-groups are formed based on a pre-defined condition(possibly a set of conditions). These sub-groups of the population are called strata.

Once these sub-groups are formed, simple random sampling is applied to each sub-group of the population to get small samples. All the collected small samples from each sub-group are then joined together to form the final sampled dataset.

For example: If a researcher wants to study heights of students in any school. He might divide the students into two sub-groups based on sex-boys and girls.

Suppose there are 200 boys and 100 girls in that school and you need to draw a sample of 60 students. There are two obvious ways to do that-

  1. Select 30 students from each sub-group (30 boys and 30 girls)
  2. Select 40 boys and 20 girls (considering the original gender ratio of 2:1)

In the first case, we have decided to take an equal number of samples from each sub-group without caring about the population ratio between the sub-groups. This kind of stratified sampling is called disproportionate stratified random sampling when you take the equal number of samples from each strata.

In the second case, we consider the population ratio in each sub-group and sample elements according to that only. This way of doing stratified sampling is said to be the proportionate stratified random sampling technique when you consider the population ratio for different strata.


Stratified Sampling vs Cluster Sampling | dropsofai.com
Stratified Sampling vs Cluster Sampling | Image Source

Cluster Sampling

Cluster sampling is done when the total population can be grouped into several groups such that the groups are mutually homogenous and internally heterogeneous.

In this technique, the population is first divided into small groups as discussed above and these groups are known as clusters.

Once clusters are formed, simple random sampling is applied to select a few clusters for the study. Once some clusters are selected(sampled), there are two possibilities-

  1. take all the elements from each selected cluster,
  2. Choose samples from each cluster based on simple random sampling or stratified sampling technique and combine later.

In the second case, we are performing sampling in two stages. This kind of cluster sampling is called ‘two-stage’ cluster sampling.

Thus, cluster sampling could be ‘multi-stage’ depending upon the requirements.


2. Non-Probability Sampling

Non-probability sampling is used when sampling is done primarily based on the judgment of the analyst. There are four common non-probability based sampling techniques in statistics-

  1. Accidental Sampling
  2. Quota Sampling
  3. Snowball Sampling
  4. Purposive Sampling

Accidental Sampling

Suppose, you want to study ‘what people think about the current education system in your country?’ and you want to collect data for your research. So, you are interviewing people to collect at least 1k feedbacks for the analysis.

Suppose, you have interviewed the first 1k people that agreed to share thoughts with you and you want to study this sample. This kind of sampling is called ‘accidental sampling’ where you don’t have control over the kind of person you are interviewing.

This kind of sampling is often biased and might not represent the original population accurately.

This technique is chosen mostly when another kind of sampling is not feasible.


Quota Sampling

Quota sampling ensures that sample is well diversified to represent the original population.

Each sub-group of the population is equally represented in the sample while any probability sampling technique is not used to select elements from the sub-groups.

The main goal of quota sampling is not to select samples based on their original proportions in the sub-groups, however, it ensures that there are enough examples from each sub-group to well represent the diversity of the original population.


Snowball Sampling

Snowball sampling is mainly useful in investigating sensitive topics. In this technique, the already sampled population leads to the newer samples to be collected.

Suppose you want to study alcoholics or drug abusers. Already identified persons from such categories might introduce you to others like them. And thus your sample size keeps increasing with the help of the already sampled population.


Purposive Sampling

Purposive sampling, also called judgmental sampling, aims at selecting the sample based on some pre-defined criteria(judgment).

Suppose, the police are trying to find a stolen car in a city. A good strategy for the police would be to first investigate already known criminals in that area(known thieves). This kind of sampling that is based on some judgmental condition is called the ‘purposive or judgemental sampling’ technique.


Tip: When sampling is done in multiple stages, it is possible to combine probability and non-probability based sampling techniques to arrive at the final sample.

3. Qualities of Good Sampling Techniques

Sampling techniques aim at selecting a small portion of the total population that is representative of that population. Practically, a small sample would never represent a large population with 100% accuracy. And a researcher that is performing sampling is well aware of the fact.

Generally researchers define some limits before doing sampling. For example, the researcher might put up a condition like- the sample must represent the true population with an error no larger than the defined limits.

The sampling plan that ensures that the sample statistics would be correct in certain defined limits is termed as ‘a good or representative sampling plan’.


4. Conclusion

This article talks about various sampling techniques used in statistical data analysis studies.

It also throws light upon the need for sampling in research and further concludes the qualities of a good sampling plan.

Thanks for reading! I hope you have enjoyed the article, kindly share your precious feedback by commenting below.

See you in the next article!


One thought on “Sampling Techniques in Statistics for Machine Learning

  1. Pingback: How to deal with Imbalanced data in classification? - Drops of AI

Comments are closed.