Data Collection in Statistics and Machine Learning




Data collection is the most important prerequisite of any statistics or machine learning project. Without data you can not proceed with your statistical or machine learning tasks.

There can be two possible ways to collect data:

  • a complete enumeration of entire population, called census, which in many cases would be impractical due to involved cost and time,
  • a partial enumeration of a sample to save time and money.

The data (raw data) that have not undergone any sort of statistical treatment are called primary data, whereas the data that have undergone any sort of treatment by statistical method at least once - the data have been collected, classified, tabulated or presented in some form for a certain purpose - are secondary data.

How Data can Help Statistics and Machine Learning

Data is backbone of a wide range of disciplines in government, academia and industry. Practically you cannot do any statistics or machine learning without data. These days there is hardly a field which is not being fueled by data. Governments, businesses and scientists are heavily relying on data to generate descriptive and inferential statistics, and build machine learning models.

Here are few data driven statistical and machine learning applications:

  • In experimental science, scientist collect data from their experiments and then statistically analyze it to validate hypotheses.
  • Governments collecting data even more than before and then use it towards descriptive and inferential statistics, as well as to build machine learning models and predictive analytics. These applications help governments to take informed decisions and do better planning and risk management.
  • In education, data plays critical role to compute statistics to describe the results and standards of education. Also, data is being used to build interactive and online/remote education programs.

Misusing Data and Statistics

Sometimes data and statistics is misused by relying on

  • unrepresentative samples,
  • small sample Size,
  • ambiguous averages and dispersions,
  • detached facts,
  • implied connections,
  • wrong and misleading graphs,
  • wrong use of Statistical techniques,
  • serious violation of assumptions behind the statistical techniques and Faculty Surveys etc.

Got A Data Science Question?

Ask our experts anything about machine learning, analytics or statistics.