The Averages - A Data Scientist Must Know About




In data science, we often spend considerable amount of time to measure and understand central tendency of data. Consequently these metrics help us to build meaningful simple to advance analytics for business. There are quite a few different types of averages which may be employed to perform this task.

In this article, we will quickly look into some of the important averages along with their advantages and disadvantages.

Arithmetic Mean

Arithmetic mean is the sum of a series of numbers divided by the count of that series of numbers. For example, if we are asked to compute arithmetic of test scores, we would simply add up all the test scores of the students, and then divide that sum by the number of students.

Advantages:

1 - Rigorously defined by mathematical formula.
2 - Based on all the observations in the data.
3 - Easy to calculate and simple to comprehend.
4 - Determined for almost every kind of data.
5 - Relatively stable statistic with the fluctuations of sampling. That is why it is universally used.
6 - Amenable to mathematical treatment.

Disadvantages:

1 - Extreme values in the data greatly affect it.
2 - It gives sometimes fallacious conclusions.
3 - The mean is not an appropriate measure of average in a highly skewed distribution.
4 - If the grouped data have “open -end” classes, mean cannot be calculated without assuming the limits.

Geometric Mean

Geometric mean take period -period compounding into account. It is calculated by taking the product of all numbers in a series and then product raising to the inverse of the length of the series.

Advantages:

1 - It is rigorously defined by a mathematical formula.
2 - Based on all observed values.
3 - In some cases it is amenable to mathematical treatment.
4 - It gives equal weightage to all the observations.
5 - It is not much affected by sampling variability.
6 - It is an appropriate type of average to be used in case rates of change or ratios are to be averaged.

Disadvantages:

1 - It is neither easy to calculate nor to understand.
2 - It vanishes if any observation is zero.
3 - In case of negative values, it cannot be computed at all.

Harmonic Mean

The harmonic mean is calculated by dividing the number of observations by the reciprocal of each number in the series. That means, the harmonic mean is the reciprocal of the arithmetic mean of the reciprocals.

Advantages:

1 - It is rigorously defined by mathematical formula.
2 - Based on all the observations in the data.
3 - Amenable to mathematical treatment.
4 - It is not much affected by sampling variability.
5 - It is an appropriate type for averaging rates and ratios.

Disadvantages:

1 - It is not readily understood.
2 - It cannot be calculated, if any one of the observations is zero.
3 - It gives too much weightage to the smaller observations.

Median

The median is the middle number in an ordered list of numbers. To find the median value, we must first order the numbers from lowest to highest. If the count of numbers is odd, the median value would be the number in the middle. If the count of numbers is even, we must determine the middle pair, add it together and divided by two to find the median value.

Advantages:

1 - It is easily calculated and understood.
2 - It is located even when the values are not capable of quantitative measurement.
3 - It is not affected by extreme values. It can be computed even when a frequency distribution involves “open -end” classes like those of income and prices.

Disadvantages:

1 - It is not rigorously defined.
2 - It is not capable of leading itself to further statistical treatment.
3 - It necessitates the arrangement of data into array which can be tedious and time consuming for a large body of data.

Mode

The mode is the most frequently occurring number found in a set of numbers. To find the mode, we must organize data in order to count the frequency of each observation number. The observation number with the highest number of occurrences is the mode.

Advantages:

1 - It is simply defined and easily calculated. In many cases, it is extremely easy to locate the mode.
2 - It is not affected by abnormally large or small observations.
3 - It can be determined for both the quantitative and the qualitative data.

Disadvantages:

1 - It is not rigorously defined.
2 - It is often indeterminate and indefinite.
3 - It is not based on all the observations made.

Got A Data Science Question?

Ask our experts anything about machine learning, analytics or statistics.