Data management is a tedious process. Data mishandling can lead to a vicious circle where eliminating complexities become difficult, sometimes impossible. Therefore, understanding the information regarding a data set is crucial to streamline the decision-making process and analyze data in the long term.
What is Descriptive Statistics?
Descriptive statistics describes the elementary characteristics of data by displaying the basic sum-up of a sample or a population. The term descriptive statistics is also applied to display quantitative descriptions in research where there is an involvement of measuring big data.
Classification of Descriptive Statistics
The classification of statistics are as follows:
- Measures of frequency distribution
- Measures of central tendency
- Measures of dispersion or variability
1. Measures of frequency distribution
Datasets are the distribution of values. The number of occurrences of a particular event in a sequence or a data set is known as frequency. Experts use tables and graphs for calculating the frequency of each likely reading of a variable, extracted in proportions.
Let us understand this with the help of an example:
Scores = 4, 4, 6, 6, 2, 1, 1, 4, 6, 6, 2, 2, 4, 6
No of sixes: 5
No of fours: 4
No of twos: 3
No of singles: 2
The normal distribution is popularly known as the Bell-curve or the Gaussian distribution. They are symmetric at the Mean, displaying the values close to the Mean more often in occurrence than compared to the data away from the Mean. The normal distribution looks like a bell-shaped curve having the characteristics stated below:
- Symmetric bell shape.
- The same Mean and Median; all situated at the middle of the Mean.
Note: The standard normal distribution contains two parameters, the Mean and Standard Deviation. With a normal distribution, 68% of the readings are in the +/- 1 Standard Deviation of the Mean, 95% are in the +/- 2 Standard Deviations, and 99.7% are in the +/- 3 Standard Deviations.
The diagram below shows you a clear view of a Normal distribution structure.
The normal distribution uses a theory called Central Limit Theorem. The theory explains that Means generated from independent, identically spread random variables have just about normal distributions, irrespective of the method of distribution by which the finite variables are tested.
2. Measures of central tendency (Mean, Median, Mode)
The central tendency is used to find the central or the middle-value of a dataset. The most commonly used central tendencies are Mean, Mode, and Median.
The Mean is also denoted as M and is the most popular technique of obtaining averages. To find the Mean of the data set, sum up the total values in the sequence, and divide the sum of the values by the number of responses, denoted as N.
Let us understand this with an example:
A person imagines the number of hours in a day he sleeps for in a week. Therefore, the dataset would contain the hours (7,8,8,10,8,6,9), and the total of the values, which is 56 and the number of values, which is 7.
We divide 56 by 7 to find the Mean. The result is 8, which is the Mean.
The Mode is simply the most repeated term in the sequence. We can find the Mode by rearranging our data set in ascending order, which is, from the lowest to the highest. We then find the most repeated term in the entire data set.
Let us understand this with an example:
Sample dataset: 4, 6, 7, 7, 8, 9, 10, 9, 7, 9, 7
Mode = 7 (since it is the most repeated value).
Attention: In our sample data set, it is evident that the number 7 appears the most, and hence we choose 7 as the Mode of the dataset.
The Median is the value in the exact midpoint of a dataset. To obtain the Median, we arrange the values in the ascending order, that is, from the lowest to the highest. We then locate the value in the centre of the set.
Let us understand this with examples:
When N is odd.
Odd Data Set – 2, 3, 5, 8, 10, 12, 14 Median (N = 7) = [(5 +1)/2]th term = 8/2 term = 4th term Therefore, the Median is 8 (since it lies it the exact middle of the dataset)
When N is even
Even Data Set – 2, 3, 5, 8, 10, 12, 14, 16 Median (N = 8) = [N/2th + (N/2 + 1)th]/2 = [ 8/2th + (8/2 +1)th]/2 = (4th + 5th )/2 = (8 + 10)/2 =18/2 =9 Therefore, the Median is 9.
3.Measure of Dispersion or Variability (Range, Interquartile Range)
The measure of Dispersion is mainly used to describe the spread of data. We use Range, Standard Deviation, and Variance to explain the measure of Dispersion.
Range regulates how far away the values are. To obtain the Range, we begin by subtracting the lowest value in a data set from the highest value.
in the data set (4,6,7,8,8,9,10), 4 is the smallest value while 10 is the highest value. Therefore, we get the Range by subtracting 4 from 10, and that equals 6.
The Interquartile Range demonstrates the center 50% of values when sorted in ascending order, that is, from the lowest to the highest. To obtain the Interquartile Range (IQR), we obtain the Mean of the lower and upper half of the dataset. The values are the quartile 1(Q1) and quartile 3 (Q3). Interquartile Range = Q3 and Q1.
The table below shows us a clear view of the Interquartile Range:
Let us understand with the help of an example.
Variance and Standard Deviation
Variance mirrors the dataset amount of Dispersion. The Variance is always greater than the Mean when the Dispersion of the data is of a greater extent. We can obtain Variance by simply squaring the Standard Deviation.
The Standard Deviation is the Mean of Variability, displaying how far are the values in the sequence from the Mean.
Let us follow the following steps to find the Standard Deviation:
- Highlight the values and their averages.
- Place the Deviation by subtracting the average from each value.
- Square each Deviation.
- Sum up all the squared Deviations.
- Divide the totals of the squared Deviations by N-1
- Calculate the square root of the outcome.
Let us understand this with an example:
|Raw Data||Deviation from Mean||Deviation Squared|
|M= 7.3||Total= 0.9||Square total= 23.83|
When we divide the total of squared Deviations by 6 (N-1): 23.83/6, we obtain 3.971, and the square root of the outcome is 1.992. Through the results, we have noted that every value differs from the Mean by 1.992 average points.
We can calculate the Modality of the distribution by calculating its total number of Peaks. Several distributions have only one Peak, but we will likely come across distributions with two or more Peaks.
The three types of Modality are:
- Unimodal: A Unimodal distribution refers to a distribution with only one Peak. This means that there is one recurrently happening value, clustered at the top.
- Bimodal: A Bimodal distribution has two Peaks, hence two recurrently happening values.
- Multimodal: A Multimodal distribution has two or multiple Peaks, hence multiple recurrently happening values.
Skewness is the calculation of how a distribution is symmetrical. It demonstrates the extent to which a distribution contrasts from the normal distribution, either to the left or right. The value of skewness of a distribution can be positive, negative, or zero. A skewness of zero implies that the Mean equals the Median.
In the picture below, we can see a better demonstration of the types of Skewness:
To identify the positive Skew, we notice that most of the data is heaped up to the left. A negative skew on the other side has most of its data heaped up to the right. We need to note that positive Skews are very popular compared to negative Skews. The Skew () function allows us to calculate the Skewness of a distribution.
Kurtosis estimates the degree to which our dataset is heavy-tailed or light-tailed, contrasted with the normal distribution. Datasets that contain high Kurtosis have high tails and many outliers, while datasets containing low Kurtosis have light tails and less outliers. Histogram and Probability are the operative ways to show the Skewness and Kurtosis of datasets.
Fisher’s measurement of Kurtosis arithmetically and efficiently calculates the Kurtosis of a distribution.
Kurtosis has three main types:
- Mesokurtic: This is a normal distribution having zero Kurtosis.
- Platykurtic: Platykurtic is a type of distribution that contains negative Kurtosis and thin ends contrasted with the normal distribution.
- Leptokurtic: Leptokurtic is a type of distribution containing a kurtosis value of more than three and fat ends giving the distribution much greater value, and less Standard Deviation.
This article has given us a comprehensive introduction to the various terms used in descriptive statistics. We have focused on the areas of normal distribution and their advantages. The commonly used measures of descriptive statistics were also explained with suitable examples. After understanding the descriptive statistic in-depth, we know how the data gets analyzed. We should keep in mind that descriptive statistics does not allow any conclusions to be made on data analysis, rather, it is a measure that describes the data.