DEV Community

Shubham Singh
Shubham Singh

Posted on • Edited on

Understanding Data For Data Analytics, Data Science, and Machine Learning – Part-2

Things to know beforehand

  • What is Variability?

    It is how much data is spread out.

Image description

[1] Central Tendency

[2] Median

When your data is very influenced by the outliers then using median is good choice because it is not effected by outliers
to calculate median sort your data (ascending or descending does not matter) and then find the middle point.

Center point will be different based on whether you n is even or odd

[a] when n is even

When n is even, there are 2 centers

First point=n2Second point=n2+1 First~point = \frac{n}{2} Second~point = \frac{n}{2}+1

[b] when n is odd

For odd n it just

Mid point=n2 Mid~point = \frac{n}{2}

in R both can be calculated with same function

median()
Enter fullscreen mode Exit fullscreen mode

[3] Mode

In Data Mode is the value which occurs most often in the data
calculating mode is a manual task because you have to count occurrence of each value in the Data.

R doesn't have an inbuilt function for mod, so we can use this function

mode <- function(v) {
   uniqv <- unique(v)
   print(uniqv[which.max(tabulate(match(v, uniqv)))])
}
mode(data)
Enter fullscreen mode Exit fullscreen mode

[2] Major of Spread

Understanding spread of data is very important to understand your data better, 2 sets of data can have same mean but different spread which may lead to low quality estimates.

[1] Range

it is one of the simplistic major of variability, to calculate Range :

Range=max valuemin value Range = max~value - min~value
diff(range(data))
# or
print(max(data) - min(data))
Enter fullscreen mode Exit fullscreen mode

[2] Inter Quartile Range (IQR) and Whiskers Plot

By dividing your data is 4 equal parts, quartiles are generated each quartile contains 25% of data, i.e.,
1st quartile is 25% of data (25th percentile); 2nd quartile is 50% of data (50th percentile); 3rd quartile is 75% of data (75th percentile); 4th quartile is 100% of data (100th percentile).

IQR=75th percentile25th percentile IQR = 75th~percentile - 25th~percentile

Box and Whiskers Plot is very useful for 5 point summery and understanding spread and Outliers

library(ggplot2)
data <- iris

ggplot(data) + geom_boxplot(
  mapping = aes(
    x = Sepal.Length,
    y = Species
  )
) + coord_flip()
Enter fullscreen mode Exit fullscreen mode

The five point summary in box plot includes the minimum value, the first quartile, the median, the third quartile, and the maximum value.

Each of these can be looked into the plot below.

  1. Minimum value : start of the vertical line.
  2. First Quartile : start of the box in the middle.
  3. Median : bold horizontal line is the point where median lies.
  4. Third Quartile : end of the box in middle.
  5. Maximum value : end of the vertical line.

And if you are wondering what is that point outside the box in virginica it is an outlier.

To compute outliers mathematically, you need a threshold if any point passes the outliers threshold it is considered as outlier.

outliers threshold=1.5IQR outliers~threshold = 1.5*IQR

[3] Variance

Variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers is spread out from their average value.

S2=σ2=Σ(xix)2n1 S^2 = \sigma^2 = \frac{\Sigma{}(x_{i}-\overline{x})^2}{n-1}

Why it is S^2 because the sum of xi - x bar can result in zero, so we square it to make it a +ve number.

var(data)
Enter fullscreen mode Exit fullscreen mode

[4] Standard Deviation

 the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.
 It is quite same as Variance difference is SD unit is same as data, but variance is in unit squared.

SD=σ=S2 SD = \sigma = \sqrt{S^2}
sd(data)
Enter fullscreen mode Exit fullscreen mode

Normal distributions with standard deviations of 5 and 10.

Image description
For Part-3 go here

Top comments (0)