omarxadel21@gmail.com

Review of the required Statistics for Data Science - Part 1 - 26/11/2023

Quick review over concepts like mean, variance, covariance, correlation coefficient, standard deviation and z-score.

Review of the required Statistics for Data Science - Part 1

This article is a quick review of Essential Statistics for Working with Data.

Table of Contents
  1. Mean
  2. Variance
  3. Covariance
  4. Correlation Coefficient
  5. Standard Deviation
  6. Z-Score

Mean

The mean is the average of a data set.

E(x)=1ni=1nxiE(x) = \frac{1}{n} \sum_{i=1} ^{n} x_i

E(x)E(x) = Arithmetic Mean

nn = total number of samples

xix_i = dataset value


Variance

Variance is a measure of how data points differ from the mean.

var(x)=1n1i=1n(xixˉ)2var(x) = \frac{1}{n-1} \sum_{i=1} ^{n} (x_i -\bar{x})^2

var(x)var(x) = Variance

nn = total number of samples

xix_i = dataset value

xˉ\bar{x} = Arithmetic Mean


Covariance

Covariance measures the direction of the relationship between two variables.

cov(x,y)=1n1i=1n(xixˉ)(yiyˉ)cov(x, y) = \frac{1}{n-1} \sum_{i=1} ^{n} (x_i -\bar{x}) (y_i -\bar{y})

cov(x,y)cov(x,y) = Covariance

nn = total number of samples

xi,yix_i, y_i = dataset values

xˉ,yˉ\bar{x},\bar{y} = Arithmetic Mean


Correlation Coefficient

A correlation coefficient is a numerical measure of some type of correlation, meaning a statistical relationship between two variables.

corr(x,y)=cov(x,y)var(x)var(y)corr(x, y) = \frac{cov(x,y)}{\sqrt{var(x)}{\sqrt{var(y)}}}

corr(x,y)corr(x,y) = Correlation Coefficient

cov(x,y)cov(x,y) = Covariance

var(x)var(x) = Variance of x

var(y)var(y) = Variance of y


Example Problem


Standard Deviation

A standard deviation (or σ) is a measure of how dispersed the data is in relation to the mean.

σ(x)=i=1n(xixˉ)2n\sigma(x) = \sqrt{\frac{\sum_{i=1} ^{n} (x_i-\bar{x})^2}{n}}

σ(x)\sigma(x) = Standard Deviation

nn = total number of samples

xix_i = dataset values

xˉ\bar{x} = Arithmetic Mean


Z-Score

A z-score, or standard score, is used for standardizing scores on the same scale by dividing a score’s deviation by the standard deviation in a data set.

z(xi)=xxˉσ(x)z(x_i) = \frac{x-\bar{x}}{\sigma(x)}

z(xi)z(x_i) = Z-Score for sample i

σ(x)\sigma(x) = Standard Deviation

xix_i = dataset values

xˉ\bar{x} = Arithmetic Mean