Python statistics

Python has a built-in module that you can use to calculate mathematical statistics of numeric data.

Prerequisite

mean()

  • Arithmetic mean ("average") of data.
from statistics import mean

print(mean([1, 2, 3, 4, 4]))
# output: 2.8
print(mean([-1.0, 2.5, 3.25, 5.75]))
# output: 2.625

fmean()

  • Convert data to floats and compute the arithmetic mean.
  • This runs faster than the mean() function and it always returns a float.
  • If the input dataset is empty, it raises a StatisticsError.
from statistics import fmean

print(fmean([3.5, 4.0, 5.25]))
# output: 4.25

geometric_mean()

  • Convert data to floats and compute the geometric mean.
  • Raises a StatisticsError if the input dataset is empty, if it contains a zero, or if it contains a negative value.
from statistics import geometric_mean

print(geometric_mean([54, 24, 36]))
# output: 36.000000000000014

harmonic_mean()

  • Return the harmonic mean of data.
  • It can be used for averaging ratios or rates
from statistics import harmonic_mean

print(harmonic_mean([40, 60]))
# output: 48.0

median()

  • Return the median (middle value) of numeric data.
from statistics import median

print(median([1, 3, 5]))
# output: 3
print(median([1, 3, 5, 7]))
# output: 4.0

median_low()

  • Return the low median of numeric data.
from statistics import median_low

print(median_low([1, 3, 5]))
# output: 3
print(median_low([1, 3, 5, 7]))
# output: 3

median_high()

  • Return the high median of data.
from statistics import median_high

print(median_high([1, 3, 5]))
# output: 3
print(median_high([1, 3, 5, 7]))
# output: 5

median_grouped()

  • Return the 50th percentile (median) of grouped continuous data.
from statistics import median_grouped

print(median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5]))
# output: 3.7
print(median_grouped([52, 52, 53, 54]))
# output: 52.5

mode()

  • Return the most common data point from discrete or nominal data.
from statistics import mode

print(mode([1, 2, 2, 3, 4, 4, 4, 4, 4, 5]))
# output: 4
print(mode([52, 52, 53, 54]))
# output: 52

multimode()

from statistics import multimode

print(mode("aabbbbbbbbcc"))
# output: ['b']
print(multimode('aabbbbccddddeeffffgg'))
# output: ['b', 'd', 'f']

quantiles()

  • Divide data into n continuous intervals with equal probability.
from statistics import quantiles

data = [
    105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110,
    100, 75, 105, 103, 109, 76, 119, 99, 91, 103, 129,
    106, 101, 84, 111, 74, 87, 86, 103, 103, 106, 86,
    111, 75, 87, 102, 121, 111, 88, 89, 101, 106, 95,
    103, 107, 101, 81, 109, 104]

print([round(q, 1) for q in quantiles(data, n=10)])
# output: [81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0]

pstdev()

  • Return the square root of the population variance.
from statistics import pstdev

print(pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75]))
# output: 0.986893273527251

pvariance()

  • Return the population variance of data
from statistics import pvariance

data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
print(pvariance(data))
# output: 1.25

stdev()

  • Return the square root of the sample variance.
from statistics import stdev

print(stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75]))
# output: 1.0810874155219827

variance()

  • Return the sample variance of data.
from statistics import variance

data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
print(variance(data))
# output: 1.3720238095238095

covariance()

  • Return the sample covariance of two inputs x and y.
  • Covariance is a measure of the joint variability of two inputs.
from statistics import covariance

x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
y = [1, 2, 3, 1, 2, 3, 1, 2, 3]
print(covariance(x, y))
# output: 0.75

correlation()

  • Return the Pearson's correlation coefficient for two inputs.
  • Pearson's correlation coefficient r takes values between -1 and +1.
  • It measures the strength and direction of the linear relationship, where +1 means very strong, positive linear relationship, -1 very strong, negative linear relationship, and 0 no linear relationship.
from statistics import correlation

x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
y = [9, 8, 7, 6, 5, 4, 3, 2, 1]
print(correlation(x, x))
# output: 1
print(correlation(x, y))
# output: -1

linear_regression()

  • Return the slope and intercept of simple linear regression parameters estimated using ordinary least squares. Simple linear regression describes relationship between an independent variable x and a dependent variable y in terms of linear function:

y = slope * x + intercept + noise

  • where slope and intercept are the regression parameters that are estimated, and noise represents the variability of the data that was not explained by the linear regression (it is equal to the difference between predicted and actual values of the dependent variable).
from statistics import NormalDist, linear_regression

x = [1, 2, 3, 4, 5]
noise = NormalDist().samples(5, seed=42)
y = [3 * x[i] + 2 + noise[i] for i in range(5)]
print(linear_regression(x, y))
# output: LinearRegression(slope=3.0907891417020465, intercept=1.756849704861633)

References: