Python statistics
Python has a built-in module that you can use to calculate mathematical statistics of numeric data.
Prerequisite¶
- We need to have basic understanding of statistics.
- https://www.statisticshowto.com/probability-and-statistics/statistics-definitions/
mean()¶
- Arithmetic mean ("average") of data.
from statistics import mean
print(mean([1, 2, 3, 4, 4]))
# output: 2.8
print(mean([-1.0, 2.5, 3.25, 5.75]))
# output: 2.625
fmean()¶
- Convert data to floats and compute the arithmetic mean.
- This runs faster than the mean() function and it always returns a float.
- If the input dataset is empty, it raises a StatisticsError.
from statistics import fmean
print(fmean([3.5, 4.0, 5.25]))
# output: 4.25
geometric_mean()¶
- Convert data to floats and compute the geometric mean.
- Raises a StatisticsError if the input dataset is empty, if it contains a zero, or if it contains a negative value.
from statistics import geometric_mean
print(geometric_mean([54, 24, 36]))
# output: 36.000000000000014
harmonic_mean()¶
- Return the harmonic mean of data.
- It can be used for averaging ratios or rates
from statistics import harmonic_mean
print(harmonic_mean([40, 60]))
# output: 48.0
median()¶
- Return the median (middle value) of numeric data.
from statistics import median
print(median([1, 3, 5]))
# output: 3
print(median([1, 3, 5, 7]))
# output: 4.0
median_low()¶
- Return the low median of numeric data.
from statistics import median_low
print(median_low([1, 3, 5]))
# output: 3
print(median_low([1, 3, 5, 7]))
# output: 3
median_high()¶
- Return the high median of data.
from statistics import median_high
print(median_high([1, 3, 5]))
# output: 3
print(median_high([1, 3, 5, 7]))
# output: 5
median_grouped()¶
- Return the 50th percentile (median) of grouped continuous data.
from statistics import median_grouped
print(median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5]))
# output: 3.7
print(median_grouped([52, 52, 53, 54]))
# output: 52.5
mode()¶
- Return the most common data point from discrete or nominal data.
from statistics import mode
print(mode([1, 2, 2, 3, 4, 4, 4, 4, 4, 5]))
# output: 4
print(mode([52, 52, 53, 54]))
# output: 52
multimode()¶
from statistics import multimode
print(mode("aabbbbbbbbcc"))
# output: ['b']
print(multimode('aabbbbccddddeeffffgg'))
# output: ['b', 'd', 'f']
quantiles()¶
- Divide data into n continuous intervals with equal probability.
from statistics import quantiles
data = [
105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110,
100, 75, 105, 103, 109, 76, 119, 99, 91, 103, 129,
106, 101, 84, 111, 74, 87, 86, 103, 103, 106, 86,
111, 75, 87, 102, 121, 111, 88, 89, 101, 106, 95,
103, 107, 101, 81, 109, 104]
print([round(q, 1) for q in quantiles(data, n=10)])
# output: [81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0]
pstdev()¶
- Return the square root of the population variance.
from statistics import pstdev
print(pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75]))
# output: 0.986893273527251
pvariance()¶
- Return the population variance of
data
from statistics import pvariance
data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
print(pvariance(data))
# output: 1.25
stdev()¶
- Return the square root of the sample variance.
from statistics import stdev
print(stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75]))
# output: 1.0810874155219827
variance()¶
- Return the sample variance of data.
from statistics import variance
data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
print(variance(data))
# output: 1.3720238095238095
covariance()¶
- Return the sample covariance of two inputs x and y.
- Covariance is a measure of the joint variability of two inputs.
from statistics import covariance
x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
y = [1, 2, 3, 1, 2, 3, 1, 2, 3]
print(covariance(x, y))
# output: 0.75
correlation()¶
- Return the Pearson's correlation coefficient for two inputs.
- Pearson's correlation coefficient r takes values between -1 and +1.
- It measures the strength and direction of the linear relationship, where +1 means very strong, positive linear relationship, -1 very strong, negative linear relationship, and 0 no linear relationship.
from statistics import correlation
x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
y = [9, 8, 7, 6, 5, 4, 3, 2, 1]
print(correlation(x, x))
# output: 1
print(correlation(x, y))
# output: -1
linear_regression()¶
- Return the slope and intercept of simple linear regression parameters estimated using ordinary least squares. Simple linear regression describes relationship between an independent variable x and a dependent variable y in terms of linear function:
y = slope * x + intercept + noise
- where slope and intercept are the regression parameters that are estimated, and noise represents the variability of the data that was not explained by the linear regression (it is equal to the difference between predicted and actual values of the dependent variable).
from statistics import NormalDist, linear_regression
x = [1, 2, 3, 4, 5]
noise = NormalDist().samples(5, seed=42)
y = [3 * x[i] + 2 + noise[i] for i in range(5)]
print(linear_regression(x, y))
# output: LinearRegression(slope=3.0907891417020465, intercept=1.756849704861633)