Python3 statistics模块：基础统计计算指南-代码聚汇网

Python3 statistics模块：基础统计计算指南

聂世歆

1. Python3 statistics模块入门指南

统计计算是数据分析的基础工作，Python标准库中的statistics模块为开发者提供了便捷的统计计算功能。这个模块从Python 3.4版本开始引入，专门用于处理数值型数据的常见统计运算，避免了依赖第三方库的麻烦。

在实际项目中，我们经常需要对数据集进行快速统计分析。比如计算销售数据的平均值、分析用户年龄的中位数、评估产品评分的离散程度等。statistics模块正是为解决这类基础统计需求而设计，它包含了均值、中位数、方差、标准差等常用统计量的计算方法。

与numpy、pandas等第三方库相比，statistics模块的优势在于：

无需额外安装，Python标准库自带
接口简单直观，学习成本低
对小型数据集处理效率高
支持多种数值类型（int、float、Decimal、Fraction）

提示：对于大型数据集或复杂统计分析，建议使用numpy或pandas等专业库。statistics模块更适合快速计算和小型数据集处理。

2. 核心功能详解

2.1 集中趋势度量

集中趋势指标反映数据分布的"中心"位置，是最常用的统计量。

均值计算函数：

mean(): 计算算术平均数
harmonic_mean(): 计算调和平均数
geometric_mean(): 计算几何平均数

python复制from statistics import mean

data = [1, 2, 3, 4, 5]
avg = mean(data)  # 返回3.0

中位数计算函数：

median(): 计算中位数（中间值）
median_low(): 计算低中位数
median_high(): 计算高中位数
median_grouped(): 计算分组数据的中位数

python复制from statistics import median

data = [1, 3, 5, 7]
med = median(data)  # 返回4.0

众数计算函数：

mode(): 计算众数（出现频率最高的值）
multimode(): 返回所有众数的列表（Python 3.8+）

python复制from statistics import mode

data = [1, 2, 2, 3, 3, 3]
m = mode(data)  # 返回3

2.2 离散程度度量

离散程度指标反映数据的波动情况。

方差与标准差：

variance(): 计算样本方差
stdev(): 计算样本标准差
pvariance(): 计算总体方差
pstdev(): 计算总体标准差

python复制from statistics import stdev

data = [2, 4, 6, 8, 10]
std_dev = stdev(data)  # 返回3.1622776601683795

极差相关：

模块本身不提供极差函数，但可以轻松实现：

python复制data = [1, 3, 5, 7]
data_range = max(data) - min(data)  # 返回6

2.3 其他实用函数

协方差与相关系数：

covariance(): 计算协方差（Python 3.10+）
correlation(): 计算皮尔逊相关系数（Python 3.10+）

python复制from statistics import covariance

x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
cov = covariance(x, y)  # 返回5.0

线性回归：

linear_regression(): 计算线性回归方程（Python 3.10+）

python复制from statistics import linear_regression

x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
slope, intercept = linear_regression(x, y)
# slope=2.0, intercept=0.0

3. 数据类型支持与性能考量

3.1 支持的数据类型

statistics模块不仅支持常规的整数和浮点数，还支持更精确的数值类型：

Decimal: 适用于金融计算等需要高精度的场景
Fraction: 适用于需要精确分数表示的场景

python复制from statistics import mean
from decimal import Decimal
from fractions import Fraction

# Decimal示例
dec_data = [Decimal("1.5"), Decimal("2.75"), Decimal("3.25")]
dec_mean = mean(dec_data)  # 返回Decimal('2.5')

# Fraction示例
frac_data = [Fraction(1, 2), Fraction(3, 4), Fraction(5, 8)]
frac_mean = mean(frac_data)  # 返回Fraction(5, 8)

3.2 性能优化建议

虽然statistics模块使用方便，但在处理大数据集时需要注意性能问题：

数据规模影响：
- 适合处理小型数据集（通常<10,000个数据点）
- 对于大型数据集，建议使用numpy或pandas
类型转换开销：
- 混合类型数据会自动转换为浮点数，可能产生额外开销
- 尽量保持数据类型一致

替代方案对比：

python复制# 小数据集(100个点)
statistics.mean: 0.0001s
numpy.mean: 0.0003s  # 有启动开销

# 大数据集(1,000,000个点) 
statistics.mean: 0.15s
numpy.mean: 0.005s  # 优势明显

内存使用：
- statistics模块函数通常需要将整个数据集加载到内存
- 对于极大数据集，考虑分块计算或使用生成器

4. 实际应用案例

4.1 学生成绩分析

假设我们有一组学生成绩数据，需要计算各种统计量：

python复制grades = [85, 92, 78, 90, 82, 88, 76, 95, 89, 83]

from statistics import *

# 集中趋势
mean_grade = mean(grades)
median_grade = median(grades)
mode_grade = mode(grades)

# 离散程度
grade_range = max(grades) - min(grades)
variance_grade = variance(grades)
stdev_grade = stdev(grades)

print(f"平均分: {mean_grade:.1f}")
print(f"中位数: {median_grade}")
print(f"众数: {mode_grade}")
print(f"极差: {grade_range}")
print(f"方差: {variance_grade:.2f}")
print(f"标准差: {stdev_grade:.2f}")

4.2 销售数据分析

分析月度销售额的分布特征：

python复制sales = [12000, 15000, 9000, 18000, 11000, 13000, 16000]

from statistics import *

# 基本统计量
avg_sales = mean(sales)
med_sales = median(sales)
sales_std = stdev(sales)

# 异常值检测
lower_bound = avg_sales - 2*sales_std
upper_bound = avg_sales + 2*sales_std
outliers = [s for s in sales if s < lower_bound or s > upper_bound]

print(f"平均销售额: ${avg_sales:,.2f}")
print(f"销售额标准差: ${sales_std:,.2f}")
print(f"异常值: {outliers}")

4.3 科学实验数据处理

处理实验测量数据，计算误差范围：

python复制measurements = [12.5, 12.7, 12.4, 12.6, 12.8, 12.5, 12.9]

from statistics import *

avg = mean(measurements)
std_err = stdev(measurements) / (len(measurements) ** 0.5)

print(f"测量平均值: {avg:.2f} ± {std_err:.2f} (标准误差)")

5. 常见问题与解决方案

5.1 错误处理与异常情况

空数据集问题：

python复制try:
    mean([])
except statistics.StatisticsError as e:
    print(f"错误: {e}")  # 输出: 错误: mean requires at least one data point

无众数情况：

python复制data = [1, 2, 3, 4]
try:
    mode(data)
except statistics.StatisticsError as e:
    print(f"错误: {e}")  # 输出: 错误: no unique mode; found 4 equally common values

类型不匹配：

python复制data = [1, 2, '3', 4]
try:
    mean(data)
except TypeError as e:
    print(f"错误: {e}")  # 输出: 错误: can't convert type 'str' to numerator/denominator

5.2 性能优化技巧

预处理数据：

python复制# 不好的做法
cleaned_data = [x for x in raw_data if x is not None]
result = mean(cleaned_data)

# 更好的做法
result = mean(x for x in raw_data if x is not None)

使用生成器表达式：

python复制# 对于大型数据集
data = (x for x in huge_dataset if x > threshold)
avg = mean(data)  # 不会一次性加载全部数据到内存

类型一致性检查：

python复制from numbers import Real

def safe_mean(data):
    if not all(isinstance(x, Real) for x in data):
        raise TypeError("所有元素必须是数值类型")
    return mean(data)

5.3 特殊场景处理

处理无穷大和NaN：

python复制import math
from statistics import mean

data = [1, 2, float('inf'), 3]
filtered = [x for x in data if not math.isinf(x)]
avg = mean(filtered)  # 返回2.0

加权平均数计算：
虽然statistics模块不直接支持加权平均，但可以轻松实现：

python复制def weighted_mean(values, weights):
    return sum(v * w for v, w in zip(values, weights)) / sum(weights)

scores = [80, 90, 70]
weights = [0.2, 0.5, 0.3]
w_mean = weighted_mean(scores, weights)  # 返回81.0

分组数据统计：
对于已经分组的数据，可以使用median_grouped：

python复制from statistics import median_grouped

# 数据表示落在区间[100-110), [110-120), etc.
midpoints = [105, 115, 125, 135, 145]
grouped_median = median_grouped(midpoints, interval=10)  # 返回125

6. 模块扩展与高级用法

6.1 自定义统计函数

基于statistics模块构建更复杂的统计函数：

计算变异系数(CV)：

python复制from statistics import mean, stdev

def coefficient_of_variation(data):
    m = mean(data)
    s = stdev(data)
    return (s / m) * 100 if m != 0 else float('nan')

data = [10, 12, 14, 16, 18]
cv = coefficient_of_variation(data)  # 返回约24.03%

计算偏度和峰度：

python复制from statistics import mean, stdev

def skewness(data):
    m = mean(data)
    s = stdev(data)
    n = len(data)
    return (sum((x - m)**3 for x in data) / n) / s**3

def kurtosis(data):
    m = mean(data)
    s = stdev(data)
    n = len(data)
    return (sum((x - m)**4 for x in data) / n) / s**4 - 3

data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
print(f"偏度: {skewness(data):.2f}")  # 输出0.0
print(f"峰度: {kurtosis(data):.2f}")  # 输出-1.2

6.2 与其它库的集成

与collections模块结合：

python复制from statistics import mean
from collections import Counter

def weighted_mean_by_frequency(data):
    freq = Counter(data)
    values = list(freq.keys())
    weights = list(freq.values())
    return mean(values, weights)

data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4]
wm = weighted_mean_by_frequency(data)  # 返回3.0

与random模块结合生成模拟数据：

python复制from statistics import mean, stdev
import random

# 生成正态分布随机数据
mu, sigma = 100, 15
data = [random.gauss(mu, sigma) for _ in range(1000)]

# 验证统计量
print(f"样本均值: {mean(data):.1f}")  # 应接近100
print(f"样本标准差: {stdev(data):.1f}")  # 应接近15

6.3 统计测试实现

实现简单的Z检验：

python复制from statistics import mean, stdev
import math

def z_test(sample, pop_mean, pop_std=None):
    sample_mean = mean(sample)
    sample_size = len(sample)
    
    if pop_std is None:
        # 使用样本标准差
        pop_std = stdev(sample)
    
    std_error = pop_std / math.sqrt(sample_size)
    z_score = (sample_mean - pop_mean) / std_error
    return z_score

# 示例：检验样本均值是否显著不同于100
sample = [98, 102, 105, 97, 103, 99, 101, 100, 104, 96]
z = z_test(sample, 100)  # 返回约0.527

实现简单的t检验：

python复制from statistics import mean, stdev
import math

def t_test(sample, pop_mean):
    sample_mean = mean(sample)
    sample_std = stdev(sample)
    n = len(sample)
    
    t = (sample_mean - pop_mean) / (sample_std / math.sqrt(n))
    return t

# 示例：单样本t检验
sample = [20, 22, 19, 25, 21, 23]
t = t_test(sample, 20)  # 返回约1.897