随机森林分类建模实战：从数据准备到模型优化

埃琳娜莱农

1. 随机森林分类建模实战指南

作为一名数据科学从业者，我经常需要处理各种分类问题。最近在客户项目中，我使用随机森林算法成功解决了一个复杂的多分类问题，准确率达到了92%。今天就来分享一下我的完整实现过程，包括数据准备、模型训练、评估和特征分析的全套方案。

随机森林(Random Forest)作为集成学习的代表算法之一，因其出色的表现和易用性，已经成为机器学习领域的"瑞士军刀"。它不仅能处理高维特征，对缺失值和异常值也有很好的鲁棒性，特别适合作为基线模型来验证问题的可解性。

2. 环境准备与数据加载

2.1 基础环境配置

在开始之前，我们需要准备好Python环境和必要的库。我推荐使用Anaconda创建独立的虚拟环境：

bash复制conda create -n rf_classifier python=3.8
conda activate rf_classifier
pip install pandas numpy scikit-learn matplotlib

提示：固定Python版本和库版本可以确保实验可复现。建议使用requirements.txt记录具体版本。

2.2 数据加载与探索

假设我们有一个电商用户行为数据集，需要预测用户是否会购买某类商品。数据格式如下：

python复制import pandas as pd

# 加载数据集
train_data = pd.read_csv('user_behavior_train.csv')
test_data = pd.read_csv('user_behavior_test.csv')

# 初步查看数据
print(f"训练集形状: {train_data.shape}")
print(f"测试集形状: {test_data.shape}")
print("\n训练集前5行:")
print(train_data.head())

典型输出可能显示：

code复制训练集形状: (8000, 15)
测试集形状: (2000, 14)

训练集前5行:
   user_id  page_views  cart_adds  ...  avg_session_duration  is_vip  purchase_flag
0    10001          12          2  ...                  325.4       0              1
1    10002           5          0  ...                  112.8       1              0
...

3. 数据预处理与特征工程

3.1 特征与标签分离

python复制# 分离特征和标签
X = train_data.drop(['user_id', 'purchase_flag'], axis=1)  # 移除ID列和标签列
y = train_data['purchase_flag']  # 目标变量

# 检查类别分布
print("类别分布:\n", y.value_counts(normalize=True))

注意：如果类别严重不平衡(如正负样本比例超过1:4)，需要考虑过采样/欠采样或调整类别权重。

3.2 数据分割与标准化

python复制from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# 划分训练集和验证集
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

# 数值特征标准化
scaler = StandardScaler()
num_cols = ['page_views', 'cart_adds', 'avg_session_duration']
X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
X_val[num_cols] = scaler.transform(X_val[num_cols])

技巧：使用stratify参数保持训练集和验证集的类别比例一致。标准化只对数值型特征进行，且用训练集的参数转换验证集。

4. 随机森林模型构建

4.1 基础模型训练

python复制from sklearn.ensemble import RandomForestClassifier

# 初始化随机森林
rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    random_state=42,
    class_weight='balanced'
)

# 训练模型
rf.fit(X_train, y_train)

参数说明：

n_estimators: 决策树数量，通常100-500
max_depth: 控制树复杂度，防止过拟合
class_weight: 处理不平衡数据，设为'balanced'自动调整权重

4.2 模型评估

python复制from sklearn.metrics import classification_report, confusion_matrix

# 预测验证集
y_pred = rf.predict(X_val)

# 评估指标
print("分类报告:\n", classification_report(y_val, y_pred))
print("\n混淆矩阵:\n", confusion_matrix(y_val, y_pred))

典型输出示例：

code复制分类报告:
               precision    recall  f1-score   support

           0       0.91      0.95      0.93      1280
           1       0.80      0.68      0.73       320

    accuracy                           0.89      1600
   macro avg       0.85      0.81      0.83      1600
weighted avg       0.89      0.89      0.89      1600

混淆矩阵:
 [[1215   65]
 [ 103  217]]

5. 特征重要性分析

5.1 计算特征重要性

python复制import matplotlib.pyplot as plt
import numpy as np

# 获取特征重要性
importances = rf.feature_importances_
std = np.std([tree.feature_importances_ for tree in rf.estimators_], axis=0)
indices = np.argsort(importances)[::-1]

# 打印特征重要性
print("特征重要性排序:")
for i in indices:
    print(f"{X.columns[i]}: {importances[i]:.4f}")

5.2 可视化展示

python复制plt.figure(figsize=(12, 6))
plt.title("特征重要性")
plt.bar(range(X.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), X.columns[indices], rotation=45, ha="right")
plt.xlim([-1, X.shape[1]])
plt.tight_layout()
plt.savefig('feature_importance.png', dpi=300)
plt.show()

6. 模型优化与调参

6.1 网格搜索调参

python复制from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10],
    'max_features': ['sqrt', 'log2']
}

grid_search = GridSearchCV(
    estimator=RandomForestClassifier(random_state=42, class_weight='balanced'),
    param_grid=param_grid,
    cv=5,
    n_jobs=-1,
    verbose=2
)

grid_search.fit(X_train, y_train)
print("最佳参数:", grid_search.best_params_)

6.2 使用最优参数重新训练

python复制best_rf = grid_search.best_estimator_
y_pred_best = best_rf.predict(X_val)
print("优化后分类报告:\n", classification_report(y_val, y_pred_best))

7. 模型部署与应用

7.1 模型保存与加载

python复制import joblib

# 保存模型和预处理对象
joblib.dump(best_rf, 'rf_classifier.pkl')
joblib.dump(scaler, 'scaler.pkl')

# 加载使用示例
loaded_model = joblib.load('rf_classifier.pkl')
loaded_scaler = joblib.load('scaler.pkl')

# 对新数据进行预处理和预测
new_data = pd.read_csv('new_users.csv')
new_data[num_cols] = loaded_scaler.transform(new_data[num_cols])
predictions = loaded_model.predict(new_data.drop('user_id', axis=1))