CatBoost在伦敦房价预测竞赛中的实战应用

2021在职mba

1. 项目概述：伦敦房价预测竞赛解析

伦敦作为全球最重要的房地产市场之一，其房价波动一直备受关注。Kaggle平台上的"London House Price Prediction: Advanced Techniques"竞赛为数据科学家们提供了一个实战舞台，要求参赛者基于房产的多维度特征构建预测模型。这个项目不仅考验机器学习技术的应用能力，更需要对房地产市场的深入理解。

1.1 竞赛核心挑战

本次竞赛的数据集包含了伦敦地区房产交易的详细记录，每笔交易都有以下关键特征：

基础属性：卧室数量、浴室数量、房产类型(propertyType)、面积等
位置信息：完整地址(fullAddress)、邮编(postcode)、区域编码(outcode)、国家(country)
法律与能源：产权类型(tenure)、当前能源评级(currentEnergyRating)
交易时间：成交月份和年份

预测目标是根据这些特征准确估计房产售价(price)。评估指标包括MAE(平均绝对误差)、MSE(均方误差)、RMSE(均方根误差)、R²(决定系数)以及MAPE(平均绝对百分比误差)。

特别提示：竞赛要求模型预测结果在log10(price)空间进行评估，这要求我们在数据处理阶段就对价格进行对数转换，最后提交前再转换回原始价格空间。

1.2 技术路线选择

面对这样的结构化数据预测问题，我们选择了CatBoost作为基础模型，主要基于以下考虑：

数据中包含大量类别特征(如邮编、产权类型等)，CatBoost对类别特征有原生支持
地址文本(fullAddress)作为重要特征，CatBoost内置的文本处理能力可以简化特征工程
梯度提升树模型在处理表格数据时通常表现优异
CatBoost支持GPU加速，适合处理大规模数据

2. 数据预处理与特征工程

2.1 数据加载与初步探索

我们首先加载训练集和测试集，并进行初步的数据探索：

python复制import pandas as pd

# 加载数据
train = pd.read_csv('/kaggle/input/london-house-price-prediction-advanced-techniques/train.csv')
test = pd.read_csv('/kaggle/input/london-house-price-prediction-advanced-techniques/test.csv')

# 查看数据概览
print(f"训练集形状: {train.shape}")
print(f"测试集形状: {test.shape}")

# 检查缺失值
print("\n训练集缺失值统计:")
print(train.isnull().sum())

print("\n测试集缺失值统计:")
print(test.isnull().sum())

这一步骤帮助我们了解数据规模和各特征的缺失情况，为后续处理提供依据。

2.2 缺失值处理策略

面对缺失数据，我们制定了分级处理方案：

高缺失率特征处理：
- 对训练集和测试集中缺失率超过50%的特征直接删除
- 记录这些特征名称，确保训练和测试集同步删除
低缺失率特征处理：
- 对于数值型特征：用中位数填充
- 对于类别型特征：用众数填充
- 特别注意：使用训练集的统计量来填充测试集，避免数据泄露

python复制def handle_missing_data(train, test, threshold=0.5):
    # 识别高缺失率特征
    high_missing_cols = []
    for col in train.columns:
        if train[col].isnull().mean() > threshold:
            high_missing_cols.append(col)
    
    # 同步删除高缺失率特征
    train = train.drop(columns=high_missing_cols)
    test = test.drop(columns=high_missing_cols)
    
    # 处理低缺失率特征
    for col in train.columns:
        if train[col].isnull().sum() > 0:
            if train[col].dtype == 'object':  # 类别型
                fill_value = train[col].mode()[0]
            else:  # 数值型
                fill_value = train[col].median()
            
            train[col] = train[col].fillna(fill_value)
            test[col] = test[col].fillna(fill_value)
    
    return train, test

train, test = handle_missing_data(train, test)

2.3 内存优化技巧

在处理大规模数据时，内存优化至关重要。我们实现了以下优化策略：

python复制def reduce_mem_usage(df):
    """迭代检查各列数据类型，尽可能转换为更节省内存的类型"""
    start_mem = df.memory_usage().sum() / 1024**2
    print(f"优化前内存使用: {start_mem:.2f} MB")
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                else:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    
    end_mem = df.memory_usage().sum() / 1024**2
    print(f"优化后内存使用: {end_mem:.2f} MB")
    print(f"内存减少: {(start_mem - end_mem)/start_mem:.1%}")
    
    return df

train = reduce_mem_usage(train)
test = reduce_mem_usage(test)

3. 模型构建与训练

3.1 数据划分与目标转换

房价数据通常呈现右偏分布，取对数可以使其更接近正态分布：

python复制from sklearn.model_selection import train_test_split
import numpy as np

# 特征与标签分离
X = train.drop(columns=['ID', 'price'])
y = np.log10(train['price'])  # 对价格取log10

# 划分训练集和验证集 (90%训练，10%验证)
X_train, X_val, y_train, y_val = train_test_split(
    X, y, 
    test_size=0.1, 
    random_state=927
)

3.2 CatBoost模型配置

我们精心配置了CatBoost模型的参数：

python复制from catboost import CatBoostRegressor

model = CatBoostRegressor(
    iterations=4096*4,       # 最大迭代次数
    learning_rate=0.08,      # 学习率
    depth=8,                 # 树深度
    l2_leaf_reg=0.4,         # L2正则化系数
    task_type='GPU',         # 使用GPU加速
    bagging_temperature=0.5, # 控制样本采样随机性
    border_count=128,        # 特征分箱数
    use_best_model=True,     # 使用验证集最佳模型
    random_state=927,        # 随机种子
    verbose=100              # 每100轮打印日志
)

3.3 模型训练与特征指定

CatBoost的强大之处在于它能原生处理类别和文本特征：

python复制# 指定类别特征和文本特征
cat_features = ['postcode', 'country', 'outcode', 'tenure', 'propertyType', 'currentEnergyRating']
text_features = ['fullAddress']

# 训练模型
model.fit(
    X_train, y_train,
    eval_set=(X_val, y_val),
    cat_features=cat_features,
    text_features=text_features,
    early_stopping_rounds=128  # 早停机制
)

4. 模型评估与优化

4.1 评估指标实现

我们实现了全面的回归评估指标计算函数：

python复制from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

def evaluate_model(model, X, y_true):
    y_pred = model.predict(X)
    
    metrics = {
        'MAE': mean_absolute_error(y_true, y_pred),
        'MSE': mean_squared_error(y_true, y_pred),
        'RMSE': np.sqrt(mean_squared_error(y_true, y_pred)),
        'R2': r2_score(y_true, y_pred),
        'MAPE': np.mean(np.abs((y_true - y_pred) / y_true)) * 100
    }
    
    return pd.DataFrame([metrics])

# 在验证集上评估
val_metrics = evaluate_model(model, X_val, y_val)
print(val_metrics)

4.2 高级优化方向

4.2.1 交叉验证策略优化

原始方案使用简单划分，我们可以升级为分层K折交叉验证：

python复制from sklearn.model_selection import KFold
from sklearn.preprocessing import KBinsDiscretizer

# 基于房价创建分层分桶
bins = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='quantile')
binned_y = bins.fit_transform(y.values.reshape(-1, 1)).ravel()

# 分层K折交叉验证
kf = KFold(n_splits=5, shuffle=True, random_state=927)
fold_metrics = []

for train_idx, val_idx in kf.split(X, binned_y):
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
    
    model.fit(X_train, y_train, eval_set=(X_val, y_val),
              cat_features=cat_features, text_features=text_features,
              early_stopping_rounds=128, verbose=False)
    
    metrics = evaluate_model(model, X_val, y_val)
    fold_metrics.append(metrics)

# 计算平均指标
final_metrics = pd.concat(fold_metrics).mean()

4.2.2 特征工程增强

我们可以从地址文本中提取更多结构化信息：

python复制import re

def extract_address_features(df):
    # 提取邮编前缀
    df['postcode_prefix'] = df['postcode'].str.extract(r'^([A-Z]+)')
    
    # 地址中是否包含特定关键词
    df['has_flat'] = df['fullAddress'].str.contains(r'\bflat\b', flags=re.IGNORECASE).astype(int)
    df['has_road'] = df['fullAddress'].str.contains(r'\broad\b', flags=re.IGNORECASE).astype(int)
    df['has_street'] = df['fullAddress'].str.contains(r'\bstreet\b', flags=re.IGNORECASE).astype(int)
    
    # 地址长度特征
    df['address_length'] = df['fullAddress'].str.len()
    df['word_count'] = df['fullAddress'].str.split().str.len()
    
    return df

train = extract_address_features(train)
test = extract_address_features(test)

4.2.3 模型集成策略

结合多个模型的优势可以提升预测稳定性：

python复制from sklearn.ensemble import StackingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

# 定义基模型
estimators = [
    ('catboost', CatBoostRegressor(iterations=2000, learning_rate=0.05, depth=6, 
                                  task_type='GPU', random_state=927, verbose=0)),
    ('xgb', XGBRegressor(n_estimators=1000, learning_rate=0.05, max_depth=6,
                        tree_method='gpu_hist', random_state=927)),
    ('lgbm', LGBMRegressor(n_estimators=1000, learning_rate=0.05, max_depth=6,
                          device='gpu', random_state=927))
]

# 定义元模型
stacking_model = StackingRegressor(
    estimators=estimators,
    final_estimator=CatBoostRegressor(iterations=500, learning_rate=0.02, 
                                     depth=4, task_type='GPU', random_state=927, verbose=0)
)

# 训练集成模型
stacking_model.fit(X_train, y_train, 
                  catboost__cat_features=cat_features,
                  catboost__text_features=text_features)

5. 结果提交与后处理

5.1 预测结果生成

python复制# 对测试集进行预测
test_preds_log = model.predict(test[X_train.columns])

# 将log10预测值转换回原始价格空间
test_preds = 10 ** test_preds_log

# 加载提交模板
submission = pd.read_csv('/kaggle/input/london-house-price-prediction-advanced-techniques/sample_submission.csv')

# 填充预测结果
submission['price'] = test_preds

# 保存提交文件
submission.to_csv('submission.csv', index=False)

5.2 预测结果后处理

为提高预测结果的合理性，我们可以添加一些后处理步骤：

python复制# 获取训练集价格的最小值和最大值
min_price = train['price'].min()
max_price = train['price'].max()

# 对预测结果进行裁剪，确保在合理范围内
submission['price'] = submission['price'].clip(lower=min_price*0.9, upper=max_price*1.1)

# 对极端高价值房产应用额外调整
price_99_percentile = train['price'].quantile(0.99)
high_value_mask = submission['price'] > price_99_percentile
submission.loc[high_value_mask, 'price'] = submission.loc[high_value_mask, 'price'] * 0.95  # 适度下调

6. 实战经验与避坑指南

6.1 关键注意事项

数据泄露预防：
- 绝对不要使用测试集信息来填充训练集的缺失值
- 交叉验证时要确保每折的特征工程独立进行
- 目标编码等操作必须在交叉验证循环内部完成
类别特征处理：
- 确保将类别特征明确标记为字符串类型
- 对于高基数类别特征(如邮编)，考虑目标编码或频率编码
- 新出现的类别值(测试集有而训练集没有)需要特殊处理
文本特征优化：
- 地址文本中的拼写错误会影响模型表现
- 考虑使用文本预处理(标准化、拼写纠正)
- 可以尝试结合外部地理数据(如到地铁站的距离)

6.2 性能调优技巧

学习率与迭代次数：
- 较小的学习率通常需要更多迭代次数
- 使用学习率衰减策略可能获得更好效果
- 早停轮次不宜设置过大，避免浪费时间
树深度与正则化：
- 深度6-10之间的树通常表现良好
- 增加L2正则化可以防止过拟合
- 特征采样比例(colsample_bylevel)可以增加多样性
GPU加速优化：
- 确保正确安装GPU版本的CatBoost
- 大批量数据时GPU优势更明显
- 监控GPU显存使用，避免溢出