目标检测作为计算机视觉的核心任务之一,其发展历程堪称一部算法优化的史诗。当我们翻开这段技术史,会发现从RCNN到Faster RCNN的演进不是简单的线性改进,而是一系列关键突破的有机组合。本文将带您用PyTorch代码重现这段进化历程,特别聚焦SPPNet的空间金字塔池化和RoI Pooling这两个革命性创新。
2014年诞生的RCNN首次将CNN引入目标检测领域,其mAP从传统方法DPM的35.1%跃升至53.7%。让我们先看看它的核心实现逻辑:
python复制class RCNN(nn.Module):
def __init__(self, backbone='vgg16', num_classes=20):
super(RCNN, self).__init__()
self.feature_extractor = get_backbone(backbone)
self.classifier = nn.Linear(4096, num_classes)
self.bbox_regressor = nn.Linear(4096, num_classes*4)
def forward(self, regions):
# regions: (N, 3, 227, 227)
features = self.feature_extractor(regions)
cls_scores = self.classifier(features)
bbox_deltas = self.bbox_regressor(features)
return cls_scores, bbox_deltas
RCNN的主要问题在于:
提示:实际项目中,选择性搜索生成的候选框通常需要约2000个,这意味着每张图片需要进行2000次CNN前向传播。
SPPNet的核心创新在于解决了CNN对输入尺寸的严格限制。其空间金字塔池化层允许网络接受任意尺寸的输入,同时输出固定长度的特征向量。以下是PyTorch实现的关键部分:
python复制class SpatialPyramidPooling(nn.Module):
def __init__(self, levels=[1, 2, 4]):
super(SpatialPyramidPooling, self).__init__()
self.levels = levels
def forward(self, x):
n, c, h, w = x.size()
features = []
for level in self.levels:
kh = h // level
kw = w // level
pool = nn.MaxPool2d(kernel_size=(kh, kw), stride=(kh, kw))
pooled = pool(x).view(n, -1)
features.append(pooled)
return torch.cat(features, dim=1)
SPPNet相比RCNN的主要改进:
| 特性 | RCNN | SPPNet |
|---|---|---|
| 特征提取次数 | ~2000次/图 | 1次/图 |
| 输入尺寸 | 固定227×227 | 任意尺寸 |
| 计算效率 | 低(14s/图) | 高(2-3s/图) |
| 特征鲁棒性 | 受形变影响大 | 多尺度更鲁棒 |
Fast RCNN通过两项关键创新解决了前代模型的痛点:RoI Pooling和多任务损失函数。让我们重点解析RoI Pooling的实现:
python复制import torch.nn.functional as F
def roi_pooling(features, rois, output_size):
"""
features: (1, C, H, W) 整图特征
rois: (N, 5) [batch_idx, x1, y1, x2, y2]
output_size: (pooled_h, pooled_w)
"""
batch_indices = rois[:, 0].long()
rois = rois[:, 1:]
# 将ROI坐标映射到特征图尺度
spatial_scale = features.size(2) / 224 # 假设原图尺寸为224
rois = rois * spatial_scale
# 计算每个ROI的网格划分
roi_height = rois[:, 3] - rois[:, 1]
roi_width = rois[:, 2] - rois[:, 0]
bin_size_h = roi_height / output_size[0]
bin_size_w = roi_width / output_size[1]
pooled_features = []
for i in range(output_size[0]):
for j in range(output_size[1]):
# 计算每个bin的坐标范围
h_start = rois[:, 1] + i * bin_size_h
h_end = rois[:, 1] + (i+1) * bin_size_h
w_start = rois[:, 0] + j * bin_size_w
w_end = rois[:, 0] + (j+1) * bin_size_w
# 执行最大池化
pooled = F.max_pool2d(
features[batch_indices, :, h_start.long():h_end.long(), w_start.long():w_end.long()],
kernel_size=(1,1)
)
pooled_features.append(pooled)
return torch.cat(pooled_features, dim=1)
Fast RCNN的训练流程也实现了重大突破:
Faster RCNN的创新在于用RPN(Region Proposal Network)替代了耗时的选择性搜索。以下是RPN的关键实现:
python复制class RPN(nn.Module):
def __init__(self, in_channels=512, mid_channels=256, anchor_ratios=[0.5, 1, 2]):
super(RPN, self).__init__()
self.conv = nn.Conv2d(in_channels, mid_channels, 3, padding=1)
self.cls_logits = nn.Conv2d(mid_channels, len(anchor_ratios)*2, 1) # 每个位置2k scores
self.bbox_pred = nn.Conv2d(mid_channels, len(anchor_ratios)*4, 1) # 每个位置4k coordinates
def forward(self, x):
x = F.relu(self.conv(x))
logits = self.cls_logits(x)
bbox_deltas = self.bbox_pred(x)
return logits, bbox_deltas
Faster RCNN的完整训练流程包括:
在实际项目中,我发现几个关键调优点: