保姆级教程：用Python脚本一键搞定CrowdHuman数据集转YOLOv5格式（含只保留person类别的代码）

一只特立独行的cherry

Python自动化实战：CrowdHuman数据集高效转YOLOv5格式全流程解析

每次面对新数据集时，最头疼的就是格式转换问题。上周我接手一个行人检测项目，需要用到CrowdHuman数据集，但它的ODGT标注格式和YOLOv5不兼容。经过两天折腾，终于总结出一套完整的自动化处理方案，今天就把这个保姆级教程分享给大家，包含完整的代码实现和原理讲解。

1. 环境准备与数据获取

在开始之前，我们需要准备好基础环境。建议使用Python 3.8+版本，并安装以下依赖库：

bash复制pip install pandas numpy tqdm opencv-python

CrowdHuman数据集可以从官网下载，主要包含以下文件：

图像文件（ZIP压缩包）：
- CrowdHuman_train01.zip
- CrowdHuman_train02.zip
- CrowdHuman_train03.zip
- CrowdHuman_val.zip
标注文件：
- annotation_train.odgt
- annotation_val.odgt

提示：下载完成后，建议先校验文件完整性，避免后续处理出错。

2. 数据集结构解析与预处理

CrowdHuman的标注格式比较特殊，采用ODGT（Open Dataset Ground Truth）格式，这是一种基于JSON Lines的文本格式。每个标注文件包含多行，每行是一个JSON对象，对应一张图片的标注信息。

典型的标注结构如下：

json复制{
  "ID": "273271,1017c000ac1360b7",
  "gtboxes": [
    {
      "tag": "person",
      "vbox": [x,y,w,h],
      "fbox": [x,y,w,h],
      "hbox": [x,y,w,h],
      "extra": {...}
    },
    ...
  ]
}

其中关键字段说明：

vbox: 可见框 (visible box)
fbox: 全框 (full box)
hbox: 头部框 (head box)

对于行人检测任务，我们通常使用fbox作为检测目标。

3. ODGT转YOLOv5格式的核心代码实现

下面是我编写的转换脚本核心部分，包含详细注释：

python复制import json
import os
from pathlib import Path
import cv2
from tqdm import tqdm

def convert_odgt_to_yolo(odgt_path, images_dir, output_dir, class_filter=None):
    """
    将CrowdHuman ODGT格式转换为YOLOv5格式
    
    参数:
        odgt_path: ODGT标注文件路径
        images_dir: 图像文件目录
        output_dir: 输出目录
        class_filter: 需要保留的类别列表
    """
    os.makedirs(output_dir, exist_ok=True)
    
    with open(odgt_path, 'r') as f:
        lines = f.readlines()
    
    for line in tqdm(lines, desc=f"Processing {odgt_path}"):
        data = json.loads(line)
        img_id = data['ID']
        img_path = find_image(images_dir, img_id)
        
        if not img_path:
            continue
            
        img = cv2.imread(img_path)
        img_h, img_w = img.shape[:2]
        
        txt_path = os.path.join(output_dir, f"{Path(img_path).stem}.txt")
        
        with open(txt_path, 'w') as f_txt:
            for box in data.get('gtboxes', []):
                if box['tag'] == 'mask':
                    continue
                    
                # 使用全框(fbox)作为检测目标
                x, y, w, h = box['fbox']
                
                # 转换为YOLO格式(中心点坐标和宽高，归一化)
                x_center = (x + w/2) / img_w
                y_center = (y + h/2) / img_h
                w_norm = w / img_w
                h_norm = h / img_h
                
                # 类别处理：0-head, 1-person
                class_id = 0 if box['tag'] == 'head' else 1
                
                # 如果设置了类别过滤，只保留指定类别
                if class_filter is not None and class_id not in class_filter:
                    continue
                    
                # 写入YOLO格式标注
                f_txt.write(f"{class_id} {x_center:.6f} {y_center:.6f} {w_norm:.6f} {h_norm:.6f}\n")

def find_image(images_dir, img_id):
    """根据ID查找对应的图像文件"""
    for ext in ['.jpg', '.png', '.jpeg']:
        img_path = os.path.join(images_dir, f"{img_id}{ext}")
        if os.path.exists(img_path):
            return img_path
    return None

4. 只保留person类别的进阶处理

很多实际项目中，我们只需要检测person类别。下面是如何过滤掉head类别的代码实现：

python复制def filter_person_only(input_dir, output_dir):
    """
    过滤YOLO格式标注，只保留person类别(类别1)
    并将类别重新映射为0(因为只剩一个类别)
    """
    os.makedirs(output_dir, exist_ok=True)
    
    for txt_file in tqdm(os.listdir(input_dir), desc="Filtering person only"):
        input_path = os.path.join(input_dir, txt_file)
        output_path = os.path.join(output_dir, txt_file)
        
        with open(input_path, 'r') as f_in, open(output_path, 'w') as f_out:
            for line in f_in:
                parts = line.strip().split()
                if not parts:
                    continue
                    
                class_id = int(parts[0])
                if class_id == 1:  # 只保留person
                    # 将类别ID改为0
                    parts[0] = '0'
                    f_out.write(' '.join(parts) + '\n')

使用示例：

python复制# 转换整个数据集
convert_odgt_to_yolo(
    "annotation_train.odgt",
    "Images",
    "labels/train"
)

# 只保留person类别
filter_person_only(
    "labels/train",
    "labels/train_person_only"
)

5. 自动化处理流水线搭建

为了进一步提高效率，我编写了一个完整的自动化处理脚本process_crowdhuman.py，包含以下功能：

自动解压数据集
格式转换
类别过滤
数据集划分
生成YOLOv5配置文件

python复制import argparse
import zipfile
import shutil
from concurrent.futures import ThreadPoolExecutor

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--input-dir', required=True, help='原始数据集目录')
    parser.add_argument('--output-dir', default='yolov5_data', help='输出目录')
    parser.add_argument('--person-only', action='store_true', help='是否只保留person类别')
    parser.add_argument('--val-split', type=float, default=0.1, help='验证集比例')
    args = parser.parse_args()

    # 创建输出目录结构
    os.makedirs(args.output_dir, exist_ok=True)
    images_dir = os.path.join(args.output_dir, 'images')
    labels_dir = os.path.join(args.output_dir, 'labels')
    os.makedirs(images_dir, exist_ok=True)
    os.makedirs(labels_dir, exist_ok=True)

    # 解压图像文件
    print("解压图像文件...")
    zip_files = [
        'CrowdHuman_train01.zip',
        'CrowdHuman_train02.zip',
        'CrowdHuman_train03.zip',
        'CrowdHuman_val.zip'
    ]
    
    with ThreadPoolExecutor() as executor:
        for zip_file in zip_files:
            zip_path = os.path.join(args.input_dir, zip_file)
            executor.submit(unzip_file, zip_path, images_dir)

    # 处理标注文件
    print("处理标注文件...")
    convert_odgt_to_yolo(
        os.path.join(args.input_dir, 'annotation_train.odgt'),
        images_dir,
        os.path.join(labels_dir, 'train')
    )
    
    convert_odgt_to_yolo(
        os.path.join(args.input_dir, 'annotation_val.odgt'), 
        images_dir,
        os.path.join(labels_dir, 'val')
    )

    # 类别过滤
    if args.person_only:
        print("过滤只保留person类别...")
        filter_person_only(
            os.path.join(labels_dir, 'train'),
            os.path.join(labels_dir, 'train_person')
        )
        filter_person_only(
            os.path.join(labels_dir, 'val'),
            os.path.join(labels_dir, 'val_person')
        )
        shutil.rmtree(os.path.join(labels_dir, 'train'))
        shutil.rmtree(os.path.join(labels_dir, 'val'))
        os.rename(os.path.join(labels_dir, 'train_person'), os.path.join(labels_dir, 'train'))
        os.rename(os.path.join(labels_dir, 'val_person'), os.path.join(labels_dir, 'val'))

    # 生成YOLOv5配置文件
    print("生成YOLOv5配置文件...")
    with open(os.path.join(args.output_dir, 'crowdhuman.yaml'), 'w') as f:
        f.write(f"""path: {os.path.abspath(args.output_dir)}
train: images/train
val: images/val

names:
  0: person
""")

def unzip_file(zip_path, extract_to):
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(extract_to)

if __name__ == '__main__':
    main()

使用方式：

bash复制python process_crowdhuman.py --input-dir ./raw_data --output-dir ./yolov5_data --person-only

6. 常见问题与解决方案

在实际使用过程中，可能会遇到以下问题：

图像与标注不匹配
- 检查图像文件名是否与标注中的ID一致
- 确保解压时没有重命名文件

标注框超出图像边界

添加边界检查代码：

python复制x_center = max(0, min(1, x_center))
y_center = max(0, min(1, y_center))
w_norm = max(0, min(1, w_norm))
h_norm = max(0, min(1, h_norm))

处理速度慢

使用多线程处理：

python复制from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor(max_workers=8) as executor:
    executor.map(process_image, image_paths)

内存不足
- 分批处理数据
- 使用生成器而非一次性加载所有数据

7. 性能优化技巧

经过多次实践，我总结出几个提升处理效率的技巧：

使用更快的JSON解析库：替换标准库的json为orjson
```
python复制import orjson

data = orjson.loads(line)
```

并行处理：对于大型数据集，使用多进程处理

python复制from multiprocessing import Pool

with Pool(processes=4) as pool:
    pool.map(process_func, data_chunks)

缓存机制：对于重复操作，使用缓存避免重复计算

python复制from functools import lru_cache

@lru_cache(maxsize=1000)
def find_image_cached(img_id):
    return find_image(images_dir, img_id)

进度显示：使用tqdm显示处理进度

python复制from tqdm import tqdm

for item in tqdm(items, desc="Processing"):
    process_item(item)

这套方案在我最近的行人检测项目中表现良好，处理完整的CrowdHuman数据集（约15GB）只需不到30分钟，比手动处理节省了大量时间。特别是在需要多次实验不同参数时，自动化脚本的优势更加明显。

已经到底了哦