手把手教你用Python和BlackboxProtobuf逆向万方数据接口（附完整代码）

穆晶波

Python逆向工程实战：无.proto文件解析Protobuf数据的黑盒技法

当我们面对一个返回Protobuf二进制数据的API时，传统方法需要先获取.proto定义文件才能正确解析数据。但在实际逆向工程场景中，这类元数据往往难以获取。本文将介绍如何利用Python生态中的blackboxprotobuf工具，直接对未知结构的Protobuf数据进行逆向解析。

1. Protobuf逆向工程基础准备

Protobuf（Protocol Buffers）作为Google开发的二进制序列化格式，相比JSON具有更高的传输效率和更小的数据体积。但正是这种高效的二进制特性，使得逆向解析变得更具挑战性。

在开始实战前，我们需要配置以下环境：

bash复制pip install blackboxprotobuf requests

关键工具说明：

blackboxprotobuf：核心逆向工具，支持无.proto文件解析
requests：用于模拟API请求获取原始数据
Chrome开发者工具：网络抓包分析

对于需要分析的API接口，我们首先需要通过浏览器开发者工具确认以下特征：

响应Content-Type包含application/grpc-web+proto或类似标识
响应体为不可读的二进制数据
请求头可能包含grpc-encoding等字段

2. 数据捕获与提取技巧

2.1 浏览器抓包方法论

在Chrome开发者工具的Network面板中，按照以下步骤操作：

筛选XHR请求
定位目标API接口
查看Response标签页的原始数据

典型的问题响应数据如下所示（十六进制表示）：

code复制00000000 1a 04 08 01 10 02 1a 0e 0a 0c 48 65 6c 6c 6f 20
00000010 57 6f 72 6c 64 21

2.2 二进制数据预处理

实际抓包获取的数据往往包含协议头尾等冗余信息，需要先进行清理：

python复制def clean_protobuf_data(raw_bytes):
    # 常见情况：去除前5字节头部和末尾20字节尾部
    return raw_bytes[5:-20] if len(raw_bytes) > 25 else raw_bytes

3. BlackboxProtobuf核心逆向技法

3.1 基础解析操作

使用blackboxprotobuf进行逆向解析的基本模式：

python复制import blackboxprotobuf

def decode_protobuf(binary_data):
    try:
        json_data, message_type = blackboxprotobuf.protobuf_to_json(binary_data)
        return {
            'data': json.loads(json_data),
            'schema': message_type
        }
    except Exception as e:
        print(f"解码失败: {str(e)}")
        return None

典型输出示例：

json复制{
  "data": {
    "1": "value1",
    "2": 123,
    "3": {
      "1": "nested_value"
    }
  },
  "schema": {
    "1": {"type": "bytes", "name": ""},
    "2": {"type": "int", "name": ""},
    "3": {"type": "message", "message_typedef": {...}}
  }
}

3.2 字段类型智能识别

blackboxprotobuf会自动推断字段类型，常见映射关系：

Protobuf类型	推断类型	Python对应类型
string	bytes	str
int32	int	int
message	message	dict
repeated	list	list

3.3 复杂嵌套结构处理

对于深层嵌套的Protobuf消息，可以采用分层解析策略：

python复制def deep_decode(binary_data):
    result = {}
    data, schema = blackboxprotobuf.protobuf_to_json(binary_data)
    result.update(json.loads(data))
    
    # 递归处理嵌套消息
    for field in schema:
        if schema[field]['type'] == 'message':
            nested_data = result[field]
            if isinstance(nested_data, str):
                nested_binary = base64.b64decode(nested_data)
                result[field] = deep_decode(nested_binary)
    return result

4. 实战：万方数据接口逆向案例

4.1 接口特征分析

通过对万方数据平台的接口分析，我们发现其典型特征：

POST请求，Content-Type: application/grpc-web+proto
请求体包含二进制Header+Protobuf数据
响应为标准的Protobuf编码格式

4.2 完整逆向代码实现

python复制import requests
import blackboxprotobuf
import json

def build_grpc_request_body(proto_body):
    # GRPC-web标准头部：5字节，前4字节是长度(大端序)
    header = len(proto_body).to_bytes(4, 'big') + b'\x00'
    return header + proto_body

def decode_wanfang_response(response):
    # 去除GRPC头部和可能的尾部
    clean_data = response.content[5:-20] if len(response.content) > 25 else response.content
    return blackboxprotobuf.protobuf_to_json(clean_data)

# 示例请求构造
url = "https://s.wanfangdata.com.cn/SearchService.SearchService/search"
headers = {
    "Content-Type": "application/grpc-web+proto",
    "User-Agent": "Mozilla/5.0"
}

# 构建请求体（模拟实际请求）
request_body = build_grpc_request_body(b'\x0a\x07\x08\x01\x12\x03\xe5\x9b\xb2\xe6\xa3\x8b')

response = requests.post(url, headers=headers, data=request_body)
if response.status_code == 200:
    json_data, message_type = decode_wanfang_response(response)
    print("解析结果:", json.dumps(json.loads(json_data), indent=2))
    print("消息结构:", message_type)

4.3 结果分析与字段映射

典型解析结果示例：

json复制{
  "1": "paper",
  "2": "围棋",
  "3": 1,
  "4": 20,
  "5": 0,
  "6": [0],
  "7": {
    "1": [
      {
        "1": "基于深度学习的围棋局面评估方法研究",
        "2": ["张三", "李四"],
        "3": "2020",
        "4": "计算机学报"
      }
    ]
  }
}

对应消息结构分析：

python复制{
  '1': {'type': 'bytes', 'name': ''},        # searchType
  '2': {'type': 'bytes', 'name': ''},        # searchWord 
  '3': {'type': 'int', 'name': ''},          # currentPage
  '4': {'type': 'int', 'name': ''},          # pageSize
  '5': {'type': 'int', 'name': ''},          # searchScope
  '6': {'type': 'int', 'name': '', 'repeated': True},  # searchFilter
  '7': {'type': 'message', 'message_typedef': {        # results
    '1': {'type': 'message', 'message_typedef': {
      '1': {'type': 'bytes', 'name': ''},    # title
      '2': {'type': 'bytes', 'name': '', 'repeated': True},  # authors
      '3': {'type': 'bytes', 'name': ''},    # year
      '4': {'type': 'bytes', 'name': ''}     # journal
    }, 'repeated': True}}
  }
}

5. 高级技巧与异常处理

5.1 类型强制指定策略

当自动类型推断不准确时，可以手动指定类型提示：

python复制type_hints = {
    '1': {'type': 'bytes'},
    '3': {'type': 'int'}
}

json_data, message_type = blackboxprotobuf.protobuf_to_json(
    binary_data,
    typedef=type_hints
)

5.2 常见错误排查指南

错误现象	可能原因	解决方案
DecodeError	数据不完整或损坏	检查数据截取范围，确认是否去除协议头尾
KeyError	字段编号冲突	提供type_hints明确字段类型
JSON解析失败	二进制数据包含非法字符	尝试base64编码后处理

5.3 性能优化建议

对于大规模数据解析，可以采用以下优化策略：

python复制# 启用快速模式（牺牲少量准确性）
blackboxprotobuf.protobuf_to_json(binary_data, fast_mode=True)

# 复用消息类型定义（避免重复分析）
known_types = {}
first_data, known_types = blackboxprotobuf.protobuf_to_json(binary_data1, typedef=known_types)
second_data, known_types = blackboxprotobuf.protobuf_to_json(binary_data2, typedef=known_types)

6. 工程化应用建议

在实际项目中应用时，建议采用以下架构：

code复制protobuf_decoder/
├── __init__.py
├── decoder.py       # 核心解码逻辑
├── schemas/         # 存储已知消息结构
│   ├── wanfang.json
│   └── types.py
└── utils/
    ├── grpc.py      # GRPC协议处理
    └── cleanup.py   # 数据清理

典型的生产级解码器类实现：

python复制class ProtobufDecoder:
    def __init__(self, schema_dir='schemas'):
        self.schemas = self._load_schemas(schema_dir)
        self.type_cache = {}
    
    def decode(self, binary_data, schema_name=None):
        if schema_name and schema_name in self.schemas:
            typedef = self.schemas[schema_name]
            return blackboxprotobuf.protobuf_to_json(binary_data, typedef)
        
        # 自动推断模式
        try:
            return blackboxprotobuf.protobuf_to_json(
                binary_data,
                typedef=self.type_cache
            )
        except Exception as e:
            raise ProtobufDecodeError(f"解码失败: {str(e)}")
    
    def _load_schemas(self, dir_path):
        schemas = {}
        for file in os.listdir(dir_path):
            if file.endswith('.json'):
                with open(os.path.join(dir_path, file)) as f:
                    schemas[file[:-5]] = json.load(f)
        return schemas

7. 安全与伦理考量

在进行任何逆向工程前，务必确认：

目标API的Robots.txt和Terms of Service条款
请求频率控制在合理范围
不绕过任何显式的访问限制
仅用于合法合规的数据分析目的

建议在请求头中添加明显的身份标识：

python复制headers = {
    'User-Agent': 'ResearchBot/1.0 (+https://example.com/bot-info)',
    'X-Requested-With': 'AcademicResearch'
}

已经到底了哦

精选内容

1 【Lin通信】从硬件到AUTOSAR：LinTrcv模块状态机与唤醒机制深度解析 2 ARM Coresight OpenOCD 系列 1 -- OpenOCD 架构解析与核心组件 3 别再只盯着YOLO了！用ByteTrack+DeepSORT实战解决目标追踪中的遮挡难题 4 从一段‘诡异’的PLC灯控程序说起：深入理解扫描周期如何‘吃掉’你的输出信号 5 从零到一：手把手教你搭建Buck电路并完成Simulink仿真验证 6 保姆级教程：用Python+OpenCV从零搭建图像去雨系统（附数据集下载）7 从Multisim到ADS：利用TRANSIENT仿真快速验证共射放大器设计 8 保姆级教程：用微信小程序+NRF51822蓝牙信标，5分钟搞定室内定位原型搭建 9 从ISO14229-1到SAE J2012：一个DTC格式标识符背后的汽车诊断标准“江湖”10 奇安信天眼实战指南：从告警研判到威胁狩猎的面试核心解析