上周亲眼目睹同事的Excel在打开风电场SCADA数据时直接崩溃——16GB的CSV文件,8核32G内存的工作站直接卡死。这就像用咖啡滤纸接消防水龙头,不崩才怪。风电数据天生带着工业级"脏数据"的基因:传感器漂移、通讯中断补零、调试期的测试数据混入...今天我们就用Python+PyArrow这套组合拳,教你怎么把"生水"处理成能喝的"纯净水"。
关键认知:风电CSV不是普通表格数据,而是带时间戳的高频传感器日志,常规办公软件根本不适合处理这种工业时序数据。
python复制# 典型异常值示例
['9999', '-999', '0', '1.23e+30', 'NaN', 'NULL', '风机停机'] # 同个温度字段里的七种"无效值"
mermaid复制graph TD
A[原始CSV] --> B{内存映射读取}
B --> C[类型推断+自动转换]
C --> D[无效值标记]
D --> E[时间序列对齐]
E --> F[物理量程过滤]
F --> G[输出Parquet]
python复制import pyarrow.csv as pacsv
# 关键参数:block_size=1GB, null_values=['NULL','-999']
parse_options = pacsv.ParseOptions(delimiter=',', quote_char='"')
read_options = pacsv.ReadOptions(block_size=1<<30)
convert_options = pacsv.ConvertOptions(
column_types={'wind_speed': 'float32'},
null_values=['NA', 'NULL', '-999']
)
with pacsv.open_csv('wind_data.csv',
parse_options=parse_options,
read_options=read_options,
convert_options=convert_options) as reader:
for next_chunk in reader:
process_chunk(next_chunk) # 你的处理函数
auto_dict_encode=False 避免分类变量被错误编码python复制date_columns = {'timestamp': pacsv.TimestampParser(format='%Y/%m/%d %H:%M:%S%z')}
python复制from scipy.ndimage import uniform_filter1d
def dynamic_threshold(df, window=144):
"""6小时滑动窗口(10分钟间隔数据)"""
rolling_mean = uniform_filter1d(df['value'], size=window)
residuals = df['value'] - rolling_mean
std = residuals.std()
return (residuals.abs() > 4*std) # 4σ原则
python复制def physical_limits(df):
"""根据风机型号技术参数过滤"""
mask = (
(df['rotor_speed'] >= 0) & (df['rotor_speed'] <= 18.5) & # 转速rpm
(df['wind_speed'] <= 35) & # 切出风速
(df['temp_gearbox'] < 120) # 齿轮箱报警阈值
)
return df[mask]
| 处理步骤 | Pandas耗时 | PyArrow耗时 | 内存峰值 |
|---|---|---|---|
| 读取16GB CSV | 崩溃 | 2分18秒 | 3.2GB |
| 类型转换 | 6分47秒 | 1分12秒 | 16GB |
| 异常值过滤 | 3分21秒 | 47秒 | 9GB |
| 输出Parquet | 4分02秒 | 38秒 | 2.1GB |
实测环境:Azure D4s v3 VM (4vCPU, 16GB RAM),PyArrow 8.0.0版本
json复制{
"data_quality": {
"missing_rate": 0.021,
"invalid_dropped": 12487,
"time_range": ["2023-01-01T00:00Z", "2023-03-31T23:50Z"],
"sampling_rate": "10min"
},
"processing_log": [
{"step": "unit_conversion", "comment": "统一风速单位为m/s"},
{"step": "time_alignment", "params": {"method": "linear"}}
]
}
python复制def preprocess_pipeline(input_path):
# 分块读取配置
chunk_size = 1_000_000 # 每批处理100万行
reader = pacsv.open_csv(input_path, batch_size=chunk_size)
# 创建Parquet写入器
writer = pq.ParquetWriter('output.parquet',
schema=pyarrow.schema([
('timestamp', pa.timestamp('us')),
('wind_speed', pa.float32()),
('power', pa.float32())
]))
for batch in reader:
df = batch.to_pandas()
# 执行所有清洗步骤
df = (df.pipe(time_align)
.pipe(unit_convert)
.pipe(outlier_detect)
.pipe(physical_filter))
# 写回PyArrow Table格式
table = pa.Table.from_pandas(df)
writer.write_table(table)
writer.close()
处理完的数据用DuckDB查询比原CSV快60倍:
sql复制-- 直接查询50GB的Parquet文件
SELECT date_trunc('hour', timestamp) as hour,
avg(wind_speed) as avg_speed
FROM 'wind_data.parquet'
WHERE power > 2000
GROUP BY 1;