电商平台每天产生海量订单数据,这些数据往往分散在不同系统中。比如用户基本信息存在MySQL用户表,订单详情存在MongoDB,而支付记录又在另一个PostgreSQL库。这种数据孤岛现象给数据分析带来巨大挑战。
去年我参与过一个跨境电商项目,需要将分散在三个数据库的用户画像、订单行为和支付记录整合到数据仓库。手动写ETL脚本不仅效率低,还经常因为字段映射错误导致数据不一致。后来我们采用Seatunnel的多表同步方案,配置过程比想象中简单很多。
核心配置逻辑:通过定义多个source连接不同数据源,使用transform进行字段映射和转换,最后将处理好的数据写入目标表。下面是一个典型配置示例:
yaml复制env {
execution.parallelism = 4
job.mode = "BATCH"
}
source {
# 用户表
Jdbc {
url = "jdbc:mysql://user-db:3306/ecommerce"
query = "SELECT user_id, register_time, vip_level FROM user_profile"
result_table_name = "user_source"
}
# 订单表
MongoDB {
uri = "mongodb://order-db:27017"
database = "order_system"
collection = "orders"
result_table_name = "order_source"
}
}
transform {
# 关联用户与订单
Sql {
query = """
SELECT u.user_id, u.vip_level, o.order_id, o.total_amount
FROM user_source u JOIN order_source o ON u.user_id = o.buyer_id
"""
result_table_name = "joined_data"
}
}
sink {
ClickHouse {
host = "analytics-db:8123"
database = "dwh"
table = "user_order_analysis"
fields = ["user_id", "vip_level", "order_id", "total_amount"]
}
}
这个配置实现了:
多表同步最常遇到的问题就是字段类型不匹配。有次同步时遇到MySQL的datetime字段转到Elasticsearch变成时间戳,排查发现是时区配置缺失。建议在source和sink都明确指定时区:
yaml复制# MySQL source配置
Jdbc {
url = "jdbc:mysql://db:3306/test?serverTimezone=Asia/Shanghai"
}
# Elasticsearch sink配置
Elasticsearch {
hosts = ["es:9200"]
index = "orders"
timestamp_format = "yyyy-MM-dd HH:mm:ss"
timezone = "+08:00"
}
另一个常见问题是字段名大小写敏感。不同数据库对大小写的处理方式不同,建议:
field_name_lowercase = true)某智能家居项目需要实时汇聚来自不同厂商的设备数据,包括:
yaml复制env {
execution.parallelism = 6
job.mode = "STREAMING"
checkpoint.interval = 30000
}
source {
# HTTP源 - 空调数据
Http {
url = "http://ac-api/getSensorData"
format = "json"
polling_interval = 5000
result_table_name = "ac_data"
}
# Kafka源 - 门锁事件
Kafka {
topics = "doorlock_events"
consumer.group.id = "seatunnel_consumer"
result_table_name = "lock_events"
}
}
transform {
# 统一设备ID格式
Sql {
query = """
SELECT
CONCAT('ac_', device_id) AS device_id,
event_time,
'temperature' AS metric_type,
temp_value AS metric_value
FROM ac_data
"""
result_table_name = "normalized_ac"
}
# 过滤异常值
Sql {
query = "SELECT * FROM lock_events WHERE event_type IN ('unlock', 'lock')"
result_table_name = "filtered_lock"
}
}
sink {
# 写入TDengine
Tdengine {
url = "jdbc:TAOS://tdengine:6030"
database = "iot"
stable = "devices"
fields = ["device_id", "event_time", "metric_type", "metric_value"]
}
}
批量写入:对于批量作业,调整sink的batch配置能显著提升性能:
yaml复制Jdbc {
batch_size = 500
batch_interval_ms = 1000
}
并行度设置:根据数据源数量合理设置parallelism,通常建议:
内存调优:在env区块添加JVM参数:
yaml复制env {
execution.parallelism = 4
job.mode = "BATCH"
job.memory.mb = 2048
}
连锁超市场景需要将每日销售记录与库存数据关联分析。这里演示如何将MySQL销售表与Hive库存表关联后写入StarRocks:
yaml复制source {
# MySQL销售数据
Jdbc {
url = "jdbc:mysql://pos-db:3306/retail"
query = """
SELECT
store_id,
product_code,
DATE(sale_time) AS sale_date,
SUM(quantity) AS daily_sales
FROM sales
GROUP BY 1,2,3
"""
result_table_name = "sales_summary"
}
# Hive库存数据
Hive {
query = """
SELECT
store_id,
sku AS product_code,
stock_date,
closing_stock
FROM inventory_daily
"""
result_table_name = "inventory"
}
}
transform {
Sql {
query = """
SELECT
s.store_id,
s.product_code,
s.sale_date,
s.daily_sales,
i.closing_stock,
s.daily_sales/i.closing_stock AS sell_through_rate
FROM sales_summary s
JOIN inventory i ON s.store_id=i.store_id
AND s.product_code=i.product_code
AND s.sale_date=i.stock_date
"""
result_table_name = "sales_analysis"
}
}
sink {
StarRocks {
jdbc_url = "jdbc:mysql://starrocks:9030"
load_url = "starrocks:8030"
database = "analytics"
table = "store_performance"
columns = ["store_id", "product_code", "sale_date", "daily_sales", "closing_stock", "sell_through_rate"]
}
}
对于每日增量数据,推荐使用时间戳字段过滤:
yaml复制Jdbc {
query = """
SELECT * FROM sales
WHERE update_time > '${last_update_time}'
"""
incremental_column = "update_time"
incremental_column_type = "timestamp"
}
配合调度系统(如DolphinScheduler)定期执行,每次运行时自动替换${last_update_time}为上次同步的最大时间戳。
不同数据库类型同步时需要特别注意:
类型转换:
特殊字符处理:
yaml复制Jdbc {
url = "jdbc:oracle:thin:@//oracle:1521/ORCLCDB?escapeProcessing=false"
}
大字段处理:
yaml复制Jdbc {
lob_fetch_size = 10240
}
事务隔离级别:
yaml复制Jdbc {
transaction_isolation = "READ_COMMITTED"
fetch_size = 1000
}
实际项目中遇到过一个PostgreSQL到MySQL的同步问题,源表的JSONB字段在目标端变成了字符串。后来通过transform的JsonParse插件解决:
yaml复制transform {
JsonParse {
source_field = "json_data"
target_field = "parsed_json"
}
}