最近在数据仓库项目中遇到一个典型需求:将 MongoDB 中的业务数据实时同步到 ClickHouse 进行分析。传统方案要么延迟太高,要么实现太复杂。经过一番调研和实践,最终基于 Flink CDC 3.5 + Flink 1.20 实现了一套稳定可靠的实时同步方案。
这个方案的核心优势在于:
整套方案已经在生产环境稳定运行,日处理千万级数据。下面分享具体实现细节和踩坑经验。
关键点:Flink CDC 的 MongoDB Connector 底层依赖 Change Stream,而 Change Stream 只在副本集或分片集群上可用。单机 MongoDB 无法使用这个方案。
创建 MongoDB 配置文件 /data/mongodb/conf/mongod.conf:
yaml复制# 必须配置副本集名称
replication:
replSetName: rs0
storage:
engine: wiredTiger
wiredTiger:
engineConfig:
cacheSizeGB: 4
dbPath: /data/db
journal:
enabled: true
systemLog:
destination: file
logAppend: true
path: /data/db/mongod.log
net:
bindIp: 0.0.0.0
port: 27017
security:
authorization: enabled
keyFile: /data/db/mongo-keyfile
bash复制openssl rand -base64 756 > /data/mongodb/data/mongo-keyfile
chmod 400 /data/mongodb/data/mongo-keyfile
chown 999:999 /data/mongodb/data/mongo-keyfile
bash复制docker run -d \
--name=mongodb \
-v /data/mongodb/data:/data/db \
-v /data/mongodb/conf/mongod.conf:/etc/mongod.conf \
-p 27017:27017 \
mongo:4.4 \
mongod --config /etc/mongod.conf
bash复制# 创建管理员用户
docker exec -it mongodb mongo --eval '
db.getSiblingDB("admin").createUser({
user: "mongoadmin",
pwd: "YourSecurePassword",
roles: [{ role: "root", db: "admin" }]
})'
# 初始化副本集
docker exec -it mongodb mongo -u mongoadmin -p YourSecurePassword \
--authenticationDatabase admin --eval "rs.initiate()"
# 配置外部访问地址(重要!)
docker exec -it mongodb mongo -u mongoadmin -p YourSecurePassword \
--authenticationDatabase admin --eval '
var cfg = rs.conf();
cfg.members[0].host = "mongo.example.com:27017";
rs.reconfig(cfg);'
javascript复制// 创建访问日志集合
db.createCollection("access_logs", {
validator: {
$jsonSchema: {
bsonType: "object",
required: ["_id", "c_d", "channel", "device_id", "url", "version"],
properties: {
_id: { bsonType: "objectId" },
user_id: { bsonType: "string" },
c_d: { bsonType: "date" },
channel: { bsonType: "string" },
device_id: { bsonType: "string" },
ip: { bsonType: "string" },
remark: { bsonType: "string" },
trace_id: { bsonType: "string" },
url: { bsonType: "string" },
version: { bsonType: "string" }
}
}
}
});
// 创建索引
db.access_logs.createIndex({ user_id: 1 }, { name: "idx_user_id" });
db.access_logs.createIndex({ c_d: 1 }, { name: "idx_c_d", expireAfterSeconds: 1209600 });
sql复制CREATE TABLE dw.ods_mongo_access_logs
(
`_id` String COMMENT 'MongoDB 文档 ID',
`user_id` Nullable(String) COMMENT '用户 ID',
`device_id` String COMMENT '设备 ID',
`trace_id` Nullable(String) COMMENT '请求 Trace ID',
`channel` String COMMENT '渠道',
`version` String COMMENT '版本',
`ip` Nullable(String) COMMENT '请求 IP',
`url` String COMMENT '请求 URL',
`remark` String COMMENT '备注',
`create_date` Nullable(DateTime('Asia/Shanghai')) COMMENT '创建时间',
`sync_time` DateTime DEFAULT now() COMMENT '数据同步时间'
)
ENGINE = ReplacingMergeTree(sync_time)
ORDER BY (device_id, create_date, _id)
PARTITION BY toYYYYMMDD(create_date)
SETTINGS allow_nullable_key = 1;
表引擎选择理由:
ReplacingMergeTree 根据 ORDER BY 键去重,保留 sync_time 最大的记录xml复制<properties>
<flink.version>1.20.1</flink.version>
</properties>
<dependencies>
<!-- Flink 核心 -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>${flink.version}</version>
</dependency>
<!-- Flink CDC MongoDB -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-mongodb-cdc</artifactId>
<version>3.5.0</version>
</dependency>
<!-- ClickHouse JDBC -->
<dependency>
<groupId>com.clickhouse</groupId>
<artifactId>clickhouse-jdbc</artifactId>
<version>0.8.5</version>
<classifier>all</classifier>
</dependency>
<!-- Flink JDBC Connector -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-jdbc</artifactId>
<version>3.3.0-1.20</version>
</dependency>
</dependencies>
java复制public class MongoToClickHouseSync {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1); // CDC 源建议单并行度
// 1. 创建 MongoDB CDC Source
MongoDBSource<String> mongoSource = MongoDBSource.<String>builder()
.hosts("mongo.example.com:27017")
.username("flink_user")
.password("YourSecurePassword")
.databaseList("app_data")
.collectionList("app_data.access_logs")
.deserializer(new JsonDebeziumDeserializationSchema())
.build();
// 2. 定义 Row 类型
RowTypeInfo rowType = new RowTypeInfo(
Types.STRING, Types.STRING, Types.STRING, Types.STRING,
Types.STRING, Types.STRING, Types.STRING, Types.STRING,
Types.STRING, Types.SQL_TIMESTAMP
);
// 3. 构建数据处理管道
env.fromSource(mongoSource, WatermarkStrategy.noWatermarks(), "MongoDB CDC Source")
.flatMap(new MongoDocParser()).returns(rowType)
.addSink(JdbcSink.sink(
"INSERT INTO ods_mongo_access_logs VALUES (?,?,?,?,?,?,?,?,?,?)",
(ps, row) -> {
ps.setString(1, row.getField(0));
ps.setString(2, row.getField(1));
// 其他字段设置...
},
JdbcExecutionOptions.builder()
.withBatchSize(2000)
.withBatchIntervalMs(5000)
.build(),
new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
.withUrl("jdbc:clickhouse://ck.example.com:8123/dw")
.withDriverName("com.clickhouse.jdbc.ClickHouseDriver")
.build()
));
env.execute("MongoDB to ClickHouse Sync");
}
}
java复制public static class MongoDocParser implements FlatMapFunction<String, Row> {
@Override
public void flatMap(String value, Collector<Row> out) throws Exception {
JsonNode root = objectMapper.readTree(value);
String operationType = root.path("operationType").asText();
// 只处理 insert/update/replace 事件
if (!"insert".equals(operationType) && !"update".equals(operationType)) {
return;
}
JsonNode fullDocNode = root.path("fullDocument");
if (fullDocNode.isMissingNode()) {
return;
}
// 解析文档字段
Row row = Row.of(
getOid(doc, "_id"),
getString(doc, "user_id"),
// 其他字段...
);
out.collect(row);
}
// 字段解析工具方法...
}
bash复制mvn clean package -P prod -DskipTests
bash复制flink run -c com.example.MongoToClickHouseSync target/flink-mongo-to-ck.jar
关键指标监控:
sourceIdleTime 指标numRecordsInPerSecondlastCheckpointDuration性能调优参数:
yaml复制# flink-conf.yaml
execution.checkpointing.interval: 30000
execution.checkpointing.timeout: 600000
state.backend: rocksdb
state.checkpoints.dir: hdfs:///flink/checkpoints
错误信息:
code复制The $changeStream stage is only supported on replica sets
解决方案:
rs.status()现象:ClickHouse 中出现重复数据
解决方案:
ReplacingMergeTree 引擎ORDER BY 键是否包含唯一标识字段sync_time 字段正确更新优化方向:
java复制JdbcExecutionOptions.builder()
.withBatchSize(5000) // 增大批次大小
.withBatchIntervalMs(2000) // 缩短批次间隔
.build()
这套方案已经在多个生产环境稳定运行,最高支持日均 10 亿级数据同步。实际应用中可以根据业务需求调整同步策略和性能参数。