在大数据实时处理领域,Flink已经成为事实上的标准计算引擎。作为Flink生态的核心组件,SQL Connector承担着连接外部数据源的关键角色。但在实际项目中,我们经常会遇到官方Connector无法满足需求的场景:
去年我在金融风控项目中就遇到过这种情况:需要实时消费Kafka中的加密交易数据,解密后关联HBase中的用户画像,最终写入Oracle风控库。官方Connector无法直接满足这个数据流转需求,这时候就需要开发自定义Connector。
一个完整的SQL Connector需要实现以下接口:
java复制// 表工厂入口
public class CustomTableFactory implements
DynamicTableSourceFactory,
DynamicTableSinkFactory {
// 定义Connector标识符
@Override
public String factoryIdentifier() {
return "custom";
}
}
// 数据源实现
public class CustomTableSource implements ScanTableSource {
@Override
public ScanRuntimeProvider getScanRuntimeProvider() {
return SourceFunctionProvider.of(new CustomSourceFunction(), false);
}
}
// 数据接收器实现
public class CustomTableSink implements DynamicTableSink {
@Override
public SinkRuntimeProvider getSinkRuntimeProvider() {
return SinkFunctionProvider.of(new CustomSinkFunction());
}
}
CREATE TABLE语句时,Flink会根据factoryIdentifier匹配对应的工厂类以开发Kafka加密数据源为例:
java复制public class DecryptKafkaSource implements SourceFunction<RowData> {
private volatile boolean isRunning = true;
private final String topic;
private final Decryptor decryptor;
@Override
public void run(SourceContext<RowData> ctx) {
KafkaConsumer consumer = createConsumer();
while (isRunning) {
ConsumerRecords records = consumer.poll(Duration.ofMillis(100));
for (record : records) {
byte[] decrypted = decryptor.decrypt(record.value());
RowData row = convertToRowData(decrypted);
ctx.collect(row);
}
}
}
@Override
public void cancel() {
isRunning = false;
}
}
在TableFactory中定义可配置参数:
java复制@Override
public Set<ConfigOption<?>> requiredOptions() {
Set<ConfigOption<?>> options = new HashSet<>();
options.add(ConfigOptions.key("topic")
.stringType()
.noDefaultValue());
return options;
}
@Override
public Set<ConfigOption<?>> optionalOptions() {
Set<ConfigOption<?>> options = new HashSet<>();
options.add(ConfigOptions.key("decrypt.algorithm")
.stringType()
.defaultValue("AES"));
return options;
}
对于Oracle这类RDBMS,建议实现批量写入:
java复制public class OracleSink extends RichSinkFunction<RowData> {
private transient Connection connection;
private transient PreparedStatement stmt;
private int batchSize = 1000;
private int batchCount = 0;
@Override
public void invoke(RowData value, Context context) {
bindParameters(stmt, value);
stmt.addBatch();
if (++batchCount >= batchSize) {
stmt.executeBatch();
batchCount = 0;
}
}
@Override
public void close() {
if (batchCount > 0) {
stmt.executeBatch();
}
stmt.close();
connection.close();
}
}
java复制public class OracleSink extends RichSinkFunction<RowData>
implements CheckpointedFunction {
private transient ListState<RowData> checkpointedState;
@Override
public void snapshotState(FunctionSnapshotContext context) {
checkpointedState.clear();
// 保存未提交的批次数据
for (RowData row : pendingRows) {
checkpointedState.add(row);
}
}
@Override
public void initializeState(FunctionInitializationContext context) {
// 从检查点恢复状态
checkpointedState = context.getOperatorStateStore()
.getListState(new ListStateDescriptor<>("buffered-rows", RowData.class));
if (context.isRestored()) {
for (RowData row : checkpointedState.get()) {
pendingRows.add(row);
}
}
}
}
Source并行度:通常与分区数保持一致
split.size配置切分查询Sink并行度:
sql复制-- 对于Oracle这类有主键约束的库
SET 'table.exec.sink.upsert-materialize' = 'ALL';
网络缓冲优化:
yaml复制taskmanager.network.memory.fraction: 0.2
taskmanager.network.memory.max: 1gb
在YARN环境下的典型配置:
yaml复制# 每个TaskManager的slot数
taskmanager.numberOfTaskSlots: 4
# 堆外内存配置
taskmanager.memory.process.size: 8192m
taskmanager.memory.jvm-overhead.min: 512m
现象:ClassNotFoundException或NoSuchMethodError
解决方案:
确保所有依赖项使用provided scope:
xml复制<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-api-java-bridge</artifactId>
<version>${flink.version}</version>
<scope>provided</scope>
</dependency>
使用child-first类加载:
java复制env.enableChangelogStateBackend(true);
诊断命令:
bash复制# 查看反压情况
flink list -m yarn-cluster -yid application_123456789
优化方案:
增加Sink端缓冲:
java复制env.setBufferTimeout(100);
调整检查点间隔:
java复制env.enableCheckpointing(30000, CheckpointingMode.EXACTLY_ONCE);
推荐使用shade插件打包:
xml复制<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
</plugin>
yaml复制high-availability: zookeeper
high-availability.zookeeper.quorum: zk1:2181,zk2:2181
high-availability.storageDir: hdfs:///flink/ha/
| Flink版本 | 适配要点 |
|---|---|
| 1.13.x | 使用TableFactory体系 |
| 1.14.x | 新增DynamicTableSink#getChangelogMode |
| 1.15.x | 支持SupportsRowLevelDelete接口 |
在最近一次从1.14升级到1.15的过程中,我们发现DataType的序列化方式发生了变化,导致检查点恢复失败。解决方案是在CustomSourceFunction中重写了snapshotState方法,显式处理类型信息。