在大数据实时计算领域,Flink已经成为事实上的标准计算引擎。而Flink SQL作为其上层抽象,因其声明式的编程方式和低学习成本,被广泛应用于实时ETL、流式分析等场景。但在实际项目中,我们经常会遇到标准Connector无法满足需求的情况:
去年我在金融风控项目中就遇到过类似情况。我们需要实时消费Kafka中的交易数据,但消息体中包含多层嵌套的JSON结构,且需要先进行解密才能使用。标准Kafka Connector无法直接处理这种场景,最终我们通过自定义Connector完美解决了这个问题。
一个完整的自定义Connector需要实现以下关键接口:
DynamicTableSourceFactory
'connector'='custom'这类声明DynamicTableSource
ScanTableSource
SerializationSchema/DeserializationSchema
当Flink SQL执行查询时,自定义Connector的工作流程如下:
java复制// 伪代码展示核心流程
SQL解析 → 通过Factory创建TableSource →
生成RuntimeProvider → 转换为DataStream →
执行实际计算 → 输出结果
首先在pom.xml中添加必要依赖:
xml复制<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-api-java-bridge_2.12</artifactId>
<version>1.14.4</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-jdbc_2.12</artifactId>
<version>1.14.4</version>
</dependency>
注意:Flink版本需要保持一致,混合使用不同版本会导致运行时错误
java复制public class JdbcDynamicTableFactory implements DynamicTableSourceFactory {
private static final String IDENTIFIER = "jdbc";
@Override
public DynamicTableSource createDynamicTableSource(Context context) {
FactoryUtil.TableFactoryHelper helper = FactoryUtil.createTableFactoryHelper(this, context);
// 解析JDBC连接参数
JdbcOptions options = new JdbcOptions(helper.getOptions());
// 获取表结构信息
ResolvedSchema schema = context.getCatalogTable().getResolvedSchema();
return new JdbcDynamicTableSource(options, schema);
}
@Override
public String factoryIdentifier() {
return IDENTIFIER;
}
}
java复制public class JdbcDynamicTableSource implements ScanTableSource {
private final JdbcOptions options;
private final ResolvedSchema schema;
@Override
public ScanRuntimeProvider getScanRuntimeProvider(ScanContext runtimeProviderContext) {
// 批处理使用InputFormat
if(!runtimeProviderContext.isBounded()) {
throw new UnsupportedOperationException("JDBC Connector仅支持批处理");
}
return InputFormatProvider.of(
new JdbcRowDataInputFormat(
options,
schema.getColumnNames(),
schema.getColumnDataTypes()
)
);
}
}
java复制public class JdbcRowDataInputFormat extends AbstractJdbcInputFormat<RowData> {
@Override
protected RowData convert(Row resultSet) throws IOException {
GenericRowData row = new GenericRowData(resultSet.getArity());
for(int i=0; i<row.getArity(); i++){
row.setField(i, resultSet.getField(i));
}
return row;
}
}
对于大数据量表,实现增量读取能显著提升性能:
java复制// 在InputFormat中添加增量逻辑
String query = String.format(
"SELECT * FROM %s WHERE %s > ? AND %s <= ?",
tableName, partitionColumn, partitionColumn
);
ps = connection.prepareStatement(query);
ps.setObject(1, lastValue);
ps.setObject(2, currentValue);
通过分片键实现并行读取:
java复制// 在getScanRuntimeProvider中设置并行度
int parallelism = runtimeProviderContext.getParallelism();
for(int i=0; i<parallelism; i++){
String shardQuery = String.format(
"SELECT * FROM %s WHERE MOD(%s, %d)=%d",
tableName, shardKey, parallelism, i
);
// 为每个并行任务创建独立的InputFormat
}
处理SQL类型到Flink类型的转换:
java复制private DataType fromJdbcType(Types sqlType, int precision, int scale) {
switch(sqlType) {
case VARCHAR: return DataTypes.STRING();
case INTEGER: return DataTypes.INT();
case TIMESTAMP: return DataTypes.TIMESTAMP(3);
// 其他类型处理...
}
}
yaml复制# 在connector配置中添加
'connection.pool.size' = '5'
'connection.max.idle.time' = '5min'
'validation.query' = 'SELECT 1'
重要:连接池必须设置合理的最大值,避免耗尽数据库连接
| 参数 | 建议值 | 说明 |
|---|---|---|
| fetch.size | 5000 | 每次从数据库获取的行数 |
| auto.commit | false | 禁用自动提交 |
| socket.timeout | 300s | 网络超时时间 |
实现健壮的重试逻辑:
java复制public class RetryableInputFormat extends RichInputFormat {
private static final int MAX_RETRIES = 3;
private static final long RETRY_DELAY = 1000L;
@Override
public void openInputFormat() {
int attempt = 0;
while(attempt <= MAX_RETRIES) {
try {
// 尝试建立连接
break;
} catch(Exception e) {
if(++attempt > MAX_RETRIES) throw e;
Thread.sleep(RETRY_DELAY * attempt);
}
}
}
}
java复制public class JdbcConnectorTest {
@Test
public void testTableFactory() {
Map<String,String> options = new HashMap<>();
options.put("connector", "jdbc");
options.put("url", "jdbc:derby:memory:test");
DynamicTableSource source = factory.createDynamicTableSource(
Context.create(null, TableSchema.fromResolvedSchema(schema))
);
assertTrue(source instanceof JdbcDynamicTableSource);
}
}
使用TestContainers进行真实数据库测试:
java复制@Testcontainers
public class JdbcConnectorITCase {
@Container
public static PostgreSQLContainer<?> postgres = new PostgreSQLContainer<>();
@Test
public void testEndToEnd() throws Exception {
// 创建测试表并插入数据
// 执行Flink SQL查询
// 验证结果
}
}
建议监控以下关键指标:
使用maven-shade-plugin处理依赖冲突:
xml复制<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.2.4</version>
<executions>
<execution>
<phase>package</phase>
<goals><goal>shade</goal></goals>
</execution>
</executions>
</plugin>
支持运行时参数覆盖:
java复制public class JdbcOptions {
private final String url;
private final String tableName;
public JdbcOptions(ReadableConfig config) {
this.url = config.get(URL);
this.tableName = config.get(TABLE_NAME);
}
public static final ConfigOption<String> URL =
ConfigOptions.key("url").stringType().noDefaultValue();
}
通过MetricGroup暴露自定义指标:
java复制public class JdbcInputFormat extends RichInputFormat {
private transient Counter recordCounter;
@Override
public void openInputFormat() {
this.recordCounter = getRuntimeContext()
.getMetricGroup()
.counter("jdbc.records.read");
}
@Override
public RowData nextRecord(RowData reuse) {
recordCounter.inc();
// 读取逻辑...
}
}
在实际项目中,我们通过自定义Connector将Oracle CDC数据实时导入Flink,处理性能达到10万条/秒,端到端延迟控制在500ms以内。关键点在于合理设计分区策略和批量提交机制,避免频繁的小事务操作数据库。