在数据仓库的实际应用中,我们经常会遇到需要将经纬度坐标与地理区域进行匹配的需求。比如根据GPS坐标确定所在城市、行政区划等。这次我遇到的具体场景是:
公司数仓中存储了大量包含经纬度信息的业务数据(如用户位置、设备位置等),需要将这些坐标点匹配到对应的城市编码。然而,城市边界数据却存储在Greenplum数据库中,且以PostGIS的geometry类型存储(具体是MULTIPOLYGON格式)。这就带来了几个技术挑战:
经过调研和评估,我设计了两种互补的技术方案:
核心思路:
优势:
核心思路:
优势:
首先需要将PostGIS中的geometry数据转换为数仓可处理的格式:
sql复制-- 从Greenplum导出数据
SELECT
code,
name,
full_name,
encode(ST_AsBinary(geom), 'base64') AS geom_base64
FROM public.base_geography;
这里使用ST_AsBinary将geometry转为WKB格式,再用Base64编码为字符串,确保二进制数据能正确传输。
java复制public class PolygonToRowGrids extends UDTF {
private double gridSize = 0.1; // 网格大小(度)
public void process(Object[] args) {
String geomBase64 = (String) args[0];
byte[] wkbBytes = Base64.decodeBase64(geomBase64);
Geometry geom = wkbReader.read(wkbBytes);
// 计算几何图形的经纬度边界
double minX = geom.getEnvelopeInternal().getMinX();
double maxX = geom.getEnvelopeInternal().getMaxX();
double minY = geom.getEnvelopeInternal().getMinY();
double maxY = geom.getEnvelopeInternal().getMaxY();
// 计算覆盖的网格范围
int minCol = (int) Math.floor((minX + 180) / gridSize);
int maxCol = (int) Math.floor((maxX + 180) / gridSize);
int minRow = (int) Math.floor((minY + 90) / gridSize);
int maxRow = (int) Math.floor((maxY + 90) / gridSize);
// 生成所有覆盖的网格
for (int row = minRow; row <= maxRow; row++) {
for (int col = minCol; col <= maxCol; col++) {
String gridId = row + "_" + col;
forward(gridId, spatialId, new Binary(wkbBytes), minX, maxX, minY, maxY);
}
}
}
}
java复制public class PointToRowGrid extends UDF {
private double gridSize = 0.1;
public String evaluate(Double lon, Double lat) {
int col = (int) Math.floor((lon + 180) / gridSize);
int row = (int) Math.floor((lat + 90) / gridSize);
return row + "_" + col;
}
}
java复制public class PolygonToH3 extends UDTF {
private int resolution = 8; // H3分辨率级别
public void process(Object[] args) {
String geomBase64 = (String) args[0];
byte[] wkbBytes = Base64.decodeBase64(geomBase64);
Geometry geom = wkbReader.read(wkbBytes);
// 提取多边形外环坐标
List<LatLng> outerCoords = extractOuterCoords(geom);
// 生成覆盖多边形的H3单元格
List<String> cells = h3.polygonToCellAddresses(outerCoords, null, resolution);
for (String cell : cells) {
forward(cell, spatialId, new Binary(wkbBytes),
geom.getEnvelopeInternal().getMinX(),
geom.getEnvelopeInternal().getMaxX(),
geom.getEnvelopeInternal().getMinY(),
geom.getEnvelopeInternal().getMaxY());
}
}
}
java复制public class PointToH3 extends UDF {
private int resolution = 8;
public String evaluate(Double lon, Double lat) {
return h3.latLngToCellAddress(lat, lon, resolution);
}
}
无论是哪种方案,最终都需要进行精确的点面包含判断:
java复制public class STContainsPointBinary extends UDF {
public Boolean evaluate(Binary geomWkb, Double lon, Double lat) {
byte[] wkbBytes = geomWkb.data();
Geometry geom = wkbReader.read(wkbBytes);
Point point = geometryFactory.createPoint(new Coordinate(lon, lat));
return geom.contains(point);
}
}
sql复制-- 空间网格方案
ADD JAR geometry-1.0-SNAPSHOT-jar-with-dependencies.jar;
CREATE FUNCTION polygon_to_rowgrids AS 'com.udTf.PolygonToRowGrids' USING 'geometry-1.0-SNAPSHOT-jar-with-dependencies.jar';
CREATE FUNCTION point_to_rowgrid AS 'com.udf.PointToRowGrid' USING 'geometry-1.0-SNAPSHOT-jar-with-dependencies.jar';
-- H3方案
CREATE FUNCTION polygon_to_h3 AS 'com.udtf.PolygonToH3' USING 'geometry-1.0-SNAPSHOT-jar-with-dependencies.jar';
CREATE FUNCTION point_to_h3 AS 'com.udf.PointToH3' USING 'geometry-1.0-SNAPSHOT-jar-with-dependencies.jar';
-- 空间关系判断
CREATE FUNCTION st_contains_point_binary AS 'com.udf.STContainsPointBinary' USING 'geometry-1.0-SNAPSHOT-jar-with-dependencies.jar';
sql复制SELECT
a.*, grid_id, spatial_id,
st_contains_point_binary(geom_wkb, b.lon, b.lat) AS is_inside
FROM dwd.test a
LATERAL VIEW polygon_to_rowgrids(geom_base64, code) t AS grid_id, spatial_id, geom_wkb, min_x, max_x, min_y, max_y
JOIN (
SELECT 130.511909 AS lon, 45.20249 AS lat,
point_to_rowgrid(130.511909, 45.20249) AS grid_id
) b ON a.grid_id = b.grid_id;
sql复制SELECT
a.*, st_contains_point_binary(a.geom_wkb, b.lon, b.lat) AS is_inside
FROM (
SELECT a.*, grid_id, spatial_id, geom_wkb
FROM dwd.test a
LATERAL VIEW polygon_to_h3(geom_base64, code) t AS grid_id, spatial_id, geom_wkb, min_x, max_x, min_y, max_y
) a
JOIN (
SELECT '130.511909' AS lon, '45.20249' AS lat,
point_to_h3(130.511909, 45.20249) AS grid_id
) b ON a.grid_id = b.grid_id;
空间网格:网格大小直接影响查询性能与精度
H3网格:分辨率选择很关键
坐标顺序问题:
坐标系一致性:
几何有效性:
性能瓶颈:
| 对比维度 | 空间网格方案 | H3方案 |
|---|---|---|
| 实现复杂度 | 简单 | 中等 |
| 网格形状 | 矩形 | 六边形 |
| 网格变形 | 高纬度变形严重 | 全球均匀 |
| 计算精度 | 依赖网格大小 | 依赖分辨率 |
| 适用场景 | 简单几何、小范围 | 复杂几何、全球数据 |
| 性能 | 中等 | 较高 |
选择建议:
本方案不仅适用于城市编码匹配,还可应用于:
在实际实施过程中,我总结了以下几点经验:
这个项目让我深刻体会到,在数据处理中,空间计算确实是一个特殊而重要的领域。通过合理设计索引和计算方案,我们能够高效处理海量空间数据的关联分析需求。