在数据分析的日常工作中,我们经常遇到需要统计所有可能组合的场景——比如每个班级与每种血型的组合人数,即使某些组合在实际数据中并不存在。传统做法要么写复杂的子查询,要么用程序预先生成基础数据,但这些方法既不优雅又低效。今天我要分享的是Hive中一个被严重低估的工具——CROSS JOIN,它能让你用5行SQL解决这类问题。
大多数数据分析师对LEFT JOIN、INNER JOIN如数家珍,却对CROSS JOIN避之不及,因为它的笛卡尔积特性常被视为性能杀手。但在全量组合统计这种特定场景下,CROSS JOIN反而是最高效的解决方案。
想象这样一个需求:统计全校每个班级与所有血型(A/B/C/D)的组合人数,包括那些人数为零的组合。传统做法通常需要:
而用CROSS JOIN,整个过程可以简化为一个SQL语句:
sql复制SELECT
classes.class,
blood_types.blood,
COUNT(students.id) AS student_count
FROM
(SELECT DISTINCT class FROM students) classes
CROSS JOIN
(SELECT 'A' AS blood UNION ALL
SELECT 'B' UNION ALL
SELECT 'C' UNION ALL
SELECT 'D') blood_types
LEFT JOIN
students ON students.class = classes.class
AND students.blood = blood_types.blood
GROUP BY
classes.class, blood_types.blood
这个查询的精妙之处在于:
让我们通过一个完整案例来演示这个技术的实际应用。假设我们有一个学生表student_blood,结构如下:
| id | name | class | blood |
|---|---|---|---|
| 1 | 张三 | 1 | A |
| 2 | 李四 | 2 | C |
| 3 | 王五 | 1 | B |
| 4 | 黄六 | 3 | D |
| 5 | 朱八 | 2 | C |
sql复制CREATE TABLE student_blood (
id INT,
name STRING,
class STRING,
blood STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
-- 插入示例数据
INSERT INTO TABLE student_blood VALUES
(1, '张三', '1', 'A'),
(2, '李四', '2', 'C'),
(3, '王五', '1', 'B'),
(4, '黄六', '3', 'D'),
(5, '朱八', '2', 'C');
sql复制SELECT
b.class,
a.blood,
COUNT(s.id) AS num
FROM
(SELECT 'A' AS blood UNION ALL
SELECT 'B' UNION ALL
SELECT 'C' UNION ALL
SELECT 'D') a
CROSS JOIN
(SELECT DISTINCT class FROM student_blood) b
LEFT JOIN
student_blood s ON a.blood = s.blood AND b.class = s.class
GROUP BY
b.class, a.blood
ORDER BY
b.class, a.blood;
执行结果将包含所有可能的组合:
| class | blood | num |
|---|---|---|
| 1 | A | 1 |
| 1 | B | 1 |
| 1 | C | 0 |
| 1 | D | 0 |
| 2 | A | 0 |
| 2 | B | 0 |
| 2 | C | 2 |
| 2 | D | 0 |
| 3 | A | 0 |
| 3 | B | 0 |
| 3 | C | 0 |
| 3 | D | 1 |
提示:当维度值很多时,可以考虑创建专门的维度表替代硬编码,提高SQL可维护性
实际项目中,维度值可能不是固定的。比如血型类型可能随时间变化,这时我们需要更动态的解决方案。
sql复制-- 创建血型维度表(假设会动态更新)
CREATE TABLE dim_blood_type AS
SELECT DISTINCT blood FROM student_blood
UNION ALL SELECT 'A' UNION ALL SELECT 'B'
UNION ALL SELECT 'C' UNION ALL SELECT 'D';
-- 动态维度解决方案
SELECT
b.class,
d.blood,
COUNT(s.id) AS num
FROM
(SELECT DISTINCT blood FROM dim_blood_type) d
CROSS JOIN
(SELECT DISTINCT class FROM student_blood) b
LEFT JOIN
student_blood s ON d.blood = s.blood AND b.class = s.class
GROUP BY
b.class, d.blood;
CROSS JOIN的强大之处在于可以轻松扩展到多个维度。比如要统计班级、血型和性别的三维组合:
sql复制SELECT
b.class,
d.blood,
g.gender,
COUNT(s.id) AS num
FROM
(SELECT DISTINCT blood FROM dim_blood_type) d
CROSS JOIN
(SELECT DISTINCT class FROM student_blood) b
CROSS JOIN
(SELECT 'M' AS gender UNION ALL SELECT 'F') g
LEFT JOIN
student_blood s ON d.blood = s.blood
AND b.class = s.class
AND g.gender = s.gender
GROUP BY
b.class, d.blood, g.gender;
虽然CROSS JOIN在组合统计中非常高效,但仍需注意以下性能要点:
| 方法 | SQL复杂度 | 执行时间 | 可维护性 |
|---|---|---|---|
| 程序生成组合 | 高 | 中 | 低 |
| 多重子查询 | 极高 | 高 | 低 |
| CROSS JOIN | 低 | 低 | 高 |
假设需要分析不同颜色和尺寸的商品组合销售情况,即使某些组合从未被购买:
sql复制-- 颜色和尺寸维度表
CREATE TABLE dim_colors AS SELECT DISTINCT color FROM products;
CREATE TABLE dim_sizes AS SELECT DISTINCT size FROM products;
-- 组合分析
SELECT
c.color,
s.size,
COUNT(o.product_id) AS sales_count,
SUM(o.amount) AS total_sales
FROM
dim_colors c
CROSS JOIN
dim_sizes s
LEFT JOIN
order_items o ON o.color = c.color AND o.size = s.size
WHERE
o.order_date BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY
c.color, s.size;
这个查询能清晰展示哪些颜色尺寸组合受欢迎,哪些根本无人问津,为库存管理提供直接依据。