第一次接触SQL数据分析时,我常常困惑为什么简单的分组统计非要绕道窗口函数。直到处理电商订单数据时才发现:当需要同时展示明细记录和分组聚合结果时,传统GROUP BY会把所有非分组字段合并,而窗口函数却能保留原始行记录。举个实际案例——统计每个用户的订单数:
sql复制/* 传统GROUP BY写法 */
SELECT user_id, COUNT(*) as order_count
FROM orders
GROUP BY user_id;
/* 窗口函数写法 */
SELECT
user_id,
order_id,
COUNT(*) OVER(PARTITION BY user_id) as order_count
FROM orders;
两者的核心区别在于:
我在MySQL 5.7环境中的实际用法:
sql复制SELECT
a.user_id,
a.order_id,
b.order_count
FROM orders a
JOIN (
SELECT user_id, COUNT(*) as order_count
FROM orders
GROUP BY user_id
) b ON a.user_id = b.user_id;
注意:大数据量时需在连接字段建立索引,否则性能会急剧下降。曾有个200万行的表查询耗时从0.5秒暴增到28秒,加上索引后恢复到0.7秒。
Oracle项目中这样写:
sql复制SELECT
user_id,
order_id,
(SELECT COUNT(*) FROM orders b
WHERE b.user_id = a.user_id) as order_count
FROM orders a;
性能对比测试(100万行数据):
| 方案 | 执行时间 | 内存消耗 |
|---|---|---|
| 自连接 | 1.2s | 1.5GB |
| 子查询 | 3.8s | 2.1GB |
| 窗口函数 | 0.8s | 0.9GB |
在阿里云RDS上的优化写法:
sql复制SELECT
a.*,
cnt.order_count
FROM orders a,
LATERAL (
SELECT COUNT(*) as order_count
FROM orders b
WHERE b.user_id = a.user_id
) cnt;
实际物流系统中的统计需求:
sql复制SELECT
a.ship_date,
a.customer_id,
a.tracking_no,
daily_orders.order_count as daily_orders,
customer_orders.order_count as total_orders
FROM shipments a
JOIN (
SELECT ship_date, COUNT(*) as order_count
FROM shipments
GROUP BY ship_date
) daily_orders ON a.ship_date = daily_orders.ship_date
JOIN (
SELECT customer_id, COUNT(*) as order_count
FROM shipments
GROUP BY customer_id
) customer_orders ON a.customer_id = customer_orders.customer_id;
统计每个用户的异常订单数:
sql复制SELECT
a.user_id,
a.order_id,
abnormal_cnt.abnormal_count
FROM orders a
JOIN (
SELECT
user_id,
SUM(CASE WHEN status = 'abnormal' THEN 1 ELSE 0 END) as abnormal_count
FROM orders
GROUP BY user_id
) abnormal_cnt ON a.user_id = abnormal_cnt.user_id;
索引策略:
执行计划检查:
sql复制EXPLAIN SELECT ... -- MySQL
EXPLAIN ANALYZE SELECT ... -- PostgreSQL
大数据量分片处理:
sql复制/* 按时间范围分批处理 */
WHERE create_time BETWEEN '2023-01-01' AND '2023-01-31'
临时表方案(超1000万行数据):
sql复制CREATE TEMPORARY TABLE temp_counts AS
SELECT user_id, COUNT(*) as cnt FROM big_table GROUP BY user_id;
SELECT a.*, b.cnt
FROM big_table a JOIN temp_counts b ON a.user_id = b.user_id;
sql复制WITH user_order_counts AS (
SELECT user_id, COUNT(*) as cnt
FROM orders
GROUP BY user_id
)
SELECT o.*, u.cnt
FROM orders o JOIN user_order_counts u ON o.user_id = u.user_id;
sql复制WITH RECURSIVE agg_data AS (
SELECT
user_id,
COUNT(*) OVER w as user_order_count,
ROW_NUMBER() OVER w as rn
FROM orders
WINDOW w AS (PARTITION BY user_id)
)
SELECT * FROM agg_data WHERE rn = 1;
sql复制SELECT
o.*,
c.order_count
FROM orders o
CROSS APPLY (
SELECT COUNT(*) as order_count
FROM orders i
WHERE i.user_id = o.user_id
) c;
NULL值陷阱:
sql复制/* 错误示例:COUNT(*)会包含NULL,COUNT(字段)不会 */
SELECT
department,
COUNT(*) as total_emps, -- 包含NULL
COUNT(manager_id) as managed_emps -- 不包含NULL
FROM employees
GROUP BY department;
去重统计的两种方式:
sql复制/* 统计不同状态的数量 */
SELECT
user_id,
COUNT(DISTINCT status) as status_types,
COUNT(CASE WHEN status = 'paid' THEN 1 END) as paid_orders
FROM orders
GROUP BY user_id;
分组后排序的坑:
sql复制/* 错误:GROUP BY后只能用聚合函数或分组字段 */
SELECT
user_id,
COUNT(*) as cnt,
order_id -- 错误!非分组字段
FROM orders
GROUP BY user_id;
HAVING的妙用:
sql复制/* 筛选下单超过5次的用户 */
SELECT
user_id,
COUNT(*) as order_count
FROM orders
GROUP BY user_id
HAVING COUNT(*) > 5;
在数据仓库项目中,曾遇到需要统计每个品类下TOP3商品的场景。最终采用的方案是:
sql复制SELECT *
FROM (
SELECT
product_id,
category_id,
sales,
ROW_NUMBER() OVER(PARTITION BY category_id ORDER BY sales DESC) as rank
FROM products
) t
WHERE rank <= 3;
这个案例让我明白:虽然能用GROUP BY+JOIN实现,但窗口函数在复杂场景下更简洁高效。当SQL开始变得复杂时,就该考虑是否需要升级到窗口函数了。