作为数据分析师,我至今记得第一次接触窗口函数时的崩溃——明明都是SQL,为什么这个语法看起来像天书?经过无数次报错和深夜调试,终于把这块硬骨头啃下来了。今天就用最接地气的方式,分享窗口函数的核心用法和那些官方文档不会告诉你的实战技巧。
窗口函数(Window Function)是SQL中处理"既要聚合又要保留明细"场景的终极武器。和普通GROUP BY不同,它能在保留原始行记录的同时,计算基于特定数据窗口的聚合值。举个实际例子:当需要计算每个员工的销售额排名,同时还要显示其具体销售数据时,窗口函数就是唯一选择。
sql复制SELECT
column1,
column2,
window_function(column3) OVER (
[PARTITION BY partition_expression]
[ORDER BY sort_expression [ASC | DESC]]
[frame_clause]
) AS result_column
FROM table_name
这个看似复杂的结构,其实可以分解为三个关键部分:
关键区别:普通GROUP BY会压缩行数,而窗口函数保持原表行数不变
假设有销售数据表sales_records:
sql复制-- 计算每个销售员的销售额在其所属区域内的排名
SELECT
salesperson,
region,
amount,
RANK() OVER (
PARTITION BY region
ORDER BY amount DESC
) AS regional_rank
FROM sales_records
这个查询会保留所有原始记录,同时新增一列显示每个销售员在所属区域的销售额排名。这正是窗口函数的魔力——既看到树木(明细),又看到森林(聚合)。
这三个函数都用于排序,但处理并列值时行为不同:
| 函数 | 并列处理 | 示例序列(100,100,90)的排名 | 适用场景 |
|---|---|---|---|
| ROW_NUMBER | 强制连续编号 | 1,2,3 | 需要绝对唯一序号时 |
| RANK | 并列则跳号 | 1,1,3 | 体育比赛排名 |
| DENSE_RANK | 并列不跳号 | 1,1,2 | 学术成绩评级 |
sql复制-- 三种排序函数对比
SELECT
student_id,
score,
ROW_NUMBER() OVER (ORDER BY score DESC) AS row_num,
RANK() OVER (ORDER BY score DESC) AS rank_val,
DENSE_RANK() OVER (ORDER BY score DESC) AS dense_rank_val
FROM exam_results
所有常规聚合函数都可以配合OVER使用:
sql复制-- 计算移动平均(当前行及前2行)
SELECT
date,
revenue,
AVG(revenue) OVER (
ORDER BY date
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) AS moving_avg
FROM daily_sales
分析时间序列数据时的利器:
sql复制-- 计算日环比增长率
SELECT
date,
revenue,
LAG(revenue, 1) OVER (ORDER BY date) AS prev_day_revenue,
(revenue - LAG(revenue, 1) OVER (ORDER BY date)) /
LAG(revenue, 1) OVER (ORDER BY date) AS growth_rate
FROM daily_sales
通过ROWS/RANGE指定计算范围:
sql复制-- 计算累计和(从分区开始到当前行)
SELECT
date,
revenue,
SUM(revenue) OVER (
PARTITION BY YEAR(date)
ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS ytd_revenue
FROM daily_sales
常用帧范围:
UNBOUNDED PRECEDING:分区开头n PRECEDING:前n行CURRENT ROW:当前行n FOLLOWING:后n行忘记OVER关键字:这是语法错误最高发区
sql复制-- 错误写法
SELECT RANK() FROM table;
-- 正确写法
SELECT RANK() OVER () FROM table;
混淆ORDER BY作用:在窗口函数中,ORDER BY只影响计算顺序,不改变结果集排序
忽视NULL值处理:多数窗口函数将NULL视为最小值
当窗口函数结果不符合预期时:
sql复制-- 分步调试示例
WITH ranked_data AS (
SELECT
product_id,
sales,
RANK() OVER (PARTITION BY category ORDER BY sales DESC) AS rank_val
FROM products
)
SELECT * FROM ranked_data WHERE rank_val <= 3 -- 检查TOP3是否正确
sql复制-- 找出每个品类中销售额增长最快的前3个商品
WITH monthly_growth AS (
SELECT
product_id,
category,
(SUM(CASE WHEN month = '2023-04' THEN amount ELSE 0 END) -
SUM(CASE WHEN month = '2023-03' THEN amount ELSE 0 END)) /
SUM(CASE WHEN month = '2023-03' THEN amount ELSE 0 END) AS growth_rate
FROM sales
GROUP BY product_id, category
),
ranked_products AS (
SELECT
product_id,
category,
growth_rate,
DENSE_RANK() OVER (PARTITION BY category ORDER BY growth_rate DESC) AS growth_rank
FROM monthly_growth
WHERE growth_rate IS NOT NULL
)
SELECT * FROM ranked_products WHERE growth_rank <= 3
sql复制-- 检测异常交易(金额超过近30天平均值的3倍标准差)
SELECT
transaction_id,
account_id,
amount,
AVG(amount) OVER (
PARTITION BY account_id
ORDER BY transaction_date
RANGE BETWEEN INTERVAL '30' DAY PRECEDING AND CURRENT ROW
) AS avg_amount,
STDDEV(amount) OVER (
PARTITION BY account_id
ORDER BY transaction_date
RANGE BETWEEN INTERVAL '30' DAY PRECEDING AND CURRENT ROW
) AS std_amount
FROM transactions
WHERE amount > (
AVG(amount) OVER (
PARTITION BY account_id
ORDER BY transaction_date
RANGE BETWEEN INTERVAL '30' DAY PRECEDING AND CURRENT ROW
) + 3 * STDDEV(amount) OVER (
PARTITION BY account_id
ORDER BY transaction_date
RANGE BETWEEN INTERVAL '30' DAY PRECEDING AND CURRENT ROW
)
)
虽然窗口函数是SQL标准,但各数据库仍有差异:
| 功能 | PostgreSQL | MySQL 8+ | SQL Server | Oracle |
|---|---|---|---|---|
| 窗口帧语法 | 完整支持 | 完整支持 | 完整支持 | 完整支持 |
| RANGE处理 | 支持 | 支持 | 支持 | 支持 |
| 命名窗口 | 支持 | 支持 | 不支持 | 支持 |
| 性能优化 | 优秀 | 一般 | 优秀 | 优秀 |
特别提示:MySQL 8.0之前版本不支持窗口函数,MariaDB从10.2开始支持
根据我的踩坑经验,建议按这个顺序掌握窗口函数:
最有效的学习方法是用实际业务数据练习,比如:
记住窗口函数的学习曲线是先陡后平——初期会觉得概念复杂,但一旦突破某个临界点,就会发现它其实比多次子查询或自连接更直观高效。