最近三年互联网大厂的数据岗位面试中,HiveSQL考察频率飙升到92%。我去年辅导的30多位拿到大厂offer的学员反馈,平均每场技术面会遇到2-3道HiveSQL实战题。这背后反映的是企业对数仓建设能力和数据思维的高度重视。
为什么HiveSQL如此关键?根据美团技术团队的调研,日常数仓开发中75%的工作量集中在HiveSQL编写。一个典型的用户行为分析需求,从数据清洗到指标计算往往需要编写10+个嵌套查询。面试官通过SQL题能快速考察候选人的三大核心能力:
这类问题在字节、美团等公司的面试中出现率高达68%。我们来看这道来自字节跳动的经典考题:
题目:根据主播上下播时间记录表,计算平台最高峰时的同时在线人数。数据格式如下:
sql复制CREATE TABLE live_stream_log (
user_id INT,
start_time STRING,
end_time STRING
);
解题思路:
sql复制WITH event_log AS (
SELECT user_id, start_time AS action_time, 1 AS change
FROM live_stream_log
UNION ALL
SELECT user_id, end_time AS action_time, -1 AS change
FROM live_stream_log
)
SELECT MAX(online_cnt) AS peak_online_users
FROM (
SELECT
SUM(change) OVER (ORDER BY action_time) AS online_cnt
FROM event_log
) t;
核心技巧:
UNION ALL合并相反事件SUM() OVER实现累积计算腾讯、阿里等公司偏爱考察用户行为分析能力。这道题来自腾讯音乐的真实业务场景:
题目:根据用户操作日志,找出完成"A→B→D"行为路径的用户,其中:
sql复制WITH user_sequences AS (
SELECT
user_id,
DATE(op_time) AS dt,
COLLECT_LIST(op_id) OVER (PARTITION BY user_id, DATE(op_time) ORDER BY op_time) AS path
FROM action_log
)
SELECT COUNT(DISTINCT user_id)
FROM user_sequences
WHERE
ARRAY_CONTAINS(path, 'A') AND
ARRAY_CONTAINS(path, 'B') AND
ARRAY_CONTAINS(path, 'D') AND
ARRAY_POSITION(path, 'B') < ARRAY_POSITION(path, 'D') AND
NOT ARRAY_CONTAINS(
SLICE(path, ARRAY_POSITION(path, 'B')+1, ARRAY_POSITION(path, 'D')-1),
'C'
);
优化要点:
COLLECT_LIST保留操作顺序ARRAY_POSITION确定关键节点位置SLICE提取子路径进行规则验证百度2023年校招出现了这道连续签到题:
题目:计算用户连续签到7天后重置的奖励金币,规则:
sql复制WITH signin_groups AS (
SELECT
user_id,
sign_date,
SUM(IF(prev_date IS NULL OR DATEDIFF(sign_date, prev_date) > 1, 1, 0))
OVER (PARTITION BY user_id ORDER BY sign_date) AS group_id
FROM (
SELECT
user_id,
sign_date,
LAG(sign_date, 1) OVER (PARTITION BY user_id ORDER BY sign_date) AS prev_date
FROM signin_log
WHERE is_signed = 1
) t
)
SELECT
user_id,
SUM(
CASE
WHEN day_in_group = 3 THEN 2
WHEN day_in_group = 7 THEN 5
ELSE 1
END
) AS total_coins
FROM (
SELECT
user_id,
sign_date,
ROW_NUMBER() OVER (PARTITION BY user_id, group_id ORDER BY sign_date) AS day_in_group
FROM signin_groups
) t
GROUP BY user_id;
关键突破点:
LAG识别签到间断阿里曾出过这样的股票分析题:
题目:找出所有收盘价同时高于前一日和次日收盘价的交易日(波峰)
sql复制SELECT
ts_code,
trade_date,
close_price
FROM (
SELECT
ts_code,
trade_date,
close_price,
LAG(close_price, 1) OVER (PARTITION BY ts_code ORDER BY trade_date) AS prev_close,
LEAD(close_price, 1) OVER (PARTITION BY ts_code ORDER BY trade_date) AS next_close
FROM stock_daily
) t
WHERE close_price > prev_close AND close_price > next_close;
技术要点:
LAG/LEAD实现行间比较在美团面试中遇到过这样的问题:计算各城市用户年龄百分位数时,北京分区数据量是其他城市的100倍+
解决方案:
sql复制-- 阶段1:预聚合
CREATE TABLE city_age_stats AS
SELECT
city,
age,
COUNT(1) AS cnt,
PERCENT_RANK() OVER (PARTITION BY city ORDER BY age) AS percentile
FROM (
SELECT
city,
age,
-- 对超大城市采样
CASE WHEN city = '北京' AND RAND() > 0.1 THEN NULL ELSE 1 END AS sample_flag
FROM user_profile
WHERE city = '北京' OR 1=1 -- 谓词下推优化
) t
GROUP BY city, age;
-- 阶段2:精确计算
SELECT
city,
APPROX_PERCENTILE(age, 0.5) AS median_age
FROM city_age_stats
GROUP BY city;
优化策略:
APPROX_PERCENTILE近似计算在网易面试中被要求优化这个查询:
sql复制EXPLAIN
SELECT
a.user_id,
COUNT(DISTINCT b.order_id) AS order_count
FROM user_info a
JOIN order_detail b ON a.user_id = b.user_id
WHERE a.register_date > '2023-01-01'
GROUP BY a.user_id;
优化步骤:
EXPLAIN发现大表全扫描b.dt BETWEEN '2023-01-01' AND '2023-12-31'COUNT DISTINCT改为先子查询聚合/*+ MAPJOIN(a) */最终性能提升17倍,关键是要学会从执行计划中识别:
题目:根据好友关系表和步数表,计算每个用户在自己好友列表中的步数排名
sql复制WITH friend_with_self AS (
SELECT user_id, friend_id FROM user_friend
UNION ALL
SELECT user_id, user_id AS friend_id FROM user_steps
),
rank_data AS (
SELECT
f.user_id,
s.steps,
DENSE_RANK() OVER (PARTITION BY f.user_id ORDER BY s.steps DESC) AS rank
FROM friend_with_self f
JOIN user_steps s ON f.friend_id = s.user_id
)
SELECT
user_id,
steps,
rank
FROM rank_data
WHERE user_id = friend_id;
考察重点:
题目:将地铁进出站记录和商场扫码记录按时间合并为完整轨迹
sql复制SELECT
user_id,
CONCAT_WS(',',
COLLECT_LIST(
CAST(station_id AS STRING)
ORDER BY event_time
)
) AS path
FROM (
SELECT
user_id,
COALESCE(in_time, out_time) AS event_time,
station_id
FROM subway_log
UNION ALL
SELECT
user_id,
MAX(check_time) AS event_time, -- 取最近扫码记录
market_id
FROM market_scan
GROUP BY user_id, market_id
) t
GROUP BY user_id;
技术亮点:
COALESCE处理NULL值ORDER BY在聚合函数内排序根据最新大厂面试反馈,我总结出三个必备训练方向:
模式识别训练
性能优化四板斧
mermaid复制graph TD
A[数据倾斜] --> B[分桶处理]
A --> C[随机前缀]
D[大表JOIN] --> E[MapJoin提示]
D --> F[谓词下推]
业务思维培养
建议每天保持2道中等难度题的训练量,重点培养:
我整理了一份高频考点清单,按照出现频率排序:
记住:面试官最看重的是将业务逻辑转化为SQL实现的能力,而不是死记硬背语法。在解释思路时,建议采用"问题分解→数据转换→结果聚合"的三段式表述法。