在日常数据库管理中,重复数据统计与汇总是一项基础但至关重要的任务。作为一名长期与SQL Server打交道的DBA,我发现很多开发者在处理重复记录时往往停留在简单的COUNT查询层面,而忽略了SQL Server提供的强大分组统计功能。本文将分享一个完整的实战案例,展示如何利用GROUP BY、WITH ROLLUP和HAVING等语句组合,实现对题库系统中重复题目的精细化统计分析。
这个方案特别适合以下场景:
推荐使用以下环境进行实践:
提示:虽然示例使用SQL Server 2016,但核心SQL语法在2008 R2及以上版本均适用,差异主要在图形化管理工具方面。
我们设计了一个典型的题库表[exams],结构如下:
| 序号 | 字段名 | 类型 | 说明 |
|---|---|---|---|
| 1 | sortid | int | 题目排序号(唯一标识) |
| 2 | etype | nvarchar(50) | 试题类型(单选/多选/判断) |
| 3 | title | nvarchar(500) | 题目正文 |
| 4 | A | nvarchar(200) | 选项A内容 |
| 5 | B | nvarchar(200) | 选项B内容 |
| 6 | C | nvarchar(200) | 选项C内容 |
| 7 | D | nvarchar(200) | 选项D内容 |
关键设计考虑:
sortid作为主键确保每条记录唯一性title字段需要建立非聚集索引以提高分组查询性能为演示重复数据场景,我们特意在207-212题插入了重复题目:
sql复制INSERT INTO [exams] VALUES
(207, '单选', '下列哪个是关系型数据库?', 'MySQL', 'MongoDB', 'Redis', 'Neo4j'),
(208, '单选', '下列哪个是关系型数据库?', 'MySQL', 'MongoDB', 'Redis', 'Neo4j'),
(209, '多选', 'SQL语句分类包含?', 'DQL', 'DML', 'DDL', 'DCL'),
(210, '多选', 'SQL语句分类包含?', 'DQL', 'DML', 'DDL', 'DCL'),
(211, '判断', 'SQL Server是微软的产品', '正确', '错误', NULL, NULL),
(212, '判断', 'SQL Server是微软的产品', '正确', '错误', NULL, NULL);
虽然本文重点在SQL分析,但完整解决方案应从数据导入开始。推荐两种方式:
方案一:使用SQL Server导入导出向导
方案二:C#程序导入(批量处理推荐)
csharp复制// 使用EPPlus读取Excel
using(var package = new ExcelPackage(new FileInfo("题库.xlsx"))) {
var worksheet = package.Workbook.Worksheets[0];
// 使用SqlBulkCopy高效导入
using(var bulkCopy = new SqlBulkCopy(connectionString)) {
bulkCopy.DestinationTableName = "exams";
// 列映射...
bulkCopy.WriteToServer(dataTable);
}
}
注意:实际开发中应添加异常处理和日志记录,对于大数据量(>10万行)建议分批次提交。
最基本的重复检测SQL如下:
sql复制SELECT
title,
etype,
count(title) AS repeat_count,
min(sortid) AS first_appear,
max(sortid) AS last_appear
FROM [exams]
GROUP BY etype, Title
ORDER BY repeat_count DESC;
关键点解析:
count(title)计算每组的记录数,即重复次数min(sortid)和max(sortid)定位该题目首次和最后出现的位置repeat_count降序排列使重复项优先显示执行结果示例:
| title | etype | repeat_count | first_appear | last_appear |
|---|---|---|---|---|
| 下列哪个是... | 单选 | 2 | 207 | 208 |
| SQL语句分类... | 多选 | 2 | 209 | 210 |
添加WITH ROLLUP实现分级汇总:
sql复制SELECT
CASE
WHEN GROUPING(title) = 1 THEN '【小计】' + etype
WHEN GROUPING(etype) = 1 THEN '【总计】'
ELSE title
END AS display_title,
etype,
count(title) AS record_count,
min(sortid) AS min_id,
max(sortid) AS max_id
FROM [exams]
GROUP BY etype, Title WITH ROLLUP
HAVING count(title) > 1 OR GROUPING(title) = 1 OR GROUPING(etype) = 1;
技术细节:
GROUPING()函数识别汇总行(返回1)CASE WHEN构造友好的显示文本HAVING子句过滤:
count>1)GROUPING(title)=1)GROUPING(etype)=1)进一步优化结果显示:
sql复制SELECT
CASE
WHEN title IS NULL AND etype IS NULL THEN '题库总计'
WHEN title IS NULL THEN etype + '小计'
ELSE title
END AS category,
CASE
WHEN title IS NULL AND etype IS NULL THEN NULL
WHEN title IS NULL THEN NULL
ELSE etype
END AS question_type,
count(title) AS count,
min(sortid) AS first_id,
max(sortid) AS last_id
FROM [exams]
GROUP BY etype, Title WITH ROLLUP
HAVING count(title) > 1 OR title IS NULL;
索引策略:
sql复制-- 创建覆盖索引
CREATE INDEX IX_exams_title_etype ON [exams](title, etype) INCLUDE (sortid);
统计信息更新:
sql复制-- 大数据量更新后执行
UPDATE STATISTICS [exams] WITH FULLSCAN;
查询提示(海量数据时):
sql复制SELECT ... FROM [exams] WITH (NOLOCK) -- 脏读允许时
OPTION (OPTIMIZE FOR UNKNOWN, MAXDOP 4);
对于需要灵活筛选的场景,可以使用动态SQL:
sql复制DECLARE @sql NVARCHAR(MAX) = N'
SELECT title, etype, count(*) as cnt
FROM [exams]
WHERE 1=1 ';
-- 根据条件动态拼接
IF @start_id IS NOT NULL
SET @sql = @sql + ' AND sortid >= ' + CAST(@start_id AS NVARCHAR);
IF @question_type IS NOT NULL
SET @sql = @sql + ' AND etype = ''' + @question_type + '''';
SET @sql = @sql + ' GROUP BY title, etype HAVING count(*) > 1';
EXEC sp_executesql @sql;
C#调用示例:
csharp复制public List<DuplicateItem> FindDuplicates(string connectionString)
{
var sql = @"SELECT title, etype, count(*) as count
FROM exams
GROUP BY title, etype
HAVING count(*) > 1";
using(var conn = new SqlConnection(connectionString))
{
return conn.Query<DuplicateItem>(sql).ToList();
}
}
public class DuplicateItem
{
public string Title { get; set; }
public string Type { get; set; }
public int Count { get; set; }
}
现象:中文字段分组结果不符合预期
原因:SQL Server的排序规则(collation)设置影响中文比较
解决方案:
sql复制-- 查询时指定中文排序规则
SELECT title COLLATE Chinese_PRC_CI_AS, ...
GROUP BY title COLLATE Chinese_PRC_CI_AS, ...
优化方案:
sql复制WITH CTE AS (
SELECT title, etype, count(*) as cnt,
ROW_NUMBER() OVER(ORDER BY count(*) DESC) AS rn
FROM [exams]
GROUP BY title, etype
HAVING count(*) > 1
)
SELECT * FROM CTE WHERE rn BETWEEN 1 AND 100;
sql复制SELECT title, etype, count(*) as cnt
INTO #temp_results
FROM [exams]
WHERE ... -- 先过滤条件
GROUP BY title, etype;
-- 再从临时表查询
SELECT * FROM #temp_results WHERE cnt > 1;
精确匹配(默认):
sql复制GROUP BY title -- 完全一致才算重复
模糊匹配(相似题目检测):
sql复制-- 使用DIFFERENCE函数(0-4评分,4为最相似)
SELECT a.title, b.title,
DIFFERENCE(a.title, b.title) AS similarity
FROM [exams] a
JOIN [exams] b ON a.sortid < b.sortid
WHERE DIFFERENCE(a.title, b.title) >= 3
ORDER BY similarity DESC;
发现重复后自动处理:
sql复制-- 标记重复记录(不删除)
ALTER TABLE [exams] ADD is_duplicate BIT DEFAULT 0;
UPDATE e1
SET is_duplicate = 1
FROM [exams] e1
INNER JOIN (
SELECT title, etype, min(sortid) as keep_id
FROM [exams]
GROUP BY title, etype
HAVING count(*) > 1
) e2 ON e1.title = e2.title AND e1.etype = e2.etype
WHERE e1.sortid <> e2.keep_id;
创建SQL Server Agent作业定期检查:
sql复制-- 每周一早上6点运行的作业
DECLARE @count INT;
SELECT @count = COUNT(*)
FROM (
SELECT title, etype
FROM [exams]
GROUP BY title, etype
HAVING count(*) > 1
) t;
IF @count > 0
BEGIN
-- 发送邮件通知
EXEC msdb.dbo.sp_send_dbmail
@profile_name = 'DBA_Alerts',
@recipients = 'dba@example.com',
@subject = '题库重复题目警报',
@body = '发现重复题目,请及时处理';
END
扩展应用到多表关联场景:
sql复制-- 检测题库表与历史题库表的重复题目
SELECT a.title, a.etype, '当前题库' as source
FROM [exams] a
WHERE EXISTS (
SELECT 1 FROM [exams_archive] b
WHERE a.title = b.title AND a.etype = b.etype
)
UNION ALL
SELECT title, etype, '历史题库' as source
FROM [exams_archive]
WHERE EXISTS (
SELECT 1 FROM [exams] b
WHERE title = b.title AND etype = b.etype
);
在实际项目中,我经常使用这个方案帮助客户清理题库系统。曾有一个案例,通过这种分析发现某在线教育平台15%的题目存在重复,清理后不仅提升了数据库性能,还显著改善了用户做题体验。关键在于不仅要找出重复项,还要建立长效机制预防重复数据产生。