上周刚完成一个跨境电商的数据分析项目,客户给了份50多万条的亚马逊销售数据。这种规模的数据直接扔进Excel肯定卡死,用MySQL全量查询也慢得让人抓狂。今天我就把处理这种中型数据集的完整流程梳理一遍,重点分享几个实战中总结的高效技巧。
这个流程的核心思路是:先用Python对原始数据做智能抽样,再用SQL进行多维度聚合分析,最后用Python可视化呈现结果。这种组合拳既能保证分析效率,又能确保结果的可解释性。下面我会用Kaggle上的亚马逊销售数据集为例,演示每个环节的具体操作和避坑要点。
重要提示:所有代码示例都经过真实数据测试,但务必根据你的实际数据格式调整字段名和参数。特别是日期格式转换那步,不同数据源的日期格式可能千差万别。
拿到50万+的CSV文件时,我的第一反应不是直接导入数据库,而是先用Pandas做个快速诊断。这步能帮你避免很多后续麻烦:
python复制import pandas as pd
# 快速查看数据概览(不加载全部数据)
df_info = pd.read_csv('amazon_sales.csv', nrows=5)
print(df_info.head())
print("\n字段类型预览:")
print(df_info.dtypes)
通过这步我发现几个关键问题:日期字段是字符串格式、部分金额字段含有特殊符号、客户ID存在大量空值。这些问题如果在SQL阶段才发现,调试起来会非常痛苦。
对于中型数据集,我推荐分层抽样而不是简单随机抽样。这样可以保证关键维度(如月份、商品类别)的分布均衡:
python复制# 完整数据抽样流程
def process_large_file(input_path, output_path, sample_size=10000):
# 分块读取避免内存溢出
chunk_iter = pd.read_csv(input_path, chunksize=50000)
# 初始化空DataFrame
sampled_data = pd.DataFrame()
for chunk in chunk_iter:
# 基础清洗:处理空值、异常值
chunk = chunk.dropna(subset=['CustomerID'])
chunk = chunk[chunk['Quantity'] > -100] # 过滤异常退货
# 分层抽样:确保每月数据都被代表
chunk_sample = chunk.groupby('Month', group_keys=False).apply(
lambda x: x.sample(min(len(x), int(sample_size/12)), random_state=42))
sampled_data = pd.concat([sampled_data, chunk_sample])
# 最终二次抽样控制总样本量
sampled_data = sampled_data.sample(n=sample_size, random_state=42)
sampled_data.to_csv(output_path, index=False)
return sampled_data
这个抽样方案有三大优势:
把抽样数据导入MySQL后,我通常会做三件事提升查询性能:
sql复制-- 1. 添加关键索引
ALTER TABLE amazon ADD INDEX idx_invoice (InvoiceNo);
ALTER TABLE amazon ADD INDEX idx_customer (CustomerID);
ALTER TABLE amazon ADD INDEX idx_date (InvoiceDate);
-- 2. 优化表结构
ALTER TABLE amazon MODIFY COLUMN UnitPrice DECIMAL(10,2);
ALTER TABLE amazon MODIFY COLUMN Quantity INT;
-- 3. 预计算常用维度
UPDATE amazon
SET Month = DATE_FORMAT(STR_TO_DATE(InvoiceDate, '%m/%d/%Y %H:%i'), '%Y-%m');
基础的销售额计算大家都会,我想重点分享几个更有商业价值的分析方向:
sql复制-- 深度退货分析
SELECT
Description,
COUNT(*) AS return_count,
SUM(UnitPrice) AS return_amount,
ROUND(COUNT(*) / (SELECT COUNT(*) FROM amazon WHERE Quantity < 0), 4) AS return_rate
FROM amazon
WHERE Quantity < 0
GROUP BY Description
HAVING return_count > 5
ORDER BY return_amount DESC;
这个查询能帮你识别哪些商品虽然销量高但退货率也高,可能是产品质量或描述有问题。
sql复制-- 完整RFM分析
WITH rfm_base AS (
SELECT
CustomerID,
DATEDIFF('2011-12-01', MAX(STR_TO_DATE(InvoiceDate, '%m/%d/%Y %H:%i'))) AS recency,
COUNT(DISTINCT InvoiceNo) AS frequency,
SUM(Quantity * UnitPrice) AS monetary
FROM amazon
WHERE Quantity > 0
GROUP BY CustomerID
)
SELECT
CustomerID,
recency,
frequency,
monetary,
NTILE(5) OVER (ORDER BY recency DESC) AS r_score,
NTILE(5) OVER (ORDER BY frequency) AS f_score,
NTILE(5) OVER (ORDER BY monetary) AS m_score,
CONCAT(
NTILE(5) OVER (ORDER BY recency DESC),
NTILE(5) OVER (ORDER BY frequency),
NTILE(5) OVER (ORDER BY monetary)
) AS rfm_cell
FROM rfm_base
HAVING monetary > 0;
这个RFM查询结果可以直接用于客户分群营销,比如:
比起静态的Matplotlib,我更喜欢用Plotly做交互式可视化:
python复制import plotly.express as px
# 从SQL查询结果创建DataFrame
monthly_sales = pd.read_sql("""
SELECT
DATE_FORMAT(STR_TO_DATE(InvoiceDate, '%m/%d/%Y %H:%i'), '%Y-%m') AS Month,
SUM(Quantity * UnitPrice) AS Sales
FROM amazon
WHERE Quantity > 0
GROUP BY Month
ORDER BY Month
""", engine)
# 创建带趋势线的交互图表
fig = px.line(monthly_sales, x='Month', y='Sales',
title='Monthly Sales Trend with Annotations',
markers=True)
# 添加平均线和其他标记
mean_sales = monthly_sales['Sales'].mean()
fig.add_hline(y=mean_sales, line_dash="dot",
annotation_text=f'Avg: ${mean_sales:,.2f}',
annotation_position="bottom right")
fig.show()
用热力图可视化商品之间的关联销售情况:
python复制from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
# 先准备交易矩阵
basket = df.groupby(['InvoiceNo', 'Description'])['Quantity'] \
.sum().unstack().fillna(0)
# 将数量转换为二进制指标
basket_sets = basket.applymap(lambda x: 1 if x > 0 else 0)
# 挖掘频繁项集
frequent_itemsets = apriori(basket_sets, min_support=0.05, use_colnames=True)
# 生成关联规则
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
# 绘制热力图
fig = px.imshow(rules.pivot(index='antecedents',
columns='consequents',
values='lift'),
title='Product Association Heatmap')
fig.show()
当数据量超过10万条时,这些优化手段能显著提升查询速度:
sql复制CREATE TEMPORARY TABLE temp_high_value_customers AS
SELECT CustomerID FROM amazon
WHERE Quantity > 0
GROUP BY CustomerID
HAVING SUM(Quantity * UnitPrice) > 1000;
-- 后续查询引用临时表
SELECT * FROM amazon
WHERE CustomerID IN (SELECT CustomerID FROM temp_high_value_customers);
sql复制-- 低效写法
SELECT * FROM amazon
WHERE InvoiceNo IN (SELECT InvoiceNo FROM ...);
-- 高效写法
SELECT a.* FROM amazon a
JOIN (SELECT DISTINCT InvoiceNo FROM ...) b
ON a.InvoiceNo = b.InvoiceNo;
处理大数据文件时最容易遇到内存问题,我的解决方案是:
python复制dtypes = {
'InvoiceNo': 'category',
'Description': 'category',
'Quantity': 'int16',
'UnitPrice': 'float32'
}
df = pd.read_csv('large_file.csv', dtype=dtypes)
python复制import dask.dataframe as dd
ddf = dd.read_csv('very_large_*.csv')
result = ddf.groupby('CustomerID').agg({'UnitPrice': 'mean'}).compute()
通过这个项目,我总结了电商数据分析的三个关键阶段:
数据准备阶段:
分析阶段:
可视化阶段:
下一步可以考虑:
这个流程不仅适用于电商数据,稍作调整也能用于零售、物流等领域的中型数据集分析。关键在于根据业务特点调整分析维度和指标权重。