电商平台每天产生海量用户行为数据,这些数据蕴含着巨大的商业价值。作为一名数据分析师,我最近完成了一个基于Python的电商用户消费行为分析项目,通过数据挖掘和机器学习技术,从原始行为日志中提取出有价值的用户洞察。这个项目涵盖了从数据收集到模型部署的全流程,特别适合想要进入电商数据分析领域的朋友参考。
项目核心目标是回答三个关键业务问题:
整个分析流程采用Python技术栈,主要依赖pandas进行数据处理,sklearn构建机器学习模型,matplotlib/seaborn实现可视化。下面我将详细拆解每个环节的技术实现和实操要点。
电商用户行为数据通常来自以下几个渠道:
典型的数据字段包括:
python复制{
'user_id': '用户唯一标识',
'session_id': '会话ID',
'product_id': '商品ID',
'category_id': '类目ID',
'behavior_type': '行为类型(点击/收藏/加购/购买)',
'timestamp': '行为时间戳',
'purchase_amount': '购买金额(仅购买行为)'
}
数据清洗是分析的基础,这里分享几个我在实践中总结的关键点:
缺失值处理:
python复制# 向前填充适用于时间序列数据
data.fillna(method='ffill', inplace=True)
# 分类变量用众数填充
from sklearn.impute import SimpleImputer
cat_imputer = SimpleImputer(strategy='most_frequent')
data[['category']] = cat_imputer.fit_transform(data[['category']])
异常值检测:
python复制# 基于IQR方法检测购买金额异常
Q1 = data['amount'].quantile(0.25)
Q3 = data['amount'].quantile(0.75)
IQR = Q3 - Q1
data = data[~((data['amount'] < (Q1 - 1.5 * IQR)) |
(data['amount'] > (Q3 + 1.5 * IQR)))]
时间格式标准化:
python复制# 统一时间格式并提取时间特征
data['timestamp'] = pd.to_datetime(data['timestamp'], unit='ms')
data['hour'] = data['timestamp'].dt.hour
data['day_of_week'] = data['timestamp'].dt.dayofweek
特别注意:电商数据中用户ID和会话ID的关联关系需要特别验证,我曾遇到过因会话超时设置不合理导致用户行为被错误分割的情况。
RFM是电商领域最经典的用户分群方法,计算三个核心指标:
python复制# 计算RFM指标
now = pd.to_datetime('2023-12-01') # 分析基准日
rfm = data.groupby('user_id').agg({
'timestamp': lambda x: (now - x.max()).days, # Recency
'order_id': 'nunique', # Frequency
'amount': 'sum' # Monetary
}).rename(columns={
'timestamp': 'recency',
'order_id': 'frequency',
'amount': 'monetary'
})
# 对指标进行分箱评分
rfm['R_score'] = pd.qcut(rfm['recency'], q=5, labels=[5,4,3,2,1])
rfm['F_score'] = pd.qcut(rfm['frequency'], q=5, labels=[1,2,3,4,5])
rfm['M_score'] = pd.qcut(rfm['monetary'], q=5, labels=[1,2,3,4,5])
rfm['RFM_score'] = rfm['R_score'].astype(int) + rfm['F_score'].astype(int) + rfm['M_score'].astype(int)
除了基础的K-Means,实践中我发现这些方法效果更好:
特征缩放:
python复制from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm[['recency','frequency','monetary']])
确定最佳聚类数:
python复制# 肘部法则
from sklearn.cluster import KMeans
inertia = []
for k in range(1, 11):
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(rfm_scaled)
inertia.append(kmeans.inertia_)
# 绘制肘部曲线选择拐点
plt.plot(range(1,11), inertia, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.show()
聚类结果解读:
python复制# 添加聚类标签
kmeans = KMeans(n_clusters=4, random_state=42)
rfm['cluster'] = kmeans.fit_predict(rfm_scaled)
# 分析各簇特征
cluster_profile = rfm.groupby('cluster').agg({
'recency': 'mean',
'frequency': 'mean',
'monetary': ['mean', 'count']
})
Apriori算法是挖掘商品关联关系的经典方法,但在实际应用中需要注意:
数据预处理:
python复制# 将订单-商品关系转换为矩阵形式
order_item = data[data['behavior_type']=='purchase'].groupby('order_id')['product_id'].apply(list)
# 生成one-hot编码
from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
te_ary = te.fit(order_item).transform(order_item)
df = pd.DataFrame(te_ary, columns=te.columns_)
参数调优:
python复制from mlxtend.frequent_patterns import apriori
# 支持度设置需要根据商品数量调整
# 商品数少(100):min_support=0.05
# 商品数多(10000):min_support=0.001
frequent_itemsets = apriori(df, min_support=0.01, use_colnames=True)
# 计算关联规则
from mlxtend.frequent_patterns import association_rules
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules.sort_values('confidence', ascending=False, inplace=True)
使用马尔可夫链分析用户行为路径:
python复制# 构建状态转移矩阵
def create_transition_matrix(data):
users = data['user_id'].unique()
transition_counts = pd.DataFrame(0, index=behavior_types, columns=behavior_types)
for user in users:
user_actions = data[data['user_id']==user].sort_values('timestamp')['behavior_type'].values
for i in range(len(user_actions)-1):
from_state = user_actions[i]
to_state = user_actions[i+1]
transition_counts.loc[from_state, to_state] += 1
# 归一化得到概率
transition_matrix = transition_counts.div(transition_counts.sum(axis=1), axis=0)
return transition_matrix
除了基本的RFM特征,这些特征在实践中很有效:
python复制# 时间维度特征
features['last_3d_actions'] = data[data['timestamp'] > (now - pd.Timedelta(days=3))].groupby('user_id')['behavior_type'].count()
# 品类偏好特征
category_pref = pd.get_dummies(data['category_id']).groupby(data['user_id']).mean()
features = pd.concat([features, category_pref], axis=1)
# 行为序列特征
def get_action_sequence(user_data):
seq = user_data.sort_values('timestamp')['behavior_type'].values
return '>'.join(seq[:10]) # 取最近10个行为
user_sequences = data.groupby('user_id').apply(get_action_sequence)
XGBoost参数调优:
python复制from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
param_grid = {
'max_depth': [3, 5, 7],
'learning_rate': [0.01, 0.1, 0.2],
'n_estimators': [100, 200],
'subsample': [0.8, 1.0]
}
xgb = XGBClassifier(random_state=42)
grid_search = GridSearchCV(estimator=xgb, param_grid=param_grid, cv=5, scoring='roc_auc')
grid_search.fit(X_train, y_train)
模型解释:
python复制# 特征重要性
import shap
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(X_test)
# 可视化
shap.summary_plot(shap_values, X_test)
使用Plotly Dash构建交互式看板:
python复制import dash
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output
app = dash.Dash(__name__)
app.layout = html.Div([
dcc.Graph(id='rfm-scatter'),
dcc.Dropdown(
id='cluster-selector',
options=[{'label': f'Cluster {i}', 'value': i} for i in range(4)],
value=[0,1,2,3],
multi=True
)
])
@app.callback(
Output('rfm-scatter', 'figure'),
[Input('cluster-selector', 'value')]
)
def update_scatter(selected_clusters):
filtered = rfm[rfm['cluster'].isin(selected_clusters)]
fig = px.scatter_3d(filtered, x='recency', y='frequency', z='monetary',
color='cluster', hover_name=filtered.index)
return fig
基于分析结果可制定以下策略:
使用Flask构建预测API:
python复制from flask import Flask, request, jsonify
import pickle
app = Flask(__name__)
model = pickle.load(open('model.pkl','rb'))
@app.route('/predict', methods=['POST'])
def predict():
data = request.get_json()
df = pd.DataFrame(data, index=[0])
pred = model.predict_proba(df)[0][1] # 返回流失概率
return jsonify({'churn_probability': float(pred)})
建立模型性能监控看板:
python复制# 监控特征漂移
def detect_drift(reference, current, threshold=0.1):
drift_report = {}
for col in reference.columns:
ks_stat = ks_2samp(reference[col], current[col]).statistic
if ks_stat > threshold:
drift_report[col] = ks_stat
return drift_report
这个项目从数据收集到业务应用形成了完整闭环,在实际运营中取得了显著效果。通过合理的用户分群和行为分析,客户回购率提升了15%,营销成本降低了20%。