别再死记硬背公式了！用Python手写Self-Attention，5分钟搞懂QKV矩阵计算

fire life

从零实现Self-Attention：用Python拆解Transformer核心计算

很多人在学习Transformer模型时，都会被Self-Attention机制中的QKV矩阵搞得晕头转向。今天我们不谈复杂公式，直接动手用Python实现一个完整的Self-Attention计算过程。通过代码，你将清晰看到每个矩阵如何参与计算，注意力分数如何产生，以及最终输出如何形成。

1. 环境准备与基础概念

在开始编码前，我们需要明确几个关键概念。Self-Attention机制中的Q(Query)、K(Key)、V(Value)矩阵，本质上是通过对输入进行不同的线性变换得到的。这种设计灵感来源于信息检索系统：

Query：相当于你的搜索请求
Key：相当于文档的索引关键词
Value：就是实际要检索的内容

在代码实现中，我们会用NumPy来完成这些矩阵运算。首先确保你的Python环境已安装NumPy：

python复制import numpy as np

假设我们的输入是一个包含3个单词的句子，每个单词用4维向量表示（实际中维度会大得多）：

python复制# 输入矩阵X：3个单词，每个单词4维向量表示
X = np.array([
    [1, 0, 1, 0],   # 单词1
    [0, 2, 0, 2],   # 单词2
    [1, 1, 1, 1]    # 单词3
])

2. 生成QKV矩阵

接下来我们需要定义三个权重矩阵Wq、Wk、Wv，用于将输入X转换为Q、K、V矩阵。在实际Transformer中，这些权重是通过学习得到的，这里我们手动初始化：

python复制# 定义权重矩阵（随机初始化）
Wq = np.random.randn(4, 3)  # 将4维输入映射到3维Q空间
Wk = np.random.randn(4, 3)  # 将4维输入映射到3维K空间
Wv = np.random.randn(4, 3)  # 将4维输入映射到3维V空间

# 计算Q、K、V矩阵
Q = np.dot(X, Wq)
K = np.dot(X, Wk)
V = np.dot(X, Wv)

print("Q矩阵:\n", Q)
print("K矩阵:\n", K)
print("V矩阵:\n", V)

注意：在实际应用中，这些权重矩阵通常会使用更合理的初始化方法，如Xavier初始化。

3. 计算注意力分数

注意力分数的计算分为几个关键步骤：

计算Q和K的点积（相似度）
缩放点积结果（防止梯度消失）
应用softmax归一化

python复制# 计算QK^T
attention_scores = np.dot(Q, K.T)

# 缩放（除以sqrt(d_k)）
d_k = K.shape[1]  # K的维度
attention_scores = attention_scores / np.sqrt(d_k)

# 应用softmax
attention_weights = np.exp(attention_scores) / np.sum(np.exp(attention_scores), axis=1, keepdims=True)

print("注意力权重矩阵:\n", attention_weights)

这个权重矩阵显示了每个单词与其他单词的关联程度。例如，第一行表示第一个单词与所有单词（包括自己）的注意力分配。

4. 加权求和得到最终输出

最后一步是将注意力权重应用于V矩阵：

python复制# 计算加权和
output = np.dot(attention_weights, V)

print("Self-Attention输出:\n", output)

这个输出矩阵就是Self-Attention的最终结果，其中每个单词的表示都融合了句子中其他单词的信息。

5. 完整代码与常见问题

将上述步骤整合为一个完整的Self-Attention函数：

python复制def self_attention(X, Wq, Wk, Wv):
    # 计算QKV
    Q = np.dot(X, Wq)
    K = np.dot(X, Wk)
    V = np.dot(X, Wv)
    
    # 计算注意力分数
    attention_scores = np.dot(Q, K.T) / np.sqrt(K.shape[1])
    attention_weights = np.exp(attention_scores) / np.sum(np.exp(attention_scores), axis=1, keepdims=True)
    
    # 加权求和
    output = np.dot(attention_weights, V)
    
    return output

常见问题排查：

维度不匹配错误：确保Wq、Wk、Wv的维度与输入X匹配
梯度消失问题：不要忘记除以sqrt(d_k)的缩放步骤
注意力权重全相同：检查softmax计算是否正确，输入值是否差异过小

6. 可视化中间结果

理解Self-Attention最好的方式之一是可视化中间计算结果。我们可以用热图展示注意力权重：

python复制import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8, 6))
sns.heatmap(attention_weights, annot=True, cmap="YlGnBu", 
            xticklabels=["单词1", "单词2", "单词3"],
            yticklabels=["单词1", "单词2", "单词3"])
plt.title("注意力权重热图")
plt.xlabel("Key")
plt.ylabel("Query")
plt.show()

这张热图直观展示了每个Query单词与各个Key单词的关联强度。

7. 扩展到Multi-Head Attention

理解了Self-Attention后，Multi-Head Attention就很容易实现了。它本质上是多个Self-Attention的并行计算，然后将结果拼接：

python复制def multi_head_attention(X, head=4):
    # 假设我们有4个头
    outputs = []
    for _ in range(head):
        # 每个头有自己的Wq,Wk,Wv
        Wq = np.random.randn(X.shape[1], 3)
        Wk = np.random.randn(X.shape[1], 3)
        Wv = np.random.randn(X.shape[1], 3)
        
        # 计算单个头的输出
        output = self_attention(X, Wq, Wk, Wv)
        outputs.append(output)
    
    # 拼接所有头的输出
    multi_head_output = np.concatenate(outputs, axis=1)
    
    return multi_head_output

Multi-Head Attention的优势在于能够从不同子空间学习信息，增强模型的表达能力。

8. 实际应用中的优化技巧

在实际项目中，我们还会对上述基础实现进行一些优化：

批处理计算：同时处理多个句子
掩码处理：处理变长输入或解码时的未来信息屏蔽
残差连接：缓解深层网络训练问题
Layer Normalization：加速训练收敛

一个更完整的实现可能如下：

python复制class SelfAttention:
    def __init__(self, input_dim, head_dim, num_heads):
        self.input_dim = input_dim
        self.head_dim = head_dim
        self.num_heads = num_heads
        
        # 初始化权重矩阵
        self.Wq = np.random.randn(input_dim, head_dim * num_heads)
        self.Wk = np.random.randn(input_dim, head_dim * num_heads)
        self.Wv = np.random.randn(input_dim, head_dim * num_heads)
        self.Wo = np.random.randn(head_dim * num_heads, input_dim)
    
    def __call__(self, X, mask=None):
        batch_size, seq_len, _ = X.shape
        
        # 线性变换得到QKV
        Q = np.dot(X, self.Wq)  # (batch, seq, head*dim)
        K = np.dot(X, self.Wk)
        V = np.dot(X, self.Wv)
        
        # 分割多头
        Q = Q.reshape(batch_size, seq_len, self.num_heads, self.head_dim)
        K = K.reshape(batch_size, seq_len, self.num_heads, self.head_dim)
        V = V.reshape(batch_size, seq_len, self.num_heads, self.head_dim)
        
        # 计算注意力分数
        attention_scores = np.einsum('bqhd,bkhd->bhqk', Q, K) / np.sqrt(self.head_dim)
        
        # 应用掩码（如果有）
        if mask is not None:
            attention_scores = attention_scores + mask * -1e9
        
        # softmax归一化
        attention_weights = np.exp(attention_scores) / np.sum(np.exp(attention_scores), axis=3, keepdims=True)
        
        # 加权求和
        output = np.einsum('bhqk,bkhd->bqhd', attention_weights, V)
        output = output.reshape(batch_size, seq_len, self.num_heads * self.head_dim)
        
        # 最终线性变换
        output = np.dot(output, self.Wo)
        
        return output