當前位置：首頁 > 运维知识 > windows >内容正文

windows

【大语言模型基础】60行Numpy教你实现GPT-原理与代码详解

發(fā)布時間：2023/12/29 windows 23 coder

生活随笔收集整理的這篇文章主要介紹了【大语言模型基础】60行Numpy教你实现GPT-原理与代码详解小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

寫在前面

本文主要是對博客 https://jaykmody.com/blog/gpt-from-scratch/ 的精簡整理，并加入了自己的理解。
中文翻譯：https://jiqihumanr.github.io/2023/04/13/gpt-from-scratch/#circle=on
項目地址：https://github.com/jaymody/picoGPT

本文最終將用60行代碼實現一個GPT。它可以加載OpenAI預訓練的GPT-2模型權重，并生成一些文本。 注：本文僅實現了GPT模型的推理（無batch，不能訓練）

一、GPT簡介

GPT(Generative Pre-trained Transformer)基于Transformer解碼器自回歸地預測下一個Token，從而進行了語言模型的建模。

只要能夠足夠好地預測下一個Token，語言模型便可能具備足夠地潛力，從而實現人工智能。

以上就是關于GPT和它的能力的一個高層次概述。讓我們深入了解更多具體細節(jié)。

輸入 / 輸出

GPT的函數簽名大致如下：

def gpt(inputs: list[int]) -> list[list[float]]:
	""" GPT代碼，實現預測下一個token
	inputs：List[int], shape為[n_seq]，輸入文本序列的token id的列表
	output：List[List[int]], shape為[n_seq, n_vocab]，預測輸出的logits列表
	"""
    output = # 需要實現的GPT內部計算邏輯 
    return output

輸入

輸入是一些由整數表示的文本序列，每個整數都與文本中的token一一對應。例如：

text  = "robot must obey orders"
tokens = ["robot", "must", "obey", "orders"]
inputs = [1, 0, 2, 4]

token, 即詞元，是文本的子片段，使用某種分詞器生成。

分詞器將文本分割為不可分割的詞元單位，實現文本的高效表示，且方便模型學習文本的結構和語義。

分詞器對應一個詞匯表，我們可用詞匯表將token映射為整數：

# 詞匯表中的token索引表示該token的整數ID
# 例如，"robot"的整數ID為1，因為vocab[1] = "robot"
vocab = ["must", "robot", "obey", "the", "orders", "."]

# 一個根據空格進行分詞的分詞器tokenizer
tokenizer = WhitespaceTokenizer(vocab)

# encode()方法將str字符串轉換為list[int]
ids = tokenizer.encode("robot must obey orders") # ids = [1, 0, 2, 4]

# 通過詞匯表映射，可以看到實際的token是什么
tokens = [tokenizer.vocab[i] for i in ids] # tokens = ["robot", "must", "obey", "orders"]

# decode()方法將list[int] 轉換回str
text = tokenizer.decode(ids) # text = "robot must obey orders"

簡而言之：

通過語料數據集和分詞器tokenizer可以構造一個包含文本中的所有token的詞匯表vocab。
使用tokenizer將文本text分割為token序列，再使用詞匯表vocab將token映射為token id整數，從而得到輸入文本token序列。

最后，可以通過vocab將token id序列再轉換回文本。

輸出

output是一個二維數組，其中output[i][j]表示文本序列的第i個位置的token（inputs[i]）是詞匯表的第j個token（vocab[j]）的概率（實際為未歸一化的logits得分）。例如：

inputs = [1, 0, 2, 4]  # "robot" "must" "obey" "orders"
vocab = ["must", "robot", "obey", "the", "orders", "."]
output = gpt(inputs)

# output[0] = [0.75, 0.1, 0.15, 0.0, 0.0, 0.0]
# 給定 "robot"，模型預測 "must" 的概率最高

# output[1] = [0.0, 0.0, 0.8, 0.1, 0.0, 0.1]
# 給定序列 ["robot", "must"]，模型預測 "obey" 的概率最高

# output[-1] = [0.0, 0.0, 0.1, 0.0, 0.85, 0.05]
# 給定整個序列["robot", "must", "obey"]，模型預測 "orders" 的概率最高
next_token_id = np.argmax(output[-1])  # next_token_id = 4
next_token = vocab[next_token_id]      # next_token = "orders"

在上述例子中，輸入序列為["robot", "must", "obey"]，GPT模型根據輸入，預測序列的下一個token是 "output"，因為 output[-1][4]的值為0.85，是詞表中最高的一個。

output[0] 表示給定輸入token "robot"，模型預測下一個token可能性最高的是"must"，為0.75。
output[-1] 表示給定整個輸入序列 ["robot", "must", "obey"]，模型預測下一個token是"orders"的可能性最高，為0.85。

為預測序列的下一個token，只需在output的最后一個位置中選擇可能性最高的token。那么，通過迭代地將上一輪的輸出拼接到輸入，并送入模型，從而持續(xù)地生成token。

這種生成方式稱為貪心采樣。實際可以對類別分布用溫度系數T進行蒸餾（放大或減小分布的不確定性），并截斷類別分布的按top-k，再進行類別分布采樣。

具體地，在每次迭代中，將上一輪預測出的token添加到輸入末尾，然后預測下一個位置的值，如此往復，就是整個自回歸的預測過程：

def generate(inputs, n_tokens_to_generate):
	""" GPT生成代碼
	inputs: list[int], 輸入文本的token ids列表
	n_tokens_to_generate：int, 需要生成的token數量
	"""
    # 自回歸式解碼循環(huán)
    for _ in range(n_tokens_to_generate): 
        output = gpt(inputs)            # 模型前向推理，輸出預測詞表大小的logits列表
        next_id = np.argmax(output[-1]) # 貪心采樣
        inputs.append(int(next_id))     # 將預測添加回輸入
    return inputs[len(inputs) - n_tokens_to_generate :]  # 只返回生成的ids

# 隨便舉例
input_ids = [1, 0, 2]                          # ["robot", "must", "obey"]
output_ids = generate(input_ids, 1)            #  output_ids = [1, 0, 2, 4]
output_tokens = [vocab[i] for i in output_ids] # ["robot", "must", "obey", "orders"]

二、GPT結構與實現

2.1 基本組成部分

首先，導入相關可視化函數

import random
import numpy as np
import matplotlib.pyplot as plt

def plot(x, y, x_axis=None, y_axis=None):
    plt.plot(x, y) 
    if x_axis and isinstance(x_axis, tuple):    
        plt.xlim(x_axis[0], x_axis[1])
    if y_axis and isinstance(y_axis, tuple): 
        plt.ylim(y_axis[0], y_axis[1])
    plt.show()

def plotHot(w):
    plt.figure()
    plt.imshow(w, cmap='hot', interpolation='nearest')
    plt.show()

GELU

GPT-2選擇的FFN中的非線性激活函數是GELU（高斯誤差線性單元），是ReLU的對比的一種替代方法。它由以下函數近似表示：

def gelu(x):
    return 0.5 * x * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)))

def relu(x):
    return np.maximum(0, x)

GELU與ReLU的對比

print(gelu(np.array([1, 2, -2, 0.5])))
print(relu(np.array([1, 2, -2, 0.5])))

x = np.linspace(-4, 4, 100) 
plot(x, np.array([gelu(x), relu(x)]).transpose())

Softmax

原始Softmax公式：$$\text{softmax}(x)_i = \frac{e^{x_i}}{\sum_j e^{x_j}}$$

相比原始Softmax, 這里使用了減去最大值max(x)技巧來保持數值穩(wěn)定性。

def softmax(x):
    # 減去最大值，避免溢出，不影響分布
    exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

def rawSoftmax(x):
    exp_x = np.exp(x)
    return exp_x / np.sum(exp_x)

num = 100  # 生成不重復的隨機數，比較 原始值、原始softmax和修正后的softmax
numbers = []
for i in range(num):
    number = random.uniform(1, 3)
    while number in numbers:
        number = random.uniform(1, 3)
    numbers.append(number)
plot(np.array(range(num)), np.array([numbers, rawSoftmax(numbers), softmax(numbers)]).transpose())

在輸入在合理范圍時，兩者輸出基本相同。

raw_x = np.array([[-200, 100, -300, 0, 70000000]])
x1 = softmax(raw_x)
x2 = rawSoftmax(np.array(raw_x))
print(x1, x1.sum(axis=-1), softmax(x1))
print(x2, x2.sum(axis=-1), softmax(x2))

在輸入存在異常值時，輸出結果比較（原始softmax出現nan）

[[0. 0. 0. 0. 1.]] [1.] [[0.14884758 0.14884758 0.14884758 0.14884758 0.40460968]]
[[ 0.  0.  0.  0. nan]] [nan] [[nan nan nan nan nan]]
tmp.py:7: RuntimeWarning: overflow encountered in exp exp_x = np.exp(x)
tmp.py:8: RuntimeWarning: invalid value encountered in divide return exp_x / np.sum(exp_x)

層歸一化

層歸一化（Layer Normalization）是基于特征維度將數據進行標準化（均值為0方差為1），同時乘以縮放系數、加上平移系數，保留其非線性能力：

\[\text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{{\sigma}} + \beta \]

層歸一化可以有效地緩解優(yōu)化過程中潛在的不穩(wěn)定、收斂速度慢等問題。

def layer_norm(x, g, b, eps: float = 1e-5):
    """ 層歸一化操作
    x: np.array, 輸入
    g: float, 可學習的縮放參數 gamma
    b: float, 可學習的平移參數 beta
	eps: float, 避免方差為0從而除零的極小值
    """
    mean = np.mean(x, axis=-1, keepdims=True)
    variance = np.var(x, axis=-1, keepdims=True)
    x = (x - mean) / np.sqrt(variance + eps)  # 將x沿著最后一個軸，進行標準化
    return g * x + b                          # 將標準化后的x進行重新縮放和平移

可視化例子

num, dim = 5, 5
x = np.array([[random.randint(-10, 10) for _ in range(dim)] for _ in range(num)] )
g, b = 1, 0 # 不縮放和平移
x_norm = layer_norm(x, g, b)
print(x)
print(x_norm)
plotHot(x)
plotHot(x_norm)

輸出結果

# 層歸一化前
[[ -9   3  -2  -6  -6]
 [-10  -6 -10   8   4]
 [ -1   5  -4  -3  -5]
 [  8   7  -5  -5   9]
 [ 10  -1  -5   3   9]]
 
# 層歸一化后
[[-1.2056067   1.68784939  0.48224268 -0.48224268 -0.48224268]
 [-0.96768591 -0.43008263 -0.96768591  1.45152886  0.91392558]
 [ 0.16876312  1.8563943  -0.67505247 -0.39378061 -0.95632434]
 [ 0.8124999   0.65624992 -1.21874985 -1.21874985  0.96874988]
 [ 1.18444594 -0.73156955 -1.42830246 -0.03483665  1.01026272]]

層歸一化前

層歸一化后（每行數據經過標準化后，分布差異變小了，從而輸入網絡的數據的分布得到了限制）

通過折線圖可視化（每條折線代表一個行向量），可以更明顯地看到變化：

axis = np.array(range(x.shape[0]))
plot(axis, x)
plot(axis, x_norm)

層歸一化前

層歸一化后

線性（仿射變換）層

標準的矩陣乘法+偏置：

def linear(x, w, b):  # [m, in], [in, out], [out] -> [m, out]
    return x @ w + b

例子

n_num = 3
in_dim, hid_dim = 4, 4
x = np.random.normal(size=(n_num, in_dim))
w = np.random.normal(size=(in_dim, hid_dim))
b = np.random.normal(size=(hid_dim,))
h = linear(x, w, b)
print(f"shape of w: {w.shape}")
print(f"input shape: {x.shape}, output shape: {h.shape}")
plotHot(w)

shape of w: (4, 4)
input shape: (3, 4), output shape: (3, 4)
權重可視化

2.2 GPT架構

從整體上來看，GPT架構分為三個部分：

嵌入表示層：文本詞元嵌入（token embeddings） + 位置嵌入（positional embeddings）
transformer解碼器堆棧：多層decoder block堆疊
預測：輸出投影回詞匯表（projection to vocab）

代碼層GPT實現

def gpt2(inputs, wte, wpe, blocks, ln_f, n_head):  
    """ GPT2模型實現
        輸入輸出tensor形狀： [n_seq] -> [n_seq, n_vocab]
        n_vocab, 詞表大小
        n_seq, 輸入token序列長度
        n_layer, 自注意力編碼器的層數
        n_embd, 詞表的詞元嵌入大小
        n_ctx, 輸入最大序列長度（位置編碼支持的長度，可用ROPE旋轉位置編碼提升外推長度） 
    params:
        inputs: List[int], token ids， 輸入token ids
        wte: np.ndarray[n_vocab, n_embd], token嵌入矩陣 （與輸出分類器共享參數）
        wpe: np.ndarray[n_ctx, n_embd], 位置編碼嵌入矩陣
        blocks：object, n_layer層因果自注意力編碼器
        ln_f：tuple[float], 層歸一化參數
        n_head：int, 注意力頭數
    """
    # 1、在詞元嵌入中添加位置編碼信息：token + positional embeddings
    x = wte[inputs] + wpe[range(len(inputs))]  # [n_seq] -> [n_seq, n_embd]

    # 2、前向傳播n_layer層Transformer blocks
    for block in blocks:
        x = transformer_block(x, **block, n_head=n_head)  # [n_seq, n_embd] -> [n_seq, n_embd]

    # 3、Transformer編碼器塊的輸出投影到詞匯表概率分布上
    # 預測下個詞在詞表上的概率分布[ 輸出語言模型的建模的條件概率分布p(x_t|x_t-1 ... x_1) ]
    x = layer_norm(x, **ln_f)  # [n_seq, n_embd] -> [n_seq, n_embd]
    # 就是和嵌入矩陣進行內積（編碼器塊的輸出相當于預測值，內積相當于求相似度最大的詞匯）
    return x @ wte.T  # [n_seq, n_embd] -> [n_seq, n_vocab]

嵌入表示層

Token embeddings
wte是一個[n_vocab, n_embd]可學習參數矩陣，它充當一個token嵌入查找表，其中矩陣的第$i$
行對應于我們詞匯表中第 $i$個token的embedding。
wte[inputs] 使用整數數組索引來檢索與輸入中每個token對應的向量。

Positional embeddings
為了編碼序列的順序信息，通過在輸入表示中添加位置編碼（positional encoding）嵌入來注入位置信息。
位置編碼可以通過學習得到也可以直接固定得到。

大小為[n_ctx, n_embd]的wpe即可學習的位置嵌入矩陣，其中矩陣的第$i$行對應輸入序列中第$i$個token的位置embedding，編碼了對應的位置信息。

n_ctx代表最大序列長度，限制了模型外推的最大范圍。n_ctx代表最大序列長度，限制了模型外推的最大范圍。

在GPT中，位置嵌入矩陣wpe和token embeddings類似，先隨機初始化，后通過訓練學習得到。wpe[inputs] 使用整數數組索引inputs來檢索與輸入中每個token對應的位置嵌入。

將token嵌入與位置嵌入聯(lián)合為一個組合嵌入，這個嵌入將token信息和位置信息都編碼進來了。

Token + Positional embeddings
將Tokene mbeddings與位置嵌入拼接后的嵌入，將token信息和位置信息都編碼進來了，它將作為transoformer decoder blocks的實際輸入。

x = wte[inputs] + wpe[range(len(inputs))]  # [n_seq] -> [n_seq, n_embd]

解碼層

transformer解碼器模塊由兩個子層組成：

多頭因果自注意力（Multi-head causal self attention）
逐位置前饋神經網絡（Position-wise feed forward neural network）

transformer解碼器中，堆疊了num_layers個如下的transformer_block：

def transformer_block(x, mlp, attn, ln_1, ln_2, n_head):  
    """ 自注意力編碼器層實現 (只實現邏輯，各個子模塊參數需傳入）
        輸入輸出tensor形狀： [n_seq, n_embd] -> [n_seq, n_embd]
        n_seq, 輸入token序列長度
        n_embd, 詞表的詞元嵌入大小
    params:
        x: np.ndarray[n_seq, n_embd], 輸入token嵌入序列
        mlp： object, 前饋神經網絡
        attn: object, 注意力編碼器層
        ln1: object, 線性層1
        ln2: object, 線性層2
        n_head：int, 注意力頭數
    """
    # Multi-head Causal Self-Attention (層歸一化 + 多頭自注意力 + 殘差連接 )
    x = x + mha(layer_norm(x, **ln_1), **attn, n_head=n_head)  # [n_seq, n_embd] -> [n_seq, n_embd]

    # Position-wise Feed Forward Network
    x = x + ffn(layer_norm(x, **ln_2), **mlp)  # [n_seq, n_embd] -> [n_seq, n_embd]

    return x

Self-Attention中的層規(guī)一化和殘差連接用于提升訓練的穩(wěn)定性。

殘差連接
殘差連接引入輸入直接到輸出的通路，便于梯度回傳從而緩解在優(yōu)化過程中由于網絡過深引起的梯度消失問題。

\[\mathbf{x}^{l+1} = f(\mathbf{x}^l) + \mathbf{x}^l \]

位置感知的前饋網絡
對序列中的所有位置的表示進行變換時使用的是同一個2層隱藏層的MLP，故稱其為position-wise的前饋網絡（Position-wise Feed Forward Network）。

\[{FFN}(\mathbf x) = Gelu(\mathbf{x} \mathbf{W}_1 + \mathbf{b}_1)\mathbf{W}_2 + \mathbf{b}_2 \]

def ffn(x, c_fc, c_proj):  
    """ 2層前饋神經網絡實現 (只實現邏輯，各個子模塊參數需傳入）
        輸入輸出tensor形狀： [n_seq, n_embd] -> [n_seq, n_embd]
        n_seq, 輸入token序列長度
        n_embd, 詞表的詞元嵌入大小
        n_hid， 隱藏維度
    params:
        x: np.ndarray[n_seq, n_embd], 輸入token嵌入序列
        c_fc： np.ndarray[n_embd, n_hid], 升維投影層參數， 默認：4*n_embd
        c_proj: np.ndarray[n_hid, n_embd], 降維投影層參數
    """
    # project up：將n_embd投影到一個更高的維度 4*n_embd
    a = gelu(linear(x, **c_fc))  # [n_seq, n_embd] -> [n_seq, 4*n_embd]

    # project back down：投影回n_embd
    x = linear(a, **c_proj)  # [n_seq, 4*n_embd] -> [n_seq, n_embd]

    return x

這里僅僅是升維再降維，具體地將n_embd投影到一個更高的維度4*n_embd，然后再將其投影回n_embd。

多頭因果自注意力
這里將通過分別解釋“多頭因果自注意力”的每個詞，來一步步理解“多頭因果自注意力”：

注意力（Attention）
自（Self）
因果（Causal）
多頭（Multi-Head）

縮放點積注意力（scaled dot-product attention）

\[\mathbf{H} = \mathrm{softmax}\left(\frac{\mathbf Q \mathbf K^\top }{\sqrtze8trgl8bvbq}\right) \mathbf V \in \mathbb{R}^{T\times d} \]

其中，查詢向量$\mathbf Q\in\mathbb R^{T\times d}$、鍵向量$\mathbf K \in\mathbb R^{T\times d}$、值向量$\mathbf V\in\mathbb R^{T\times d}$，$T$為序列長度。

注意力得分除以$\sqrtze8trgl8bvbq$進行縮放, 是考慮到在$d$過大時，點積值較大會使得后續(xù)Softmax操作溢出導致梯度爆炸，不利于模型優(yōu)化。

def attention_raw(q, k, v):  
    """ 原始縮放點積注意力實現
        輸入輸出tensor形狀： [n_q, d_k], [n_k, d_k], [n_k, d_v] -> [n_q, d_v]
    params:
        q: np.ndarray[n_seq, n_embd], 查詢向量
        k： np.ndarray[n_seq, n_embd], 鍵向量
        v: np.ndarray[n_seq, n_embd], 值向量
    """
    return softmax(q @ k.T / np.sqrt(q.shape[-1])) @ v

# 以通過對q、k、v進行投影變換來增強自注意效果
def self_attention_raw(x, w_k, w_q, w_v, w_proj): 
    """ 自注意力原始實現
        輸入輸出tensor形狀： [n_seq, n_embd] -> [n_seq, n_embd]
    params:
        x: np.ndarray[n_seq, n_embd], 輸入token嵌入序列
        w_k： np.ndarray[n_embd, n_embd], 查詢向量投影層參數
        w_q: np.ndarray[n_embd, n_embd], 鍵向量投影層參數
        w_v: np.ndarray[n_embd, n_embd], 值向量投影層參數
        w_proj: np.ndarray[n_embd, n_embd], 自注意力輸出投影層參數
    """
    # qkv projections
    q = x @ w_k # [n_seq, n_embd] @ [n_embd, n_embd] -> [n_seq, n_embd]
    k = x @ w_q # [n_seq, n_embd] @ [n_embd, n_embd] -> [n_seq, n_embd]
    v = x @ w_v # [n_seq, n_embd] @ [n_embd, n_embd] -> [n_seq, n_embd]

    # perform self attention
    x = attention(q, k, v) # [n_seq, n_embd] -> [n_seq, n_embd]

    # out projection
    x = x @ w_proj # [n_seq, n_embd] @ [n_embd, n_embd] -> [n_seq, n_embd]

    return x

# 將w_q、w_k和w_v組合成一個單獨的矩陣w_fc，執(zhí)行投影操作，然后拆分結果，我們就可以將矩陣乘法的數量從4個減少到2個
def self_attention(x, c_attn, c_proj): 
    """ 自注意力優(yōu)化后實現（w_q 、w_k 、w_v合并成一個矩陣w_fc進行投影，再拆分結果）
        同時GPT-2的實現：加入偏置項參數（所以使用線性層，進行仿射變換）
        輸入輸出tensor形狀： [n_seq, n_embd] -> [n_seq, n_embd]
    params:
        x: np.ndarray[n_seq, n_embd], 輸入token嵌入序列
        w_fc： np.ndarray[n_embd, 3*n_embd], 查詢向量投影層參數
        w_proj: np.ndarray[n_embd, n_embd], 自注意力輸出投影層參數
    """
    # qkv projections
    x = linear(x, **c_attn) # [n_seq, n_embd] -> [n_seq, 3*n_embd]

    # split into qkv
    q, k, v = np.split(x, 3, axis=-1) # [n_seq, 3*n_embd] -> 3 of [n_seq, n_embd]

    # perform self attention
    x = attention(q, k, v) # [n_seq, n_embd] -> [n_seq, n_embd]

    # out projection
    x = linear(x, **c_proj) # [n_seq, n_embd] @ [n_embd, n_embd] = [n_seq, n_embd]

    return x

因果
為了防止序列建模時出現信息泄露，需要修改注意力矩陣（增加Mask）以隱藏或屏蔽我們的輸入，從而避免模型在訓練階段直接看到后續(xù)的文本序列（信息泄露）進而無法得到有效地訓練。

# 輸入是 ["not", "all", "heroes", "wear", "capes"] 

# 原始自注意力
        not    all   heroes  wear  capes
   not 0.116  0.159  0.055  0.226  0.443
   all 0.180  0.397  0.142  0.106  0.175
heroes 0.156  0.453  0.028  0.129  0.234
  wear 0.499  0.055  0.133  0.017  0.295
 capes 0.089  0.290  0.240  0.228  0.153

 # 因果自注意力 （行為j, 列為i）
 # 為防止輸入的所有查詢都能預測未來，需要將所有j>i位置設置為0 ：
        not    all   heroes  wear  capes
   not 0.116  0.     0.     0.     0.
   all 0.180  0.397  0.     0.     0.
heroes 0.156  0.453  0.028  0.     0.
  wear 0.499  0.055  0.133  0.017  0.
 capes 0.089  0.290  0.240  0.228  0.153

 # 在應用 softmax 之前，我們需要修改我們的注意力矩陣，得到掩碼自注意力
 # 即，在softmax之前將要屏蔽項的注意力得分設置為 ?∞（歸一化系數為0）
 # mask掩碼矩陣
 0 -1e10 -1e10 -1e10 -1e10
 0   0   -1e10 -1e10 -1e10
 0   0     0   -1e10 -1e10
 0   0     0     0   -1e10
 0   0     0     0     0

 使用 -1e10 而不是 -np.inf ，因為 -np.inf 可能會導致 nans

加入掩碼矩陣的注意力實現：

def attention(q, k, v, mask):  
    """ 縮放點積注意力實現
        輸入輸出tensor形狀： [n_q, d_k], [n_k, d_k], [n_k, d_v] -> [n_q, d_v]
    params:
        q: np.ndarray[n_seq, n_embd], 查詢向量
        k： np.ndarray[n_seq, n_embd], 鍵向量
        v: np.ndarray[n_seq, n_embd], 值向量
        mask: np.ndarray[n_seq, n_seq], 注意力掩碼矩陣
    """
    return softmax(q @ k.T / np.sqrt(q.shape[-1]) + mask) @ v

因果注意力掩碼矩陣可視化

x = np.array([1, 1, 1, 1, 1])
causal_mask = (1 - np.tri(x.shape[0], dtype=x.dtype))* -1e10   
print(causal_mask)
plotHot(causal_mask)

[[-0.e+00 -1.e+10 -1.e+10 -1.e+10 -1.e+10]
 [-0.e+00 -0.e+00 -1.e+10 -1.e+10 -1.e+10]
 [-0.e+00 -0.e+00 -0.e+00 -1.e+10 -1.e+10]
 [-0.e+00 -0.e+00 -0.e+00 -0.e+00 -1.e+10]
 [-0.e+00 -0.e+00 -0.e+00 -0.e+00 -0.e+00]]

注意力可視化

def causal_self_attention(x, c_attn, c_proj): 
    """ 因果自注意力優(yōu)化后實現（w_q 、w_k 、w_v合并成一個矩陣w_fc進行投影，再拆分結果）
        同時GPT-2的實現：加入偏置項參數（所以使用線性層，進行仿射變換）
        輸入輸出tensor形狀： [n_seq, n_embd] -> [n_seq, n_embd]
    params:
        x: np.ndarray[n_seq, n_embd], 輸入token嵌入序列
        c_attn： np.ndarray[n_embd, 3*n_embd], 查詢向量投影層參數
        c_proj: np.ndarray[n_embd, n_embd], 自注意力輸出投影層參數
    """
    # qkv projections
    x = linear(x, **c_attn) # [n_seq, n_embd] -> [n_seq, 3*n_embd]

    # split into qkv
    q, k, v = np.split(x, 3, axis=-1) # [n_seq, 3*n_embd] -> 3 of [n_seq, n_embd]

    # causal mask to hide future inputs from being attended to
    causal_mask = (1 - np.tri(x.shape[0], dtype=x.dtype))* -1e10   # [n_seq, n_seq]

    # perform causal self attention
    x = attention(q, k, v, causal_mask) # [n_seq, n_embd] -> [n_seq, n_embd]

    # out projection
    x = linear(x, **c_proj) # [n_seq, n_embd] @ [n_embd, n_embd] = [n_seq, n_embd]

    return x

實際，用-1e10替換-np.inf，因為-np.inf會導致nans錯誤。

多頭自注意力(Multi-Head-self-Attention)

def mha(x, c_attn, c_proj, n_head):
    """ 多頭自注意力實現
        輸入輸出tensor形狀： [n_seq, n_embd] -> [n_seq, n_embd]
        每個注意力計算的維度從n_embd降低到 n_embd/n_head。
        通過降低維度，模型利用多個子空間進行建模
    params:
        x: np.ndarray[n_seq, n_embd], 輸入token嵌入序列
        c_attn： np.ndarray[n_embd, 3*n_embd], 查詢向量投影層參數
        c_proj: np.ndarray[n_embd, n_embd], 自注意力輸出投影層參數
    """  
    # qkv投影變換
    x = linear(x, **c_attn)  # [n_seq, n_embd] -> [n_seq, 3*n_embd]

    # 劃分為qkv
    qkv = np.split(x, 3, axis=-1)  # [n_seq, 3*n_embd] -> [3, n_seq, n_embd]

    # 將n_embd繼續(xù)劃分為_head個注意力頭
    qkv_heads = list(map(lambda x: np.split(x, n_head, axis=-1), qkv))  # [3, n_seq, n_embd] -> [3, n_head, n_seq, n_embd/n_head]

    # 構造causal mask矩陣
    causal_mask = (1 - np.tri(x.shape[0], dtype=x.dtype))* -1e10  # [n_seq, n_seq]

    # 單獨執(zhí)行每個頭的因果自注意力（可多核多線程并行執(zhí)行）
    out_heads = [attention(q, k, v, causal_mask) for q, k, v in zip(*qkv_heads)]  # [3, n_head, n_seq, n_embd/n_head] -> [n_head, n_seq, n_embd/n_head]

    # 合并多個heads的結果
    x = np.hstack(out_heads)  # [n_head, n_seq, n_embd/n_head] -> [n_seq, n_embd]

    # 多頭因果自注意力輸出projection
    x = linear(x, **c_proj)  # [n_seq, n_embd] -> [n_seq, n_embd]

    return x

將所有代碼組合起來
將所有代碼組合起來就得到了gpt2.py，總共的代碼只有120行（如果你移除注釋、空格之類的，那就只有60行）。

二、項目實戰(zhàn)

可以通過以下代碼測試：

python gpt2.py "Alan Turing theorized that computers would one day become" --n_tokens_to_generate 8

其輸出是：the most powerful machines on the planet.

ToDO

參考鏈接

【1】配圖部分來自，https://jalammar.github.io/illustrated-gpt2/

總結

以上是生活随笔為你收集整理的【大语言模型基础】60行Numpy教你实现GPT-原理与代码详解的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。