大語言模型指令微調資料集準備與批次訓練

大語言模型的指令微調需要大量的資料集，而這些資料集通常以 JSON 格式儲存。為了有效訓練模型，需要將 JSON 資料轉換成模型可以理解的格式，例如 Alpaca 或 Phi-3 的提示樣式。本文中的程式碼示範瞭如何下載、解析 JSON 資料集，並使用 Python 函式將資料格式化為 Alpaca 樣式。此外，為了提升訓練效率，需要將資料集劃分為訓練集、驗證集和測試集，並使用 PyTorch 建立自定義資料載入器，搭配自定義的 collate 函式進行批次填充，以確保每個批次中的序列長度一致，進而最佳化模型訓練效能。程式碼中也示範瞭如何使用 pad_token_id 進行填充，以及如何建立目標標記 ID 以供模型訓練使用。

7.2 為監督式指令微調準備資料集

讓我們下載並格式化用於指令微調預訓練大語言模型（LLM）的指令資料集。該資料集包含1,100個指令-回應對，如圖7.2所示。這個資料集是專為本文建立的，但感興趣的讀者可以在附錄B中找到其他公開可用的指令資料集。

以下程式碼實作並執行了一個函式，用於下載這個資料集，該資料集是一個相對較小的檔案（僅204 KB），以JSON格式儲存。JSON（JavaScript Object Notation）映象了Python字典的結構，提供了一個簡單的資料交換結構，既易於人類閱讀，也方便機器處理。

import json
import os
import urllib.request

def download_and_load_file(file_path, url):
    if not os.path.exists(file_path):
        with urllib.request.urlopen(url) as response:
            text_data = response.read().decode("utf-8")
        with open(file_path, "w", encoding="utf-8") as file:
            file.write(text_data)
    else:
        with open(file_path, "r", encoding="utf-8") as file:
            text_data = file.read()
    with open(file_path, "r") as file:
        data = json.load(file)
    return data

file_path = "instruction-data.json"
url = (
    "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch"
    "/main/ch07/01_main-chapter-code/instruction-data.json"
)
data = download_and_load_file(file_path, url)
print("Number of entries:", len(data))

執行上述程式碼的輸出結果是：

Number of entries: 1100

內容解密：

download_and_load_file函式檢查本地是否已存在指定的檔案。如果不存在，則從指定的URL下載檔案並儲存到本地。
使用urllib.request.urlopen開啟URL並讀取資料，資料以UTF-8編碼解碼後寫入本地檔案。
如果檔案已存在，則直接讀取本地檔案內容。
使用json.load將JSON檔案內容載入到Python字典中。
最後傳回載入的資料。

我們從JSON檔案載入的資料列表包含了1,100個指令資料集的條目。讓我們列印其中一個條目來看看每個條目的結構是怎樣的：

print("Example entry:\n", data[50])

範例條目的內容是：

Example entry:
 {'instruction': 'Identify the correct spelling of the following word.',
 'input': 'Ocassion', 'output': "The correct spelling is 'Occasion.'"}

內容解密：

每個條目是一個Python字典物件，包含'instruction'、'input'和'output'三個鍵。
'instruction'描述了任務，'input'是輸入內容，'output'是預期的輸出回應。

讓我們再看看另一個範例：

print("Another example entry:\n", data[999])

根據這個條目的內容，'input'欄位有時可能是空的：

Another example entry:
 {'instruction': "What is an antonym of 'complicated'?",
 'input': '',
 'output': "An antonym of 'complicated' is 'simple'."}

內容解密：

'input'欄位不是必須的，有時會是空字串，表示不需要額外的輸入內容。

指令微調涉及在一個資料集上訓練模型，其中輸入-輸出對（如我們從JSON檔案中提取的那些）被明確提供。有多種方法可以將這些條目格式化以供LLM使用。圖7.4展示了兩種不同的格式化範例，通常被稱為提示樣式，用於著名LLM（如Alpaca和Phi-3）的訓練。

Alpaca是早期公開詳細介紹其指令微調過程的LLM之一。Phi-3，由微軟開發，被納入以展示提示樣式的多樣性。本章的其餘部分使用Alpaca提示樣式，因為它是最流行的樣式之一，大部分原因是它幫助定義了原始的微調方法。

讓我們定義一個format_input函式，用於將資料列表中的條目轉換為Alpaca樣式的輸入格式。

def format_input(entry):
    instruction_text = (
        f"Below is an instruction that describes a task. "
        f"Write a response that appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )
    input_text = (
        f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""
    )
    return instruction_text + input_text

內容解密：

format_input函式接受一個字典條目作為輸入，並建構一個格式化的字串。
如果entry['input']不為空，則新增### Input:段落；否則，僅包含### Instruction:段落。

讓我們測試這個函式，使用之前檢視過的資料集條目data[50]：

model_input = format_input(data[50])
desired_response = f"\n\n### Response:\n{data[50]['output']}"
print(model_input + desired_response)

格式化的輸入看起來如下：

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Identify the correct spelling of the following word.

### Input:
Ocassion

### Response:
The correct spelling is 'Occasion.'

練習7.1 更改提示樣式

在用Alpaca提示樣式微調模型後，嘗試使用圖7.4所示的Phi-3提示樣式，並觀察它是否會影響模型的回應品質。

微調大語言模型以遵循指令

在進行指令微調的過程中，我們需要對資料集進行適當的格式化和處理。接下來，我們將探討如何將資料集劃分為訓練集、驗證集和測試集，並建立自定義的資料載入器以有效地進行批次訓練。

資料集劃分

首先，我們需要將下載的資料集劃分為訓練集、驗證集和測試集。這一步驟與前一章節中對垃圾郵件分類別資料集的處理類別似。以下是具體的劃分程式碼：

train_portion = int(len(data) * 0.85)
test_portion = int(len(data) * 0.1)
val_portion = len(data) - train_portion - test_portion
train_data = data[:train_portion]
test_data = data[train_portion:train_portion + test_portion]
val_data = data[train_portion + test_portion:]
print("訓練集長度：", len(train_data))
print("驗證集長度：", len(val_data))
print("測試集長度：", len(test_data))

內容解密：

這段程式碼首先計算出訓練集、測試集和驗證集的比例，然後根據這些比例將原始資料集 data 劃分為三個部分。訓練集用於模型的訓練，驗證集用於在訓練過程中評估模型的效能，而測試集則用於最終評估模型的表現。

建立自定義資料載入器

在進行指令微調時，我們需要建立自定義的資料載入器以有效地處理批次資料。這涉及到定義一個自定義的 collate 函式，該函式負責將個別的資料樣本合併成一個批次。

建立 `InstructionDataset` 類別

首先，我們需要建立一個 InstructionDataset 類別，該類別將應用 format_input 函式並對輸入進行預先標記化處理。以下是相關程式碼：

class InstructionDataset(torch.utils.data.Dataset):
    def __init__(self, data, tokenizer):
        self.data = data
        self.tokenizer = tokenizer
        self.formatted_data = [format_input(item) for item in data]
        self.encoded_data = [tokenizer.encode(item) for item in self.formatted_data]

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.encoded_data[idx]

內容解密：

InstructionDataset 類別繼承自 PyTorch 的 Dataset 類別，並在初始化過程中對資料進行格式化和標記化處理。__len__ 方法傳回資料集的大小，而 __getitem__ 方法則傳回指定索引處的已編碼資料樣本。

自定義 `collate` 函式

接下來，我們需要定義一個自定義的 collate 函式，以有效地將資料樣本填充至相同的長度並建立目標標記 ID。

def custom_collate_fn(batch):
    # 對批次中的序列進行填充
    padded_batch = torch.nn.utils.rnn.pad_sequence([torch.tensor(seq) for seq in batch], batch_first=True, padding_value=50256)
    
    # 建立目標標記 ID
    target_ids = padded_batch[:, 1:]
    target_ids = torch.cat((target_ids, torch.full((target_ids.size(0), 1), -100)), dim=1)
    
    return padded_batch, target_ids

內容解密：

custom_collate_fn 函式首先對輸入批次中的序列進行填充，使其具有相同的長度。然後，它透過將輸入序列向右移動一位來建立目標標記 ID，並將最後一個標記替換為 -100 以排除填充標記對損失函式的影響。

7.3 將資料組織成訓練批次

在進行指令微調時，需要將資料組織成批次以加速訓練過程。這需要對輸入資料進行填充，以確保每個批次中的序列具有相同的長度。

實作指令資料集類別

首先，定義一個 InstructionDataset 類別來處理資料的預處理和編碼：

import torch
from torch.utils.data import Dataset

class InstructionDataset(Dataset):
    def __init__(self, data, tokenizer):
        self.data = data
        self.encoded_texts = []
        for entry in data:
            instruction_plus_input = format_input(entry)
            response_text = f"\n\n### Response:\n{entry['output']}"
            full_text = instruction_plus_input + response_text
            self.encoded_texts.append(tokenizer.encode(full_text))

    def __getitem__(self, index):
        return self.encoded_texts[index]

    def __len__(self):
        return len(self.data)

這個類別將輸入資料格式化並編碼成 token ID 序列。

內容解密：

__init__ 方法：初始化資料集物件，遍歷資料並將每個條目格式化和編碼。
format_input 函式：根據特定的提示範本格式化輸入資料。
tokenizer.encode 方法：將格式化後的文字編碼成 token ID 序列。
__getitem__ 方法：傳回指定索引的編碼文字。
__len__ 方法：傳回資料集的長度。

自定義批次填充函式

為了最小化不必要的填充，需要開發一個自定義的 collate 函式來填充每個批次中的序列到相同的長度：

def custom_collate_draft_1(batch, pad_token_id=50256, device="cpu"):
    batch_max_length = max(len(item) + 1 for item in batch)
    inputs_lst = []
    for item in batch:
        new_item = item.copy()
        new_item += [pad_token_id]
        padded = new_item + [pad_token_id] * (batch_max_length - len(new_item))
        inputs = torch.tensor(padded[:-1])
        inputs_lst.append(inputs)
    inputs_tensor = torch.stack(inputs_lst).to(device)
    return inputs_tensor

這個函式將輸入批次填充到相同的長度，並將其轉換為張量。

內容解密：

batch_max_length 變數：計算批次中最長序列的長度。
new_item 和 padded 列表：對每個序列進行填充，以達到 batch_max_length。
inputs 張量：將填充後的序列轉換為張量，並移除最後一個填充 token。
inputs_tensor 張量：將所有輸入張量堆積疊成一個批次張量。

測試自定義批次填充函式

測試 custom_collate_draft_1 函式，以確保其正確運作：

inputs_1 = [0, 1, 2, 3, 4]
inputs_2 = [5, 6]
inputs_3 = [7, 8, 9]
batch = (inputs_1, inputs_2, inputs_3)
print(custom_collate_draft_1(batch))

輸出結果如下：

tensor([[ 0,  1,  2,  3,  4],
        [ 5,  6, 50256, 50256, 50256],
        [ 7,  8,  9, 50256, 50256]])

結果表明，所有輸入序列已被填充到相同的長度。

內容解密：

輸入序列：inputs_1、inputs_2 和 inputs_3 是三個不同的輸入序列。
填充結果：使用 custom_collate_draft_1 函式將輸入序列填充到相同的長度。
輸出張量：填充後的序列被轉換為張量，並傳回作為輸出。

玄貓 BlackCat

技術愛好者，專注於分享程式開發、雲端技術與 AI 應用的心得體會。

大語言模型指令微調資料集準備與批次訓練

7.2 為監督式指令微調準備資料集

內容解密：

內容解密：

內容解密：

內容解密：

練習7.1 更改提示樣式

微調大語言模型以遵循指令

資料集劃分

內容解密：

建立自定義資料載入器

建立 InstructionDataset 類別

內容解密：

自定義 collate 函式

內容解密：

7.3 將資料組織成訓練批次

實作指令資料集類別

內容解密：

自定義批次填充函式

內容解密：

測試自定義批次填充函式

內容解密：

玄貓 BlackCat

建立 `InstructionDataset` 類別

自定義 `collate` 函式