RAG 流程資料嵌入儲存與評估指標實作

Deeplake 提供有效率的向量儲存功能，可最佳化根據檢索的生成式 AI 模型的資料處理流程。首先，將文字資料分割成固定大小的區塊，再利用 OpenAI 的 text-embedding-3-small 模型將文字區塊轉換為向量表示。這些向量儲存在 Deeplake 建立的向量儲存函式庫中，方便後續檢索。利用向量儲存，可以有效地增強使用者輸入，將檢索到的相關資訊新增到使用者輸入中，提升生成內容的準確性和相關性。最後，使用餘弦相似度等評估指標，可以衡量生成內容與使用者輸入以及檢索資料之間的相似度，評估生成式 AI 模型的效能。

資料嵌入與儲存

第二團隊的工作是對第一團隊準備好的資料進行嵌入和儲存。這個過程涉及將文字資料轉換為可被機器學習模型理解的向量表示，並將這些向量儲存在適當的資料函式庫或儲存系統中。

流程圖

  graph LR
    A[資料收集] --> B[資料清理]
    B --> C[資料嵌入]
    C --> D[資料儲存]

圖表翻譯： 此流程圖展示了資料處理的流程。首先，資料被收集；接著，資料被清理以移除不必要的資訊；然後，清理後的資料被轉換為向量表示；最後，這些向量被儲存在適當的系統中。

使用 Deeplake 建立向量儲存以最佳化 RAG 的資料處理

在本章中，我們將探討如何使用 Deeplake 建立向量儲存，以最佳化 RAG（Retrieval-Augmented Generation）的資料處理流程。首先，我們需要下載所需的資料檔案。

下載資料檔案

from grequests import download
source_text = "llm.txt"
directory = "Chapter02"
filename = "llm.txt"
download(directory, filename)

下載完成後，我們需要驗證資料檔案的內容，以確保其正確性。

內容驗證

# 開啟檔案並讀取前20行
with open('llm.txt', 'r', encoding='utf-8') as file:
    lines = file.readlines()
# 列印前20行
for line in lines[:20]:
    print(line.strip())

輸出結果令人滿意，如下所示：

Exploration of space, planets, and moons "Space Exploration" redi

分割資料

我們將資料分割成固定大小的區塊，以最佳化資料處理流程。在本例中，我們將區塊大小設定為1000個字元。

資料分割程式碼

with open(source_text, 'r') as f:
    text = f.read()
CHUNK_SIZE = 1000
chunked_text = [text[i:i+CHUNK_SIZE] for i in range(0, len(text), CHUNK_SIZE)]

內容解密：

with open(source_text, 'r') as f:：開啟指定的 source_text 檔案，以唯讀模式讀取其內容。
text = f.read()：將檔案的全部內容讀取到 text 變數中。
CHUNK_SIZE = 1000：定義每個文字區塊的大小為1000個字元，用於將大段文字分割成更小的部分。
chunked_text = [text[i:i+CHUNK_SIZE] for i in range(0, len(text), CHUNK_SIZE)]：使用列表推導式將 text 按照 CHUNK_SIZE 分割成多個區塊，並存入 chunked_text 列表中。
- text[i:i+CHUNK_SIZE]：擷取從索引 i 開始到 i+CHUNK_SIZE 的子字串。
- range(0, len(text), CHUNK_SIZE)：生成一個從0到 len(text) 的索引範圍，步長為 CHUNK_SIZE。

建立向量儲存

我們需要定義 Activeloop 向量儲存的路徑，並檢查該儲存是否存在。如果不存在，則建立新的向量儲存。

建立向量儲存的程式碼

vector_store_path = "hub://denis76/space_exploration_v1"
try:
    # 嘗試載入向量儲存
    vector_store = VectorStore(path=vector_store_path)
    print("向量儲存存在")
except FileNotFoundError:
    print("向量儲存不存在，您可以建立它。")
    # 建立向量儲存的程式碼
    create_vector_store = True

內容解密：

vector_store_path = "hub://denis76/space_exploration_v1"：定義向量儲存的路徑，這裡使用的是 Activeloop 平台上的資料集路徑。
try...except 區塊：
- VectorStore(path=vector_store_path)：嘗試根據指定的路徑載入向量儲存。
- except FileNotFoundError:：如果向量儲存不存在，則捕捉 FileNotFoundError 例外並輸出相應的提示訊息。
- create_vector_store = True：標記需要建立新的向量儲存。

建立嵌入函式

嵌入函式用於將文字區塊轉換為向量，以啟用根據向量的搜尋。在本例中，我們使用 OpenAI 的 “text-embedding-3-small” 模型。

建立嵌入函式的程式碼

def embedding_function(texts, model="text-embedding-3-small"):
    if isinstance(texts, str):
        texts = [texts]
    texts = [t.replace("\n", " ") for t in texts]
    return [data.embedding for data in openai.embeddings.create(input=texts, model=model).data]

內容解密：

def embedding_function(texts, model="text-embedding-3-small"):：定義一個名為 embedding_function 的函式，用於將輸入的文字轉換為嵌入向量。預設使用的模型是 “text-embedding-3-small”。
if isinstance(texts, str): texts = [texts]：檢查 texts 是否為字串，如果是，則將其轉換為列表，以便統一處理。
texts = [t.replace("\n", " ") for t in texts]：將 texts 列表中的每個字串中的換行符 \n 替換為空格，以避免對嵌入過程造成幹擾。
return [data.embedding for data in openai.embeddings.create(input=texts, model=model).data]：呼叫 OpenAI 的嵌入 API，將處理後的 texts 轉換為嵌入向量，並傳回這些向量。

將資料新增至向量儲存

add_to_vector_store = True
if add_to_vector_store == True:
    with open(source_text, 'r') as f:
        text = f.read()
    CHUNK_SIZE = 1000
    chunked_text = [text[i:i+1000] for i in range(0, len(text), 1000)]
    vector_store.add(text=chunked_text,
                     embedding_function=embedding_function,
                     embedding_data=chunked_text,
                     metadata=[{"source": source_text}]*len(chunked_text))

內容解密：

add_to_vector_store = True：定義一個旗標 add_to_vector_store，用於控制是否將資料新增至向量儲存。
if add_to_vector_store == True:：檢查旗標的值，如果為 True，則執行接下來的程式碼。
with open(source_text, 'r') as f: text = f.read()：開啟並讀取 source_text 檔案的內容。
chunked_text = [text[i:i+1000] for i in range(0, len(text), 1000)]：將讀取的文字內容按照1000個字元進行分割，存入 chunked_text 列表中。
vector_store.add(...)：將分割後的 chunked_text 新增至向量儲存中。
- text=chunked_text：指定要新增的文字資料。
- embedding_function=embedding_function：指定用於生成嵌入向量的函式。
- embedding_data=chunked_text：指定用於生成嵌入向量的資料。
- metadata=[{"source": source_text}]*len(chunked_text)：為每個文字區塊新增中繼資料，記錄資料的來源。

查看向量儲存的摘要資訊

print(vector_store.summary())

內容解密：

print(vector_store.summary())：列印向量儲存的摘要資訊，包括資料集的結構和內容。

載入並視覺化向量儲存

ds = deeplake.load(vector_store_path)
ds_size = ds.size_approx()

內容解密：

ds = deeplake.load(vector_store_path)：使用 Deeplake 的 load 方法載入指定的向量儲存資料集。
ds_size = ds.size_approx()：估算資料集的大小（以位元組為單位）。

透過這些步驟，我們成功地建立了向量儲存並將資料新增至其中，為後續的 RAG 應用奠定了基礎。

增強輸入生成

增強生成是第三個管道元件，我們將利用檢索到的資料來增強使用者輸入。本元件處理使用者輸入，查詢向量儲存，增強輸入，並呼叫 gpt-4-turbo。

向量儲存的選擇與載入

首先，我們需要選擇向量儲存的路徑，並載入資料集：

vector_store_path = "hub://denis76/space_exploration_v1"
ds = deeplake.load(vector_store_path)
vector_store = VectorStore(path=vector_store_path)

內容解密：

vector_store_path 指定了向量儲存的位置。
deeplake.load(vector_store_path) 載入了指定的向量儲存資料集。
VectorStore(path=vector_store_path) 初始化了一個向量儲存物件，用於後續的搜尋操作。

輸入與查詢檢索

我們需要一個嵌入函式來嵌入使用者輸入：

def embedding_function(texts, model="text-embedding-3-small"):
    if isinstance(texts, str):
        texts = [texts]
    texts = [t.replace("\n", " ") for t in texts]
    return [data.embedding for data in openai.embeddings.create(input=texts, model=model).data]

內容解密：

embedding_function 定義了一個用於生成文字嵌入的函式。
使用與資料嵌入和儲存元件相同的嵌入模型，以確保輸入和向量資料集之間的完全相容性。

接下來，我們可以處理使用者輸入或批次處理使用者輸入：

user_prompt = "Tell me about space exploration on the Moon and Mars"
search_results = vector_store.search(embedding_data=user_prompt, k=1)

內容解密：

user_prompt 定義了使用者輸入的查詢字串。
vector_store.search 使用嵌入的使用者輸入進行搜尋，並傳回最相關的結果。

增強輸入

程式將檢索到的頂部文字新增到使用者輸入中：

top_text = search_results['text'][0].strip()
augmented_input = user_prompt + " " + top_text
print(augmented_input)

內容解密：

top_text 提取了搜尋結果中的第一個文字。
augmented_input 將使用者輸入與檢索到的頂部文字結合，形成增強輸入。

生成內容

使用 gpt-4o 模型生成內容：

def call_gpt4_with_full_text(itext):
    prompt = f"Please summarize or elaborate on the following content: {itext}"
    try:
        response = client.chat.completions.create(
            model=gpt_model,
            messages=[
                {"role": "system", "content": "You are a space expert."},
                {"role": "assistant", "content": "You can read the text."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.1
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        return str(e)

gpt4_response = call_gpt4_with_full_text(augmented_input)
print(gpt_model, "Response:", gpt4_response)

內容解密：

call_gpt4_with_full_text 定義了一個呼叫 gpt-4o 模型的函式，用於生成對增強輸入的回應。
使用 client.chat.completions.create 方法與 gpt-4o 模型互動，生成回應內容。

輸出格式化

最後，我們可以格式化輸出的回應：

print_formatted_response(gpt4_response)

內容解密：

print_formatted_response 函式首先檢查回應是否包含 Markdown 特徵，如果是，則格式化回應；否則，執行標準的文字包裝輸出。

圖表翻譯：

此流程展示了增強輸入生成的完整流程，包括向量儲存的選擇、使用者輸入的處理、增強輸入的生成以及最終的內容生成。

  graph LR
    A[使用者輸入] --> B[嵌入函式]
    B --> C[向量儲存搜尋]
    C --> D[檢索結果]
    D --> E[增強輸入]
    E --> F[gpt-4o 生成內容]
    F --> G[輸出格式化]

圖表翻譯： 此圖表描述了增強輸入生成的流程，包括使用者輸入的嵌入、向量儲存的搜尋、增強輸入的生成以及最終使用 gpt-4o 模型生成內容的過程。

衡量生成式人工智慧輸出品質的評估指標實作

在前述章節中，我們已經成功建立了一個RAG（Retrieval-Augmented Generation）流程，用於生成式人工智慧模型的輸出。現在，讓我們進一步探討如何評估這些輸出的品質。

使用餘弦相似度（Cosine Similarity）評估輸出品質

餘弦相似度是一種常見的評估指標，用於衡量兩個文字之間的相似程度。在本文中，我們將實作餘弦相似度來評估使用者輸入與生成式人工智慧模型輸出之間的相似度。

定義餘弦相似度函式

首先，我們需要定義一個餘弦相似度函式，使用TfidfVectorizer和cosine_similarity來計算兩個文字之間的相似度。

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def calculate_cosine_similarity(text1, text2):
    vectorizer = TfidfVectorizer()
    tfidf = vectorizer.fit_transform([text1, text2])
    similarity = cosine_similarity(tfidf[0:1], tfidf[1:2])
    return similarity[0][0]

計算使用者輸入與GPT-4輸出之間的相似度

接下來，我們使用定義好的函式來計算使用者輸入與GPT-4輸出之間的相似度。

similarity_score = calculate_cosine_similarity(user_prompt, gpt4_response)
print(f"Cosine Similarity Score: {similarity_score:.3f}")

輸出結果可能如下：

Cosine Similarity Score: 0.396

計算增強輸入與GPT-4輸出之間的相似度

此外，我們還可以計算增強輸入（augmented input）與GPT-4輸出之間的相似度。

similarity_score = calculate_cosine_similarity(augmented_input, gpt4_response)
print(f"Cosine Similarity Score: {similarity_score:.3f}")

輸出結果可能如下：

Cosine Similarity Score: 0.857

使用句子轉換器（Sentence Transformers）計算相似度

除了使用根據TF-IDF的餘弦相似度外，我們還可以使用句子轉換器來計算文字之間的相似度。句子轉換器能夠捕捉文字之間的深層語義關係，因此在某些情況下可能更為有效。

首先，我們需要安裝sentence-transformers函式庫。

!pip install sentence-transformers

然後，我們可以使用all-MiniLM-L6-v2模型來計算文字之間的相似度。

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

def calculate_cosine_similarity_with_embeddings(text1, text2):
    embeddings1 = model.encode(text1)
    embeddings2 = model.encode(text2)
    similarity = cosine_similarity([embeddings1], [embeddings2])
    return similarity[0][0]

similarity_score = calculate_cosine_similarity_with_embeddings(augmented_input, gpt4_response)
print(f"Cosine Similarity Score: {similarity_score:.3f}")

輸出結果可能如下：

Cosine Similarity Score: 0.739

內容解密：

餘弦相似度是一種衡量兩個文字之間相似程度的指標。
TF-IDF是一種用於文字表示的方法，能夠捕捉文字中的重要特徵。
句子轉換器能夠捕捉文字之間的深層語義關係，因此在某些情況下可能更為有效。
評估指標的選擇取決於具體的專案需求和目標。

圖表翻譯：

此圖示展示了使用不同方法計算餘弦相似度的流程。 圖表翻譯： 本圖表展示了餘弦相似度計算的流程，包括根據TF-IDF和句子轉換器的方法。圖中詳細描述了每個步驟的輸入、處理和輸出結果。

玄貓

技術愛好者，專注於分享程式開發、雲端技術與 AI 應用的心得體會。