太空探索資料處理與RAG框架建構

利用維基百科的太空探索相關文章，建構一個完整的 RAG 框架，用於增強語言模型的問答能力。首先，收集整理包含阿波羅計畫、哈伯望遠鏡等主題的文章資料，並使用 Python 程式碼進行資料清理和預處理。接著，利用 OpenAI 的嵌入模型將文字轉換成向量，並儲存至 Deep Lake 向量資料函式庫，實作高效的向量搜尋。最後，透過整合使用者查詢、向量搜尋結果和 GPT-4，生成更精確、資訊更豐富的回應。

太空探索資料處理與嵌入式儲存系統開發

太空探索是一個涵蓋歷史、技術、任務和計劃的多導向領域，本文將透過一系列維基百科文章的資料處理與嵌入式儲存系統開發，展示如何建立一個完整的RAG（Retrieval-Augmented Generation）框架。

資料收集與準備

首先，我們需要收集與太空探索相關的維基百科文章。選定的文章列表包括：

太空探索（Space Exploration）
阿波羅計劃（Apollo Program）
哈伯太空望遠鏡（Hubble Space Telescope）
火星探測車（Mars Rover）
國際太空站（International Space Station）
SpaceX
朱諾號探測器（Juno Spacecraft）
航海家計劃（Voyager Program）
伽利略號探測器（Galileo Spacecraft）
開普勒太空望遠鏡（Kepler Space Telescope）

這些文章涵蓋了太空探索的多個方面，從歷史計劃到現代技術進步和任務。

程式碼實作

import requests
from bs4 import BeautifulSoup
import re

# 定義需要抓取的維基百科文章URL
urls = [
    "https://en.wikipedia.org/wiki/Space_exploration",
    "https://en.wikipedia.org/wiki/Apollo_program",
    "https://en.wikipedia.org/wiki/Hubble_Space_Telescope",
    "https://en.wikipedia.org/wiki/Mars_rover",
    "https://en.wikipedia.org/wiki/International_Space_Station",
    "https://en.wikipedia.org/wiki/SpaceX",
    "https://en.wikipedia.org/wiki/Juno_(spacecraft)",
    "https://en.wikipedia.org/wiki/Voyager_program",
    "https://en.wikipedia.org/wiki/Galileo_(spacecraft)",
    "https://en.wikipedia.org/wiki/Kepler_Space_Telescope"
]

# 清理文字函式，用於移除數值參考文獻標記
def clean_text(content):
    content = re.sub(r'\[\d+\]', '', content)
    return content

# 取得並清理網頁內容的函式
def fetch_and_clean(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    content = soup.find('div', {'class': 'mw-parser-output'})
    
    # 移除參考文獻、書目等章節
    for section_title in ['References', 'Bibliography', 'External']:
        section = content.find('span', id=section_title)
        if section:
            for sib in section.parent.find_next_siblings():
                sib.decompose()
            section.parent.decompose()
    
    text = content.get_text(separator=' ', strip=True)
    text = clean_text(text)
    return text

# 將清理後的文字寫入檔案
with open('llm.txt', 'w', encoding='utf-8') as file:
    for url in urls:
        clean_article_text = fetch_and_clean(url)
        file.write(clean_article_text + '\n')
print("內容已寫入llm.txt")

內容解密：

首先，我們匯入了必要的函式庫：requests 用於傳送HTTP請求，BeautifulSoup 用於解析HTML，re 用於正規表示式操作。
定義了需要抓取的維基百科文章URL列表。
clean_text 函式用於移除文字中的數值參考文獻標記，如 [1]、[2] 等。
fetch_and_clean 函式用於取得指定URL的內容，並清理HTML標籤和參考文獻章節。
將清理後的文字寫入 llm.txt 檔案中，每篇文章之間以換行符分隔。

資料嵌入與儲存

團隊 #2 的任務是對準備好的資料進行嵌入和儲存。他們將接收由團隊 #1 提供的資料批次，並對其進行處理。

程式碼實作

# 下載由團隊 #1 提供的資料批次
# 假設資料儲存在伺服器上，以下程式碼用於下載資料

# 載入資料批次
with open('llm.txt', 'r', encoding='utf-8') as file:
    lines = file.readlines()

# 列印前20行以驗證內容
for line in lines[:20]:
    print(line.strip())

內容解密：

載入由團隊 #1 提供的 llm.txt 檔案。
讀取檔案的前20行以驗證內容是否正確。

建立向量儲存以實作高效的向量搜尋

在前面的章節中，我們已經探討瞭如何準備和處理資料。本章節將重點介紹如何建立向量儲存（Vector Store）來向量化資料或將資料新增到現有的向量儲存中。向量儲存是實作高效向量搜尋的關鍵元件。

驗證向量儲存是否存在並在必要時建立

首先，我們需要定義 Activeloop 向量儲存的路徑，無論我們的資料集是否存在：

vector_store_path = "hub://denis76/space_exploration_v1"

請務必將 hub://denis76/space_exploration_v1 替換為您的組織和資料集名稱。

接下來，我們編寫一個函式來嘗試載入向量儲存，如果它不存在，則自動建立一個：

from deeplake.core.vectorstore.deeplake_vectorstore import VectorStore
import deeplake.util

try:
    # 嘗試載入向量儲存
    vector_store = VectorStore(path=vector_store_path)
    print("向量儲存存在")
except FileNotFoundError:
    print("向量儲存不存在，即將建立。")
    # 在此處新增建立向量儲存的程式碼
    create_vector_store = True

內容解密：

try 區塊嘗試載入指定的向量儲存。如果成功，則列印“向量儲存存在”。
except FileNotFoundError 區塊捕捉異常，表示向量儲存不存在，並列印相應的訊息。
create_vector_store = True 表示需要建立新的向量儲存。

輸出確認向量儲存已建立：

Your Deep Lake dataset has been successfully created!
向量儲存存在

建立嵌入函式

嵌入函式將我們建立的資料區塊轉換為向量，以實作根據向量的搜尋。在本程式中，我們將使用 "text-embedding-3-small" 來嵌入檔案。

def embedding_function(texts, model="text-embedding-3-small"):
    if isinstance(texts, str):
        texts = [texts]
    texts = [t.replace("\n", " ") for t in texts]
    return [data.embedding for data in openai.embeddings.create(input=texts, model=model).data]

內容解密：

函式 embedding_function 將輸入文字列表轉換為嵌入向量。
如果輸入 texts 是字串，則將其轉換為列表。
將文字中的換行符 \n 替換為空格，以確保嵌入品質。
使用 OpenAI 的 embeddings.create 方法生成嵌入向量。

OpenAI 的 text-embedding-3-small 模型通常使用具有有限維度的嵌入，以平衡獲得足夠細節與龐大的計算工作量和儲存空間。在執行程式碼之前，請務必檢查模型頁面和定價資訊。

將資料新增到向量儲存

設定新增資料標誌為 True：

add_to_vector_store = True

if add_to_vector_store:
    with open(source_text, 'r') as f:
        text = f.read()
    CHUNK_SIZE = 1000
    chunked_text = [text[i:i+CHUNK_SIZE] for i in range(0, len(text), CHUNK_SIZE)]
    vector_store.add(
        text=chunked_text,
        embedding_function=embedding_function,
        embedding_data=chunked_text,
        metadata=[{"source": source_text}] * len(chunked_text)
    )

內容解密：

如果 add_to_vector_store 為 True，則讀取源文字檔案並將其分成區塊。
使用 vector_store.add 方法將區塊化的文字、嵌入函式、嵌入資料和後設資料新增到向量儲存中。
後設資料包含資料來源的資訊，在本例中為 llm.txt 檔案。

源文字 source_text = "llm.txt" 已被嵌入並儲存。顯示的資料集結構摘要表明資料集已載入：

Creating 839 embeddings in 2 batches of size 500: 100%|█████████|
Dataset(path='hub://denis76/space_exploration_v1', tensors=['text', 'metadata', 'embedding', 'id'])
tensor      htype       shape       dtype       compression
-------     -------     -------     -------     -------
text        text        (839, 1)     str         None
metadata    json        (839, 1)     str         None
embedding   embedding   (839, 1536)  float32     None
id          text        (839, 1)     str         None

內容解密：

資料集包含四個張量：embedding、id、metadata 和 text。
embedding：每個資料區塊被嵌入到一個向量中。
id：ID 是唯一的字串字元。
metadata：後設資料包含資料來源的資訊，在本例中為 llm.txt 檔案。
text：資料集中文字區塊的內容。

查看向量儲存摘要

我們可以使用以下程式碼顯示剛剛顯示的摘要：

# 列印向量儲存的摘要
print(vector_store.summary())

內容解密：

vector_store.summary() 方法提供向量儲存的詳細摘要，包括張量資訊。

查看向量儲存資訊

Activeloop 的 API 參考檔案提供了管理資料集所需的所有資訊。我們可以在以下網址視覺化我們的資料集：https://app.activeloop.ai/datasets/mydatasets/。

我們也可以用一行程式碼載入我們的資料集：

ds = deeplake.load(vector_store_path)

輸出提供了一個路徑，用於線上視覺化、查詢和探索我們的資料集：

This dataset can be visualized in Jupyter Notebook by ds.visualize()
hub://denis76/space_exploration_v1 loaded successfully.
You can also access your dataset directly on Activeloop by signing in and going to your datasets.

圖表翻譯：

此圖示展示瞭如何在 Activeloop 平台上查詢和探索 Deep Lake 資料集。

在許多可用功能中，我們可以顯示資料集的估計大小：

# 估算資料集的大小（以位元組為單位）
ds_size = ds.size_approx()

獲得大小後，我們可以將其轉換為兆位元組和千兆位元組。

增強輸入生成技術解析

增強輸入生成的技術原理與實踐

增強輸入生成是第三個管道元件，負責處理使用者輸入、查詢向量儲存、增強輸入並呼叫gpt-4-turbo模型。本章節將探討該元件的技術細節和實作方法。

資料載入與向量儲存初始化

首先，我們需要載入之前建立的向量儲存。程式碼如下：

vector_store_path = "hub://denis76/space_exploration_v1"
ds = deeplake.load(vector_store_path)
vector_store = VectorStore(path=vector_store_path)

輸入處理與查詢檢索

接下來，我們需要對使用者輸入進行處理並檢索相關資訊。程式碼實作如下：

def embedding_function(texts, model="text-embedding-3-small"):
    if isinstance(texts, str):
        texts = [texts]
    texts = [t.replace("\n", " ") for t in texts]
    return [data.embedding for data in openai.embeddings.create(input=texts, model=model).data]

user_prompt = "Tell me about space exploration on the Moon and Mars"
search_results = vector_store.search(embedding_data=user_prompt, k=5)

增強輸入生成

檢索到相關資訊後，我們將其與使用者輸入結合，生成增強輸入：

top_text = search_results['text'][0].strip()
augmented_input = user_prompt + " " + top_text
print(augmented_input)

呼叫GPT-4模型生成回應

最後，我們使用增強輸入呼叫GPT-4模型生成最終回應：

def call_gpt4_with_full_text(itext):
    prompt = f"Please summarize or elaborate on the following content: {itext}"
    try:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "You are a space expert."},
                {"role": "assistant", "content": "You can read the text."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.1
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        return str(e)

gpt4_response = call_gpt4_with_full_text(augmented_input)
print("GPT-4 Response:", gpt4_response)

#### 內容解密：

此段程式碼展示瞭如何使用增強輸入生成技術來提升GPT模型的回應品質。首先，我們載入預先建立的向量儲存，然後對使用者輸入進行嵌入並檢索相關資訊。接著，我們將檢索到的資訊與原始輸入結合，生成增強輸入。最後，我們使用這個增強輸入呼叫GPT-4模型生成最終回應。這種方法可以顯著提高模型輸出的準確性和相關性。

技術關鍵點分析

向量儲存的使用：本案例中使用了Deep Lake作為向量儲存解決方案，這使得高效的相似性搜尋成為可能。
嵌入函式的一致性：在資料嵌入和使用者輸入嵌入中使用相同的嵌入模型，確保了兩者之間的相容性。
查詢結果的處理：系統選擇了最相關的查詢結果並將其與原始輸入結合，生成增強輸入。
GPT-4模型的呼叫：透過精心設計的提示和角色分配，最大化了模型的輸出品質。

效能最佳化建議

最佳化嵌入模型：選擇更高效的嵌入模型可以顯著提高整體效能。
調整查詢引數：適當調整k值可以平衡搜尋結果的相關性和搜尋效率。
快取機制：引入快取機制可以減少重複查詢的開銷。
非同步處理：考慮使用非同步處理來提高系統的平行處理能力。
多模態輸入支援：未來可以考慮支援多模態輸入，如影像和文字的混合輸入。
更先進的查詢機制：引入更複雜的查詢機制，如上下文感知查詢，可以進一步提高系統的智慧化程度。
即時學習能力：整合即時學習機制，使系統能夠根據使用者反饋不斷最佳化。

參考資料

Deep Lake官方檔案
OpenAI API參考
GPT-4技術報告

附錄：完整程式碼清單

import deeplake
from deeplake.core.vectorstore.deeplake_vectorstore import VectorStore
import openai
from openai import OpenAI

# 載入向量儲存
vector_store_path = "hub://denis76/space_exploration_v1"
ds = deeplake.load(vector_store_path)
vector_store = VectorStore(path=vector_store_path)

# 定義嵌入函式
def embedding_function(texts, model="text-embedding-3-small"):
    # 省略實作細節

# 處理使用者輸入
user_prompt = "Tell me about space exploration on the Moon and Mars"
search_results = vector_store.search(embedding_data=user_prompt, k=5)

# 生成增強輸入
top_text = search_results['text'][0].strip()
augmented_input = user_prompt + " " + top_text

# 呼叫GPT-4模型
def call_gpt4_with_full_text(itext):
    # 省略實作細節

gpt4_response = call_gpt4_with_full_text(augmented_input)
print("GPT-4 Response:", gpt4_response)

#### 內容解密：

此完整程式碼清單展示了增強輸入生成技術的核心實作。透過載入向量儲存、定義嵌入函式、處理使用者輸入、生成增強輸入以及呼叫GPT-4模型，我們實作了一個完整的增強輸入生成流程。這個流程可以根據具體需求進行調整和最佳化，以適應不同的應用場景。

玄貓

技術愛好者，專注於分享程式開發、雲端技術與 AI 應用的心得體會。