LangChain高效文字生成與查詢計劃

LangChain 提供了現代 AI 應用中所需的核心元件，讓開發者能更有效地控制和運用大語言模型。本文除了探討如何生成結構化的查詢計劃外，也深入剖析少樣本提示範本的建立技巧，包含固定長度與根據長度的範例選擇策略，讓 LLM 在有限範例下也能生成更精確的結果。此外，文章也涵蓋了 LangChain 在資料處理方面的應用，包含如何載入、轉換、嵌入和儲存不同格式的資料，讓 LLM 能夠更好地理解和運用企業內部的多元資料。

利用 LangChain 實作高效的文字生成與查詢計劃

在現代人工智慧與自然語言處理領域，LangChain 提供了一套強大的工具，用於構建和管理根據大語言模型（LLM）的應用程式。本文將探討如何使用 LangChain 實作查詢計劃的生成、少樣本提示範本的建立，以及如何最佳化文字生成的過程。

查詢計劃的生成

查詢計劃是任務執行的關鍵步驟，透過 LangChain 可以輕鬆地將自然語言查詢轉換為結構化的查詢計劃。以下是一個具體的實作範例：

from langchain_openai.chat_models import ChatOpenAI
from langchain_core.output_parsers import PydanticOutputParser
from langchain_core.prompts import ChatPromptTemplate, SystemMessagePromptTemplate

# 初始化 ChatOpenAI 模型
model = ChatOpenAI()

# 定義查詢計劃的結構
class QueryPlan(BaseModel):
    query_graph: List[Query]

# 初始化 PydanticOutputParser
parser = PydanticOutputParser(pydantic_object=QueryPlan)

# 定義提示範本
template = """Generate a query plan. This will be used for task execution.
Answer the following query: {query}
Return the following query graph format:
{format_instructions}
"""

# 建立系統訊息提示範本
system_message_prompt = SystemMessagePromptTemplate.from_template(template)

# 建立聊天提示範本
chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt])

# 建立 LCEL 鏈
chain = chat_prompt | model | parser

# 執行查詢計劃生成
result = chain.invoke({
    "query": '''I want to get the results from my database. Then I want to find
                out what the average age of my top 10 customers is. Once I have the average
                age, I want to send an email to John. Also I just generally want to send a
                welcome introduction email to Sarah, regardless of the other tasks.''',
    "format_instructions": parser.get_format_instructions()
})

# 輸出查詢計劃
print(result.query_graph)

內容解密：

初始化 ChatOpenAI 模型：使用 ChatOpenAI() 初始化一個 LLM 例項，用於處理自然語言查詢。
定義查詢計劃結構：使用 Pydantic 定義 QueryPlan 結構，確保輸出符合預期格式。
建立提示範本：透過 SystemMessagePromptTemplate 和 ChatPromptTemplate 建立一個結構化的提示範本，指引 LLM 生成查詢計劃。
執行 LCEL 鏈：將提示範本、LLM 模型和輸出解析器串聯起來，形成一個完整的處理鏈，並執行查詢計劃的生成。
輸出結果：最終輸出生成的查詢計劃，以結構化的形式呈現。

少樣本提示範本的建立

在許多情況下，我們希望透過提供少數範例來引導 LLM 生成更準確的輸出。LangChain 提供了 FewShotChatMessagePromptTemplate 來實作這一功能。

固定長度的少樣本範例

from langchain_core.prompts import FewShotChatMessagePromptTemplate, ChatPromptTemplate

# 定義範例
examples = [
    {"question": "What is the capital of France?", "answer": "Paris"},
    {"question": "What is the capital of Spain?", "answer": "Madrid"},
    # 更多範例...
]

# 定義範例提示範本
example_prompt = ChatPromptTemplate.from_messages([
    ("human", "{question}"),
    ("ai", "{answer}"),
])

# 建立少樣本提示範本
few_shot_prompt = FewShotChatMessagePromptTemplate(
    example_prompt=example_prompt,
    examples=examples,
)

# 列印格式化的少樣本提示
print(few_shot_prompt.format())

內容解密：

定義範例：提供一系列的問答對作為範例，用於引導 LLM 的輸出。
建立範例提示範本：使用 ChatPromptTemplate 將範例格式化為 HumanMessage 和 AIMessage 對。
建立少樣本提示範本：透過 FewShotChatMessagePromptTemplate 將多個範例組合起來，形成完整的少樣本提示。

根據長度的範例選擇器

在某些應用場景中，使用者輸入的長度可能會影響提示的構建。LangChain 的 LengthBasedExampleSelector 可以根據輸入的長度動態選擇合適的範例。

from langchain_core.prompts import FewShotPromptTemplate, PromptTemplate
from langchain.prompts.example_selector import LengthBasedExampleSelector
import tiktoken

# 定義範例
examples = [
    {"input": "Gollum", "output": "<Story involving Gollum>"},
    {"input": "Gandalf", "output": "<Story involving Gandalf>"},
    {"input": "Bilbo", "output": "<Story involving Bilbo>"},
]

# 定義範例範本
story_prompt = PromptTemplate(
    input_variables=["input", "output"],
    template="Character: {input}\nStory: {output}",
)

# 定義計算 token 數量的函式
def num_tokens_from_string(string: str) -> int:
    encoding = tiktoken.get_encoding("cl100k_base")
    num_tokens = len(encoding.encode(string))
    return num_tokens

# 建立根據長度的範例選擇器
example_selector = LengthBasedExampleSelector(
    examples=examples,
    example_prompt=story_prompt,
    max_length=1000,  # 最大 token 長度
    get_text_length=num_tokens_from_string,
)

內容解密：

定義範例和範本：提供故事生成的範例，並定義對應的提示範本。
計算 token 數量：使用 tiktoken 計算輸入字串的 token 數量，確保不會超過 LLM 的上下文限制。
建立範例選擇器：根據輸入的長度動態選擇合適的範例，保持提示在合理的 token 長度內。

進階提示技術與 LangChain 實踐

在前面的章節中，我們探討了 LangChain 的基本概念及其在文字生成中的應用。本章將深入討論如何使用 LangChain 進行進階提示技術的實踐，特別是在少樣本學習（Few-Shot Learning）領域的應用。

少樣本提示範本的建立

在 LangChain 中，建立少樣本提示範本（Few-Shot Prompt Template）是一種常見的技術，用於引導大語言模型（LLM）根據給定的範例生成內容。以下是一個建立動態少樣本提示範本的範例：

from langchain_core.prompts import FewShotPromptTemplate, PromptTemplate
from langchain_core.example_selectors import LengthBasedExampleSelector

# 定義範例選擇器
example_selector = LengthBasedExampleSelector(
    examples=[{"input": "輸入1", "output": "輸出1"}, {"input": "輸入2", "output": "輸出2"}],
    example_prompt=PromptTemplate(
        template="輸入：{input}\n輸出：{output}",
        input_variables=["input", "output"],
    ),
    max_length=1000,
)

# 定義動態提示範本
dynamic_prompt = FewShotPromptTemplate(
    example_selector=example_selector,
    example_prompt=PromptTemplate(
        template="輸入：{input}\n輸出：{output}",
        input_variables=["input", "output"],
    ),
    prefix="根據提供的角色/故事對，生成一個故事。",
    suffix="角色：{character}\n故事：",
    input_variables=["character"],
)

# 使用新的角色格式化提示
formatted_prompt = dynamic_prompt.format(character="Frodo")

# 建立聊天模型
chat = ChatOpenAI()
response = chat.invoke([SystemMessage(content=formatted_prompt)])

print(response.content)

內容解密：

範例選擇器（LengthBasedExampleSelector）：根據輸入內容的長度動態調整範例數量，確保 LLM 不會超出其上下文視窗限制。
動態提示範本（FewShotPromptTemplate）：結合範例選擇器和提示範本，生成根據輸入角色的動態故事。
聊天模型（ChatOpenAI）：使用格式化的提示呼叫聊天模型，生成相應的故事內容。

少樣本學習的限制與最佳化

儘管少樣本學習在某些場景中非常有用，但它也有一些限制。主要問題包括：

過擬合：LLM 可能會過度依賴提供的範例，而忽略實際的提示內容。
Token 限制：LLM 的 Token 限制導致範例數量和回應長度之間的權衡。

為瞭解決這些問題，可以採取以下策略：

調整提示措辭：透過改變提示的表達方式來獲得不同的回應。
明確指令：指示模型在理解任務後忽略範例，或僅將範例用於格式指導。
微調模型：對於複雜任務，如果少樣本學習的效果不佳，可以考慮對模型進行微調。

儲存與載入 LLM 提示

為了提高提示的可分享性、儲存性和版本控制，LangChain 支援將提示儲存為 JSON 或 YAML 檔案，並允許從這些檔案中載入提示。

from langchain_core.prompts import PromptTemplate, load_prompt

# 定義提示範本
prompt = PromptTemplate(
    template="將這句話從英文翻譯成西班牙文。\n句子：{sentence}\n翻譯：",
    input_variables=["sentence"],
)

# 儲存提示範本
prompt.save("translation_prompt.json")

# 載入提示範本
loaded_prompt = load_prompt("translation_prompt.json")

內容解密：

儲存提示範本：將定義好的提示範本儲存為 JSON 檔案。
載入提示範本：使用 load_prompt 函式從 JSON 檔案中載入提示範本。

資料連線與擷取：強化LLM應用程式的關鍵步驟

在現代企業中，資料的多樣性和複雜性不斷增加，如何有效地連線和擷取這些資料，成為提升大語言模型（LLM）應用效率的關鍵。本文將探討如何利用LangChain等工具，實作資料的載入、轉換、嵌入和儲存，並最終提升LLM的決策能力。

資料的多樣性與挑戰

企業資料通常以多種形式存在，包括非結構化資料（如Google Docs、Slack對話、網頁內容）和結構化資料（如SQL、NoSQL或Graph資料函式庫中的資料）。要有效利用LLM，需要克服以下挑戰：

非結構化資料的處理：將分散的非結構化資料轉換為可查詢的形式。
結構化資料的整合：從已索引的結構化資料函式庫中提取特定特徵，以供LLM使用。

資料處理流程

非結構化資料處理

對於非結構化資料，需經過以下步驟：

載入（Loading）：使用檔案載入器（如LangChain的Docx2txtLoader、PyPDFLoader）將不同格式的資料（如PDF、DOCX）轉換為可處理的檔案物件。
轉換（Transforming）：利用檔案轉換工具（如CharacterTextSplitter）將大段文字分割成較小的區塊，以適應LLM的輸入需求。
嵌入（Embedding）：透過文字嵌入模型將文字轉換為向量形式，以便進行相似性搜尋。
儲存（Storing）：將嵌入後的向量儲存在向量資料函式庫（如Faiss、Pinecone）中，以便快速檢索。

結構化資料處理

對於結構化資料，可以直接利用LangChain代理（LangChain Agent）進行查詢，提取所需的特定特徵，並將其納入LLM的提示中。

LangChain在資料處理中的角色

LangChain提供了多種工具來簡化資料處理流程：

檔案載入器（Document Loaders）：支援多種格式（如PDF、DOCX、CSV）的資料載入。
檔案轉換器（Document Transformers）：提供文字分割、格式轉換等功能。
文字嵌入模型（Text Embedding Models）：將文字轉換為向量表示。
向量資料函式庫（Vector Stores）：高效儲存和查詢嵌入向量。
檢索器（Retrievers）：支援對向量資料函式庫的查詢和檢索操作。

實作範例：建立NutriFusion Foods的LLM資料收集管道

以下是一個具體的實作範例，展示如何為NutriFusion Foods建立一個LLM資料收集管道，涵蓋PDF、CSV和DOCX等多種資料來源。

步驟1：安裝必要的Python套件

首先，需要安裝pdf2image、docx2txt和pypdf這三個套件，用於處理不同格式的檔案。

pip install pdf2image docx2txt pypdf

步驟2：載入不同格式的資料

使用LangChain的檔案載入器，分別載入PDF、CSV和DOCX檔案。

from langchain_community.document_loaders import Docx2txtLoader
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.document_loaders.csv_loader import CSVLoader
import glob
from langchain.text_splitter import CharacterTextSplitter

# 初始化檔案列表
all_documents = []

# 載入PDF檔案
loader = PyPDFLoader("data/principles_of_marketing_book.pdf")
pages = loader.load_and_split()
for page in pages:
    page.metadata["description"] = "Principles of Marketing Book"
all_documents.extend(pages)

# 載入CSV檔案
csv_files = glob.glob("data/*.csv")
csv_files = [f for f in csv_files if "Marketing" in f]
for csv_file in csv_files:
    loader = CSVLoader(file_path=csv_file)
    data = loader.load()
    all_documents.extend(data)

# 載入DOCX檔案並分割文字
urls = [
    'https://storage.googleapis.com/oreilly-content/NutriFusion%20Foods%20Marketing%20Plan%202022.docx',
    'https://storage.googleapis.com/oreilly-content/NutriFusion%20Foods%20Marketing%20Plan%202023.docx',
]
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=200, chunk_overlap=0)
docs = []
for url in urls:
    loader = Docx2txtLoader(url.replace('\n', ''))
    pages = loader.load()
    chunks = text_splitter.split_documents(pages)
    for chunk in chunks:
        chunk.metadata["source"] = "NutriFusion Foods Marketing Plan - 2022/2023"
    docs.extend(chunks)
all_documents.extend(docs)

#### 內容解密：

PyPDFLoader的使用：透過PyPDFLoader，我們可以輕鬆載入PDF檔案並將其分割成多頁，同時可以為每一頁新增額外的後設資料（如描述資訊）。
CSVLoader的使用：利用CSVLoader，可以逐一載入符合條件的CSV檔案，並將其內容轉換為LangChain的檔案物件。
Docx2txtLoader的使用：對於遠端儲存的DOCX檔案，使用Docx2txtLoader進行載入，並利用CharacterTextSplitter將文字分割成適當大小的區塊，以便後續處理。
後設資料的管理：在處理過程中，為每個檔案或區塊新增合適的後設資料（如來源、出處），以增強上下文資訊，提升LLM的理解能力。

玄貓

技術愛好者，專注於分享程式開發、雲端技術與 AI 應用的心得體會。