Trie 壓縮與文字處理技術詳解

Trie 結構有效壓縮文字，利用字首匹配特性縮短詞彙儲存空間。Python 的 textwrap 模組簡化斷行流程，並能精確控制行寬與對齊方式。萃取式摘要演算法根據詞頻統計，快速擷取文章重點。Ngram 模型則應用於詞性標記，透過上下文資訊提升標記準確度。詞彙網路視覺化技術則以圖形方式呈現詞彙間的語義關聯，便於理解文字結構。這些技術相互結合，構成了完整的文字處理流程，可應用於資訊檢索、自然語言理解等領域。

Trie 壓縮文字

題目要求

使用 Trie 結構處理文字，定位每個詞彙的唯一性點並捨棄剩餘部分。

實作程式碼

class TrieNode:
    def __init__(self):
        self.children = {}
        self.is_end_of_word = False

def compress_text(text, trie):
    """使用Trie壓縮文字"""
    words = text.split()
    compressed = []
    for word in words:
        node = trie.root
        prefix = ""
        for char in word:
            if char not in node.children:
                break
            prefix += char
            node = node.children[char]
            if node.is_end_of_word:
                break
        compressed.append(prefix)
    return ' '.join(compressed)

# 測試範例
class Trie:
    def __init__(self):
        self.root = TrieNode()

    def insert(self, word):
        node = self.root
        for char in word:
            if char not in node.children:
                node.children[char] = TrieNode()
            node = node.children[char]
        node.is_end_of_word = True

trie = Trie()
trie.insert("apple")
trie.insert("application")
print(compress_text("apple application", trie))

內容解密：

建立 Trie：插入詞彙建立 Trie 結構。
compress_text 函式：遍歷文字，查詢每個詞彙的唯一字首。

技術分析

Trie 結構適合用於字首匹配和壓縮。

文字斷行與調整

題目要求

使用 Python 的 textwrap 模組斷行，並新增空格以調整輸出。

實作程式碼

import textwrap

def justify_text(text, width):
    """斷行並調整文字"""
    lines = textwrap.wrap(text, width)
    justified_lines = []
    for line in lines:
        words = line.split()
        if len(words) == 1:
            justified_lines.append(line + ' ' * (width - len(line)))
        else:
            total_chars = sum(len(word) for word in words)
            total_spaces = width - total_chars
            gaps = len(words) - 1
            base_spaces = total_spaces // gaps
            extra_spaces = total_spaces % gaps
            justified_line = ''
            for i, word in enumerate(words):
                justified_line += word
                if i < gaps:
                    justified_line += ' ' * (base_spaces + (1 if i < extra_spaces else 0))
            justified_lines.append(justified_line)
    return '\n'.join(justified_lines)

# 測試範例
text = "This is a sample text that needs to be justified."
print(justify_text(text, 20))

內容解密：

斷行：使用 textwrap.wrap 函式斷行。
調整：計算並分配空格以調整行寬。

技術分析

使用 textwrap 簡化了斷行邏輯。
動態計算空格分配，實作了文字的兩端對齊。

萃取式摘要工具

題目要求

開發一個簡單的萃取式摘要工具，列印檔案中包含最高總詞頻的句子。

實作程式碼

from collections import Counter
import re

def extractive_summarization(text, n_sentences):
    """萃取式摘要"""
    sentences = re.split(r'[.!?]', text)
    sentences = [s.strip() for s in sentences if s.strip()]
    
    word_freq = Counter(' '.join(sentences).split())
    sentence_scores = {}
    for sentence in sentences:
        score = sum(word_freq[word] for word in sentence.split())
        sentence_scores[sentence] = score
    
    top_sentences = sorted(sentence_scores, key=sentence_scores.get, reverse=True)[:n_sentences]
    return ' '.join(top_sentences)

# 測試範例
text = "This is a sample text. It contains multiple sentences. Some sentences are more important than others."
print(extractive_summarization(text, 2))

內容解密：

分割句子：將文字分割成句子列表。
計算詞頻：統計詞彙頻率。
評分句子：根據詞頻計算句子得分。
選取最高得分句子：排序並選取前 n 個句子。

技術分析

使用正規表示式進行句子分割。
根據詞頻統計結果進行句子評分。

NgramTagger 類別實作

題目要求

開發自定義的 NgramTagger 類別，封裝詞彙標記訓練和測試資料的詞彙彙總方法。

實作程式碼

from nltk.tag import NgramTagger

class CustomNgramTagger(NgramTagger):
    def __init__(self, n, train, backoff=None):
        super().__init__(n, train, backoff=backoff)

    def context_vocab(self, tagged_data):
        """計算上下文詞彙表"""
        # 簡化範例，實際實作需根據需求定義上下文詞彙表的計算邏輯
        vocab = set()
        for sentence in tagged_data:
            for word, tag in sentence:
                vocab.add(word)
        return vocab

# 測試範例
train_data = [[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('test', 'NN')]]
tagger = CustomNgramTagger(2, train_data)
print(tagger.context_vocab(train_data))

內容解密：

繼承 NgramTagger：自定義類別繼承自 NLTK 的 NgramTagger。
context_vocab 方法：計算並傳回上下文詞彙表。

技術分析

繼承和擴充套件現有的 NgramTagger 類別，以滿足特定需求。

詞彙網路視覺化

題目要求

根據形容詞的語義取向，使用 NetworkX 繪製網路圖。

實作程式碼

import networkx as nx
import matplotlib.pyplot as plt

def visualize_adjective_network(adjectives, orientations):
    """視覺化形容詞網路"""
    G = nx.Graph()
    for adj in adjectives:
        G.add_node(adj)
    
    # 簡化範例，假設相同取向的形容詞之間有邊
    for i in range(len(adjectives)):
        for j in range(i+1, len(adjectives)):
            if orientations[i] == orientations[j]:
                G.add_edge(adjectives[i], adjectives[j])
    
    nx.draw(G, with_labels=True)
    plt.show()

# 測試範例
adjectives = ["good", "bad", "excellent", "poor"]
orientations = ["positive", "negative", "positive", "negative"]
visualize_adjective_network(adjectives, orientations)

內容解密：

建立圖結構：使用 NetworkX 建立圖並新增節點和邊。
繪製網路：根據形容詞的語義取向繪製網路圖。

技術分析

利用 NetworkX 和 Matplotlib 進行視覺化展示。

統計上不常見的短語發現

題目要求

設計演算法以發現檔案集合中的統計上不常見的短語。

實作程式碼

from collections import Counter
import re

def statistically_improbable_phrases(documents, n):
    """發現統計上不常見的短語"""
    # 簡化範例，實際實作需更複雜的邏輯
    all_phrases = []
    for doc in documents:
        phrases = re.findall(r'\b\w+\s\w+\b', doc)
        all_phrases.extend(phrases)
    
    phrase_freq = Counter(all_phrases)
    background_freq = Counter()
    for doc in documents:
        words = doc.split()
        for i in range(len(words)-1):
            background_freq[words[i] + ' ' + words[i+1]] += 1
    
    improbable_phrases = []
    for phrase, freq in phrase_freq.items():
        background_count = background_freq[phrase]
        if freq > background_count * 2:  # 簡化閾值判斷
            improbable_phrases.append((phrase, freq))
    
    return sorted(improbable_phrases, key=lambda x: x[1], reverse=True)[:n]

# 測試範例
documents = ["This is a sample document.", "Another document for testing."]
print(statistically_improbable_phrases(documents, 2))

內容解密：

提取短語：從檔案中提取短語。
統計頻率：計算短語和背景頻率。
識別不常見短語：比較頻率，找出統計上不常見的短語。

技術分析

使用正規表示式提取短語。
比較觀察頻率和背景頻率以識別不常見短語。

詞彙方塊遊戲

題目要求

實作暴力演算法以發現詞彙方塊，一種 n × n 詞彙網格，其中第 n 行與第 n 列相同。

實作程式碼

import itertools

def word_squares(words, n):
    """暴力法尋找詞彙方塊"""
    # 簡化範例，實際實作需更高效的演算法
    for square in itertools.product(words, repeat=n):
        if all(square[i][j] == square[j][i] for i in range(n) for j in range(n)):
            yield square

# 測試範例
words = ["cat", "dog", "sun", "hat"]
for square in word_squares(words, 2):
    for word in square:
        print(word)
    print()

內容解密：

生成詞彙組合：使用 itertools.product 生成詞彙的 n 次笛卡爾積。
檢查方塊條件：驗證生成的組合是否滿足詞彙方塊條件。

技術分析

使用暴力法遍歷所有可能的組合。

標記詞類別：詞語標記技術詳解

詞類別（part-of-speech）是自然語言處理中的基礎概念，它對於理解語言結構至關重要。本章將探討詞類別標記（POS tagging）的原理和應用，並介紹相關的技術實作。

5.1 使用標記器

詞類別標記器（POS tagger）是一種能夠為句子中的每個詞賦予正確詞類別標籤的工具。NLTK 提供了方便的介面來進行詞類別標記：

import nltk
text = nltk.word_tokenize("And now for something completely different")
nltk.pos_tag(text)

輸出結果為：

[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), 
('completely', 'RB'), ('different', 'JJ')]

內容解密：

nltk.pos_tag() 函式接收一個詞彙列表，傳回對應的詞類別標記
每個詞都被標記為特定的詞類別，如 ‘CC’（連線詞）、‘RB’（副詞）、‘IN’（介詞）等
詞類別標記對於理解句子的結構和語義至關重要

5.2 標記語料函式庫

在 NLTK 中，標記後的詞彙通常以元組（tuple）的形式表示，包含詞本身和其對應的標記：

tagged_token = nltk.tag.str2tuple('fly/NN')
print(tagged_token)  # ('fly', 'NN')

內容解密：

nltk.tag.str2tuple() 函式將字串形式的標記詞轉換為元組形式
標記語料函式庫中的每個詞都以這種形式儲存，便於後續處理和分析
這種表示方法使得詞和其詞類別之間的對應關係更加明確

5.3 詞類別標記的重要性

詞類別標記對於多種自然語言處理任務至關重要，例如：

語法分析：瞭解詞類別有助於分析句子的語法結構
語義理解：區分詞類別可以幫助理解詞語在不同上下文中的含義
語音合成：正確的詞類別標記可以指導語音合成系統正確發音

5.4 自動標記技術

自動詞類別標記是自然語言處理中的一個基本任務。常見的自動標記方法包括：

根據規則的方法：利用預定義的語法規則進行標記
根據統計的方法：使用統計模型（如隱馬爾可夫模型）進行標記
根據機器學習的方法：訓練機器學習模型來進行標記

技術比較

方法	優點	缺點
根據規則	可解釋性強	規則維護困難
根據統計	準確率較高	需要大量標記資料
根據機器學習	靈活性高	需要大量訓練資料

5.5 實作一個簡單的標記器

我們可以使用 NLTK 提供的工具來實作一個簡單的標記器：

import nltk
from nltk import pos_tag, word_tokenize

def simple_tagger(sentence):
    tokens = word_tokenize(sentence)
    return pos_tag(tokens)

# 示例用法
sentence = "The quick brown fox jumps over the lazy dog."
print(simple_tagger(sentence))

內容解密：

word_tokenize() 函式用於將句子分割成單個詞語
pos_tag() 函式對分割後的詞語進行詞類別標記
這種簡單的實作可以滿足基本的標記需求

5.6 標記器的評估

評估標記器的效能通常使用準確率指標：

import nltk
from nltk.metrics import accuracy

# 假設我們有一個標記好的測試集
tagged_test_data = [...]
predicted_tags = [tagger.tag(tokens) for tokens, _ in tagged_test_data]

# 計算準確率
accuracy_score = accuracy(gold_tags, predicted_tags)
print(f"標記準確率：{accuracy_score:.4f}")

內容解密：

準確率是評估標記器效能的重要指標
需要使用標記好的測試資料來評估模型的真實效能
除了準確率，還可以考慮使用其他評估指標如精確率、召回率等

玄貓

技術愛好者，專注於分享程式開發、雲端技術與 AI 應用的心得體會。