Python 文書處理技巧與演算法實作

本文涵蓋一系列 Python 文書處理技術，從基礎的詞彙頻率統計到進階的 Trie 結構操作、WordNet 語義分析，以及網路分析和編輯距離計算等演算法應用。程式碼範例清晰展示了各項技術的實作方法，並搭配技術分析說明其原理和優勢，適合對自然語言處理和文字探勘有興趣的開發者參考。文章也探討了作者識別和性別特定詞彙選擇等研究方向，展現了文書處理技術在實際應用中的廣泛可能性。

省略最頻繁詞彙

題目要求

撰寫 shorten(text, n) 函式，省略文字中最頻繁出現的前 n 個詞彙。

實作程式碼

from collections import Counter

def shorten(text, n):
    """省略文字中最頻繁出現的前n個詞彙"""
    words = text.split()
    freq_dist = Counter(words)
    most_common = freq_dist.most_common(n)
    stop_words = set(word for word, freq in most_common)
    
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return ' '.join(filtered_words)

# 測試範例
text = "This is a sample text. This text is just a sample."
print(shorten(text, 2))

內容解密：

統計詞彙頻率：使用 Counter 統計詞彙頻率。
找出最頻繁詞彙：取得前 n 個最頻繁的詞彙。
過濾詞彙：省略最頻繁的詞彙，重新組合成文字。

技術分析

使用 Counter 簡化頻率統計。
動態生成停用詞集合，提高過濾效率。

詞彙索引建立

題目要求

撰寫程式碼，建立詞彙索引，允許使用者根據詞義或發音查詢詞彙。

實作程式碼

class Lexicon:
    def __init__(self):
        self.index = {}

    def add_entry(self, word, meaning, pronunciation):
        """新增詞彙條目"""
        if word not in self.index:
            self.index[word] = []
        self.index[word].append({'meaning': meaning, 'pronunciation': pronunciation})

    def lookup(self, query):
        """根據查詢條件傳回詞彙資訊"""
        results = []
        for word, entries in self.index.items():
            for entry in entries:
                if query in entry['meaning'] or query in entry['pronunciation']:
                    results.append((word, entry))
        return results

# 測試範例
lexicon = Lexicon()
lexicon.add_entry("hello", "a greeting", "/həˈloʊ/")
lexicon.add_entry("world", "the earth", "/wɜːrld/")
print(lexicon.lookup("greeting"))

內容解密：

Lexicon 類別：建立詞彙索引。
- 使用字典儲存詞彙條目，每個條目包含詞義和發音資訊。
add_entry 方法：新增詞彙條目。
- 若詞彙不存在，則初始化為空列表。
lookup 方法：根據查詢條件傳回詞彙資訊。
- 遍歷索引，匹配詞義或發音。

技術分析

使用字典結構儲存索引，提高查詢效率。

WordNet Synsets 排序

題目要求

撰寫列表推導式，根據與給定 Synset 的路徑距離對 WordNet Synsets 列表進行排序。

實作程式碼

from nltk.corpus import wordnet as wn

def sort_synsets(synsets, target_synset):
    """根據路徑距離排序Synsets"""
    return sorted(synsets, key=lambda synset: synset.path_similarity(target_synset))

# 測試範例
target = wn.synset('right_whale.n.01')
synsets = [wn.synset('minke_whale.n.01'), wn.synset('orca.n.01'), 
           wn.synset('novel.n.01'), wn.synset('tortoise.n.01')]
print(sort_synsets(synsets, target))

內容解密：

sort_synsets 函式：根據路徑相似度排序 Synsets。
- 使用 path_similarity 方法計算相似度。

技術分析

利用 WordNet 提供的相似度計算方法，簡化了排序邏輯。

詞彙頻率排序

題目要求

撰寫函式，將包含重複詞彙的列表轉換為按頻率降序排列的無重複詞彙列表。

實作程式碼

from collections import Counter

def sort_by_frequency(word_list):
    """按頻率降序排列詞彙列表"""
    freq_dist = Counter(word_list)
    return sorted(freq_dist, key=freq_dist.get, reverse=True)

# 測試範例
words = ["table", "chair", "table", "table", "chair"]
print(sort_by_frequency(words))

內容解密：

統計詞彙頻率：使用 Counter 統計詞彙頻率。
排序：根據頻率降序排列詞彙。

技術分析

使用 Counter 簡化頻率統計。
利用排序函式的 key 引數實作按頻率排序。

集合差異運算

題目要求

撰寫函式，傳回出現在文字中但不在詞彙表中的詞彙集合。

實作程式碼

def diff_vocabulary(text, vocabulary):
    """傳回文字中但不在詞彙表中的詞彙集合"""
    text_words = set(text.split())
    vocab_set = set(vocabulary)
    return text_words.difference(vocab_set)

# 測試範例
text = "This is a sample text with some words."
vocabulary = ["this", "is", "a", "sample"]
print(diff_vocabulary(text.lower(), vocabulary))

內容解密：

轉換為集合：將文字和詞彙表轉換為集合。
集合差異運算：使用 difference 方法計算差集。

技術分析

使用集合運算提高效率。

itemgetter 應用

題目要求

使用 itemgetter 函式對詞彙列表進行排序。

實作程式碼

from operator import itemgetter

def sort_words(words):
    """使用itemgetter排序詞彙列表"""
    # 假設words是包含元組的列表，例如[(word, freq), ...]
    return sorted(words, key=itemgetter(1))

# 測試範例
words = [("apple", 3), ("banana", 1), ("cherry", 2)]
print(sort_words(words))

內容解密：

itemgetter 使用：根據元組的第二個元素排序。

技術分析

itemgetter 提供了一種便捷的方式來提取複雜資料結構中的特定元素進行排序。

Trie 查詢

題目要求

實作 lookup 函式，在 Trie 中查詢鍵值並傳回對應的值。

實作程式碼

class TrieNode:
    def __init__(self):
        self.children = {}
        self.value = None

class Trie:
    def __init__(self):
        self.root = TrieNode()

    def insert(self, key, value):
        """插入鍵值對"""
        node = self.root
        for char in key:
            if char not in node.children:
                node.children[char] = TrieNode()
            node = node.children[char]
        node.value = value

    def lookup(self, key):
        """查詢鍵值"""
        node = self.root
        for char in key:
            if char not in node.children:
                return None
            node = node.children[char]
        return node.value

# 測試範例
trie = Trie()
trie.insert("hello", "greeting")
print(trie.lookup("hello"))

內容解密：

TrieNode 類別：表示 Trie 中的節點。
- 包含子節點字典和值屬性。
Trie 類別：實作 Trie 結構。
- 提供插入和查詢方法。
lookup 方法：遍歷 Trie 查詢鍵值。

技術分析

Trie 結構適合用於快速查詢具有相同字首的鍵值。

關鍵字連結網路

題目要求

使用 NLTK 的 Shakespeare 語料函式庫提取關鍵字，並使用 NetworkX 繪製關鍵字連結網路。

實作程式碼

import networkx as nx
import matplotlib.pyplot as plt
from nltk.corpus import shakespeare

def keyword_linkage():
    """提取關鍵字並繪製連結網路"""
    # 簡化範例，實際實作需更複雜的關鍵字提取邏輯
    keywords = ["love", "hate", "jealousy", "power"]
    G = nx.Graph()
    G.add_nodes_from(keywords)
    
    # 簡單連線前兩個關鍵字
    G.add_edge(keywords[0], keywords[1])
    
    nx.draw(G, with_labels=True)
    plt.show()

# 測試範例
keyword_linkage()

內容解密：

提取關鍵字：從語料函式庫中提取關鍵字。
建立圖結構：使用 NetworkX 建立圖並新增節點和邊。
繪製網路：使用 Matplotlib 繪製圖形。

技術分析

結合 NLTK 和 NetworkX 進行文字網路分析。

編輯距離計算

題目要求

研究 Levenshtein 演算法並使用 NLTK 實作編輯距離計算。

實作程式碼

import nltk

def edit_distance(s1, s2):
    """計算編輯距離"""
    return nltk.edit_distance(s1, s2)

# 測試範例
print(edit_distance("kitten", "sitting"))

內容解密：

使用 NLTK 編輯距離函式：直接呼叫 nltk.edit_distance 計算編輯距離。

技術分析

NLTK 提供了現成的編輯距離計算函式，簡化了實作。

Catalan 數計算

題目要求

實作遞迴和動態規劃方法計算 Catalan 數。

實作程式碼

def catalan_recursive(n):
    """遞迴計算Catalan數"""
    if n == 0:
        return 1
    else:
        return sum(catalan_recursive(i) * catalan_recursive(n-1-i) for i in range(n))

def catalan_dp(n):
    """動態規劃計算Catalan數"""
    catalan = [0] * (n + 1)
    catalan[0] = 1
    for i in range(1, n + 1):
        catalan[i] = sum(catalan[j] * catalan[i-1-j] for j in range(i))
    return catalan[n]

# 測試範例
print(catalan_recursive(5))
print(catalan_dp(5))

內容解密：

catalan_recursive 函式：遞迴計算 Catalan 數。
- 使用遞迴公式計算。
catalan_dp 函式：動態規劃計算 Catalan 數。
- 使用陣列儲存中間結果，避免重複計算。

技術分析

動態規劃方法顯著提高了計算效率。

作者識別研究

題目要求

重現 (Zhao & Zobel, 2007) 關於作者識別的研究結果。

實作程式碼

碼由於題目涉及較多具體研究細節和文獻參考，實作較為複雜，需深入研究相關文獻並設計合適的作者識別演算法。

性別特定詞彙選擇研究

題目要求

研究性別特定的詞彙選擇並重現相關結果。

實作程式碼

# 簡化範例，實際研究需更複雜的統計和分析方法
import nltk
from nltk.corpus import brown

def gender_specific_lexical_choice():
    """研究性別特定的詞彙選擇"""
    # 假設已取得性別標記的語料函式庫
    male_words = ["he", "him", "his"]
    female_words = ["she", "her", "hers"]
    
    male_count = sum(1 for word in brown.words() if word.lower() in male_words)
    female_count = sum(1 for word in brown.words() if word.lower() in female_words)
    
    print(f"Male words count: {male_count}")
    print(f"Female words count: {female_count}")

# 測試範例
gender_specific_lexical_choice()

內容解密：

語料函式庫選擇：選擇合適的已標記語料函式庫。
統計詞彙：統計不同性別相關詞彙的出現次數。

技術分析

使用大型標記語料函式庫進行統計分析。

Trie 視覺化

題目要求

撰寫遞迴函式，以字母順序漂亮列印 Trie 結構。

實作程式碼

class TrieNode:
    def __init__(self):
        self.children = {}
        self.value = None

def pretty_print_trie(node, prefix=""):
    """遞迴列印Trie"""
    if node.value is not None:
        print(f"{prefix}: {node.value}")
    for char in sorted(node.children.keys()):
        pretty_print_trie(node.children[char], prefix + char)

# 測試範例
root = TrieNode()
root.children['a'] = TrieNode()
root.children['a'].value = "apple"
root.children['b'] = TrieNode()
root.children['b'].children['a'] = TrieNode()
root.children['b'].children['a'].value = "banana"
pretty_print_trie(root)

內容解密：

TrieNode 類別：表示 Trie 中的節點。
pretty_print_trie 函式：遞迴遍歷並列印 Trie。

技術分析

遞迴方法便於遍歷樹狀結構。

玄貓

技術愛好者，專注於分享程式開發、雲端技術與 AI 應用的心得體會。