Python 驅動 AI 資安威脅搜尋自動化：從機器學習到實戰部署的完整指南

在當代資訊安全的戰場上,攻擊手法的複雜度與自動化程度已經遠遠超越傳統防禦機制所能應對的範圍。從進階持續性威脅到零時差漏洞利用,從多階段攻擊鏈到無檔案惡意軟體,攻擊者不斷演化的戰術讓資安團隊疲於奔命。傳統的簽章比對、規則引擎與人工分析雖然仍有其價值,卻難以跟上威脅態勢的快速變化,更無法處理每日產生的海量安全事件。

在這個背景下,機器學習與人工智慧技術為資安防禦帶來了革命性的轉變。透過從歷史資料中學習攻擊模式,AI 系統能夠識別人類分析師難以察覺的微妙異常。透過自動化大規模資料處理,機器學習模型能夠在數秒內分析數百萬筆日誌記錄。透過持續學習與自我調適,智慧防禦系統能夠跟隨威脅演化而進化。

Python 在這個技術浪潮中扮演著核心角色。作為資料科學與機器學習領域的主流語言,Python 擁有豐富且成熟的生態系統。TensorFlow 與 PyTorch 提供了建構深度學習模型的強大框架,scikit-learn 封裝了經典機器學習演算法的完整實作,pandas 與 NumPy 則提供了高效的資料處理能力。這些工具的組合讓資安團隊能夠快速開發原型,驗證想法,並將成功的模型部署到生產環境。

從技術架構的角度來看,AI 驅動的資安威脅搜尋系統需要整合多個層面的能力。資料收集層負責從各種安全設備與系統中匯集日誌與事件,包括防火牆、IDS/IPS、端點防護、網路流量分析器與 SIEM 平台。資料預處理層清洗、正規化與轉換原始資料,提取有意義的特徵供模型訓練使用。模型訓練層運用機器學習演算法建構威脅檢測模型,包括監督式學習用於已知威脅的分類,無監督式學習用於異常檢測,強化學習用於動態決策最佳化。推論層將訓練好的模型應用於即時資料流,產生威脅評分與告警。回應層則根據模型輸出觸發自動化防禦措施或通知分析師進行深入調查。

AI 驅動資安自動化的技術基礎架構

建構企業級的 AI 資安威脅搜尋系統需要完整的技術堆疊支援。這個堆疊的基礎是 Python 運行環境,必須選擇穩定且與主流機器學習框架相容的版本。Python 3.8 以上的版本提供了良好的效能與函式庫支援,是當前的標準選擇。

在環境管理方面,使用虛擬環境隔離專案相依性是必要的實踐。無論是內建的 venv 模組、第三方的 virtualenv 工具,或是更進階的 conda 環境管理器,都能夠避免不同專案之間的套件衝突。這種隔離不僅簡化了開發流程,更確保了生產環境的穩定性。

機器學習框架的選擇取決於具體的應用場景。TensorFlow 以其完整的生態系統與生產部署工具鏈,適合需要大規模部署的企業應用。PyTorch 則以其直觀的動態計算圖與研究友善的特性,適合快速原型開發與演算法實驗。對於傳統機器學習任務,scikit-learn 提供了豐富且經過充分驗證的演算法實作,從決策樹到支援向量機,從隨機森林到梯度提升,涵蓋了絕大多數的應用需求。

資料處理是機器學習流程中最耗時但也最關鍵的環節。pandas 函式庫提供了強大的 DataFrame 資料結構,讓資料清洗、轉換與特徵工程變得直觀且高效。NumPy 則提供了底層的數值運算能力,支援大規模矩陣運算與向量化操作。在處理時間序列資料時,專門的函式庫如 statsmodels 能夠提供額外的分析能力。

# Python 環境設定與相依套件安裝
# 這個腳本展示如何建立完整的 AI 資安開發環境

# 首先建立虛擬環境
# 在命令列執行:
# python -m venv ai_security_env
# source ai_security_env/bin/activate  # Linux/macOS
# ai_security_env\Scripts\activate     # Windows

# 安裝核心機器學習框架
# TensorFlow 用於深度學習模型開發
# pip install tensorflow>=2.13.0

# PyTorch 提供另一種深度學習選擇
# pip install torch torchvision torchaudio

# scikit-learn 提供經典機器學習演算法
# pip install scikit-learn>=1.3.0

# 資料處理與分析套件
# pandas 用於結構化資料處理
# pip install pandas>=2.0.0

# NumPy 用於數值運算
# pip install numpy>=1.24.0

# 資料視覺化套件
# matplotlib 用於基礎繪圖
# pip install matplotlib>=3.7.0

# seaborn 提供進階統計視覺化
# pip install seaborn>=0.12.0

# 資安特定函式庫
# scapy 用於網路封包分析
# pip install scapy>=2.5.0

# impacket 用於網路協定實作
# pip install impacket>=0.11.0

# yara-python 用於惡意軟體規則匹配
# pip install yara-python>=4.3.0

# 驗證安裝
import sys
import tensorflow as tf
import sklearn
import pandas as pd
import numpy as np

def verify_environment():
    """
    驗證開發環境是否正確設定
    檢查所有關鍵套件的版本與可用性
    """
    print("環境驗證報告")
    print("=" * 50)
    
    # 檢查 Python 版本
    print(f"Python 版本: {sys.version}")
    
    # 檢查 TensorFlow
    print(f"TensorFlow 版本: {tf.__version__}")
    print(f"GPU 可用性: {tf.config.list_physical_devices('GPU')}")
    
    # 檢查 scikit-learn
    print(f"scikit-learn 版本: {sklearn.__version__}")
    
    # 檢查資料處理套件
    print(f"pandas 版本: {pd.__version__}")
    print(f"NumPy 版本: {np.__version__}")
    
    # 檢查系統資源
    import psutil
    print(f"可用記憶體: {psutil.virtual_memory().available / (1024**3):.2f} GB")
    print(f"CPU 核心數: {psutil.cpu_count()}")
    
    print("=" * 50)
    print("環境驗證完成")

# 執行驗證
if __name__ == "__main__":
    verify_environment()

這個環境設定腳本提供了完整的開發環境建立指引。虛擬環境的使用確保了專案相依性的隔離,避免了全域套件衝突。版本號的明確指定保證了程式碼的可重現性,這在團隊協作與生產部署中特別重要。環境驗證函式則提供了快速檢查工具,確認所有必要元件都已正確安裝。

資料來源的整合是 AI 資安系統的另一個關鍵環節。現代企業的安全資料分散在各種系統中,從 SIEM 平台的整合日誌到 EDR 系統的端點遙測,從網路流量分析器的封包擷取到威脅情報平台的指標饋送。有效的資料收集策略需要建立標準化的資料管道,確保資料的即時性、完整性與一致性。

在資料預處理階段,需要處理多種資料品質問題。遺失值可能來自感測器故障或網路中斷,需要適當的插補策略或直接移除。異常值可能是真實的攻擊訊號,也可能是設備錯誤,需要仔細分析。資料格式的標準化確保了不同來源的資料能夠整合處理。特徵工程則將原始資料轉換為模型能夠理解的形式,這個過程往往需要領域知識的深度參與。

模型訓練需要充足的運算資源。對於深度學習模型,GPU 加速能夠將訓練時間從數天縮短到數小時。雲端平台提供了彈性的運算資源,讓團隊能夠根據需求擴展。容器化技術如 Docker 則簡化了環境配置,確保訓練環境與生產環境的一致性。

評估指標的選擇需要考慮業務目標。準確率雖然直觀,但在類別不平衡的資安場景中可能產生誤導。精確率關注誤報率,召回率關注漏報率,F1 分數則平衡兩者。ROC 曲線與 AUC 分數提供了更全面的效能評估。在實務中,往往需要根據具體場景調整評估標準,例如在關鍵基礎設施防護中,漏報的代價可能遠大於誤報。

惡意軟體檢測的機器學習實作

惡意軟體檢測是 AI 資安應用中最成熟的領域之一。傳統的簽章比對方法雖然對已知威脅有效,卻無法應對變種與零日攻擊。機器學習方法透過分析檔案的行為特徵、結構特性與執行模式,能夠識別從未見過的惡意程式。

從技術實作來看,惡意軟體檢測是一個典型的二元分類問題。訓練資料包含已標記的良性檔案與惡意檔案樣本,模型學習區分兩者的特徵模式。特徵提取是這個流程中最關鍵的步驟,需要將檔案轉換為數值向量。靜態特徵包括檔案大小、匯入函式、字串內容與 PE 結構,動態特徵則包括 API 呼叫序列、檔案系統操作與網路通訊行為。

支援向量機是惡意軟體檢測中常用的演算法之一。SVM 透過尋找最佳決策邊界來分離不同類別,在高維特徵空間中表現優異。核心技巧讓 SVM 能夠處理非線性可分的問題,常用的核心函式包括線性核心、多項式核心與 RBF 核心。

# 惡意軟體檢測的完整實作流程
# 這個範例展示從資料準備到模型訓練的完整過程

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
import matplotlib.pyplot as plt
import seaborn as sns

class MalwareDetector:
    """
    惡意軟體檢測器類別
    封裝完整的檢測流程,包括資料預處理、模型訓練與評估
    """
    
    def __init__(self, kernel='rbf', C=1.0, gamma='scale'):
        """
        初始化檢測器
        
        參數:
            kernel: SVM 核心函式類型 (linear, rbf, poly)
            C: 正則化參數,控制誤分類的懲罰程度
            gamma: RBF 核心的係數,控制決策邊界的平滑度
        """
        self.scaler = StandardScaler()
        self.model = SVC(
            kernel=kernel,
            C=C,
            gamma=gamma,
            probability=True,  # 啟用機率估計,用於閾值調整
            random_state=42
        )
        self.is_trained = False
        
    def extract_features(self, file_path):
        """
        從檔案中提取特徵向量
        這是簡化範例,實際應用需要更複雜的特徵工程
        
        參數:
            file_path: 要分析的檔案路徑
            
        回傳:
            特徵向量 (numpy array)
        """
        # 在實際應用中,這裡會包含:
        # 1. PE 標頭分析 (匯入表、區段資訊、資源)
        # 2. 字串提取與 TF-IDF 向量化
        # 3. API 呼叫序列分析
        # 4. 熵值計算
        # 5. 檔案元資料提取
        
        features = {
            'file_size': 0,           # 檔案大小
            'num_sections': 0,        # PE 區段數量
            'num_imports': 0,         # 匯入函式數量
            'entropy': 0.0,           # 檔案熵值
            'suspicious_strings': 0,  # 可疑字串數量
            'api_calls_diversity': 0, # API 呼叫多樣性
            'network_indicators': 0,  # 網路行為指標
            'persistence_methods': 0  # 持久化機制數量
        }
        
        # 這裡應該實作實際的特徵提取邏輯
        # 目前返回示例值
        return np.array(list(features.values()))
    
    def prepare_data(self, X, y, test_size=0.2):
        """
        準備訓練與測試資料集
        
        參數:
            X: 特徵矩陣 (n_samples, n_features)
            y: 標籤向量 (n_samples,) 0=良性, 1=惡意
            test_size: 測試集比例
            
        回傳:
            (X_train, X_test, y_train, y_test) 元組
        """
        # 分割資料集
        # stratify 參數確保訓練集與測試集保持相同的類別比例
        X_train, X_test, y_train, y_test = train_test_split(
            X, y,
            test_size=test_size,
            stratify=y,  # 保持類別平衡
            random_state=42
        )
        
        # 特徵標準化
        # 將特徵縮放到相同尺度,提升 SVM 效能
        X_train_scaled = self.scaler.fit_transform(X_train)
        X_test_scaled = self.scaler.transform(X_test)
        
        return X_train_scaled, X_test_scaled, y_train, y_test
    
    def train(self, X_train, y_train):
        """
        訓練惡意軟體檢測模型
        
        參數:
            X_train: 訓練特徵矩陣
            y_train: 訓練標籤向量
        """
        print("開始訓練惡意軟體檢測模型...")
        
        # 使用交叉驗證評估模型穩定性
        # 5-fold 交叉驗證能夠更可靠地評估模型效能
        cv_scores = cross_val_score(
            self.model, X_train, y_train,
            cv=5,  # 5 折交叉驗證
            scoring='f1'  # 使用 F1 分數作為評估指標
        )
        
        print(f"交叉驗證 F1 分數: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
        
        # 訓練最終模型
        self.model.fit(X_train, y_train)
        self.is_trained = True
        
        print("模型訓練完成")
    
    def evaluate(self, X_test, y_test):
        """
        評估模型效能
        
        參數:
            X_test: 測試特徵矩陣
            y_test: 測試標籤向量
        """
        if not self.is_trained:
            raise ValueError("模型尚未訓練,請先執行 train() 方法")
        
        # 預測
        y_pred = self.model.predict(X_test)
        y_pred_proba = self.model.predict_proba(X_test)[:, 1]
        
        # 計算評估指標
        print("\n模型效能評估:")
        print("=" * 50)
        
        # 分類報告包含精確率、召回率、F1 分數
        print("\n分類報告:")
        print(classification_report(
            y_test, y_pred,
            target_names=['良性', '惡意']
        ))
        
        # AUC 分數評估模型的整體判別能力
        auc_score = roc_auc_score(y_test, y_pred_proba)
        print(f"AUC 分數: {auc_score:.4f}")
        
        # 混淆矩陣視覺化
        self._plot_confusion_matrix(y_test, y_pred)
        
        return {
            'predictions': y_pred,
            'probabilities': y_pred_proba,
            'auc_score': auc_score
        }
    
    def _plot_confusion_matrix(self, y_true, y_pred):
        """
        繪製混淆矩陣
        視覺化分類結果,清楚呈現誤報與漏報情況
        """
        cm = confusion_matrix(y_true, y_pred)
        
        plt.figure(figsize=(8, 6))
        sns.heatmap(
            cm, annot=True, fmt='d',
            cmap='Blues',
            xticklabels=['良性', '惡意'],
            yticklabels=['良性', '惡意']
        )
        plt.title('惡意軟體檢測混淆矩陣')
        plt.ylabel('實際類別')
        plt.xlabel('預測類別')
        plt.tight_layout()
        plt.savefig('confusion_matrix.png', dpi=300)
        print("\n混淆矩陣已儲存至 confusion_matrix.png")
    
    def predict(self, X):
        """
        對新樣本進行預測
        
        參數:
            X: 特徵矩陣或單一特徵向量
            
        回傳:
            預測結果與信心分數
        """
        if not self.is_trained:
            raise ValueError("模型尚未訓練")
        
        # 確保輸入是 2D 陣列
        if X.ndim == 1:
            X = X.reshape(1, -1)
        
        # 標準化特徵
        X_scaled = self.scaler.transform(X)
        
        # 預測
        predictions = self.model.predict(X_scaled)
        probabilities = self.model.predict_proba(X_scaled)
        
        return predictions, probabilities

# 使用範例
def main():
    """
    示範完整的惡意軟體檢測流程
    """
    # 產生示例資料
    # 實際應用中應該從真實檔案中提取特徵
    np.random.seed(42)
    
    # 模擬 1000 個良性檔案的特徵
    benign_samples = np.random.randn(1000, 8) * 0.5
    
    # 模擬 1000 個惡意檔案的特徵
    # 惡意檔案在某些特徵上有明顯差異
    malicious_samples = np.random.randn(1000, 8) * 0.8 + np.array([2, 1, 3, 2, 1, 2, 3, 2])
    
    # 組合資料集
    X = np.vstack([benign_samples, malicious_samples])
    y = np.array([0] * 1000 + [1] * 1000)  # 0=良性, 1=惡意
    
    # 初始化檢測器
    detector = MalwareDetector(kernel='rbf', C=10, gamma='scale')
    
    # 準備資料
    X_train, X_test, y_train, y_test = detector.prepare_data(X, y)
    
    # 訓練模型
    detector.train(X_train, y_train)
    
    # 評估模型
    results = detector.evaluate(X_test, y_test)
    
    # 對新樣本進行預測
    new_sample = np.random.randn(8) * 0.8 + np.array([2, 1, 3, 2, 1, 2, 3, 2])
    pred, proba = detector.predict(new_sample)
    
    print(f"\n新樣本預測結果: {'惡意' if pred[0] == 1 else '良性'}")
    print(f"惡意機率: {proba[0][1]:.4f}")

if __name__ == "__main__":
    main()

這個完整的實作範例展示了惡意軟體檢測的標準流程。MalwareDetector 類別封裝了所有必要的功能,從特徵提取到模型訓練與評估。資料預處理階段包含了分割與標準化,確保模型訓練的品質。交叉驗證提供了更可靠的效能估計,避免過度擬合。評估階段不僅計算準確率,更關注精確率、召回率與 AUC 分數,這些指標在資安場景中更具實務意義。

在實際應用中,特徵工程是決定模型效能的關鍵因素。靜態分析可以提取 PE 檔案的結構特徵,包括區段數量、匯入表內容、資源類型等。字串分析能夠識別可疑的 URL、IP 位址或檔案路徑。熵值計算可以檢測加密或壓縮的程式碼區段,這是許多惡意軟體的特徵。動態分析則需要在沙箱環境中執行樣本,監控其行為模式,包括檔案操作、登錄修改、網路通訊與程序注入等。

異常檢測在網路安全中的應用

異常檢測是另一個 AI 資安的重要應用領域。相較於監督式學習需要大量標記資料,異常檢測能夠在無標記或少標記的情況下運作。這個特性在資安場景中特別有價值,因為惡意行為往往稀少且多變,難以獲得充足的訓練樣本。

無監督式學習演算法透過建立正常行為的基準線,識別顯著偏離這個基準的異常模式。Isolation Forest 是一個特別適合異常檢測的演算法。它的核心思想是異常點更容易被隔離,因為它們與大部分資料點的特徵有明顯差異。演算法透過隨機選擇特徵與分割點建構樹狀結構,異常點通常只需要較少的分割次數就能被隔離。

在網路安全應用中,異常檢測可以用於多種場景。網路流量分析透過監控連線模式、傳輸量與協定使用,識別潛在的 DDoS 攻擊或資料外洩。使用者行為分析建立每個使用者的正常活動基準,檢測帳號被盜用或內部威脅。系統日誌分析監控伺服器與應用程式的運作狀態,及早發現安全事件或系統故障。

# 網路流量異常檢測系統
# 使用 Isolation Forest 演算法識別異常網路行為

import numpy as np
import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from datetime import datetime, timedelta

class NetworkAnomalyDetector:
    """
    網路流量異常檢測器
    使用 Isolation Forest 演算法建立正常流量基準並識別異常
    """
    
    def __init__(self, contamination=0.1, random_state=42):
        """
        初始化異常檢測器
        
        參數:
            contamination: 預期異常比例 (0-0.5)
            random_state: 隨機種子,確保結果可重現
        """
        self.contamination = contamination
        self.model = IsolationForest(
            contamination=contamination,
            random_state=random_state,
            n_estimators=100,  # 樹的數量,更多的樹提供更穩定的結果
            max_samples='auto',  # 每棵樹的樣本數
            max_features=1.0,    # 每次分割考慮的特徵比例
            bootstrap=False      # 不使用 bootstrap 採樣
        )
        self.scaler = StandardScaler()
        self.feature_names = None
        self.is_trained = False
        
    def extract_network_features(self, network_data):
        """
        從網路流量資料中提取特徵
        
        參數:
            network_data: 包含網路流量記錄的 DataFrame
            
        回傳:
            特徵 DataFrame
        """
        features = pd.DataFrame()
        
        # 時間特徵
        # 異常活動可能發生在非工作時間
        features['hour'] = network_data['timestamp'].dt.hour
        features['day_of_week'] = network_data['timestamp'].dt.dayofweek
        features['is_weekend'] = (features['day_of_week'] >= 5).astype(int)
        
        # 流量特徵
        # 傳輸量與封包數的統計特徵
        features['bytes_sent'] = network_data['bytes_sent']
        features['bytes_received'] = network_data['bytes_received']
        features['total_bytes'] = features['bytes_sent'] + features['bytes_received']
        features['packet_count'] = network_data['packet_count']
        features['avg_packet_size'] = features['total_bytes'] / (features['packet_count'] + 1)
        
        # 連線特徵
        # 連線持續時間與頻率
        features['connection_duration'] = network_data['duration']
        features['unique_dst_ips'] = network_data.groupby('src_ip')['dst_ip'].transform('nunique')
        features['unique_dst_ports'] = network_data.groupby('src_ip')['dst_port'].transform('nunique')
        
        # 協定特徵
        # 不同協定的使用模式
        features['protocol_tcp'] = (network_data['protocol'] == 'TCP').astype(int)
        features['protocol_udp'] = (network_data['protocol'] == 'UDP').astype(int)
        features['protocol_icmp'] = (network_data['protocol'] == 'ICMP').astype(int)
        
        # 埠號特徵
        # 異常埠使用可能表示惡意活動
        common_ports = [80, 443, 22, 21, 25, 53]
        features['is_common_port'] = network_data['dst_port'].isin(common_ports).astype(int)
        features['port_number'] = network_data['dst_port']
        
        # 比率特徵
        # 發送與接收的比例異常可能表示資料外洩
        features['send_recv_ratio'] = features['bytes_sent'] / (features['bytes_received'] + 1)
        
        self.feature_names = features.columns.tolist()
        return features
    
    def train(self, X_train):
        """
        訓練異常檢測模型
        
        參數:
            X_train: 訓練特徵矩陣 (僅包含正常流量)
        """
        print("開始訓練異常檢測模型...")
        print(f"訓練樣本數: {len(X_train)}")
        print(f"特徵維度: {X_train.shape[1]}")
        
        # 特徵標準化
        X_train_scaled = self.scaler.fit_transform(X_train)
        
        # 訓練模型
        self.model.fit(X_train_scaled)
        self.is_trained = True
        
        # 計算訓練集上的異常分數分布
        scores = self.model.decision_function(X_train_scaled)
        threshold = np.percentile(scores, self.contamination * 100)
        
        print(f"異常分數閾值: {threshold:.4f}")
        print("模型訓練完成")
        
        return scores, threshold
    
    def detect(self, X_test):
        """
        檢測異常流量
        
        參數:
            X_test: 測試特徵矩陣
            
        回傳:
            predictions: 預測結果 (1=正常, -1=異常)
            scores: 異常分數 (越小越異常)
        """
        if not self.is_trained:
            raise ValueError("模型尚未訓練")
        
        # 標準化特徵
        X_test_scaled = self.scaler.transform(X_test)
        
        # 預測
        # -1 表示異常, 1 表示正常
        predictions = self.model.predict(X_test_scaled)
        
        # 異常分數
        # 負數分數表示異常,絕對值越大越異常
        scores = self.model.decision_function(X_test_scaled)
        
        return predictions, scores
    
    def analyze_anomalies(self, X_test, predictions, scores, top_k=10):
        """
        分析檢測到的異常
        
        參數:
            X_test: 測試特徵矩陣
            predictions: 預測結果
            scores: 異常分數
            top_k: 顯示最異常的前 k 個樣本
        """
        # 找出異常樣本
        anomaly_indices = np.where(predictions == -1)[0]
        
        print(f"\n異常檢測結果:")
        print("=" * 50)
        print(f"總樣本數: {len(predictions)}")
        print(f"異常樣本數: {len(anomaly_indices)}")
        print(f"異常比例: {len(anomaly_indices) / len(predictions) * 100:.2f}%")
        
        if len(anomaly_indices) > 0:
            # 按異常分數排序,找出最異常的樣本
            sorted_indices = anomaly_indices[np.argsort(scores[anomaly_indices])]
            
            print(f"\n最異常的前 {min(top_k, len(sorted_indices))} 個樣本:")
            print("-" * 50)
            
            for i, idx in enumerate(sorted_indices[:top_k], 1):
                print(f"\n異常樣本 #{i} (索引 {idx}):")
                print(f"  異常分數: {scores[idx]:.4f}")
                
                # 顯示特徵值
                if self.feature_names:
                    print("  特徵值:")
                    for feat_name, feat_value in zip(self.feature_names, X_test.iloc[idx]):
                        print(f"    {feat_name}: {feat_value}")
        
        return anomaly_indices
    
    def visualize_anomalies(self, X_test, predictions, scores):
        """
        視覺化異常檢測結果
        """
        # 使用 PCA 降維到 2D 進行視覺化
        from sklearn.decomposition import PCA
        
        pca = PCA(n_components=2)
        X_pca = pca.fit_transform(self.scaler.transform(X_test))
        
        # 繪製散點圖
        plt.figure(figsize=(12, 5))
        
        # 子圖 1: 預測結果
        plt.subplot(1, 2, 1)
        normal_mask = predictions == 1
        anomaly_mask = predictions == -1
        
        plt.scatter(
            X_pca[normal_mask, 0],
            X_pca[normal_mask, 1],
            c='blue', label='正常', alpha=0.5, s=20
        )
        plt.scatter(
            X_pca[anomaly_mask, 0],
            X_pca[anomaly_mask, 1],
            c='red', label='異常', alpha=0.8, s=50, marker='x'
        )
        plt.xlabel('第一主成分')
        plt.ylabel('第二主成分')
        plt.title('異常檢測結果 (PCA 視覺化)')
        plt.legend()
        plt.grid(True, alpha=0.3)
        
        # 子圖 2: 異常分數分布
        plt.subplot(1, 2, 2)
        plt.hist(scores[normal_mask], bins=50, alpha=0.5, label='正常', color='blue')
        plt.hist(scores[anomaly_mask], bins=50, alpha=0.5, label='異常', color='red')
        plt.xlabel('異常分數')
        plt.ylabel('樣本數')
        plt.title('異常分數分布')
        plt.legend()
        plt.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.savefig('anomaly_detection_results.png', dpi=300)
        print("\n視覺化結果已儲存至 anomaly_detection_results.png")

# 使用範例
def generate_sample_network_data(n_normal=1000, n_anomalous=100):
    """
    產生示例網路流量資料
    """
    np.random.seed(42)
    
    # 產生正常流量資料
    normal_data = {
        'timestamp': pd.date_range('2025-01-01', periods=n_normal, freq='1min'),
        'src_ip': ['192.168.1.{}'.format(np.random.randint(1, 255)) for _ in range(n_normal)],
        'dst_ip': ['10.0.0.{}'.format(np.random.randint(1, 255)) for _ in range(n_normal)],
        'dst_port': np.random.choice([80, 443, 22], n_normal),
        'protocol': np.random.choice(['TCP', 'UDP'], n_normal, p=[0.8, 0.2]),
        'bytes_sent': np.random.lognormal(10, 1, n_normal).astype(int),
        'bytes_received': np.random.lognormal(10, 1, n_normal).astype(int),
        'packet_count': np.random.poisson(50, n_normal),
        'duration': np.random.exponential(30, n_normal)
    }
    
    # 產生異常流量資料
    # 特徵: 大量資料傳輸、異常埠號、非工作時間
    anomalous_data = {
        'timestamp': pd.date_range('2025-01-01 02:00:00', periods=n_anomalous, freq='1min'),
        'src_ip': ['192.168.1.{}'.format(np.random.randint(1, 255)) for _ in range(n_anomalous)],
        'dst_ip': ['8.8.8.{}'.format(np.random.randint(1, 255)) for _ in range(n_anomalous)],
        'dst_port': np.random.randint(1024, 65535, n_anomalous),
        'protocol': np.random.choice(['TCP', 'UDP'], n_anomalous),
        'bytes_sent': np.random.lognormal(15, 2, n_anomalous).astype(int),  # 更大的傳輸量
        'bytes_received': np.random.lognormal(8, 1, n_anomalous).astype(int),
        'packet_count': np.random.poisson(200, n_anomalous),  # 更多封包
        'duration': np.random.exponential(60, n_anomalous)  # 更長的連線時間
    }
    
    # 組合資料
    df_normal = pd.DataFrame(normal_data)
    df_anomalous = pd.DataFrame(anomalous_data)
    df_all = pd.concat([df_normal, df_anomalous], ignore_index=True)
    
    return df_normal, df_all

def main():
    """
    示範網路異常檢測流程
    """
    # 產生示例資料
    df_normal, df_all = generate_sample_network_data()
    
    # 初始化檢測器
    detector = NetworkAnomalyDetector(contamination=0.1)
    
    # 提取特徵
    print("提取網路流量特徵...")
    X_train = detector.extract_network_features(df_normal)
    X_test = detector.extract_network_features(df_all)
    
    # 訓練模型 (僅使用正常流量)
    scores_train, threshold = detector.train(X_train)
    
    # 檢測異常
    predictions, scores = detector.detect(X_test)
    
    # 分析結果
    anomaly_indices = detector.analyze_anomalies(X_test, predictions, scores)
    
    # 視覺化
    detector.visualize_anomalies(X_test, predictions, scores)

if __name__ == "__main__":
    main()

這個網路異常檢測系統展示了完整的實作流程。特徵工程階段從原始網路流量資料中提取多維度的特徵,包括時間模式、流量統計、連線特性與協定使用。Isolation Forest 模型透過學習正常流量的分布,能夠識別顯著偏離的異常模式。視覺化功能透過 PCA 降維,讓高維資料的異常模式能夠直觀呈現。

在實務部署中,異常檢測系統需要持續更新基準線。網路環境會隨時間變化,新的應用程式、業務流程或基礎設施變更都會改變正常行為的定義。定期使用最新的正常流量資料重新訓練模型,能夠確保檢測系統保持準確性。同時,需要建立回饋機制,讓分析師能夠標記誤報與漏報,這些資訊可以用於調整模型參數或改進特徵工程。

Python 與 AI 技術在資安威脅搜尋自動化領域開啟了新的可能性。從惡意軟體檢測到異常行為分析,從網路釣魚防禦到威脅情報分析,機器學習演算法正在改變資安團隊的工作方式。然而,技術工具只是解決方案的一部分,成功的 AI 資安系統需要結合領域知識、持續最佳化與人類監督。隨著威脅態勢的演化,資安團隊必須保持學習與創新,充分利用 Python 生態系統的豐富資源,建構更智慧、更有效的防禦體系。

玄貓 BlackCat

技術愛好者，專注於分享程式開發、雲端技術與 AI 應用的心得體會。