機器學習驅動的信用評分系統：從資料分析到智慧貸款決策的完整實踐

現代金融機構面臨著日益複雜的信貸風險管理挑戰，傳統的信用評估方法往往依賴人工經驗判斷與簡單的規則引擎，難以有效處理大量申請案件並準確評估借款人的違約風險。機器學習技術的引入為信用評分與貸款決策帶來了革命性的改變，透過分析借款人的信用歷史、收入水準、債務比率、就業狀況等多維度特徵，建構能夠自動化評估信用風險的預測模型。這些模型不僅能夠處理非線性的複雜關係，更能從海量歷史貸款資料中學習違約模式，識別傳統方法難以察覺的風險訊號。然而機器學習模型的應用也帶來新的挑戰，包含模型可解釋性、演算法偏差、公平性議題以及監管合規要求，金融機構必須在追求預測準確度的同時，確保決策過程的透明度與公正性，避免對特定族群產生系統性歧視。

信用評分資料集的結構特徵與業務意涵

信用評分資料集通常包含借款人的個人資訊、財務狀況、信用歷史以及最終的貸款決策結果。完整的資料集應該涵蓋足夠的樣本數量以確保統計顯著性，一般而言至少需要數千筆歷史貸款記錄才能訓練出穩健的預測模型。資料集中的每一筆記錄代表一個貸款申請案，包含申請人的唯一識別碼、信用評分數值、年收入金額、貸款金額、貸款期限、債務收入比、就業年資、房屋所有權狀態以及最終的貸款批准決策。信用評分通常是由信用局根據借款人的信用歷史計算得出，數值範圍一般在300至850之間，分數越高代表信用風險越低。收入資料反映借款人的還款能力，但需要注意收入的穩定性與來源類型同樣重要。債務收入比計算借款人既有債務支付占收入的比例，這是評估還款能力的關鍵指標。

資料品質對模型效能有決定性影響，實務上常見的資料問題包含遺失值、異常值、資料輸入錯誤以及資料不一致性。遺失值可能源自於申請人未提供完整資訊或資料收集系統的缺陷，需要根據遺失機制選擇適當的處理策略如刪除、填補或使用專門處理遺失值的演算法。異常值的偵測與處理同樣重要，例如收入數值異常偏高或信用評分超出合理範圍的記錄，這些可能是資料錯誤也可能是真實但罕見的案例，需要仔細驗證。資料不平衡問題在信用評分領域特別顯著，由於違約案例通常遠少於正常還款案例，這種不平衡會導致模型傾向預測多數類別，需要採用過採樣、欠採樣或調整類別權重等技術來平衡訓練資料。

# 信用評分資料分析與前處理系統
# 完整的資料載入、清洗、探索性分析與特徵工程流程

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# 設定繁體中文字型以正確顯示圖表
plt.rcParams['font.sans-serif'] = ['Microsoft JhengHei']
plt.rcParams['axes.unicode_minus'] = False

class CreditDataAnalyzer:
    """
    信用資料分析器
    負責資料載入、清洗、探索性分析與視覺化
    """
    
    def __init__(self):
        """初始化資料分析器"""
        self.data = None  # 原始資料
        self.processed_data = None  # 處理後的資料
        self.numeric_features = []  # 數值型特徵清單
        self.categorical_features = []  # 類別型特徵清單
        self.target_column = 'loan_approved'  # 目標變數欄位名稱
        
    def load_data(self, filepath):
        """
        載入信用評分資料集
        
        參數:
            filepath: 資料檔案路徑（支援CSV格式）
            
        回傳:
            data: 載入的DataFrame
        """
        try:
            # 讀取CSV檔案
            self.data = pd.read_csv(filepath, encoding='utf-8')
            
            print(f"成功載入資料集")
            print(f"資料維度: {self.data.shape[0]} 筆記錄, {self.data.shape[1]} 個欄位")
            print(f"\n欄位清單:")
            print(self.data.columns.tolist())
            
            return self.data
            
        except FileNotFoundError:
            print(f"錯誤: 找不到檔案 {filepath}")
            return None
        except Exception as e:
            print(f"載入資料時發生錯誤: {str(e)}")
            return None
    
    def generate_sample_data(self, n_samples=145):
        """
        生成範例信用評分資料集
        用於示範與測試
        
        參數:
            n_samples: 要生成的樣本數量
            
        回傳:
            data: 生成的DataFrame
        """
        np.random.seed(42)
        
        # 生成申請人ID
        applicant_ids = np.arange(1, n_samples + 1)
        
        # 生成信用評分（300-850之間，符合常態分佈）
        credit_scores = np.random.normal(loc=650, scale=100, size=n_samples)
        credit_scores = np.clip(credit_scores, 300, 850).astype(int)
        
        # 生成年收入（30000-120000之間，略呈右偏分佈）
        incomes = np.random.lognormal(mean=11, sigma=0.5, size=n_samples)
        incomes = np.clip(incomes, 30000, 120000).astype(int)
        
        # 生成貸款金額（與收入相關）
        loan_amounts = (incomes * np.random.uniform(1.5, 4.0, n_samples)).astype(int)
        
        # 生成債務收入比
        debt_to_income = np.random.uniform(0.1, 0.5, n_samples)
        
        # 生成就業年資（0-40年）
        employment_years = np.random.exponential(scale=8, size=n_samples)
        employment_years = np.clip(employment_years, 0, 40).astype(int)
        
        # 生成房屋所有權狀態
        home_ownership = np.random.choice(
            ['自有', '租賃', '抵押'],
            size=n_samples,
            p=[0.3, 0.4, 0.3]
        )
        
        # 生成貸款目的
        loan_purpose = np.random.choice(
            ['債務整合', '購屋', '創業', '教育', '其他'],
            size=n_samples,
            p=[0.35, 0.25, 0.15, 0.15, 0.10]
        )
        
        # 基於多個因素決定貸款批准結果
        # 計算批准機率（信用評分和收入是主要因素）
        approval_prob = (
            (credit_scores - 300) / 550 * 0.4 +  # 信用評分影響40%
            (incomes - 30000) / 90000 * 0.3 +  # 收入影響30%
            (1 - debt_to_income) * 0.2 +  # 債務收入比影響20%
            (employment_years / 40) * 0.1  # 就業年資影響10%
        )
        
        # 加入隨機性
        approval_prob = np.clip(approval_prob + np.random.normal(0, 0.1, n_samples), 0, 1)
        
        # 根據機率決定批准結果
        loan_approved = (approval_prob > 0.5).astype(int)
        
        # 生成貸款狀態描述
        loan_status = ['已批准' if approved else '已拒絕' 
                      for approved in loan_approved]
        
        # 建立DataFrame
        self.data = pd.DataFrame({
            'applicant_id': applicant_ids,
            'credit_score': credit_scores,
            'income': incomes,
            'loan_amount': loan_amounts,
            'debt_to_income': debt_to_income,
            'employment_years': employment_years,
            'home_ownership': home_ownership,
            'loan_purpose': loan_purpose,
            'loan_status': loan_status,
            'loan_approved': loan_approved
        })
        
        print(f"成功生成 {n_samples} 筆範例資料")
        print(f"批准率: {loan_approved.mean():.2%}")
        
        return self.data
    
    def explore_data(self):
        """
        執行探索性資料分析
        輸出資料統計摘要與基本特徵
        """
        if self.data is None:
            print("錯誤: 尚未載入資料")
            return
        
        print("\n" + "="*60)
        print("資料探索性分析")
        print("="*60)
        
        # 基本資訊
        print("\n【資料基本資訊】")
        print(f"總記錄數: {len(self.data)}")
        print(f"總欄位數: {len(self.data.columns)}")
        print(f"\n資料型態:")
        print(self.data.dtypes)
        
        # 遺失值統計
        print("\n【遺失值統計】")
        missing_stats = pd.DataFrame({
            '遺失數量': self.data.isnull().sum(),
            '遺失比例': self.data.isnull().sum() / len(self.data) * 100
        })
        print(missing_stats[missing_stats['遺失數量'] > 0])
        
        if missing_stats['遺失數量'].sum() == 0:
            print("沒有遺失值")
        
        # 識別數值型和類別型特徵
        self.numeric_features = self.data.select_dtypes(
            include=['int64', 'float64']
        ).columns.tolist()
        
        self.categorical_features = self.data.select_dtypes(
            include=['object']
        ).columns.tolist()
        
        # 移除ID欄位和目標變數
        if 'applicant_id' in self.numeric_features:
            self.numeric_features.remove('applicant_id')
        if self.target_column in self.numeric_features:
            self.numeric_features.remove(self.target_column)
        if 'loan_status' in self.categorical_features:
            self.categorical_features.remove('loan_status')
        
        print(f"\n數值型特徵: {self.numeric_features}")
        print(f"類別型特徵: {self.categorical_features}")
        
        # 數值型特徵統計
        print("\n【數值型特徵統計摘要】")
        print(self.data[self.numeric_features].describe())
        
        # 目標變數分佈
        print("\n【目標變數分佈】")
        target_dist = self.data[self.target_column].value_counts()
        target_pct = self.data[self.target_column].value_counts(normalize=True) * 100
        
        print(f"貸款批准: {target_dist.get(1, 0)} 筆 ({target_pct.get(1, 0):.2f}%)")
        print(f"貸款拒絕: {target_dist.get(0, 0)} 筆 ({target_pct.get(0, 0):.2f}%)")
        
        # 類別型特徵分佈
        print("\n【類別型特徵分佈】")
        for feature in self.categorical_features:
            print(f"\n{feature}:")
            print(self.data[feature].value_counts())
    
    def detect_outliers(self, method='iqr', threshold=3):
        """
        偵測數值型特徵中的異常值
        
        參數:
            method: 偵測方法 ('iqr' 或 'zscore')
            threshold: Z-score方法的閾值
            
        回傳:
            outliers: 包含異常值資訊的字典
        """
        if self.data is None:
            print("錯誤: 尚未載入資料")
            return None
        
        outliers = {}
        
        print(f"\n使用 {method.upper()} 方法偵測異常值")
        print("="*60)
        
        for feature in self.numeric_features:
            if method == 'iqr':
                # 使用四分位距方法
                Q1 = self.data[feature].quantile(0.25)
                Q3 = self.data[feature].quantile(0.75)
                IQR = Q3 - Q1
                
                lower_bound = Q1 - 1.5 * IQR
                upper_bound = Q3 + 1.5 * IQR
                
                outlier_mask = (
                    (self.data[feature] < lower_bound) | 
                    (self.data[feature] > upper_bound)
                )
                
            elif method == 'zscore':
                # 使用Z-score方法
                z_scores = np.abs(stats.zscore(self.data[feature]))
                outlier_mask = z_scores > threshold
            
            outlier_count = outlier_mask.sum()
            outlier_pct = outlier_count / len(self.data) * 100
            
            if outlier_count > 0:
                outliers[feature] = {
                    'count': outlier_count,
                    'percentage': outlier_pct,
                    'indices': self.data[outlier_mask].index.tolist()
                }
                
                print(f"\n{feature}:")
                print(f"  異常值數量: {outlier_count} ({outlier_pct:.2f}%)")
                print(f"  正常範圍: {self.data[~outlier_mask][feature].min():.2f} "
                      f"- {self.data[~outlier_mask][feature].max():.2f}")
        
        if not outliers:
            print("\n未偵測到異常值")
        
        return outliers
    
    def visualize_distributions(self, save_path=None):
        """
        視覺化特徵分佈
        
        參數:
            save_path: 圖表儲存路徑（選填）
        """
        if self.data is None:
            print("錯誤: 尚未載入資料")
            return
        
        # 設定圖表樣式
        sns.set_style("whitegrid")
        
        # 數值型特徵分佈
        n_numeric = len(self.numeric_features)
        fig, axes = plt.subplots(
            nrows=(n_numeric + 1) // 2, 
            ncols=2,
            figsize=(15, 5 * ((n_numeric + 1) // 2))
        )
        axes = axes.flatten()
        
        for idx, feature in enumerate(self.numeric_features):
            ax = axes[idx]
            
            # 繪製直方圖
            self.data[feature].hist(bins=30, ax=ax, edgecolor='black', alpha=0.7)
            ax.set_title(f'{feature} 分佈', fontsize=12, fontweight='bold')
            ax.set_xlabel(feature)
            ax.set_ylabel('頻率')
            
            # 添加統計資訊
            mean_val = self.data[feature].mean()
            median_val = self.data[feature].median()
            ax.axvline(mean_val, color='red', linestyle='--', 
                      label=f'平均值: {mean_val:.2f}')
            ax.axvline(median_val, color='green', linestyle='--', 
                      label=f'中位數: {median_val:.2f}')
            ax.legend()
        
        # 隱藏多餘的子圖
        for idx in range(n_numeric, len(axes)):
            axes[idx].set_visible(False)
        
        plt.tight_layout()
        
        if save_path:
            plt.savefig(f'{save_path}_distributions.png', dpi=300, bbox_inches='tight')
            print(f"圖表已儲存至: {save_path}_distributions.png")
        
        plt.show()
        
        # 類別型特徵分佈
        if self.categorical_features:
            n_categorical = len(self.categorical_features)
            fig, axes = plt.subplots(
                nrows=(n_categorical + 1) // 2,
                ncols=2,
                figsize=(15, 5 * ((n_categorical + 1) // 2))
            )
            
            if n_categorical == 1:
                axes = [axes]
            else:
                axes = axes.flatten()
            
            for idx, feature in enumerate(self.categorical_features):
                ax = axes[idx]
                
                # 繪製長條圖
                value_counts = self.data[feature].value_counts()
                value_counts.plot(kind='bar', ax=ax, edgecolor='black', alpha=0.7)
                ax.set_title(f'{feature} 分佈', fontsize=12, fontweight='bold')
                ax.set_xlabel(feature)
                ax.set_ylabel('數量')
                ax.tick_params(axis='x', rotation=45)
                
                # 添加數值標籤
                for container in ax.containers:
                    ax.bar_label(container, fmt='%d')
            
            # 隱藏多餘的子圖
            for idx in range(n_categorical, len(axes)):
                axes[idx].set_visible(False)
            
            plt.tight_layout()
            
            if save_path:
                plt.savefig(f'{save_path}_categorical.png', 
                           dpi=300, bbox_inches='tight')
                print(f"圖表已儲存至: {save_path}_categorical.png")
            
            plt.show()
    
    def analyze_feature_correlations(self):
        """
        分析特徵之間的相關性
        特別關注特徵與目標變數的關係
        """
        if self.data is None:
            print("錯誤: 尚未載入資料")
            return
        
        print("\n特徵相關性分析")
        print("="*60)
        
        # 計算數值型特徵的相關係數矩陣
        numeric_data = self.data[self.numeric_features + [self.target_column]]
        correlation_matrix = numeric_data.corr()
        
        # 輸出與目標變數的相關性
        target_corr = correlation_matrix[self.target_column].sort_values(
            ascending=False
        )
        
        print("\n與貸款批准結果的相關性:")
        for feature, corr in target_corr.items():
            if feature != self.target_column:
                print(f"{feature:20s}: {corr:>7.4f}")
        
        # 視覺化相關性矩陣
        plt.figure(figsize=(12, 10))
        sns.heatmap(
            correlation_matrix,
            annot=True,
            fmt='.3f',
            cmap='coolwarm',
            center=0,
            square=True,
            linewidths=1
        )
        plt.title('特徵相關性熱圖', fontsize=14, fontweight='bold', pad=20)
        plt.tight_layout()
        plt.show()
        
        return correlation_matrix

# 主程式執行範例
if __name__ == "__main__":
    # 初始化資料分析器
    analyzer = CreditDataAnalyzer()
    
    # 生成範例資料
    print("生成信用評分範例資料集...")
    data = analyzer.generate_sample_data(n_samples=500)
    
    # 顯示前幾筆資料
    print("\n前5筆資料範例:")
    print(data.head())
    
    # 執行探索性資料分析
    analyzer.explore_data()
    
    # 偵測異常值
    outliers = analyzer.detect_outliers(method='iqr')
    
    # 視覺化資料分佈
    print("\n生成資料分佈圖表...")
    analyzer.visualize_distributions()
    
    # 分析特徵相關性
    correlation_matrix = analyzer.analyze_feature_correlations()
    
    print("\n資料分析完成!")

這段完整的信用資料分析系統展示了如何系統性地處理與分析信用評分資料集。CreditDataAnalyzer類別實作了從資料載入到探索性分析的完整流程,包含資料品質檢查、統計摘要計算、異常值偵測以及視覺化呈現。資料生成功能模擬真實的信用評分資料分佈模式,信用評分遵循常態分佈集中在650分附近,收入資料呈現右偏分佈反映真實世界的收入不平等現象,貸款批准決策則綜合考量信用評分、收入、債務收入比與就業年資等多個因素。探索性分析功能提供資料的全面檢視,包含資料維度、遺失值統計、特徵類型識別以及目標變數的類別分佈,這些資訊對於後續的特徵工程與模型選擇至關重要。異常值偵測實作了IQR與Z-score兩種常用方法,幫助識別資料中的極端值與可能的錯誤記錄。視覺化功能透過直方圖展示數值型特徵的分佈特性,長條圖呈現類別型特徵的類別分佈,相關性熱圖則揭示特徵之間以及特徵與目標變數之間的線性關係強度。

特徵工程策略與信用風險指標建構

特徵工程在信用評分模型中扮演關鍵角色,高品質的特徵能夠顯著提升模型的預測能力。除了直接使用原始特徵如信用評分與收入外,衍生特徵的建構能夠捕捉更複雜的風險模式。債務收入比是最重要的衍生指標之一,計算方式為每月債務支付總額除以每月總收入,這個比率直接反映借款人的還款壓力,一般而言債務收入比超過43%被視為高風險。信用利用率衡量借款人已使用的信用額度占總可用額度的比例,高利用率暗示財務緊張可能增加違約風險。貸款收入比計算申請貸款金額與年收入的比例,評估貸款規模對借款人財務狀況的影響程度。

時間相關特徵同樣具有預測價值,信用歷史長度反映借款人累積信用記錄的時間,較長的信用歷史通常對應較穩定的信用表現。近期信用查詢次數可以偵測借款人是否在短時間內大量申請信貸,這可能是財務困難的警訊。就業穩定性透過目前工作的年資來衡量,穩定的就業通常意味著穩定的收入來源。類別型特徵的編碼策略需要謹慎選擇,順序編碼適用於具有自然順序的類別如教育程度,獨熱編碼則適用於名義類別如職業類型,目標編碼將類別映射為該類別下目標變數的平均值,這種方法特別適合高基數類別特徵但需要注意過度擬合風險。特徵交互作用的建構能夠捕捉特徵之間的協同效應,例如高收入與高信用評分的組合可能具有超線性的正面影響。

# 信用評分特徵工程系統
# 實作完整的特徵轉換、衍生特徵建構與特徵選擇流程

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, OrdinalEncoder
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np

class CreditFeatureEngineering:
    """
    信用特徵工程類別
    負責特徵轉換、衍生特徵建構與特徵選擇
    """
    
    def __init__(self):
        """初始化特徵工程器"""
        self.scalers = {}  # 儲存各種標準化器
        self.encoders = {}  # 儲存編碼器
        self.selected_features = []  # 選擇的特徵清單
        
    def create_derived_features(self, data):
        """
        建構衍生特徵
        基於原始特徵計算新的風險指標
        
        參數:
            data: 原始資料DataFrame
            
        回傳:
            data: 添加衍生特徵後的DataFrame
        """
        # 複製資料避免修改原始資料
        data = data.copy()
        
        print("建構衍生特徵...")
        
        # 1. 貸款收入比 (Loan to Income Ratio)
        # 衡量貸款金額相對收入的比例
        data['loan_to_income_ratio'] = data['loan_amount'] / data['income']
        print("✓ 建立貸款收入比")
        
        # 2. 信用評分等級分類
        # 將連續的信用評分轉換為類別
        data['credit_score_category'] = pd.cut(
            data['credit_score'],
            bins=[0, 580, 670, 740, 800, 850],
            labels=['差', '普通', '良好', '優良', '卓越']
        )
        print("✓ 建立信用評分等級")
        
        # 3. 收入等級分類
        # 根據收入百分位數分類
        data['income_category'] = pd.qcut(
            data['income'],
            q=4,
            labels=['低收入', '中低收入', '中高收入', '高收入']
        )
        print("✓ 建立收入等級分類")
        
        # 4. 債務負擔指標
        # 結合債務收入比與貸款金額的綜合指標
        data['debt_burden_score'] = (
            data['debt_to_income'] * 0.6 + 
            data['loan_to_income_ratio'] * 0.4
        )
        print("✓ 建立債務負擔指標")
        
        # 5. 就業穩定性評分
        # 將就業年資轉換為穩定性評分
        data['employment_stability'] = np.where(
            data['employment_years'] >= 5, '穩定',
            np.where(data['employment_years'] >= 2, '一般', '不穩定')
        )
        print("✓ 建立就業穩定性評分")
        
        # 6. 綜合風險評分
        # 結合多個因素的標準化風險分數
        # 標準化各項指標到0-1範圍
        credit_normalized = (data['credit_score'] - 300) / 550
        income_normalized = (data['income'] - data['income'].min()) / (
            data['income'].max() - data['income'].min()
        )
        debt_normalized = 1 - data['debt_to_income']  # 反轉因為低債務是好的
        employment_normalized = data['employment_years'] / data['employment_years'].max()
        
        # 計算加權綜合分數
        data['composite_risk_score'] = (
            credit_normalized * 0.35 +
            income_normalized * 0.25 +
            debt_normalized * 0.25 +
            employment_normalized * 0.15
        )
        print("✓ 建立綜合風險評分")
        
        # 7. 高風險標誌
        # 識別高風險申請人的二元指標
        data['high_risk_flag'] = (
            (data['credit_score'] < 600) |
            (data['debt_to_income'] > 0.4) |
            (data['loan_to_income_ratio'] > 3)
        ).astype(int)
        print("✓ 建立高風險標誌")
        
        print(f"\n總共建立 7 個衍生特徵")
        print(f"資料維度: {data.shape}")
        
        return data
    
    def encode_categorical_features(self, data, categorical_features, method='onehot'):
        """
        編碼類別型特徵
        
        參數:
            data: 資料DataFrame
            categorical_features: 類別型特徵清單
            method: 編碼方法 ('onehot', 'label', 'ordinal', 'target')
            
        回傳:
            data: 編碼後的DataFrame
        """
        data = data.copy()
        
        print(f"\n使用 {method} 方法編碼類別型特徵...")
        
        if method == 'onehot':
            # 獨熱編碼：適用於無序類別特徵
            for feature in categorical_features:
                # 使用get_dummies進行獨熱編碼
                dummies = pd.get_dummies(
                    data[feature],
                    prefix=feature,
                    drop_first=True  # 避免共線性
                )
                data = pd.concat([data, dummies], axis=1)
                data = data.drop(feature, axis=1)
                
                print(f"✓ {feature} 編碼為 {len(dummies.columns)} 個二元特徵")
        
        elif method == 'label':
            # 標籤編碼：適用於有序類別特徵
            for feature in categorical_features:
                if feature not in self.encoders:
                    self.encoders[feature] = LabelEncoder()
                    data[feature] = self.encoders[feature].fit_transform(
                        data[feature].astype(str)
                    )
                else:
                    data[feature] = self.encoders[feature].transform(
                        data[feature].astype(str)
                    )
                
                print(f"✓ {feature} 編碼為數值標籤")
        
        elif method == 'ordinal':
            # 順序編碼：適用於具有明確順序的類別
            # 需要預先定義類別順序
            ordinal_mappings = {
                'credit_score_category': ['差', '普通', '良好', '優良', '卓越'],
                'income_category': ['低收入', '中低收入', '中高收入', '高收入'],
                'employment_stability': ['不穩定', '一般', '穩定']
            }
            
            for feature in categorical_features:
                if feature in ordinal_mappings:
                    data[feature] = data[feature].map(
                        {cat: idx for idx, cat in enumerate(ordinal_mappings[feature])}
                    )
                    print(f"✓ {feature} 編碼為順序數值")
        
        return data
    
    def scale_features(self, data, numeric_features, method='standard'):
        """
        標準化數值型特徵
        
        參數:
            data: 資料DataFrame
            numeric_features: 數值型特徵清單
            method: 標準化方法 ('standard', 'minmax', 'robust')
            
        回傳:
            data: 標準化後的DataFrame
        """
        data = data.copy()
        
        print(f"\n使用 {method} 方法標準化數值型特徵...")
        
        # 選擇標準化器
        if method == 'standard':
            # Z-score標準化：適用於常態分佈資料
            scaler = StandardScaler()
            print("使用Z-score標準化（均值0，標準差1）")
            
        elif method == 'minmax':
            # Min-Max標準化：將資料縮放到[0,1]範圍
            scaler = MinMaxScaler()
            print("使用Min-Max標準化（範圍[0,1]）")
            
        elif method == 'robust':
            # 穩健標準化：對異常值不敏感
            scaler = RobustScaler()
            print("使用穩健標準化（使用中位數和四分位距）")
        
        # 執行標準化
        data[numeric_features] = scaler.fit_transform(data[numeric_features])
        self.scalers[method] = scaler
        
        print(f"✓ 已標準化 {len(numeric_features)} 個數值型特徵")
        
        return data
    
    def select_features(self, X, y, method='mutual_info', k=10):
        """
        特徵選擇
        選擇最具預測力的特徵
        
        參數:
            X: 特徵矩陣
            y: 目標變數
            method: 選擇方法 ('mutual_info', 'f_classif', 'rfe')
            k: 要選擇的特徵數量
            
        回傳:
            selected_features: 選擇的特徵名稱清單
            scores: 特徵重要性分數
        """
        print(f"\n使用 {method} 方法選擇前 {k} 個特徵...")
        
        if method == 'mutual_info':
            # 互資訊法：衡量特徵與目標變數的相互依賴性
            selector = SelectKBest(mutual_info_classif, k=k)
            selector.fit(X, y)
            scores = selector.scores_
            selected_mask = selector.get_support()
            
        elif method == 'f_classif':
            # F統計量法：基於ANOVA F值選擇特徵
            selector = SelectKBest(f_classif, k=k)
            selector.fit(X, y)
            scores = selector.scores_
            selected_mask = selector.get_support()
            
        elif method == 'rfe':
            # 遞迴特徵消除：使用模型遞迴地移除特徵
            estimator = RandomForestClassifier(n_estimators=100, random_state=42)
            selector = RFE(estimator, n_features_to_select=k)
            selector.fit(X, y)
            scores = selector.ranking_  # RFE返回的是排名
            selected_mask = selector.get_support()
        
        # 取得選擇的特徵名稱
        self.selected_features = X.columns[selected_mask].tolist()
        
        # 建立特徵重要性DataFrame
        feature_importance = pd.DataFrame({
            'feature': X.columns,
            'score': scores,
            'selected': selected_mask
        }).sort_values('score', ascending=False)
        
        print("\n前10個最重要的特徵:")
        print(feature_importance.head(10))
        
        print(f"\n已選擇的特徵: {self.selected_features}")
        
        return self.selected_features, feature_importance
    
    def handle_missing_values(self, data, numeric_strategy='median', 
                            categorical_strategy='most_frequent'):
        """
        處理遺失值
        
        參數:
            data: 資料DataFrame
            numeric_strategy: 數值型特徵的填補策略
            categorical_strategy: 類別型特徵的填補策略
            
        回傳:
            data: 處理後的DataFrame
        """
        data = data.copy()
        
        # 檢查遺失值
        missing_counts = data.isnull().sum()
        features_with_missing = missing_counts[missing_counts > 0]
        
        if len(features_with_missing) == 0:
            print("沒有遺失值需要處理")
            return data
        
        print(f"\n處理 {len(features_with_missing)} 個特徵的遺失值...")
        
        # 分離數值型和類別型特徵
        numeric_cols = data.select_dtypes(include=['int64', 'float64']).columns
        categorical_cols = data.select_dtypes(include=['object', 'category']).columns
        
        # 處理數值型特徵的遺失值
        if any(col in features_with_missing.index for col in numeric_cols):
            numeric_imputer = SimpleImputer(strategy=numeric_strategy)
            numeric_missing = [col for col in numeric_cols 
                             if col in features_with_missing.index]
            data[numeric_missing] = numeric_imputer.fit_transform(
                data[numeric_missing]
            )
            print(f"✓ 使用 {numeric_strategy} 策略填補數值型遺失值")
        
        # 處理類別型特徵的遺失值
        if any(col in features_with_missing.index for col in categorical_cols):
            categorical_imputer = SimpleImputer(strategy=categorical_strategy)
            categorical_missing = [col for col in categorical_cols 
                                 if col in features_with_missing.index]
            data[categorical_missing] = categorical_imputer.fit_transform(
                data[categorical_missing]
            )
            print(f"✓ 使用 {categorical_strategy} 策略填補類別型遺失值")
        
        return data

# 完整的特徵工程流程範例
if __name__ == "__main__":
    # 載入前面生成的資料
    from CreditDataAnalyzer import CreditDataAnalyzer
    
    # 生成資料
    analyzer = CreditDataAnalyzer()
    data = analyzer.generate_sample_data(n_samples=1000)
    
    # 初始化特徵工程器
    fe = CreditFeatureEngineering()
    
    # 步驟1: 建構衍生特徵
    data_with_features = fe.create_derived_features(data)
    
    # 步驟2: 處理遺失值（如果有）
    data_clean = fe.handle_missing_values(data_with_features)
    
    # 步驟3: 編碼類別型特徵
    categorical_features = ['home_ownership', 'loan_purpose', 'employment_stability']
    data_encoded = fe.encode_categorical_features(
        data_clean,
        categorical_features,
        method='onehot'
    )
    
    # 步驟4: 標準化數值型特徵
    numeric_features = [
        'credit_score', 'income', 'loan_amount', 
        'debt_to_income', 'employment_years',
        'loan_to_income_ratio', 'debt_burden_score', 'composite_risk_score'
    ]
    
    # 分離特徵和目標變數
    X = data_encoded.drop(['applicant_id', 'loan_status', 'loan_approved'], axis=1)
    y = data_encoded['loan_approved']
    
    # 只標準化訓練集中的數值特徵
    numeric_features_in_X = [f for f in numeric_features if f in X.columns]
    X_scaled = fe.scale_features(X, numeric_features_in_X, method='standard')
    
    # 步驟5: 特徵選擇
    selected_features, importance_df = fe.select_features(
        X_scaled, y, 
        method='mutual_info', 
        k=15
    )
    
    # 最終資料集
    X_final = X_scaled[selected_features]
    
    print("\n" + "="*60)
    print("特徵工程完成!")
    print(f"最終特徵數量: {X_final.shape[1]}")
    print(f"訓練樣本數量: {X_final.shape[0]}")
    print("="*60)

這段完整的特徵工程系統展示了信用評分模型開發中特徵處理的最佳實踐。CreditFeatureEngineering類別實作了從原始特徵到模型就緒特徵的完整轉換流程。衍生特徵建構功能創建了七個新的風險指標,貸款收入比直接衡量貸款負擔,信用評分等級將連續變數離散化便於模型捕捉非線性關係,收入等級分類使用分位數切分確保各類別樣本平衡,債務負擔指標綜合考量多個債務相關因素,就業穩定性評分量化就業的可靠性,綜合風險評分整合多維度資訊提供單一風險度量,高風險標誌作為二元指標快速識別需要特別關注的申請案。類別特徵編碼實作了多種方法以適應不同場景,獨熱編碼適合無序類別避免引入虛假的順序關係,標籤編碼節省記憶體適合樹模型,順序編碼保留類別間的順序資訊,目標編碼則特別適合高基數類別但需要謹慎處理過度擬合。數值特徵標準化提供三種選項,Z-score標準化假設常態分佈適合線性模型,Min-Max標準化將特徵縮放到固定範圍適合神經網路,穩健標準化使用中位數與四分位距對異常值不敏感適合包含極端值的資料。特徵選擇實作了互資訊、F統計量與遞迴特徵消除三種方法,互資訊能夠捕捉非線性關係,F統計量基於變異數分析適合線性可分問題,遞迴特徵消除考慮特徵組合效應但計算成本較高。

信用評分預測模型的建構與訓練策略

建構準確且穩健的信用評分預測模型需要謹慎選擇演算法並調整超參數以達到最佳效能。邏輯迴歸作為最基礎的分類演算法,具有良好的可解釋性且計算效率高,特別適合作為基準模型評估其他複雜模型的改善程度。決策樹模型能夠自動處理非線性關係與特徵交互作用,且生成的規則易於理解與向監管機構說明,但單一決策樹容易過度擬合需要謹慎修剪。隨機森林透過集成多棵決策樹降低方差提升泛化能力,同時提供特徵重要性評估協助理解哪些因素對信用決策最關鍵。梯度提升樹如XGBoost與LightGBM在信用評分領域展現卓越效能,這些演算法透過逐步修正前一輪的預測誤差達到更高準確度,內建的正則化項與早停機制有效防止過度擬合。

模型訓練過程需要注意多個面向以確保最終模型的實用性。交叉驗證是評估模型泛化能力的標準方法,k折交叉驗證將資料分為k份輪流作為驗證集,能夠更穩健地估計模型效能。分層採樣確保訓練集與測試集中的類別比例保持一致,這在類別不平衡問題中特別重要。超參數調整可以使用網格搜尋或隨機搜尋系統性地探索參數空間,貝葉斯優化則提供更高效的搜尋策略特別適合計算密集的模型。模型評估需要使用多個指標全面衡量效能,準確率衡量整體預測正確的比例但在不平衡資料中可能誤導,精確率關注預測為正例中真正為正例的比例反映誤殺成本,召回率關注實際正例中被正確識別的比例反映漏殺成本,F1分數調和精確率與召回率提供平衡評估,AUC-ROC曲線評估模型在不同閾值下的分類能力且不受類別不平衡影響。

# 信用評分預測模型建構系統
# 實作多種機器學習演算法的訓練、評估與比較

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, roc_curve, classification_report, confusion_matrix
)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import joblib

class CreditScoringModel:
    """
    信用評分預測模型類別
    支援多種機器學習演算法的訓練與評估
    """
    
    def __init__(self, model_type='random_forest', random_state=42):
        """
        初始化模型
        
        參數:
            model_type: 模型類型
            random_state: 隨機種子確保可重現性
        """
        self.model_type = model_type
        self.random_state = random_state
        self.model = self._initialize_model()
        self.is_trained = False
        self.feature_names = None
        self.training_history = {}
        
    def _initialize_model(self):
        """
        根據指定類型初始化模型
        
        回傳:
            model: 初始化的模型物件
        """
        models = {
            'logistic': LogisticRegression(
                random_state=self.random_state,
                max_iter=1000
            ),
            'decision_tree': DecisionTreeClassifier(
                random_state=self.random_state,
                max_depth=10,
                min_samples_split=20,
                min_samples_leaf=10
            ),
            'random_forest': RandomForestClassifier(
                n_estimators=100,
                random_state=self.random_state,
                max_depth=15,
                min_samples_split=20,
                min_samples_leaf=10,
                n_jobs=-1
            ),
            'gradient_boosting': GradientBoostingClassifier(
                n_estimators=100,
                learning_rate=0.1,
                max_depth=5,
                random_state=self.random_state
            ),
            'svm': SVC(
                kernel='rbf',
                probability=True,
                random_state=self.random_state
            ),
            'naive_bayes': GaussianNB(),
            'knn': KNeighborsClassifier(
                n_neighbors=5,
                weights='distance'
            )
        }
        
        if self.model_type not in models:
            raise ValueError(f"不支援的模型類型: {self.model_type}")
        
        return models[self.model_type]
    
    def train(self, X_train, y_train, X_val=None, y_val=None):
        """
        訓練模型
        
        參數:
            X_train: 訓練特徵
            y_train: 訓練標籤
            X_val: 驗證特徵（選填）
            y_val: 驗證標籤（選填）
        """
        print(f"\n開始訓練 {self.model_type} 模型...")
        print(f"訓練集大小: {X_train.shape[0]} 筆記錄, {X_train.shape[1]} 個特徵")
        
        # 記錄訓練開始時間
        start_time = datetime.now()
        
        # 儲存特徵名稱
        if hasattr(X_train, 'columns'):
            self.feature_names = X_train.columns.tolist()
        
        # 訓練模型
        self.model.fit(X_train, y_train)
        
        # 記錄訓練時間
        training_time = (datetime.now() - start_time).total_seconds()
        
        # 評估訓練集效能
        train_pred = self.model.predict(X_train)
        train_accuracy = accuracy_score(y_train, train_pred)
        
        print(f"✓ 訓練完成 (耗時: {training_time:.2f} 秒)")
        print(f"訓練集準確率: {train_accuracy:.4f}")
        
        # 如果提供驗證集，則評估驗證效能
        if X_val is not None and y_val is not None:
            val_pred = self.model.predict(X_val)
            val_accuracy = accuracy_score(y_val, val_pred)
            print(f"驗證集準確率: {val_accuracy:.4f}")
            
            # 儲存訓練歷史
            self.training_history = {
                'train_accuracy': train_accuracy,
                'val_accuracy': val_accuracy,
                'training_time': training_time
            }
        else:
            self.training_history = {
                'train_accuracy': train_accuracy,
                'training_time': training_time
            }
        
        self.is_trained = True
    
    def predict(self, X):
        """
        進行預測
        
        參數:
            X: 特徵矩陣
            
        回傳:
            predictions: 預測結果
        """
        if not self.is_trained:
            raise ValueError("模型尚未訓練，請先呼叫 train() 方法")
        
        return self.model.predict(X)
    
    def predict_proba(self, X):
        """
        預測機率
        
        參數:
            X: 特徵矩陣
            
        回傳:
            probabilities: 預測機率
        """
        if not self.is_trained:
            raise ValueError("模型尚未訓練，請先呼叫 train() 方法")
        
        if hasattr(self.model, 'predict_proba'):
            return self.model.predict_proba(X)
        else:
            raise AttributeError(f"{self.model_type} 模型不支援機率預測")
    
    def evaluate(self, X_test, y_test, detailed=True):
        """
        評估模型效能
        
        參數:
            X_test: 測試特徵
            y_test: 測試標籤
            detailed: 是否輸出詳細報告
            
        回傳:
            metrics: 評估指標字典
        """
        if not self.is_trained:
            raise ValueError("模型尚未訓練，請先呼叫 train() 方法")
        
        print(f"\n評估 {self.model_type} 模型效能...")
        print(f"測試集大小: {X_test.shape[0]} 筆記錄")
        
        # 進行預測
        y_pred = self.predict(X_test)
        
        # 計算評估指標
        metrics = {
            'accuracy': accuracy_score(y_test, y_pred),
            'precision': precision_score(y_test, y_pred, zero_division=0),
            'recall': recall_score(y_test, y_pred, zero_division=0),
            'f1': f1_score(y_test, y_pred, zero_division=0)
        }
        
        # 如果模型支援機率預測，計算AUC
        if hasattr(self.model, 'predict_proba'):
            y_proba = self.predict_proba(X_test)[:, 1]
            metrics['roc_auc'] = roc_auc_score(y_test, y_proba)
        
        # 輸出評估結果
        print("\n模型效能指標:")
        print(f"準確率 (Accuracy):  {metrics['accuracy']:.4f}")
        print(f"精確率 (Precision): {metrics['precision']:.4f}")
        print(f"召回率 (Recall):    {metrics['recall']:.4f}")
        print(f"F1分數 (F1-Score):  {metrics['f1']:.4f}")
        
        if 'roc_auc' in metrics:
            print(f"AUC-ROC:            {metrics['roc_auc']:.4f}")
        
        # 詳細報告
        if detailed:
            print("\n分類報告:")
            print(classification_report(
                y_test, y_pred,
                target_names=['拒絕貸款', '批准貸款'],
                digits=4
            ))
            
            # 混淆矩陣
            cm = confusion_matrix(y_test, y_pred)
            print("\n混淆矩陣:")
            print(f"                預測拒絕  預測批准")
            print(f"實際拒絕        {cm[0][0]:6d}    {cm[0][1]:6d}")
            print(f"實際批准        {cm[1][0]:6d}    {cm[1][1]:6d}")
            
            # 計算業務指標
            false_positive_rate = cm[0][1] / (cm[0][0] + cm[0][1]) if (cm[0][0] + cm[0][1]) > 0 else 0
            false_negative_rate = cm[1][0] / (cm[1][0] + cm[1][1]) if (cm[1][0] + cm[1][1]) > 0 else 0
            
            print(f"\n業務影響分析:")
            print(f"誤批率 (批准不該批准的): {false_positive_rate:.2%}")
            print(f"誤拒率 (拒絕應該批准的): {false_negative_rate:.2%}")
        
        return metrics
    
    def cross_validate(self, X, y, cv=5):
        """
        交叉驗證
        
        參數:
            X: 特徵矩陣
            y: 標籤
            cv: 折數
            
        回傳:
            cv_results: 交叉驗證結果
        """
        print(f"\n執行 {cv} 折交叉驗證...")
        
        # 多個評估指標的交叉驗證
        scoring = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
        
        cv_results = {}
        for metric in scoring:
            scores = cross_val_score(
                self.model, X, y,
                cv=cv,
                scoring=metric,
                n_jobs=-1
            )
            cv_results[metric] = {
                'scores': scores,
                'mean': scores.mean(),
                'std': scores.std()
            }
            
            print(f"{metric:10s}: {scores.mean():.4f} (+/- {scores.std():.4f})")
        
        return cv_results
    
    def get_feature_importance(self, top_n=10):
        """
        取得特徵重要性
        僅支援樹模型
        
        參數:
            top_n: 顯示前N個重要特徵
            
        回傳:
            importance_df: 特徵重要性DataFrame
        """
        if not self.is_trained:
            raise ValueError("模型尚未訓練")
        
        if not hasattr(self.model, 'feature_importances_'):
            raise AttributeError(f"{self.model_type} 模型不支援特徵重要性分析")
        
        # 取得特徵重要性
        importances = self.model.feature_importances_
        
        # 建立DataFrame
        importance_df = pd.DataFrame({
            'feature': self.feature_names if self.feature_names else range(len(importances)),
            'importance': importances
        }).sort_values('importance', ascending=False)
        
        print(f"\n前 {top_n} 個最重要的特徵:")
        print(importance_df.head(top_n))
        
        # 視覺化
        plt.figure(figsize=(10, 6))
        top_features = importance_df.head(top_n)
        plt.barh(range(top_n), top_features['importance'])
        plt.yticks(range(top_n), top_features['feature'])
        plt.xlabel('重要性分數')
        plt.title(f'{self.model_type} 模型特徵重要性 (前{top_n}名)')
        plt.gca().invert_yaxis()
        plt.tight_layout()
        plt.show()
        
        return importance_df
    
    def plot_roc_curve(self, X_test, y_test):
        """
        繪製ROC曲線
        
        參數:
            X_test: 測試特徵
            y_test: 測試標籤
        """
        if not hasattr(self.model, 'predict_proba'):
            raise AttributeError(f"{self.model_type} 模型不支援機率預測")
        
        # 預測機率
        y_proba = self.predict_proba(X_test)[:, 1]
        
        # 計算ROC曲線
        fpr, tpr, thresholds = roc_curve(y_test, y_proba)
        roc_auc = roc_auc_score(y_test, y_proba)
        
        # 繪製
        plt.figure(figsize=(8, 6))
        plt.plot(fpr, tpr, color='darkorange', lw=2,
                label=f'ROC曲線 (AUC = {roc_auc:.4f})')
        plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--',
                label='隨機猜測')
        plt.xlim([0.0, 1.0])
        plt.ylim([0.0, 1.05])
        plt.xlabel('假陽性率 (False Positive Rate)')
        plt.ylabel('真陽性率 (True Positive Rate)')
        plt.title(f'{self.model_type} 模型 ROC 曲線')
        plt.legend(loc="lower right")
        plt.grid(alpha=0.3)
        plt.tight_layout()
        plt.show()
    
    def save_model(self, filepath):
        """
        儲存模型
        
        參數:
            filepath: 模型儲存路徑
        """
        if not self.is_trained:
            raise ValueError("模型尚未訓練")
        
        # 儲存模型和相關資訊
        model_data = {
            'model': self.model,
            'model_type': self.model_type,
            'feature_names': self.feature_names,
            'training_history': self.training_history,
            'is_trained': self.is_trained
        }
        
        joblib.dump(model_data, filepath)
        print(f"✓ 模型已儲存至: {filepath}")
    
    @classmethod
    def load_model(cls, filepath):
        """
        載入模型
        
        參數:
            filepath: 模型檔案路徑
            
        回傳:
            model_instance: 載入的模型實例
        """
        model_data = joblib.load(filepath)
        
        # 建立模型實例
        instance = cls(model_type=model_data['model_type'])
        instance.model = model_data['model']
        instance.feature_names = model_data['feature_names']
        instance.training_history = model_data['training_history']
        instance.is_trained = model_data['is_trained']
        
        print(f"✓ 成功載入 {model_data['model_type']} 模型")
        
        return instance

# 模型比較與選擇
class ModelComparison:
    """
    模型比較工具
    訓練和比較多個模型的效能
    """
    
    def __init__(self):
        """初始化模型比較器"""
        self.models = {}
        self.results = {}
        
    def add_model(self, model_name, model_type):
        """
        添加要比較的模型
        
        參數:
            model_name: 模型名稱
            model_type: 模型類型
        """
        self.models[model_name] = CreditScoringModel(model_type=model_type)
        print(f"✓ 已添加 {model_name} ({model_type}) 模型")
    
    def train_all(self, X_train, y_train, X_test, y_test):
        """
        訓練所有模型
        
        參數:
            X_train, y_train: 訓練資料
            X_test, y_test: 測試資料
        """
        print("\n" + "="*60)
        print("開始訓練所有模型")
        print("="*60)
        
        for name, model in self.models.items():
            print(f"\n訓練 {name} 模型...")
            model.train(X_train, y_train)
            
            # 評估模型
            metrics = model.evaluate(X_test, y_test, detailed=False)
            self.results[name] = metrics
            
        print("\n" + "="*60)
        print("所有模型訓練完成")
        print("="*60)
    
    def compare_results(self):
        """
        比較所有模型的結果
        """
        if not self.results:
            print("尚未訓練任何模型")
            return
        
        # 建立比較表格
        comparison_df = pd.DataFrame(self.results).T
        comparison_df = comparison_df.sort_values('f1', ascending=False)
        
        print("\n模型效能比較:")
        print("="*80)
        print(comparison_df.to_string())
        
        # 視覺化比較
        fig, axes = plt.subplots(2, 2, figsize=(15, 10))
        
        metrics = ['accuracy', 'precision', 'recall', 'f1']
        titles = ['準確率', '精確率', '召回率', 'F1分數']
        
        for idx, (metric, title) in enumerate(zip(metrics, titles)):
            ax = axes[idx // 2, idx % 2]
            
            values = [self.results[name][metric] for name in self.results.keys()]
            names = list(self.results.keys())
            
            ax.barh(range(len(names)), values)
            ax.set_yticks(range(len(names)))
            ax.set_yticklabels(names)
            ax.set_xlabel('分數')
            ax.set_title(title)
            ax.set_xlim([0, 1])
            
            # 添加數值標籤
            for i, v in enumerate(values):
                ax.text(v, i, f' {v:.4f}', va='center')
        
        plt.tight_layout()
        plt.show()
        
        # 找出最佳模型
        best_model = comparison_df['f1'].idxmax()
        print(f"\n根據F1分數，最佳模型為: {best_model}")
        print(f"F1分數: {comparison_df.loc[best_model, 'f1']:.4f}")
        
        return comparison_df

# 主程式執行範例
if __name__ == "__main__":
    # 假設已經完成特徵工程，載入處理好的資料
    print("準備訓練資料...")
    
    # 這裡應該載入實際的處理後資料
    # 為了示範，我們生成一些範例資料
    from sklearn.datasets import make_classification
    
    X, y = make_classification(
        n_samples=1000,
        n_features=15,
        n_informative=10,
        n_redundant=3,
        n_classes=2,
        weights=[0.6, 0.4],
        random_state=42
    )
    
    # 分割訓練集和測試集
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )
    
    print(f"訓練集: {X_train.shape[0]} 筆記錄")
    print(f"測試集: {X_test.shape[0]} 筆記錄")
    
    # 初始化模型比較器
    comparator = ModelComparison()
    
    # 添加要比較的模型
    comparator.add_model('邏輯迴歸', 'logistic')
    comparator.add_model('決策樹', 'decision_tree')
    comparator.add_model('隨機森林', 'random_forest')
    comparator.add_model('梯度提升', 'gradient_boosting')
    
    # 訓練所有模型
    comparator.train_all(X_train, y_train, X_test, y_test)
    
    # 比較結果
    comparison_results = comparator.compare_results()
    
    print("\n模型訓練與評估完成!")

這段完整的模型訓練系統展示了如何建構產品級的信用評分預測模型。CreditScoringModel類別提供統一的介面支援多種機器學習演算法,每種演算法都經過精心調整的預設參數以達到良好的初始效能。邏輯迴歸作為線性基準模型提供可解釋的係數,決策樹透過限制深度與最小樣本數防止過度擬合,隨機森林使用100棵樹的集成在準確度與計算效率間取得平衡,梯度提升樹採用較小的學習率與適中的樹深度確保穩健的泛化效能。模型訓練功能記錄詳細的訓練歷史包含訓練時間與各階段效能指標,便於後續分析與優化。評估功能提供全面的效能指標包含準確率、精確率、召回率、F1分數以及AUC-ROC,混淆矩陣的詳細分析協助理解模型在不同類別上的表現差異。業務影響分析將技術指標轉換為業務語言,誤批率反映了可能的壞帳風險,誤拒率則代表潛在的業務機會損失。交叉驗證功能提供更穩健的效能估計,透過k折驗證降低單次分割帶來的隨機性影響。特徵重要性分析揭示哪些因素對貸款決策影響最大,這不僅有助於模型優化更能提供業務洞察。ROC曲線視覺化展示模型在不同判斷閾值下的權衡,協助金融機構根據風險偏好選擇適當的決策閾值。ModelComparison類別實現自動化的模型比較流程,同時訓練多個模型並生成詳細的效能對比報告與視覺化圖表,大幅簡化模型選擇過程。

模型偏差檢測與公平性評估框架

機器學習模型在信用評分領域的應用必須特別關注演算法偏差與公平性議題,因為不公平的貸款決策可能違反反歧視法規並造成社會不公。演算法偏差可能源自多個環節,訓練資料本身可能包含歷史歧視的痕跡,例如過去某些族群較難獲得貸款批准的記錄會被模型學習並延續。特徵選擇可能無意中引入代理變數,即使模型不直接使用種族或性別等敏感屬性,但郵遞區號、姓名等特徵可能間接暗示這些資訊。模型設計上的選擇如優化目標的設定、類別權重的調整都可能對不同群體產生差異化影響。

公平性評估需要採用多個指標從不同角度檢視模型行為。人口統計均等要求不同群體獲得正面預測的比例應該相近,這種定義強調結果的平等但可能忽視群體間真實風險的差異。機會均等要求在真實為正例的情況下,不同群體被正確識別為正例的機率應該相同,這保護了合格申請人不因群體身份而被誤拒。預測均等要求在模型預測為正例的情況下,不同群體真正為正例的比例應該相同,這確保獲得貸款批准的人實際違約率在各群體間一致。校準性要求模型輸出的機率預測在各群體中都準確反映真實違約率。需要注意的是,這些公平性定義之間可能存在數學上的不相容性,無法同時滿足所有定義,金融機構需要根據業務目標與法規要求選擇適當的公平性標準。

@startuml
!define PLANTUML_FORMAT svg
!theme _none_

skinparam dpi auto
skinparam shadowing false
skinparam linetype ortho
skinparam roundcorner 5
skinparam defaultFontName "Microsoft JhengHei UI"
skinparam defaultFontSize 16
skinparam minClassWidth 100

title 信用評分模型公平性評估流程

package "資料準備階段" {
  [歷史貸款資料] as DATA
  [敏感屬性識別] as SENSITIVE
  [偏差檢測分析] as BIAS_CHECK
}

package "模型訓練階段" {
  [特徵工程處理] as FEATURE
  [模型訓練優化] as TRAIN
  [公平性約束整合] as FAIRNESS_CONSTRAINT
}

package "模型評估階段" {
  [效能指標計算] as PERFORMANCE
  [群體間差異分析] as GROUP_ANALYSIS
  [公平性指標評估] as FAIRNESS_METRIC
}

package "持續監控階段" {
  [生產環境部署] as DEPLOY
  [即時偏差監控] as MONITOR
  [模型定期審計] as AUDIT
}

actor "資料科學團隊" as DS
actor "法遵部門" as COMPLIANCE
actor "業務決策者" as BUSINESS

DATA --> SENSITIVE : 資料分析
SENSITIVE --> BIAS_CHECK : 群體識別

BIAS_CHECK --> FEATURE : 偏差報告
FEATURE --> TRAIN : 處理後特徵
TRAIN --> FAIRNESS_CONSTRAINT : 基礎模型

FAIRNESS_CONSTRAINT --> PERFORMANCE : 公平模型
PERFORMANCE --> GROUP_ANALYSIS : 整體效能
GROUP_ANALYSIS --> FAIRNESS_METRIC : 分群效能

FAIRNESS_METRIC --> DEPLOY : 評估通過
DEPLOY --> MONITOR : 模型上線
MONITOR --> AUDIT : 監控資料

DS --> FEATURE
DS --> TRAIN
COMPLIANCE --> FAIRNESS_CONSTRAINT
COMPLIANCE --> FAIRNESS_METRIC
BUSINESS --> DEPLOY
BUSINESS --> AUDIT

note right of SENSITIVE
  識別敏感屬性:
  - 性別
  - 年齡
  - 種族/族裔
  - 地理位置
  - 婚姻狀況
end note

note right of FAIRNESS_METRIC
  公平性指標:
  - 人口統計均等
  - 機會均等
  - 預測均等
  - 校準性評估
end note

note right of MONITOR
  持續監控項目:
  - 批准率趨勢
  - 群體間差異
  - 違約率分佈
  - 申訴案件分析
end note

@enduml

這張流程圖完整呈現了信用評分模型從開發到部署的公平性保障機制。資料準備階段首先從歷史貸款資料中識別敏感屬性,這些屬性在某些司法管轄區受到法律保護不得用於貸款決策,包含性別、年齡、種族、族裔、宗教信仰、婚姻狀況等。偏差檢測分析檢視歷史資料中是否存在系統性的群體間差異,例如某些群體的貸款批准率異常偏低,即使在控制信用評分與收入等合法因素後差異依然存在。模型訓練階段的特徵工程需要謹慎處理可能成為敏感屬性代理變數的特徵,郵遞區號可能暗示種族構成,某些姓名可能暗示族裔背景,教育機構可能與社經地位相關。公平性約束整合將公平性要求納入模型優化目標,例如添加懲罰項限制不同群體間預測差異,或使用對抗性去偏差技術訓練對敏感屬性不敏感的表徵。

模型評估階段進行全面的公平性檢查,效能指標計算提供模型的整體準確度,但這還不夠,必須進一步分析模型在不同群體上的效能差異。群體間差異分析計算各敏感群體的批准率、誤批率、誤拒率等指標,檢視是否存在顯著差異。公平性指標評估使用多個數學定義量化公平性,人口統計均等檢查各群體獲得正面預測的比例是否接近,機會均等檢查真陽性率的群體間差異,預測均等檢查精確率的群體間差異,校準性評估檢查預測機率是否在各群體中都準確。持續監控階段確保模型在生產環境中維持公平性,因為資料分佈的變化可能導致原本公平的模型產生偏差。即時偏差監控追蹤批准率趨勢與群體間差異指標,當發現異常時觸發告警。模型定期審計由獨立團隊執行全面的公平性評估,檢視申訴案件是否存在系統性模式,並評估是否需要重新訓練或調整模型。資料科學團隊、法遵部門與業務決策者在整個流程中密切協作,確保模型既滿足技術效能要求也符合法規規範與倫理標準。

人工智慧驅動的信用評分系統為金融機構帶來顯著價值,透過機器學習模型處理大量申請案件能夠大幅提升審核效率並降低人力成本。模型能夠識別複雜的違約風險模式,相較於傳統規則引擎提供更準確的風險評估。然而技術的應用也帶來新的挑戰與責任,金融機構必須確保演算法決策的透明度與可解釋性,特別是在需要向申請人說明拒絕原因時。模型偏差與公平性議題需要持續關注,建立完善的監控與審計機制確保不會對特定群體產生系統性歧視。資料隱私與安全同樣重要,信用資料包含大量個人敏感資訊,必須採取嚴格的資料保護措施防止洩漏與濫用。

展望未來,信用評分技術將朝向更智慧化與個人化的方向發展。整合更多元的資料源如社交媒體活動、網路購物行為、行動支付記錄等替代資料,能夠為缺乏傳統信用歷史的群體如年輕人或新移民提供更公平的評估機會。深度學習技術的應用可能發現更複雜的違約模式,但同時也帶來可解釋性的挑戰需要開發新的解釋方法。聯邦學習等隱私保護技術讓多個金融機構能夠在不共享原始資料的前提下協作訓練模型,提升模型品質同時保護客戶隱私。即時動態評分根據借款人最新的財務狀況即時更新信用評估,提供更及時的風險監控。金融機構應該積極擁抱這些新技術,同時建立完善的治理框架確保技術應用符合監管要求與社會期待,在提升業務效能與維護公平正義之間取得適當平衡。

玄貓 BlackCat

技術愛好者，專注於分享程式開發、雲端技術與 AI 應用的心得體會。