殘差網路分析：發掘超越引力模型的隱藏連結

引力模型在預測節點互動流量時，雖提供了基於質量與距離的堅實基礎，卻常忽略現實世界中更複雜的驅動因素。模型預測與實際觀測值之間的差異，即「殘差」，不應被視為單純誤差，反而蘊含著重要資訊。當實際流量遠超預期，正殘差便成為一個強烈信號，指向了模型變數之外的特殊吸引力，如緊密的產業鏈結或獨特旅遊誘因。本文核心在於系統性地捕捉這些信號，將其從原始網路分離，並組織成一個僅包含超預期連結的「殘差網路」。此過程不僅是數據篩選，更是一種理論視角的轉換，旨在揭示由隱性關係構成的深層次網路結構。

殘差網路的構建與洞察

本章節將聚焦於如何利用引力模型分析中的殘差數據，構建一個「殘差網路」。這個網路僅包含實際流量顯著高於模型預期的連接，從而突顯出那些具有「特殊關係」或「額外吸引力」的機場間聯繫。我們將探討如何篩選這些邊，構建殘差網路，並對其進行視覺化，以揭示重要的商業與旅遊節點。

構建殘差網路

殘差的篩選標準：
- 引力模型的局限：正如前文所述，引力模型主要基於距離和節點質量來預測流量。然而，現實中的航空網路受到許多其他因素的影響，這些因素無法被簡單的引力模型完全捕捉。
- 殘差的意義：殘差代表了模型預測與實際情況之間的差異。正殘差表示實際流量超過了預期的水平，這通常意味著存在一些額外的、未被模型考慮的「吸引力」或「特殊聯繫」。
- 對數殘差 (log_residual)：
  - 程式碼中計算了 log_residual = math.log10(count) - math.log10(expected)。
  - 使用對數殘差有幾個好處：
    1. 尺度不變性：對數轉換使得殘差的解釋與原始流量的尺度無關。
    2. 對稱性：正殘差和負殘差在對數尺度上更為對稱。
    3. 處理極端值：有助於平滑極端流量值的影響。
  - 篩選條件：residual_edges = [e for e in G_air.edges if G_air.edges[e]['log_residual'] > 0]
    - 這段程式碼篩選出所有 log_residual 大於 0 的邊。這意味著實際流量的對數值大於預期流量的對數值，即實際流量顯著高於模型預期。
構建子圖：
- G_air.edge_subgraph(residual_edges)：
  - NetworkX 的 edge_subgraph() 方法非常適合用於從現有圖中提取一個子圖，該子圖僅包含指定的邊及其相關的節點。
  - 這樣就創建了一個新的圖 G_residual，它只包含那些具有顯著正殘差的連接。
- 保留最大連通分量：
  - G_residual = nx.subgraph(G_residual, ...)：在某些情況下，篩選後的殘差網路可能包含多個斷開的連通分量。為了聚焦於最主要的連接，通常會進一步提取最大的連通分量。這一步驟確保了我們分析的是網路中最集中的「核心」連接。

殘差網路的視覺化與解讀

視覺化方法：
- 殘差網路的視覺化方法與之前基於地理位置的網路視覺化類似。
- 使用地理佈局 pos：關鍵在於繼續使用之前計算好的地理位置佈局 pos。這確保了殘差網路中的節點位置與其真實地理位置對應，使得我們可以將殘差網路疊加在地理背景上進行理解。
- 節點繪製：nx.draw_networkx_nodes(G_residual, pos=pos, ...) 繪製殘差網路中的節點。
- 邊的透明度：
  - max_weight = max([G_residual.edges[e]['log_residual'] for e in G_residual.edges])：找到殘差網路中最大的對數殘差值。
  - alpha = G_residual.edges[e]['log_residual'] / max_weight：根據邊的對數殘差佔最大對數殘差的比例來設定透明度。對數殘差越大的邊，越不透明，越顯眼。
視覺化結果的特徵：
- 稀疏性：相較於完整的地理網路，殘差網路通常會顯得更加稀疏。這是因為只有那些「表現超預期」的連接被保留了下來。
- 重要節點的凸顯：
  - 這個稀疏的網路更容易識別出重要的節點和連接。
  - 許多重要的節點會圍繞在美國的主要商業中心和旅遊目的地周圍。
  - 範例：文中提到，紐約市 (New York City)、舊金山 (San Francisco) 和洛杉磯 (Los Angeles) 是重要的商業中心，它們的機場在殘差網路中可能扮演核心角色。同時，像猶他州的溫多弗 (Wendover, Utah) 這樣的度假勝地也可能因為其特殊的吸引力而顯現出來。
- 洞察價值：
  - 殘差網路揭示了哪些機場之間的聯繫具有超越地理距離和規模的「額外價值」。
  - 這對於理解區域間的經濟聯繫、旅遊流動模式、以及潛在的市場機會非常有幫助。

程式碼實現細節

計算對數殘差：
- G_air.edges[v, w]['log_residual'] = math.log10(count) - math.log10(expected)：計算並添加對數殘差屬性。
篩選殘差邊：
- residual_edges = [e for e in G_air.edges if G_air.edges[e]['log_residual'] > 0]：創建一個包含所有正對數殘差邊的列表。
構建子圖：
- G_residual = G_air.edge_subgraph(residual_edges)：從原圖中提取邊子圖。
- G_residual = nx.subgraph(G_residual, ...)：（此處程式碼片段未完整展示）通常用於提取最大連通分量。
視覺化殘差網路：
- 使用與之前相同的 pos 佈局。
- 根據 log_residual 值來設定邊的透明度 alpha。
- max_weight = max([G_residual.edges[e]['log_residual'] for e in G_residual.edges])：找到殘差網路中的最大對數殘差值，用於歸一化透明度。

import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pathlib import Path
import math

# --- 假設 G_air_spatial 已載入並處理好節點屬性 (lat, long, count, mass, distance, expected, residual, log_residual) ---
# 為了程式碼的獨立性，這裡重新載入並處理部分數據
# 實際應用中，應接續前文的 G_air_spatial

# --- 重新創建模擬數據和網路 (如果前文未執行) ---
data_dir_sim = Path('./data_simulated')
carrier_data_path = data_dir_sim / 'BTS2018' / 'carrier.csv'
airport_db_path = data_dir_sim / 'partow' / 'GlobalAirportDatabase.txt'

# 模擬創建 carrier.csv
if not carrier_data_path.exists():
    data_dir_sim.mkdir(parents=True, exist_ok=True)
    mock_data_carrier = {
        'ORIGIN_AIRPORT_ID': [10001, 10001, 10001, 10002, 10002, 10003, 10003, 10001, 10002, 10003, 10004, 10005, 10006, 10007],
        'DEST_AIRPORT_ID':   [10002, 10003, 10004, 10001, 10003, 10001, 10002, 10005, 10005, 10005, 10001, 10002, 10001, 10002],
        'PASSENGERS':        [1500, 800, 200, 1200, 900, 750, 600, 100, 150, 300, 50, 70, 1200, 1000],
        'YEAR':              [2018]*14, 'MONTH': [1]*14
    }
    df_mock_carrier = pd.DataFrame(mock_data_carrier)
    df_mock_carrier.to_csv(carrier_data_path, index=False)

# 模擬創建 GlobalAirportDatabase.txt
if not airport_db_path.exists():
    data_dir_partow = data_dir_sim / 'partow'
    data_dir_partow.mkdir(parents=True, exist_ok=True)
    mock_airport_db_content = """
1:JFK:John F. Kennedy International Airport:USA:40.639751:-73.778925
2:LAX:Los Angeles International Airport:USA:33.941556:-118.408530
3:ORD:Chicago O'Hare International Airport:USA:41.974209:-87.907321
4:ATL:Hartsfield-Jackson Atlanta International Airport:USA:33.640445:-84.427700
5:DEN:Denver International Airport:USA:39.856057:-104.673737
6:SFO:San Francisco International Airport:USA:37.619002:-122.371297
7:SEA:Seattle-Tacoma International Airport:USA:47.448981:-122.309310
8:MIA:Miami International Airport:USA:25.793199:-80.290594
9:DFW:Dallas/Fort Worth International Airport:USA:32.899811:-97.040416
10:LAS:McCarran International Airport:USA:36.084030:-115.153734
11:PHX:Phoenix Sky Harbor International Airport:USA:33.434250:-112.011590
12:IAH:George Bush Intercontinental Airport:USA:29.990163:-95.336770
13:BOS:Logan International Airport:USA:42.365599:-71.009667
14:MSP:Minneapolis–Saint Paul International Airport:USA:44.884794:-93.217711
15:PHL:Philadelphia International Airport:USA:39.874399:-75.242449
16:CLT:Charlotte Douglas International Airport:USA:35.214433:-80.947171
17:EWR:Newark Liberty International Airport:USA:40.689532:-74.174492
18:DTW:Detroit Metropolitan Airport:USA:42.211501:-83.353409
19:FLL:Fort Lauderdale–Hollywood International Airport:USA:26.074199:-80.150667
20:BWI:Baltimore/Washington International Airport:USA:39.177452:-76.668409
21:SLC:Salt Lake City International Airport:USA:40.789999:-111.979111
22:SAN:San Diego International Airport:USA:32.733819:-117.193751
23:TPA:Tampa International Airport:USA:27.977474:-82.531189
24:PDX:Portland International Airport:USA:45.589802:-122.597510
25:IAD:Washington Dulles International Airport:USA:38.953054:-77.456459
26:DCA:Ronald Reagan Washington National Airport:USA:38.851111:-77.037500
27:ANC:Ted Stevens Anchorage International Airport:USA:61.174355:-149.996333 # Alaska, will be removed
28:HNL:Daniel K. Inouye International Airport:USA:21.318694:-157.924694 # Hawaii, will be removed
"""
    with open(airport_db_path, 'w') as f:
        f.write(mock_airport_db_content.strip())

# --- 重新載入並處理數據 ---
G_air_spatial = nx.Graph()
airport_locations = {}

try:
    # 載入旅客流量
    df_carrier = pd.read_csv(carrier_data_path)
    df_carrier_2018 = df_carrier[df_carrier['YEAR'] == 2018]
    for index, row in df_carrier_2018.iterrows():
        origin = str(row['ORIGIN_AIRPORT_ID'])
        dest = str(row['DEST_AIRPORT_ID'])
        passengers = int(row['PASSENGERS'])
        if origin == dest or passengers == 0: continue
        if G_air_spatial.has_edge(origin, dest):
            G_air_spatial.edges[origin, dest]['count'] += passengers
        else:
            G_air_spatial.add_edge(origin, dest, count=passengers)
            if origin not in G_air_spatial: G_air_spatial.add_node(origin)
            if dest not in G_air_spatial: G_air_spatial.add_node(dest)
    
    # 載入地理位置
    with open(airport_db_path) as f:
        for row in f:
            columns = row.strip().split(':')
            if len(columns) >= 16:
                code = columns[1]
                country = columns[3]
                if country == 'USA':
                    try:
                        lat = float(columns[14])
                        long = float(columns[15])
                        airport_locations[code] = (lat, long)
                    except ValueError: pass
    
    # 添加地理屬性並篩選大陸機場
    nodes_to_remove = []
    for node_id in list(G_air_spatial.nodes()):
        if node_id in airport_locations:
            lat, long = airport_locations[node_id]
            if 24 <= lat <= 49 and -125 <= long <= -66:
                G_air_spatial.nodes[node_id]['lat'] = lat
                G_air_spatial.nodes[node_id]['long'] = long
            else:
                nodes_to_remove.append(node_id)
        else:
            nodes_to_remove.append(node_id)
    G_air_spatial.remove_nodes_from(nodes_to_remove)

    print(f"網路準備就緒: {G_air_spatial.number_of_nodes()} 個節點, {G_air_spatial.number_of_edges()} 條邊。")

except Exception as e:
    print(f"初始化網路時發生錯誤: {e}")

# --- Haversine 距離計算函數 ---
def haversine(lat1, lon1, lat2, lon2):
    R_km = 6371 
    lat1_rad, lon1_rad, lat2_rad, lon2_rad = map(math.radians, [lat1, lon1, lat2, lon2])
    dlon = lon2_rad - lon1_rad
    dlat = lat2_rad - lat1_rad
    a = math.sin(dlat / 2)**2 + math.cos(lat1_rad) * math.cos(lat2_rad) * math.sin(dlon / 2)**2
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
    distance = R_km * c
    return distance

# --- 為邊添加距離屬性 ---
for v, w in G_air_spatial.edges():
    try:
        p_v = (G_air_spatial.nodes[v]['lat'], G_air_spatial.nodes[v]['long'])
        p_w = (G_air_spatial.nodes[w]['lat'], G_air_spatial.nodes[w]['long'])
        G_air_spatial.edges[v, w]['distance'] = haversine(p_v[0], p_v[1], p_w[0], p_w[1])
    except KeyError:
        G_air_spatial.edges[v, w]['distance'] = None 
print("已為邊添加距離屬性。")

# --- 計算節點質量 (加權度數) ---
degree_data = list(G_air_spatial.degree(weight='count'))
nx.set_node_attributes(G_air_spatial, dict(degree_data), 'mass')
print("已計算節點質量 (總旅客流量)。")

# --- 簡化方法估計引力模型常數乘數 k ---
k_values = [] 
for v, w in G_air_spatial.edges():
    if v >= w: continue
    try:
        count = G_air_spatial.edges[v, w]['count']
        distance = G_air_spatial.edges[v, w]['distance']
        mass_v = G_air_spatial.nodes[v]['mass']
        mass_w = G_air_spatial.nodes[w]['mass']
        if count is not None and distance is not None and distance > 0 and \
           mass_v is not None and mass_w is not None and mass_v > 0 and mass_w > 0:
            k_val = count * (distance**2) / (mass_v * mass_w)
            k_values.append(k_val)
    except KeyError: continue

# --- 計算幾何平均數 g ---
g = 0 
valid_k_values = [val for val in k_values if val > 0] 
if valid_k_values:
    log_k_values = [math.log10(val) for val in valid_k_values]
    mean_log_k = sum(log_k_values) / len(log_k_values)
    g = 10**mean_log_k
    print(f"\n估計的引力模型常數 g (幾何平均數): {g:.4f}")
    
    # --- 計算預期流量和殘差 ---
    for v, w in G_air_spatial.edges():
        if 'distance' in G_air_spatial.edges[v, w] and \
           'count' in G_air_spatial.edges[v, w] and \
           'mass' in G_air_spatial.nodes[v] and \
           'mass' in G_air_spatial.nodes[w]:
            
            count = G_air_spatial.edges[v, w]['count']
            distance = G_air_spatial.edges[v, w]['distance']
            mass_v = G_air_spatial.nodes[v]['mass']
            mass_w = G_air_spatial.nodes[w]['mass']
            
            expected_flow = 0
            if distance is not None and distance > 0 and \
               mass_v is not None and mass_w is not None and mass_v > 0 and mass_w > 0 and g > 0:
                expected_flow = (g * mass_v * mass_w) / (distance**2)
            
            G_air_spatial.edges[v, w]['expected'] = expected_flow
            
            residual = count - expected_flow
            G_air_spatial.edges[v, w]['residual'] = residual
            
            # 計算並添加對數殘差
            if expected_flow > 0 and count > 0: # 確保對數計算有效
                log_residual = math.log10(count) - math.log10(expected_flow)
                G_air_spatial.edges[v, w]['log_residual'] = log_residual
            else:
                G_air_spatial.edges[v, w]['log_residual'] = None # 無效時設為 None
        else:
            G_air_spatial.edges[v, w]['expected'] = None
            G_air_spatial.edges[v, w]['residual'] = None
            G_air_spatial.edges[v, w]['log_residual'] = None

    print("已計算預期流量、殘差和對數殘差，並添加到邊屬性。")
    
    # --- 構建殘差網路 ---
    # 篩選出對數殘差大於 0 的邊
    residual_edges = [
        e for e in G_air_spatial.edges() 
        if G_air_spatial.edges[e].get('log_residual') is not None and G_air_spatial.edges[e]['log_residual'] > 0
    ]
    
    # 創建邊子圖
    G_residual = G_air_spatial.edge_subgraph(residual_edges)
    
    # 提取最大的連通分量 (如果需要)
    # 這裡為了簡化，直接使用邊子圖，如果需要最大連通分量，可以取消下面這行的註釋
    # if G_residual.number_of_nodes() > 0:
    #     largest_cc = max(nx.connected_components(G_residual), key=len)
    #     G_residual = G_residual.subgraph(largest_cc).copy()

    print(f"已構建殘差網路，包含 {G_residual.number_of_nodes()} 個節點和 {G_residual.number_of_edges()} 條邊。")

    # --- 視覺化殘差網路 ---
    if G_residual.number_of_edges() > 0:
        fig, ax = plt.subplots(figsize=(15, 15))
        
        # 獲取位置字典 pos (假設它已在前文計算並可用)
        # 如果 pos 未定義，則需要重新計算
        if 'pos' not in locals():
            print("警告：位置字典 'pos' 未定義，將重新計算地理佈局。")
            pos = {}
            min_lat = min(data['lat'] for _, data in G_air_spatial.nodes(data=True) if 'lat' in data)
            max_lat = max(data['lat'] for _, data in G_air_spatial.nodes(data=True) if 'lat' in data)
            min_lon = min(data['long'] for _, data in G_air_spatial.nodes(data=True) if 'long' in data)
            max_lon = max(data['long'] for _, data in G_air_spatial.nodes(data=True) if 'long' in data)
            fig_width = 15
            fig_height = 15
            for node_id, data in G_air_spatial.nodes(data=True):
                if 'lat' in data and 'long' in data:
                    lat = data['lat']
                    long = data['long']
                    x_coord = fig_width * (long - min_lon) / (max_lon - min_lon)
                    y_coord = fig_height * (lat - min_lat) / (max_lat - min_lat)
                    pos[node_id] = (x_coord, y_coord)

        # 繪製節點
        nx.draw_networkx_nodes(G_residual, pos=pos, node_color='#7f7fff', node_size=20, ax=ax)

        # 計算殘差網路中的最大對數殘差值，用於透明度
        max_log_residual = 0
        # 確保列表不為空，否則 max() 會出錯
        log_residuals_in_residual_graph = [
            data['log_residual'] for u, v, data in G_residual.edges(data=True) 
            if data.get('log_residual') is not None
        ]
        if log_residuals_in_residual_graph:
            max_log_residual = max(log_residuals_in_residual_graph)
        
        # 逐條繪製邊，根據對數殘差設定透明度
        for e in G_residual.edges():
            log_res = G_residual.edges[e].get('log_residual')
            if log_res is not None and max_log_residual > 0: # 確保計算有效
                alpha = log_res / max_log_residual
            else:
                alpha = 0.5 # 預設值

            nx.draw_networkx_edges(
                G_residual,
                pos=pos,
                edgelist=[e],
                edge_color='#7f7fff',
                alpha=alpha,
                width=1.0,
                arrows=False,
                ax=ax
            )

        ax.set_aspect(1)
        plt.title("US Air Traffic: Residual Network (Connections Exceeding Expected Traffic)", fontsize=18)
        plt.xlabel("Longitude (Projected)", fontsize=14)
        plt.ylabel("Latitude (Projected)", fontsize=14)
        plt.xlim([0, fig_width]) # 確保範圍匹配 pos 的縮放
        plt.ylim([0, fig_height])
        plt.xticks([])
        plt.yticks([])
        plt.show()
        
        print("\n殘差網路視覺化分析：")
        print("  - 殘差網路比完整地理網路更稀疏，突顯了具有額外吸引力的連接。")
        print("  - 容易識別重要節點和連接，這些節點通常是主要的商業或旅遊中心。")
        print("  - 揭示了超越距離和規模的特殊關係，例如紐約、舊金山、洛杉磯等地的機場。")
        
    else:
        print("\n殘差網路中沒有邊，無法進行視覺化。")
        
else:
    print("\n未能計算出有效的引力模型常數 g。無法進行殘差網路的構建和視覺化。")

@startuml
!define DISABLE_LINK
!define PLANTUML_FORMAT svg
!theme _none_

skinparam dpi auto
skinparam shadowing false
skinparam linetype ortho
skinparam roundcorner 5
skinparam defaultFontName "Microsoft JhengHei UI"
skinparam defaultFontSize 16
skinparam minClassWidth 100

start

:殘差網路的構建與洞察;
split
:構建殘差網路;
note right
殘差篩選標準:
  - 正殘差 (Actual > Expected)
  - 對數殘差 (log_residual > 0)
  - 意義: 實際流量顯著高於模型預期
  - 篩選條件: log_residual > 0
構建子圖:
  - 使用 G_air.edge_subgraph(residual_edges)
  - 提取包含正殘差邊及其節點的子圖
保留最大連通分量 (可選):
  - 確保分析的網路是連貫的
  - 聚焦於核心連接
end note
split again
:殘差網路的視覺化與解讀;
note right
視覺化方法:
  - 使用地理位置佈局 (pos)
  - 節點繪製
  - 邊的透明度基於對數殘差
邊的透明度 (alpha):
  - alpha = log_residual / max_log_residual
  - 殘差越大，邊越不透明
視覺化結果特徵:
  - 稀疏性: 僅保留超預期連接
  - 重要節點凸顯: 商業中心 (NYC, SF, LA), 度假勝地
  - 洞察價值: 揭示超越距離和規模的特殊關係
end note
end split

stop

@enduml

看圖說話：

此圖示總結了「殘差網路的構建與洞察」的內容，重點在於展示如何利用引力模型分析中的正殘差數據來構建一個更具資訊量的殘差網路，並透過視覺化來揭示航空網路中超越距離和規模的特殊聯繫。流程開頭首先聚焦於「構建殘差網路」，說明了篩選標準（基於對數殘差大於零），以及如何利用 edge_subgraph 方法來提取這些邊及其節點，接著詳細闡述了「殘差網路的視覺化與解讀」，強調了使用地理佈局和根據對數殘差設定邊透明度的視覺化方法，並指出了殘差網路的稀疏性以及如何透過它來識別重要的商業和旅遊節點，從而獲得更深層次的網路洞察。

結論

解構此「殘差網路」分析方法的關鍵元素可以發現，其核心價值不僅在於視覺化數據，更在於建立一個超越標準模型的洞察框架。傳統引力模型雖能解釋大部分基於規模與距離的流量，卻也因此遮蔽了由特殊商業、文化或旅遊因素驅動的「超常規」連結。殘差網路透過精準篩選出這些「正向異常值」，將分析焦點從「普遍現象」轉移至「特殊機會」，成功突破了基礎模型的詮釋瓶頸。這種方法將理論模型與經驗數據的差異轉化為可供探索的資訊資產，其整合價值在於揭示了隱藏在雜訊背後的、具有高度策略意義的網絡結構。

展望未來，這種融合領域知識模型（如引力模型）與數據科學殘差分析的策略，將不僅限於交通網路。我們預見它會廣泛應用於供應鏈韌性評估、金融市場風險傳導與跨區域人才流動等領域，成為識別隱性關聯與非對稱機會的主流分析典範之一。

玄貓認為，此分析路徑已展現出卓越的洞察效益。對於追求深度商業智慧、不滿足於表面數據的決策者與分析師而言，掌握這種從「模型餘料」中提煉黃金的思維，將是建立資訊優勢的關鍵一步。

玄貓

技術愛好者，專注於分享程式開發、雲端技術與 AI 應用的心得體會。