Pandas資料讀取與記憶體最佳化技巧

在資料科學領域，處理大型資料集時，記憶體管理至關重要。Pandas 提供了多種有效策略來最佳化資料讀取和處理過程中的記憶體使用，確保程式碼高效執行。利用 read_csv 和 read_excel 函式的引數，例如 dtype、usecols 和 chunksize，可以精確控制資料載入方式，減少記憶體佔用。此外，Pandas 也支援讀取特定工作表、跳過指定行列以及處理具有層次結構的 Excel 資料，提高資料處理的靈活性。對於與 SQL 資料函式庫的互動，Pandas 可以結合 SQLAlchemy，實作資料的讀取和寫入，簡化資料函式庫操作流程。

最佳化 pandas 資料讀取與記憶體使用

在處理大型 CSV 檔案時，最佳化資料讀取與記憶體使用是提升效能的關鍵。本文將介紹如何使用 pandas 的 read_csv 函式搭配適當的資料型別轉換，來降低記憶體使用並保持資料完整性。

選擇適當的資料型別

預設情況下，pandas 會根據資料內容自動推斷資料型別。然而，透過明確指定資料型別，可以有效減少記憶體的使用。以下範例展示如何使用 dtype 引數來指定資料型別：

import pandas as pd

# 定義資料型別
dtypes = {
    "carat": pd.Float32Dtype(),
    "cut": pd.StringDtype(),
    "color": pd.StringDtype(),
    "clarity": pd.StringDtype(),
    "depth": pd.Float32Dtype(),
    "table": pd.Float32Dtype(),
    "price": pd.Int16Dtype(),
    "x": pd.Float32Dtype(),
    "y": pd.Float32Dtype(),
    "z": pd.Float32Dtype(),
}

# 讀取 CSV 檔案並指定資料型別
df2 = pd.read_csv(
    "data/diamonds.csv",
    nrows=1_000,
    dtype=dtypes
)

# 檢視資料資訊
df2.info()

內容解密：

使用 pd.Float32Dtype() 將浮點數欄位（如 carat、depth、table、x、y、z）轉換為 32 位浮點數，減少記憶體使用。
使用 pd.StringDtype() 將字串欄位（如 cut、color、clarity）儲存為字串型別，以支援缺失值處理。
使用 pd.Int16Dtype() 將整數欄位（如 price）轉換為 16 位整數，以節省記憶體。

驗證資料完整性

在轉換資料型別後，使用 describe() 方法來驗證資料是否正確轉換：

# 檢視原始資料與轉換後資料的統計資訊
print(df.describe())
print(df2.describe())

內容解密：

比較原始資料 (df) 與轉換後資料 (df2) 的統計資訊，確保數值範圍、平均值、標準差等統計指標一致。
如果統計結果一致，表示資料型別轉換沒有影響資料的正確性。

使用分型別別最佳化低基數欄位

對於基數較低的字串欄位（如 cut、color、clarity），可以將其轉換為分型別別以進一步節省記憶體：

# 將字串欄位轉換為分型別別
cat_cols = ["cut", "color", "clarity"]
df3 = pd.read_csv(
    "data/diamonds.csv",
    nrows=1_000,
    dtype=dtypes
)
df3[cat_cols] = df3[cat_cols].astype(pd.CategoricalDtype())

# 檢視轉換後的資料資訊
df3.info()

內容解密：

使用 pd.CategoricalDtype() 將低基數字串欄位轉換為分型別別，能顯著降低記憶體使用。
分型別別適合用於具有重複值的欄位，例如 cut、color 和 clarity。

跳過不必要的欄位

如果某些欄位在分析中不需要，可以使用 usecols 引數來跳過這些欄位，以進一步節省記憶體：

# 定義需要的欄位及其型別
dtypes = {
    "carat": pd.Float32Dtype(),
    "cut": pd.StringDtype(),
    "color": pd.StringDtype(),
    "clarity": pd.StringDtype(),
    "depth": pd.Float32Dtype(),
    "table": pd.Float32Dtype(),
    "price": pd.Int16Dtype(),
}

# 讀取 CSV 檔案時只載入需要的欄位
df4 = pd.read_csv(
    "data/diamonds.csv",
    nrows=1_000,
    dtype=dtypes,
    usecols=dtypes.keys(),
)

# 將字串欄位轉換為分型別別
df4[cat_cols] = df4[cat_cols].astype(pd.CategoricalDtype())

# 檢視最終的資料資訊
df4.info()

內容解密：

使用 usecols 引數限制讀取的欄位，避免載入不需要的資料。
結合分型別別轉換，進一步降低記憶體使用。

Pandas I/O 系統：高效處理資料的關鍵

Pandas 的 I/O 系統提供了強大的功能，能夠幫助我們高效地讀取和寫入各種資料格式，包括 CSV、Excel 等。在本章中，我們將探討如何使用 Pandas 來處理資料，並介紹一些實用的技巧和最佳實踐。

最佳化記憶體使用：處理大型資料集

當處理大型資料集時，記憶體使用是一個重要的考量。Pandas 提供了多種方法來最佳化記憶體使用，例如使用 dtypes 來指定資料型別。

指定資料型別

dtypes = {
    "carat": pd.Float32Dtype(),
    "cut": pd.StringDtype(),
    "color": pd.StringDtype(),
    "clarity": pd.StringDtype(),
    "depth": pd.Float32Dtype(),
    "table": pd.Float32Dtype(),
    "price": pd.Int16Dtype(),
}

df = pd.read_csv("data/diamonds.csv", dtype=dtypes, usecols=dtypes.keys())

內容解密：

使用 dtypes 字典來指定每個欄位的資料型別，可以有效減少記憶體使用。
pd.Float32Dtype() 和 pd.Int16Dtype() 用於指定浮點數和整數的型別，以減少記憶體佔用。
pd.StringDtype() 用於指定字串型別，以提高效能和減少記憶體使用。

分塊處理資料

對於非常大的資料集，我們可以使用 chunksize 引數來分塊讀取資料。

df_iter = pd.read_csv(
    "data/diamonds.csv",
    nrows=1_000,
    dtype=dtypes,
    usecols=dtypes.keys(),
    chunksize=200
)

for df in df_iter:
    cat_cols = ["cut", "color", "clarity"]
    df[cat_cols] = df[cat_cols].astype(pd.CategoricalDtype())
    print(f"processed chunk of shape {df.shape}")

內容解密：

使用 chunksize=200 將資料分成每塊 200 筆資料進行處理。
將類別型欄位轉換為 Categorical 型別，以減少記憶體使用。
逐塊處理資料，可以有效避免記憶體不足的問題。

使用 Callable 函式篩選欄位

Pandas 的 usecols 引數可以接受一個 Callable 函式，用於篩選要讀取的欄位。

def startswith_c(column_name: str) -> bool:
    return column_name.startswith("c")

pd.read_csv(
    "data/diamonds.csv",
    dtype_backend="numpy_nullable",
    usecols=startswith_c,
)

內容解密：

定義 startswith_c 函式，用於檢查欄位名稱是否以 “c” 開頭。
使用 usecols=startswith_c 來篩選要讀取的欄位，只讀取欄位名稱以 “c” 開頭的欄位。
這種方法可以有效減少讀取的資料量，提高效能。

Microsoft Excel 資料處理

Pandas 也提供了處理 Microsoft Excel 檔案的功能，包括讀取和寫入 Excel 檔案。

基本讀寫操作

df = pd.DataFrame([
    ["Paul", "McCartney", 1942],
    ["John", "Lennon", 1940],
    ["Richard", "Starkey", 1940],
    ["George", "Harrison", 1943],
], columns=["first", "last", "birth"])

buf = io.BytesIO()
df.to_excel(buf, index=False)
buf.seek(0)
pd.read_excel(buf, dtype_backend="numpy_nullable")

內容解密：

建立一個簡單的 DataFrame，並使用 to_excel 方法寫入到 io.BytesIO() 緩衝區。
使用 pd.read_excel 方法從緩衝區讀取 Excel 資料。
設定 index=False 以避免寫入索引欄位。

在非預設位置尋找 Excel 資料表

有時 Excel 檔案中的資料並不位於預設的第一個工作表或第一個儲存格。Pandas 提供了引數來指定讀取的位置。

# 指定 index_col 引數
buf.seek(0)
pd.read_excel(buf, dtype_backend="numpy_nullable", index_col=0)

圖表翻譯：

此圖示展示瞭如何使用 index_col 引數來指定索引欄位，以正確讀取 Excel 資料。

內容解密：

使用 index_col=0 指定第一欄為索引欄位，以正確讀取資料。
這種方法可以有效避免多餘的索引欄位出現在讀取的資料中。

使用 pandas 讀取 Excel 檔案中的特定資料

在處理 Excel 檔案時，經常需要從特定的工作表中讀取資料，或是忽略某些行和列。pandas 提供了 read_excel 函式來實作這一點。

如何讀取特定工作表中的資料

假設我們有一個 Excel 檔案 beatles.xlsx，其中包含多個工作表，我們想要讀取名為 “the_data” 的工作表中的資料，並且忽略前四行以及不需要的列。

import pandas as pd

# 讀取 Excel 檔案中的特定工作表，忽略前四行，並選取特定的列
df = pd.read_excel(
    "data/beatles.xlsx",
    dtype_backend="numpy_nullable",
    sheet_name="the_data",
    skiprows=4,
    usecols="C:E",
)

print(df)

輸出結果：

      first         last  birth
0  Paul McCartney   1942
1    John Lennon   1940
2  Richard Starkey   1940
3  George Harrison   1943

內容解密：

sheet_name="the_data"：指定要讀取的工作表名稱。
skiprows=4：忽略前四行，從第五行開始讀取資料。
usecols="C:E"：選取 Excel 中的 C 到 E 列。

使用列標籤選取特定列

除了使用列字母（例如 “C:E”）來選取列之外，也可以直接使用列的標籤。

df = pd.read_excel(
    "data/beatles.xlsx",
    dtype_backend="numpy_nullable",
    sheet_name="the_data",
    skiprows=4,
    usecols=["first", "last", "birth"],
)

print(df)

輸出結果與前例相同。

內容解密：

usecols=["first", "last", "birth"]：使用列標籤來選取特定的列。

處理具有層次結構的 Excel 資料

在資料分析中，經常需要處理具有層次結構的資料，例如地區/次地區和年份/季度。pandas 可以輕鬆處理這類別資料，並保持其層次結構。

如何讀取具有層次結構的 Excel 資料

假設我們有一個 Excel 檔案 hierarchical.xlsx，其中包含具有層次結構的資料。

df = pd.read_excel(
    "data/hierarchical.xlsx",
    dtype_backend="numpy_nullable",
    index_col=[0, 1],
    header=[0, 1],
)

print(df)

輸出結果：

Year                   2024       2025      
Quarter                 Q1   Q2   Q1   Q2
Region Sub-Region                            
America East           1    2    4    8
        West          16   32   64  128
        South        256  512 1024 4096
Europe West         8192 16384 32768 65536
       East       131072 262144 524288 1048576

內容解密：

index_col=[0, 1]：指定前兩列作為行索引，以建立層次結構。
header=[0, 1]：指定前兩行作為列索引，以建立層次結構。

使用層次結構進行資料選取

有了層次結構的索引後，可以輕鬆地進行資料選取。

print(df.loc[(slice(None), "East"), (slice(None), "Q2")])

輸出結果：

Year                   2024   2025
Quarter                 Q2     Q2
Region Sub-Region              
America East           2      8
Europe East       262144 1048576

圖表翻譯：

此圖示呈現了具有層次結構的資料，其中地區和次地區形成行索引，年份和季度形成列索引。這種結構使得我們可以方便地進行多層次的資料分析。

@startuml
skinparam backgroundColor #FEFEFE
skinparam componentStyle rectangle

title Pandas資料讀取與記憶體最佳化技巧

package "Pandas 資料處理" {
    package "資料結構" {
        component [Series
一維陣列] as series
        component [DataFrame
二維表格] as df
        component [Index
索引] as index
    }

    package "資料操作" {
        component [選取 Selection] as select
        component [篩選 Filtering] as filter
        component [分組 GroupBy] as group
        component [合併 Merge/Join] as merge
    }

    package "資料轉換" {
        component [重塑 Reshape] as reshape
        component [透視表 Pivot] as pivot
        component [聚合 Aggregation] as agg
    }
}

series --> df : 組成
index --> df : 索引
df --> select : loc/iloc
df --> filter : 布林索引
df --> group : 分組運算
group --> agg : 聚合函數
df --> merge : 合併資料
df --> reshape : melt/stack
reshape --> pivot : 重新組織

note right of df
  核心資料結構
  類似 Excel 表格
end note

@enduml

圖表翻譯： 此圖示顯示了地區和次地區之間的層次關係，以及年份和季度之間的層次關係，並最終與資料相連。

使用 SQLAlchemy 與 SQL 資料函式庫互動

pandas 可以與 SQL 資料函式庫互動，進行資料分析。這需要依賴 SQLAlchemy，一個強大的 Python SQL 工具包。

如何使用 pandas 與 SQL 資料函式庫互動

首先，需要使用 SQLAlchemy 建立與資料函式庫的連線。然後，可以使用 pandas 的 read_sql 函式來讀取資料函式庫中的資料。

from sqlalchemy import create_engine
import pandas as pd

# 建立與資料函式庫的連線
engine = create_engine('postgresql://username:password@host:port/dbname')

# 從資料函式庫中讀取資料
df = pd.read_sql('SELECT * FROM table_name', con=engine)

print(df)

程式碼註解說明：

使用 create_engine 建立與 PostgreSQL 資料函式庫的連線。
使用 read_sql 從指定的資料表中讀取所有資料。

注意事項：

需要替換 username, password, host, port, dbname 和 table_name 為實際的資料函式庫連線資訊和資料表名稱。

玄貓 BlackCat

技術愛好者，專注於分享程式開發、雲端技術與 AI 應用的心得體會。