Python 資料分析與驗證工具應用

在資料科學專案中，資料分析、驗證和視覺化是不可或缺的環節。YData Profiling 能夠快速生成涵蓋資料概況、欄位描述和相關性分析的報告，幫助我們初步瞭解資料特性。Great Expectations 則允許我們定義資料期望並進行驗證，確保資料品質。此外，Plotly 和 PyGWalker 提供了互動式視覺化功能，方便我們探索資料的分佈和模式。更進一步，Pandas 生態系統整合了 Scikit-learn 和 XGBoost 等機器學習函式庫，以及 DuckDB 資料函式庫系統，讓資料科學家能夠更有效率地進行模型訓練和資料查詢。

使用 YData Profiling 進行資料分析與驗證

在資料科學領域，理解和驗證資料是至關重要的步驟。YData Profiling 和 Great Expectations 是兩個強大的工具，可以幫助我們更有效地進行資料分析和驗證。

YData Profiling：快速產生資料分析報告

YData Profiling 是一個用於產生資料分析報告的函式庫，可以幫助我們快速瞭解資料的特性。使用 YData Profiling，我們可以輕鬆地產生一個包含多種常見視覺化的報告，幫助我們描述資料的欄位。

如何使用 YData Profiling

首先，我們需要安裝 YData Profiling 函式庫。在本文中，我們使用的是 ydata_profiling 版本 4.9.0。要產生資料分析報告，只需執行以下程式碼：

from ydata_profiling import ProfileReport
profile = ProfileReport(df, title="Vehicles Profile Report")

#### 內容解密：

from ydata_profiling import ProfileReport：匯入 YData Profiling 的 ProfileReport 類別。
profile = ProfileReport(df, title="Vehicles Profile Report")：建立一個 ProfileReport 物件，傳入我們的 DataFrame df 和報告標題。

如果我們在 Jupyter Notebook 中執行程式碼，可以直接在 Notebook 中檢視報告：

profile.to_widgets()

或者，我們可以將報告匯出到本地的 HTML 檔案：

profile.to_file("vehicles_profile.html")

#### 內容解密：

profile.to_widgets()：在 Jupyter Notebook 中顯示報告。
profile.to_file("vehicles_profile.html")：將報告匯出到本地的 HTML 檔案。

產生的報告包含多個部分，包括 Overview、欄位描述、相關性分析等。Overview 部分會列出資料中缺失的值、重複的行數等資訊。

Great Expectations：資料驗證

Great Expectations 是一個用於資料驗證的函式庫，可以幫助我們確保資料的品質。使用 Great Expectations，我們可以定義多個期望（Expectation），並驗證資料是否符合這些期望。

如何使用 Great Expectations

首先，我們需要安裝 Great Expectations 函式庫。在本文中，我們使用的是 Great Expectations 版本 1.0.2。要開始使用 Great Expectations，我們需要匯入函式庫並建立一個 Context：

import great_expectations as gx
context = gx.get_context()

#### 內容解密：

import great_expectations as gx：匯入 Great Expectations 函式庫。
context = gx.get_context()：建立一個 Great Expectations 的 Context。

接下來，我們需要建立一個 Data Source 和一個 Data Asset。Data Source 代表資料的來源，而 Data Asset 則是結果的分組機制。

datasource = context.data_sources.add_pandas(name="pandas_datasource")
data_asset = datasource.add_dataframe_asset(name="vehicles")

#### 內容解密：

datasource = context.data_sources.add_pandas(name="pandas_datasource")：建立一個 Pandas Data Source。
data_asset = datasource.add_dataframe_asset(name="vehicles")：建立一個 Data Asset。

然後，我們可以建立一個 Batch Definition，並取得一個 Batch：

batch_definition_name = "dataframe_definition"
batch_definition = data_asset.add_batch_definition_whole_dataframe(batch_definition_name)
batch = batch_definition.get_batch(batch_parameters={"dataframe": df})

#### 內容解密：

batch_definition = data_asset.add_batch_definition_whole_dataframe(batch_definition_name)：建立一個 Batch Definition。
batch = batch_definition.get_batch(batch_parameters={"dataframe": df})：取得一個 Batch。

最後，我們可以定義多個期望（Expectation），並驗證資料是否符合這些期望。例如，我們可以驗證某個欄位是否不包含空值：

city_exp = gx.expectations.ExpectColumnValuesToNotBeNull(column="city08")
result = batch.validate(city_exp)
print(result)

#### 內容解密：

city_exp = gx.expectations.ExpectColumnValuesToNotBeNull(column="city08")：定義一個期望，驗證 city08 欄位是否不包含空值。
result = batch.validate(city_exp)：驗證資料是否符合期望。
print(result)：列印驗證結果。

資料驗證與視覺化

在資料科學的領域中，資料驗證與視覺化是兩個至關重要的步驟。資料驗證確保了資料的品質和可靠性，而視覺化則幫助我們更好地理解資料的分佈和特性。

使用 Great Expectations 進行資料驗證

Great Expectations 是一個強大的工具，可以幫助我們定義和驗證資料的預期。它允許我們對資料進行各種斷言，例如檢查空值、分佈範圍等。

設定資料預期

import great_expectations as ge

# 建立 Great Expectations 的 Data Context
context = ge.data_context.DataContext()

# 載入資料
df = pd.read_csv("data/vehicles.csv.zip")

# 建立 Expectation Suite
suite = context.create_expectation_suite("vehicles_suite")

# 定義預期
df_ge = ge.from_pandas(df, expectation_suite=suite)
df_ge.expect_column_values_to_not_be_null("city08")
df_ge.expect_column_values_to_be_between("city08", min_value=0, max_value=100)

內容解密：

ge.data_context.DataContext()：建立 Great Expectations 的 Data Context，用於管理和儲存 Expectation Suite。
ge.from_pandas(df, expectation_suite=suite)：將 Pandas DataFrame 轉換為 Great Expectations 的 Data Asset，並繫結 Expectation Suite。
expect_column_values_to_not_be_null("city08")：定義預期，檢查 “city08” 列的值不為空。
expect_column_values_to_be_between("city08", min_value=0, max_value=100)：定義預期，檢查 “city08” 列的值是否在 0 到 100 之間。

資料視覺化

視覺化是理解資料的重要手段。除了靜態圖表外，互動式視覺化工具如 Plotly 和 PyGWalker 可以提供更豐富的體驗。

使用 Plotly 進行互動式視覺化

import plotly.express as px

# 載入資料
df = pd.read_csv("data/vehicles.csv.zip")

# 建立散點圖
fig = px.scatter(df, x="city08", y="highway08", hover_data=["make", "model", "year"])
fig.show()

內容解密：

px.scatter()：使用 Plotly Express 建立散點圖。
x="city08" 和 y="highway08"：指定 x 軸和 y 軸的欄位。
hover_data=["make", "model", "year"]：指定懸停時顯示的額外資訊。

使用 PyGWalker 進行互動式資料探索

import pygwalker as pyg

# 載入資料
df = pd.read_csv("data/vehicles.csv.zip")

# 啟動 PyGWalker
pyg.walk(df)

內容解密：

pyg.walk(df)：啟動 PyGWalker，提供互動式介面讓使用者拖曳欄位來生成圖表。

資料科學與機器學習

Pandas 與許多資料科學和機器學習函式庫緊密整合，例如 scikit-learn。

使用 scikit-learn 進行線性迴歸

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# 載入資料並選擇數值型欄位
df = pd.read_csv("data/vehicles.csv.zip")
num_df = df.select_dtypes(include=["number"]).dropna()

# 分割資料為訓練集和測試集
X = num_df.drop(columns=["comb08"])
y = num_df["comb08"]
X_train, X_test, y_train, y_test = train_test_split(X, y)

# 建立線性迴歸模型並訓練
model = LinearRegression()
model.fit(X_train, y_train)

# 進行預測
y_pred = model.predict(X_test)

內容解密：

select_dtypes(include=["number"])：選擇數值型的欄位。
dropna()：丟棄含有缺失值的列。
train_test_split(X, y)：將資料分割為訓練集和測試集。
LinearRegression()：建立線性迴歸模型。
model.fit(X_train, y_train)：訓練模型。
model.predict(X_test)：使用模型進行預測。

透過這些工具和方法，我們可以有效地進行資料驗證、視覺化和機器學習模型的建立，從而更好地理解和利用資料。

pandas 生態系統中的機器學習與資料函式庫應用

在資料科學領域中，pandas 函式庫提供了強大的資料處理能力，而結合機器學習與資料函式庫技術則能進一步拓展其應用範圍。本篇文章將介紹如何使用 scikit-learn 與 XGBoost 進行機器學習分析，以及如何利用 DuckDB 資料函式庫系統與 pandas 進行整合。

使用 scikit-learn 進行迴歸分析

scikit-learn 是 Python 中廣泛使用的機器學習函式庫，提供了豐富的演算法與工具。首先，我們使用 LinearRegression 模型對車輛資料集進行迴歸分析：

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# 分割資料集為訓練集與測試集
X_train, X_test, y_train, y_test = train_test_split(X, y)

# 建立並訓練 LinearRegression 模型
regr = LinearRegression()
regr.fit(X_train, y_train)

# 對測試集進行預測
y_pred = regr.predict(X_test)

# 評估模型準確度
mean_squared_error(y_test, y_pred)

內容解密：

資料分割：使用 train_test_split 將資料集分割為訓練集與測試集，以評估模型的泛化能力。
模型訓練：建立 LinearRegression 模型並使用訓練資料進行訓練。
預測與評估：對測試集進行預測，並使用 mean_squared_error 評估模型的準確度。

使用 XGBoost 進行分類別分析

XGBoost 是另一種強大的機器學習函式庫，特別適用於梯度提升演算法。我們使用 XGBoost 對車輛品牌進行分類別：

from xgboost import XGBClassifier

# 篩選特定品牌的資料
brands = {"Dodge", "Toyota", "Volvo", "BMW", "Buick", "Audi", "Volkswagen", "Subaru"}
df2 = df[df["make"].isin(brands)]

# 將目標變數轉換為類別型別
X = df2.select_dtypes(include=["number"])
y = df2["make"].astype(pd.CategoricalDtype())

# 分割資料集並訓練 XGBClassifier 模型
X_train, X_test, y_train, y_test = train_test_split(X, y.cat.codes)
bst = XGBClassifier()
bst.fit(X_train, y_train)

# 進行預測並評估模型準確度
preds = bst.predict(X_test)
accuracy = (preds == y_test).sum() / len(y_test)
print(f"Model prediction accuracy is: {accuracy:.2%}")

內容解密：

資料篩選：篩選特定品牌的資料以進行分類別分析。
目標變數轉換：將目標變數轉換為類別型別，以便 XGBoost 模型進行分類別。
模型訓練與評估：訓練 XGBClassifier 模型並評估其準確度。

使用 DuckDB 進行 SQL 查詢

DuckDB 是一種輕量級的資料函式庫系統，能夠與 pandas 無縫整合，提供 SQL 查詢功能。我們可以使用 DuckDB 對 pandas DataFrame 進行查詢：

import duckdb

# 將 DataFrame 載入 DuckDB 資料表
duckdb.sql("CREATE TABLE vehicles AS SELECT * FROM df")

# 執行 SQL 查詢
result = duckdb.sql("SELECT COUNT(*) FROM vehicles WHERE make = 'Honda'")

# 將查詢結果轉換為 DataFrame
result_df = result.df()

內容解密：

資料載入：將 pandas DataFrame 載入 DuckDB 資料表。
SQL 查詢：使用 SQL 語法對資料表進行查詢。
結果轉換：將查詢結果轉換為 pandas DataFrame，以便進一步分析。

玄貓

技術愛好者，專注於分享程式開發、雲端技術與 AI 應用的心得體會。