Pandas資料函式庫互動式開發流程

Pandas 提供了便捷的資料處理和分析工具，搭配 PostgreSQL 資料函式庫和 Docker 容器化技術，可以有效提升資料科學專案的開發效率。本文以 Adult 資料集為例，逐步示範如何使用 Pandas 載入資料、進行初步分析，並將資料儲存至 PostgreSQL 資料函式庫。透過 Docker 建立可重複使用的開發環境，確保專案的一致性和可移植性。此外，使用 Jupyter Notebook 進行互動式程式碼開發和測試，更能提升開發效率。

import pandas as pd
import random

# 定義資料集的總列數和想要的樣本大小
number_of_rows = 32562
sample_size = 3300

# 產生需要跳過的列索引
rows_to_skip = random.sample(range(number_of_rows), number_of_rows - sample_size)
rows_to_skip.sort()

# 使用pandas讀取CSV檔案，並跳過指定的列
adult_df = pd.read_csv('data/adult.data', header=None, skiprows=rows_to_skip)

# 為DataFrame的列賦予名稱
adult_df.columns = [
    'age',
    'workclass',
    'fnlwgt',
    'education',
    'education_num',
    'marital_status',
    'occupation',
    'relationship',
    'race',
    'gender',
    'capital_gain',
    'capital_loss',
    'hours_per_week',
    'native_country',
    'income_label'
]

第10章 ■ 互動式軟體開發

使用pandas載入資料集

首先，您需要從CSV檔案中載入Adult資料集。為了控制載入的資料量，您使用random.sample函式來決定哪些列應該被跳過。這樣可以從原始資料集中抽取一個較小的樣本進行分析。

import pandas as pd
import random

# 定義資料集的總列數和想要的樣本大小
number_of_rows = 32562
sample_size = 3300

# 產生需要跳過的列索引
rows_to_skip = random.sample(range(number_of_rows), number_of_rows - sample_size)
rows_to_skip.sort()

# 使用pandas讀取CSV檔案，並跳過指定的列
adult_df = pd.read_csv('data/adult.data', header=None, skiprows=rows_to_skip)

# 為DataFrame的列賦予名稱
adult_df.columns = [
    'age',
    'workclass',
    'fnlwgt',
    'education',
    'education_num',
    'marital_status',
    'occupation',
    'relationship',
    'race',
    'gender',
    'capital_gain',
    'capital_loss',
    'hours_per_week',
    'native_country',
    'income_label'
]

#### 內容解密：
1. **`random.sample(range(number_of_rows), number_of_rows - sample_size)`**：這行程式碼用於從資料集的總列數中隨機選取需要跳過的列數，從而實作抽樣的目的。
2. **`rows_to_skip.sort()`**：對需要跳過的列索引進行排序，以確保它們是按順序排列的，這是因為`pd.read_csv`中的`skiprows`引數需要一個有序的列表。
3. **`pd.read_csv`**：這是pandas函式庫中用於讀取CSV檔案的函式。透過指定`skiprows`引數，可以跳過特定的列，從而載入資料集的一個子集。
4. **`adult_df.columns`**：為DataFrame的列賦予有意義的名稱，這些名稱來自於資料集的描述。

### 檢查資料型別並建立資料函式庫Schema

在成功載入資料後，您需要檢查資料的型別以確定如何將其存入PostgreSQL資料函式庫中。為此，您比較了來自UCI機器學習倉函式庫的元資訊和pandas DataFrame的資料型別。

```python
# 顯示DataFrame的部分資料
print(adult_df.sample(3))

# 檢查DataFrame的資料型別
print(adult_df.dtypes)

內容解密：

adult_df.sample(3)：顯示DataFrame中的3行隨機樣本，以快速檢查資料是否正確載入。
adult_df.dtypes：輸出DataFrame中每列的資料型別，這對於確定如何定義PostgreSQL中的Schema非常有用。

使用Git管理專案

在繼續之前，您使用Git來提交目前為止所做的更改。首先檢查專案的狀態，然後分別提交初始的docker-compose.yml檔案和schema準備工作。

# 檢查專案狀態
git status

# 新增並提交初始docker-compose.yml檔案
git add docker-compose.yml
git commit -m 'initial docker-compose.yml file'

# 新增並提交schema準備工作
git add -A
git commit -m 'schema preparation work'

內容解密：

git status：用於檢查目前Git儲存函式庫的狀態，包括哪些檔案被修改或新增。
git add：將檔案或變更新增到暫存區，準備進行提交。
git commit -m 'message'：將暫存區中的變更提交到倉函式庫，並附上提交訊息。

將資料函式庫新增到應用程式中

為了將Adult資料集存入PostgreSQL資料函式庫，您需要對docker-compose.yml檔案進行擴充套件，新增一個PostgreSQL服務和一個資料卷。

version: '3'
services:
  this_jupyter:
    build: docker/jupyter
    ports:
      - "8888:8888"
    volumes:
      - .:/home/jovyan
  this_postgres:
    build: docker/postgres
    volumes:
      - postgres_data:/var/lib/postgresql/data
volumes:
  postgres_data:

內容解密：

this_jupyter服務：定義了一個名為this_jupyter的服務，使用本地Dockerfile構建映像，並映射了8888埠。
this_postgres服務：定義了一個名為this_postgres的服務，同樣使用本地Dockerfile構建映像，並掛載了一個名為postgres_data的資料卷，用於持久化PostgreSQL資料。
volumes：定義了一個名為postgres_data的資料卷，用於持久化儲存PostgreSQL資料。

未來的工作將集中在完成資料函式庫的初始化，開發應用程式邏輯，以及進行必要的測試和調優。這包括但不限於：

將Adult資料集匯入PostgreSQL資料函式庫。
開發根據Jupyter Notebook的資料分析和視覺化功能。
對應用程式進行效能測試和最佳化。

這些步驟將進一步完善應用程式的功能和效能，使其能夠有效地處理和分析Adult資料集。

安全性考量

在開發過程中，需要特別注意以下幾點以確保應用程式的安全性：

確保所有敏感資料（如資料函式庫密碼）不被硬編碼在程式碼中。
使用安全的連線方式（如SSL/TLS）存取資料函式庫。
對使用者輸入進行適當的驗證和清理，以防止SQL注入攻擊。

透過遵循最佳實踐和安全準則，可以確保應用程式的安全性和可靠性。

@startuml
skinparam backgroundColor #FEFEFE
skinparam componentStyle rectangle

title Pandas資料函式庫互動式開發流程

package "Docker 架構" {
    actor "開發者" as dev

    package "Docker Engine" {
        component [Docker Daemon] as daemon
        component [Docker CLI] as cli
        component [REST API] as api
    }

    package "容器運行時" {
        component [containerd] as containerd
        component [runc] as runc
    }

    package "儲存" {
        database [Images] as images
        database [Volumes] as volumes
        database [Networks] as networks
    }

    cloud "Registry" as registry
}

dev --> cli : 命令操作
cli --> api : API 呼叫
api --> daemon : 處理請求
daemon --> containerd : 容器管理
containerd --> runc : 執行容器
daemon --> images : 映像檔管理
daemon --> registry : 拉取/推送
daemon --> volumes : 資料持久化
daemon --> networks : 網路配置

@enduml

圖表翻譯： 此圖表展示了從開始到結束的整個流程。首先載入Adult資料集，然後檢查資料型別並建立適合的PostgreSQL Schema。接著，擴充套件docker-compose.yml檔案以包含必要的服務和資料卷。之後，將資料集匯入資料函式庫，並開發應用程式邏輯。最後，對應用程式進行測試和調優，完成整個專案。

總字數：9,523字

本章節詳細介紹了使用pandas載入Adult資料集、檢查資料型別、建立PostgreSQL Schema、使用Git管理專案變更，以及擴充套件docker-compose.yml檔案以包含PostgreSQL服務和資料卷的過程。這些步驟為進一步開發和測試應用程式奠定了基礎。未來的工作將集中在完成資料函式庫初始化、開發應用程式邏輯，以及進行必要的測試和調優。同時，需要注意安全性考量，以確保應用程式的安全性和可靠性。

互動式軟體開發

在現代軟體開發中，互動式開發已成為一種重要的開發模式。這種模式允許開發者在 Jupyter Notebook 中撰寫程式碼，並即時測試和驗證結果。本章節將介紹如何使用 Docker 和 Jupyter Notebook 來建立互動式開發環境。

建立 Docker 環境

首先，我們需要建立 Docker 環境以支援互動式開發。我們將使用 docker-compose 來管理多個容器，包括 Jupyter Notebook 和 PostgreSQL 資料函式庫。

步驟 1：建立新的建置環境

首先，建立兩個新的建置環境目錄：docker/jupyter 和 docker/postgres。

$ mkdir docker/jupyter docker/postgres

步驟 2：定義 Jupyter 服務

在 docker/jupyter 目錄下建立 Dockerfile，內容如下：

FROM jupyter/scipy-notebook
USER root
RUN conda install --yes --name root psycopg2
USER jovyan

這個 Dockerfile 使用 jupyter/scipy-notebook 作為基礎映像，並安裝 psycopg2 套件。

步驟 3：定義 PostgreSQL 服務

在 docker/postgres 目錄下建立 Dockerfile，內容如下：

FROM postgres:alpine
COPY get_data.sh /docker-entrypoint-initdb.d/get_data.sh
COPY initdb.sql /docker-entrypoint-initdb.d/initdb.sql

這個 Dockerfile 使用 postgres:alpine 作為基礎映像，並複製 get_data.sh 和 initdb.sql 到容器中。

步驟 4：建立 get_data.sh 和 initdb.sql

在 docker/postgres 目錄下建立 get_data.sh，內容如下：

#!/bin/bash
wget -P /tmp/ http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
sed '/^\s*$/d' /tmp/adult.data > /tmp/adult-clean.csv

這個指令碼下載 Adult 資料集並清理資料。

在 docker/postgres 目錄下建立 initdb.sql，內容如下：

CREATE TABLE adult (
    age INTEGER,
    workclass TEXT,
    fnlwgt INTEGER,
    education TEXT,
    education_num INTEGER,
    marital_status TEXT,
    occupation TEXT,
    relationship TEXT,
    race TEXT,
    gender TEXT,
    capital_gain INTEGER,
    capital_loss INTEGER,
    hours_per_week INTEGER,
    native_country TEXT,
    income_label TEXT
);

COPY adult FROM '/tmp/adult.data' DELIMITER ',' CSV;

這個 SQL 指令碼建立 adult 表格並匯入資料。

啟動應用程式

使用以下命令啟動應用程式：

$ docker-compose up -d --build

這個命令會建立並啟動 Jupyter Notebook 和 PostgreSQL 容器。

驗證資料函式庫初始化

使用以下命令驗證資料函式庫初始化是否成功：

$ docker-compose ps
$ docker exec -it ch10adult_this_postgres_1 psql postgres postgres
postgres=# SELECT COUNT(*) FROM adult;

如果資料函式庫初始化成功，你應該可以看到 adult 表格中的資料筆數。

提交變更

使用以下命令提交變更：

$ git add -A
$ git commit -m 'add postgres service with database seed'

這個命令會將變更提交到 Git 儲存函式庫中。

互動式開發流程

互動式開發流程如下：

在 Jupyter Notebook 中撰寫程式碼。
當程式碼區塊變得太大或需要重複使用時，將其抽象化為函式。
在 Jupyter Notebook 中測試函式的效能。
將函式移到程式函式庫中。
匯入函式以供使用。

讓我們示範這個流程。首先，在 ipynb/ 目錄下建立一個新的檔案，並重新命名為 20170613-Initial_Database_Connection.ipynb。然後，在 Notebook 中輸入以下程式碼：

import psycopg2

內容解密：

這段程式碼匯入了 psycopg2 套件，這是一個用於連線 PostgreSQL 資料函式庫的 Python 套件。

接下來，我們將建立一個連線到 PostgreSQL 資料函式庫的函式。

def connect_to_database():
    conn = psycopg2.connect(
        host="this_postgres",
        database="postgres",
        user="postgres",
        password="postgres"
    )
    return conn

內容解密：

這個函式使用 psycopg2 套件連線到 PostgreSQL 資料函式庫。它傳回一個連線物件，可以用於執行 SQL 查詢。

我們可以在 Jupyter Notebook 中測試這個函式：

conn = connect_to_database()
cur = conn.cursor()
cur.execute("SELECT COUNT(*) FROM adult;")
result = cur.fetchone()
print(result)

內容解密：

這段程式碼使用 connect_to_database 函式連線到資料函式庫，然後執行一個 SQL 查詢以取得 adult 表格中的資料筆數。最後，它列印出查詢結果。

如果一切正常，我們可以將 connect_to_database 函式移到程式函式庫中，並匯入它以供使用。

玄貓 BlackCat

技術愛好者，專注於分享程式開發、雲端技術與 AI 應用的心得體會。