Python 使用 urllib 與 requests 建構 HTTP 客戶端

Python 提供了強大的模組，例如 urllib.request 和 requests，方便開發者建構 HTTP 客戶端，與 Web 伺服器和 API 進行互動。urllib.request 是 Python 內建的模組，提供基本的 HTTP 功能，而 requests 則是一個第三方函式庫，簡化了 HTTP 操作，並提供了更友善的 API。透過這些模組，可以輕鬆地傳送各種 HTTP 請求，例如 GET 和 POST，設定請求標頭，處理不同格式的回應，例如 JSON，以及下載檔案等。理解這些模組的使用方法對於網路應用程式開發至關重要。

使用urllib.request構建HTTP客戶端

使用urlopen函式可以生成一個類別似於檔案的物件，用於從URL讀取資料。該物件具有read、readline、readlines和close等方法，這些方法與檔案物件中的方法完全相同，儘管實際上我們是在使用抽象化底層socket的包裝方法。

urllib.request模組簡介

urllib.request模組允許透過其地址存取網路上發布的資源。如果檢視Python 3模組的檔案，可以看到該類別中的所有函式。

使用urlopen函式傳送POST請求

urlopen函式提供了一個可選的data引數，用於使用POST方法向HTTP地址傳送資訊。該引數是一個具有正確編碼的字串：

urllib.request.urlopen(url, data=None, [timeout,] *, cafile=None, capath=None, cadefault=False, context=None)

在下面的指令碼中，我們使用urlopen方法透過data引數作為字典來執行POST請求：

import urllib.request
import urllib.parse

data_dictionary = {"id": "0123456789"}
data = urllib.parse.urlencode(data_dictionary)
data = data.encode('ascii')

with urllib.request.urlopen("http://httpbin.org/post", data) as response:
    print(response.read().decode('utf-8'))

內容解密：

首先，我們定義了一個包含鍵值對的字典data_dictionary。
使用urllib.parse.urlencode()方法將字典編碼為字串。
將編碼後的字串轉換為位元組格式，以便透過POST請求傳送。
使用urlopen函式傳送POST請求到指定的URL，並讀取回應內容。

使用urllib.request讀取URL內容

使用urllib.request讀取URL內容是一個簡單直接的過程。可以在Python直譯器中執行以下指令：

>>> from urllib.request import urlopen
>>> response
<http.client.HTTPResponse object at 0x7fa3c53059b0>
>>> response.readline()

內容解密：

使用urlopen()方法傳送請求並接收對指定資源的回應。
response物件提供了多種方法，例如readline()，用於讀取回應內容。

自訂請求標頭

HTTP請求包含兩個主要部分：標頭和主體。標頭包含有關回應的特定後設資料，並告知客戶端如何解釋回應。使用urllib.request模組，可以測試標頭是否提供有關Web伺服器的資訊。

範例程式碼

import urllib.request
from urllib.request import Request

url = "http://python.org"
USER_AGENT = 'Mozilla/5.0 (Linux; Android 10) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.101 Mobile Safari/537.36'

def chrome_user_agent():
    opener = urllib.request.build_opener()
    opener.addheaders = [('User-agent', USER_AGENT)]
    urllib.request.install_opener(opener)
    response = urllib.request.urlopen(url)
    
    print("Response headers")
    print("
---
-
---
-
---
-
---
-
---
-")
    for header, value in response.getheaders():
        print(header + ":" + value)

    request = Request(url)
    request.add_header('User-agent', USER_AGENT)
    
    print("\nRequest headers")
    print("
---
-
---
-
---
-
---
-
---
-")
    for header, value in request.header_items():
        print(header + ":" + value)

if __name__ == '__main__':
    chrome_user_agent()

內容解密：

自訂User-Agent標頭以模擬特定的Chrome瀏覽器版本。
使用兩種方法更改User-Agent：一種是使用opener物件的addheaders屬性，另一種是使用Request物件的add_header()方法。
列印回應標頭和請求標頭，以驗證自訂標頭的效果。

取得JSON格式的回應

如果收到JSON格式的回應，可以使用Python的json模組進行處理：

>>> import json
>>> response = urllib.request.urlopen(url, timeout=30)
>>> json_response = json.loads(response.read())

內容解密：

使用json.loads()方法將JSON格式的回應內容轉換為Python字典。
可以透過存取字典的鍵值對來取得所需的資訊。

使用 `urllib.request` 進行 HTTP 程式設計

在前面的章節中，我們已經瞭解如何使用 urllib.request 套件來取得網頁伺服器的相關資訊，包括請求和回應的標頭（headers）。本章節將探討如何使用此套件來提取電子郵件地址以及下載檔案。

從 URL 提取電子郵件地址

使用 urllib.request 和正規表示式（regular expression）可以從網頁內容中提取電子郵件地址。以下是一個範例程式碼，展示瞭如何實作這一點：

import urllib.request
import re

# 自定義 User-Agent
USER_AGENT = 'Mozilla/5.0 (Linux; Android 10) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.101 Mobile Safari/537.36'

# 輸入 URL
url = input("輸入網址:http://")

# 建立 opener 物件並新增自定義的 User-Agent
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', USER_AGENT)]
urllib.request.install_opener(opener)

try:
    # 傳送請求並取得回應
    response = urllib.request.urlopen('http://' + url)
    html_content = response.read()

    # 定義正規表示式模式以匹配電子郵件地址
    pattern = re.compile(r"[-a-zA-Z0-9._]+@[-a-zA-Z0-9_]+\.[a-zA-Z0-9_.]+")
    mails = re.findall(pattern, str(html_content))

    # 輸出提取到的電子郵件地址
    print(mails)
except Exception as e:
    print("發生錯誤：", e)

程式碼解析：

自定義 User-Agent：透過 build_opener() 方法建立一個 opener 物件，並新增自定義的 User-Agent 標頭，以模擬瀏覽器行為。
傳送請求：使用 urlopen() 方法傳送 HTTP 請求並取得回應內容。
正規表示式匹配：定義一個正規表示式模式，用於匹配電子郵件地址，並使用 re.findall() 方法在 HTML 內容中查詢匹配的電子郵件地址。
錯誤處理：捕捉並輸出可能發生的異常，以提高程式的健壯性。

使用 `urllib.request` 下載檔案

urllib.request 提供了兩種下載檔案的方法：urlretrieve() 和 urlopen()。

import urllib.request

# 下載檔案的 URL
url = "https://www.python.org/static/img/python-logo.png"

try:
    # 方法一：使用 urlretrieve() 下載檔案
    print("開始下載...")
    urllib.request.urlretrieve(url, "python.png")
    print("下載完成！")

    # 方法二：使用 urlopen() 下載檔案
    with urllib.request.urlopen(url) as response:
        print("狀態碼：", response.status)
        print("下載中...")
        with open("python_open.png", "wb") as image:
            image.write(response.read())
        print("下載完成！")
except Exception as e:
    print("下載失敗：", e)

程式碼解析：

urlretrieve() 方法：直接將遠端資源下載到本地檔案。
urlopen() 方法：透過傳回的回應物件讀取內容，並手動寫入本地檔案。
錯誤處理：對可能發生的異常進行捕捉和輸出，以確保程式的穩定性。

處理 `urllib.request` 的異常

在使用 urllib.request 時，可能會遇到各種異常，如 URL 不存在或網路問題等。正確地處理這些異常對於編寫可靠的程式至關重要。

import urllib.request
import urllib.error

def count_words_in_url(url):
    try:
        response = urllib.request.urlopen(url)
        content = response.read()
        return len(content.split())
    except urllib.error.URLError as e:
        print('異常：', e)
        print('原因：', e.reason)

# 測試有效的 URL
print(count_words_in_url('https://www.gutenberg.org/cache/epub/2000/pg2000.txt'))

# 測試無效的 URL
print(count_words_in_url('https://not-exists.txt'))

程式碼解析：

例外處理：使用 try-except 結構捕捉 urllib.error.URLError 異常。
URLError 異常：當 URL 無效或發生網路錯誤時，會引發此異常。

建構 HTTP 客戶端使用 `requests` 模組

除了 urllib.request，Python 的 requests 模組是另一種流行的 HTTP 客戶端實作，提供了更簡潔和友好的 API。

安裝 `requests` 模組

pip3 install requests

本章節介紹瞭如何使用 urllib.request 和 requests 模組進行 HTTP 程式設計，包括傳送請求、處理回應、下載檔案及錯誤處理。這些技能對於開發網路應用程式和服務至關重要。接下來，我們將進一步探討如何使用 requests 模組與 RESTful API 互動。

使用 requests 建構 HTTP 客戶端

在我們的指令碼中測試函式庫時，只需像其他模組一樣匯入它。基本上，requests 是對 urllib.request 的封裝，同時結合其他 Python 模組，以提供具有簡單方法的 REST 結構。因此，我們有 get、post、put、update、delete、head 和 options 方法，這些都是與 RESTful API 互動所需的方法。

requests 模組的簡單實作

例如，使用 requests 的 GET 查詢如下：

>>> import requests
>>> response = requests.get('http://www.python.org')

如我們所見，requests.get() 方法傳回一個 response 物件。在這個物件中，我們可以找到與請求的回應相對應的所有資訊。

response 物件的主要屬性

response.status_code：這是由伺服器傳回的 HTTP 狀態碼。
response.content：在這裡，我們可以找到伺服器回應的內容。
response.json()：如果回應是 JSON 格式，這個方法會將字串序列化並傳回對應的字典結構。如果回應不是 JSON 格式，則會引發例外。

檢視 response 物件的屬性

在下面的指令碼中，我們也可以透過 python.org 網域的 response 物件檢視屬性。response.headers 陳述式提供了網頁伺服器回應的標頭。基本上，response 是一個物件字典，我們可以使用 items() 方法以鍵值格式迭代它。

程式碼範例：requests_headers.py

#!/usr/bin/env python3
import requests, json

domain = input("Enter the hostname http://")
response = requests.get("http://" + domain)

print("Status code: " + str(response.status_code))
print("Headers response: ")
for header, value in response.headers.items():
    print(header, '-->', value)
print("Headers request : ")
for header, value in response.request.headers.items():
    print(header, '-->', value)

程式碼解說：

首先，我們匯入必要的模組，包括 requests 和 json。
使用者輸入要查詢的網域名稱。
使用 requests.get() 方法傳送 GET 請求到指定的網域。
列印出 HTTP 狀態碼和伺服器回應的標頭。
同時列印出請求的標頭，包括 User-Agent 等資訊。

使用 requests 取得圖片和連結

在下面的例子中，我們將使用 requests 和正規表示式模組提取圖片和連結。使用 re 模組在目標 URL 中查詢 img 和 href HTML 元素是一種簡單的方式。

程式碼範例：get_images_links_url.py

#!/usr/bin/env python3
import requests
import re

url = input("Enter URL > ")
var = requests.get(url).text

print("Images:")
print("#########################")
for image in re.findall("<img (.*)>", var):
    for images in image.split():
        if re.findall("src=(.*)", images):
            image = images[:-1].replace("src=\"", "")
            if image.startswith("http"):
                print(image)
            else:
                print(url + image)

print("#########################")
print("Links:")
print("#########################")
for link, name in re.findall("<a (.*)>(.*)</a>", var):
    for a in link.split():
        if re.findall("href=(.*)", a):
            url_image = a[0:-1].replace("href=\"", "")
            if url_image.startswith("http"):
                print(url_image)
            else:
                print(url + url_image)

程式碼解說：

使用者輸入要提取圖片和連結的 URL。
使用 requests.get() 取得 URL 的內容。
使用正規表示式提取 img 元素中的圖片連結和 a 元素中的超連結。
列印出提取到的圖片和連結。

使用 REST API 傳送 GET 請求

為了測試使用 requests 模組傳送請求，我們可以使用 https://httpbin.org/ 服務並嘗試不同的請求型別。在所有情況下，執行程式碼以獲得所需輸出的方式都是相同的；唯一會改變的是請求的型別和傳送到伺服器的資料。

玄貓 BlackCat

技術愛好者，專注於分享程式開發、雲端技術與 AI 應用的心得體會。

Python 使用 urllib 與 requests 建構 HTTP 客戶端

使用urllib.request構建HTTP客戶端

urllib.request模組簡介

使用urlopen函式傳送POST請求

內容解密：

使用urllib.request讀取URL內容

內容解密：

自訂請求標頭

範例程式碼

內容解密：

取得JSON格式的回應

內容解密：

使用 urllib.request 進行 HTTP 程式設計

從 URL 提取電子郵件地址

程式碼解析：

使用 urllib.request 下載檔案

程式碼解析：

處理 urllib.request 的異常

程式碼解析：

建構 HTTP 客戶端使用 requests 模組

安裝 requests 模組

使用 requests 建構 HTTP 客戶端

requests 模組的簡單實作

response 物件的主要屬性

檢視 response 物件的屬性

程式碼範例：requests_headers.py

程式碼解說：

使用 requests 取得圖片和連結

程式碼範例：get_images_links_url.py

程式碼解說：

使用 REST API 傳送 GET 請求

玄貓 BlackCat

使用 `urllib.request` 進行 HTTP 程式設計

使用 `urllib.request` 下載檔案

處理 `urllib.request` 的異常

建構 HTTP 客戶端使用 `requests` 模組

安裝 `requests` 模組