Python 網頁爬蟲終極指南：2025 全面解析

讓我帶你回到我第一次嘗試用網頁爬蟲抓取商業數據的那一刻。當時我窩在廚房餐桌，一邊喝著咖啡，一邊敲著還沒寫完的 Python 腳本，想從競爭對手網站抓商品價格。心裡想：「這應該很簡單吧？」結果最後只得到一份空白的 CSV，從此對那些說「用 Python 自動化很簡單」的人多了幾分敬畏。時光快轉到 2025 年，網頁爬蟲已經是數據驅動企業的核心工具，幫助銷售、電商、行銷、營運團隊即時掌握那些人工根本無法取得的關鍵資訊。

但重點來了：雖然 Python 網頁爬蟲比以前更強大，整個產業也變化飛快。網頁爬蟲市場正熱得發燙——。幾乎來提升決策效率。但真正的挑戰不只是寫程式，更在於選對工具、有效擴充，以及避免被一堆腳本搞到崩潰。這份終極指南會帶你認識主流 Python 網頁爬蟲函式庫（含程式範例）、實際商業應用，以及為什麼即使我熱愛 Python，還是認為像這種無程式碼解決方案，才是 2025 年大多數企業用戶的首選。

什麼是 Python 網頁爬蟲？給非技術人的簡單說明

簡單來說，網頁爬蟲就是「自動化複製貼上」。你不用再請一堆實習生手動收集商品價格、聯絡名單或評論，而是讓軟體自動瀏覽網頁、擷取你要的資料，然後匯出到試算表或資料庫。所謂 Python 網頁爬蟲，就是用 Python 腳本來完成這件事——自動抓取網頁、解析 HTML，並提取你關心的資訊。

你可以把它想像成一個數位小幫手，24 小時幫你逛網站、收集資料，永遠不會喊累。企業最常抓的資料有哪些？價格資訊、商品細節、聯絡方式、評論、圖片、新聞文章，甚至房地產物件。雖然有些網站會提供 API，但大多數不是沒有就是限制一堆。這時網頁爬蟲就派上用場：即使沒有「下載」按鈕，也能大規模取得公開資料。

為什麼 Python 網頁爬蟲對企業團隊超重要？

老實說，2025 年如果你的公司還沒用網頁爬蟲，真的等於白白錯過商機。原因很簡單：

自動化資料收集： 不用再人工複製貼上競爭對手網站或線上名錄的資料。
即時洞察： 隨時掌握最新價格、庫存或市場動態。
大規模擷取： 幾分鐘內就能抓上千頁資料，效率遠勝人工。
投資報酬率高： 採用數據驅動策略的企業，平均。

來看看各部門的高效應用場景：

部門	應用範例	帶來的價值
銷售	從名錄抓取潛在客戶、補齊郵件	更大、更精準的名單
行銷	追蹤競爭對手價格、促銷、評論	更聰明的行銷策略、快速調整
電商	監控商品價格、庫存、評論	動態定價、庫存預警
營運	整合供應商資料、自動產生報表	節省時間、減少人為錯誤
房地產	從多個網站收集物件資訊	更多物件、更快回應客戶

一句話總結：網頁爬蟲就是企業做出更快、更聰明決策的秘密武器。

Python 主流網頁爬蟲函式庫總覽（含程式範例）

說到 Python 網頁爬蟲，生態圈超豐富，從簡單下載到進階瀏覽器自動化通通有。每個工具都附上範例：

urllib 和 urllib3：基礎 HTTP 請求

這是 Python 內建的 HTTP 請求工具，雖然比較陽春，但適合基礎需求。

1import urllib3, urllib3.util
2http = urllib3.PoolManager()
3headers = urllib3.util.make_headers(user_agent="MyBot/1.0")
4response = http.request('GET', "<https://httpbin.org/json>", headers=headers)
5print(response.status)        # HTTP 狀態碼
6print(response.data[:100])    # 前 100 個位元組內容

如果你想減少依賴或需要細部控制，可以用這個。但大多數情況下，requests 會更方便。

requests：最受歡迎的 Python 網頁爬蟲函式庫

如果 Python 爬蟲有吉祥物，那一定是 requests。簡單又強大，幫你搞定所有 HTTP 細節。

1import requests
2r = requests.get("<https://httpbin.org/json>", headers={"User-Agent": "MyBot/1.0"})
3print(r.status_code)      # 200
4print(r.json())           # 解析後的 JSON 內容

它能自動處理 cookies、session、重導等，讓你專心處理資料。不過 requests 只負責抓 HTML，還需要像 BeautifulSoup 這樣的解析器來提取資料。

BeautifulSoup：輕鬆解析 HTML、提取資料

BeautifulSoup 是 Python 解析 HTML 的首選，語法超親民，新手也能輕鬆上手。

1from bs4 import BeautifulSoup
2html = "<div class='product'><h2>Widget</h2><span class='price'>$19.99</span></div>"
3soup = BeautifulSoup(html, 'html.parser')
4title = soup.find('h2').text               # "Widget"
5price = soup.find('span', class_='price').text  # "$19.99"

適合小型或中型專案，或剛入門時使用。要處理大量資料或複雜查詢，可以考慮 lxml。

lxml 與 XPath：快速、強大的 HTML/XML 解析

需要速度或想用 XPath 查詢語法時，lxml 是你的好夥伴。

1from lxml import html
2doc = html.fromstring(page_content)
3prices = doc.xpath("//span[@class='price']/text()")

XPath 能精準抓資料。lxml 執行效率高，但學習曲線比 BeautifulSoup 稍微陡峭一點。

Scrapy：大規模網頁爬蟲框架

Scrapy 是大型爬蟲專案的首選，功能齊全，類似 Django 之於網站開發。

1import scrapy
2class QuotesSpider(scrapy.Spider):
3    name = "quotes"
4    start_urls = ["<http://quotes.toscrape.com/>"]
5    def parse(self, response):
6        for quote in response.css("div.quote"):
7            yield {
8                "text": quote.css("span.text::text").get(),
9                "author": quote.css("small.author::text").get(),
10            }

Scrapy 支援非同步請求、自動跟隨連結、資料管線、多種格式匯出。小型腳本用不到，但大規模爬取時無可取代。

Selenium、Playwright、Pyppeteer：動態網站資料擷取

遇到 JavaScript 動態載入的網站，就需要瀏覽器自動化工具。Selenium 和 Playwright 是主流選擇。

Selenium 範例：

1from selenium import webdriver
2from selenium.webdriver.common.by import By
3driver = webdriver.Chrome()
4driver.get("<https://example.com/login>")
5driver.find_element(By.NAME, "username").send_keys("user123")
6driver.find_element(By.NAME, "password").send_keys("secret")
7driver.find_element(By.ID, "submit-btn").click()
8titles = [el.text for el in driver.find_elements(By.CLASS_NAME, "product-title")]

Playwright 範例：

1from playwright.sync_api import sync_playwright
2with sync_playwright() as p:
3    browser = p.chromium.launch(headless=True)
4    page = browser.new_page()
5    page.goto("<https://website.com>")
6    page.wait_for_selector(".item")
7    data = page.eval_on_selector(".item", "el => el.textContent")

這類工具能處理人類看得到的所有內容，但速度較慢、資源消耗大。建議只在必要時使用。

MechanicalSoup、RoboBrowser、PyQuery、Requests-HTML：其他實用工具

MechanicalSoup：自動化表單填寫與導覽，基於 Requests 與 BeautifulSoup。

1import mechanicalsoup
2browser = mechanicalsoup.StatefulBrowser()
3browser.open("<http://example.com/login>")
4browser.select_form('form#loginForm')
5browser["username"] = "user123"
6browser["password"] = "secret"
7browser.submit_selected()
8page = browser.get_current_page()
9print(page.title.text)

RoboBrowser：類似 MechanicalSoup，但維護較少。

PyQuery：jQuery 風格的 HTML 解析。

1from pyquery import PyQuery as pq
2doc = pq("<div><p class='title'>Hello</p><p>World</p></div>")
3print(doc("p.title").text())      # "Hello"
4print(doc("p").eq(1).text())      # "World"

Requests-HTML：結合 HTTP 請求、解析，甚至支援 JS 渲染。

1from requests_html import HTMLSession
2session = HTMLSession()
3r = session.get("<https://example.com>")
4r.html.render(timeout=20)
5links = [a.text for a in r.html.find("a.story-link")]

這些工具適合需要快速處理表單、CSS 選擇器或輕量 JS 渲染的情境。

Asyncio 與 Aiohttp：加速大規模 Python 網頁爬蟲

要抓上百、上千頁時，同步請求太慢。這時可用 aiohttp 與 asyncio 進行並發爬取。

1import aiohttp, asyncio
2async def fetch_page(session, url):
3    async with session.get(url) as resp:
4        return await resp.text()
5async def fetch_all(urls):
6    async with aiohttp.ClientSession() as session:
7        tasks = [fetch_page(session, url) for url in urls]
8        return await asyncio.gather(*tasks)
9urls = ["<https://example.com/page1>", "<https://example.com/page2>"]
10html_pages = asyncio.run(fetch_all(urls))

這種方式能同時抓多個頁面，大幅提升效率。

專用函式庫：PRAW（Reddit）、PyPDF2 等

PRAW：用於透過 API 擷取 Reddit 資料。

1import praw
2reddit = praw.Reddit(client_id='XXX', client_secret='YYY', user_agent='myapp')
3for submission in reddit.subreddit("learnpython").hot(limit=5):
4    print(submission.title, submission.score)

PyPDF2：擷取 PDF 文字內容。

1from PyPDF2 import PdfReader
2reader = PdfReader("sample.pdf")
3num_pages = len(reader.pages)
4text = reader.pages[0].extract_text()

其他：Instagram、Twitter、OCR（Tesseract）等也有專屬 Python 函式庫。只要你有特殊需求，幾乎都找得到對應工具。

Python 爬蟲函式庫比較表

工具 / 函式庫	易用性	速度與規模	適用情境
Requests + BeautifulSoup	簡單	中等	新手、靜態網頁、快速腳本
lxml（含 XPath）	中等	快速	大型專案、複雜解析
Scrapy	難	非常快	企業級、大型爬蟲、資料管線
Selenium / Playwright	中等	慢	JS 動態網頁、互動式網站
aiohttp + asyncio	中等	非常快	高流量、靜態頁面為主
MechanicalSoup	簡單	中等	登入、表單、Session 管理
PyQuery	中等	快速	喜歡 CSS 選擇器、DOM 操作
Requests-HTML	簡單	變動	小型專案、輕量 JS 渲染

實戰教學：如何用 Python 打造網頁爬蟲（含範例）

以下用一個（假設的）電商網站商品列表為例，處理分頁並匯出成 CSV：

1import requests
2from bs4 import BeautifulSoup
3import csv
4base_url = "<https://example.com/products>"
5page_num = 1
6all_products = []
7while True:
8    url = base_url if page_num == 1 else f"{base_url}/page/{page_num}"
9    print(f"Scraping page: {url}")
10    response = requests.get(url, timeout=10)
11    if response.status_code != 200:
12        print(f"Page {page_num} returned status {response.status_code}, stopping.")
13        break
14    soup = BeautifulSoup(response.text, 'html.parser')
15    products = soup.find_all('div', class_='product-item')
16    if not products:
17        print("No more products found, stopping.")
18        break
19    for prod in products:
20        name_tag = prod.find('h2', class_='product-title')
21        price_tag = prod.find('span', class_='price')
22        name = name_tag.get_text(strip=True) if name_tag else "N/A"
23        price = price_tag.get_text(strip=True) if price_tag else "N/A"
24        all_products.append((name, price))
25    page_num += 1
26print(f"Collected {len(all_products)} products. Saving to CSV...")
27with open('products_data.csv', 'w', newline='', encoding='utf-8') as f:
28    writer = csv.writer(f)
29    writer.writerow(["Product Name", "Price"])
30    writer.writerows(all_products)
31print("Data saved to products_data.csv")

這段程式做了什麼？

逐頁抓取、解析商品、收集名稱與價格，沒資料時自動停止。
最後將結果匯出成 CSV，方便後續分析。

想直接匯出 Excel？用 pandas 就搞定：

1import pandas as pd
2df = pd.DataFrame(all_products, columns=["Product Name", "Price"])
3df.to_excel("products_data.xlsx", index=False)

處理表單、登入與 Session

很多網站需要登入或表單提交，可以這樣做：

用 requests 維持 Session：

1session = requests.Session()
2login_data = {"username": "user123", "password": "secret"}
3session.post("<https://targetsite.com/login>", data=login_data)
4resp = session.get("<https://targetsite.com/account/orders>")

用 MechanicalSoup：

1import mechanicalsoup
2browser = mechanicalsoup.StatefulBrowser()
3browser.open("<http://example.com/login>")
4browser.select_form('form#login')
5browser["user"] = "user123"
6browser["pass"] = "secret"
7browser.submit_selected()

Session 能幫你保留登入狀態，方便連續抓多頁。

抓取動態內容與 JavaScript 渲染頁面

如果資料不在 HTML 原始碼（檢視原始碼只看到空 div），就要用瀏覽器自動化。

Selenium 範例：

1from selenium.webdriver.support.ui import WebDriverWait
2from selenium.webdriver.support import expected_conditions as EC
3driver.get("<http://examplesite.com/dashboard>")
4WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'stats-table')))
5html = driver.page_source

或者，如果能找到 JS 呼叫的 API 端點，直接用 requests 抓 JSON，速度更快。

匯出資料：CSV、Excel、資料庫等

CSV： 用 Python 內建 csv 模組（如上例）。
Excel： 用 pandas 或 openpyxl。

Google Sheets： 用 gspread 函式庫。

1import gspread
2gc = gspread.service_account(filename="credentials.json")
3sh = gc.open("My Data Sheet")
4worksheet = sh.sheet1
5worksheet.clear()
6worksheet.append_row(["Name", "Price"])
7for name, price in all_products:
8    worksheet.append_row([name, price])

資料庫： SQL 可用 sqlite3、pymysql、psycopg2 或 SQLAlchemy；NoSQL 可用 pymongo 連接 MongoDB。

Python 網頁爬蟲 vs. 現代無程式碼方案：為什麼 2025 年 Thunderbit 是首選？

來談談最現實的問題：維護。自己寫爬蟲很有成就感，但當你要同時抓 100 個網站、每個網站規則都不同，而且偏偏在報告前一天晚上全數失效時，你就知道這有多崩潰。

這也是我推薦的原因。2025 年，對企業用戶來說，Thunderbit 有以下優勢：

完全免寫程式： Thunderbit 提供視覺化介面，只要點選「AI 建議欄位」、調整欄位、按下「抓取」就完成。不用寫 Python、不用除錯、不用查 Stack Overflow。
大規模擷取無壓力： 想抓 1 萬筆商品？Thunderbit 雲端引擎自動處理，你不用盯著腳本跑。
零維護負擔： 如果你要追蹤 100 個競爭對手網站，維護 100 支 Python 腳本會讓人崩潰。Thunderbit 只要選擇或微調範本，AI 會自動適應版面變化。
支援子頁面與分頁： Thunderbit 能自動點擊連結、處理分頁，甚至能進一步抓取每個商品的詳細頁面。
即時範本： 針對熱門網站（Amazon、Zillow、LinkedIn 等）有現成範本，一鍵就能取得資料。
免費資料匯出： 可直接匯出到 Excel、Google Sheets、Airtable 或 Notion，完全免費。