"Python 爬虫用什么库?"
新手:requests 老手:aiohttp + scrapy 高手:这 10 个库组合使用
今天分享 10 个我私藏的 Python 爬虫库,每一个都能让你的效率翻倍。
1️⃣ requests - 基础必备
安装:
pip install requests
用法:
import requestsresponse = requests.get('https://api.github.com/users/openclaw')data = response.json()print(data['login']) # openclaw
适用场景:简单 HTTP 请求
2️⃣ aiohttp - 异步并发
安装:
pip install aiohttp
用法:
import aiohttpimport asyncioasyncdeffetch(session, url):asyncwith session.get(url) as response:returnawait response.json()asyncdefmain():asyncwith aiohttp.ClientSession() as session: tasks = [fetch(session, f'https://api.github.com/users/{user}') for user in ['openclaw', 'torvalds', 'guido']] results = await asyncio.gather(*tasks)return resultsdata = asyncio.run(main())
适用场景:高并发爬虫
3️⃣ BeautifulSoup - HTML 解析
安装:
pip install beautifulsoup4
用法:
from bs4 import BeautifulSoupimport requestshtml = requests.get('https://github.com/trending').textsoup = BeautifulSoup(html, 'html.parser')for row in soup.select('article.Box-row'): name = row.select_one('h2 a').text.strip() stars = row.select_one('[aria-label="Stars"]').text.strip()print(f"{name}: {stars}")
适用场景:静态网页解析
4️⃣ Scrapy - 爬虫框架
安装:
pip install scrapy
用法:
import scrapyclassGithubSpider(scrapy.Spider): name = 'github' start_urls = ['https://github.com/trending']defparse(self, response):for row in response.css('article.Box-row'):yield {'name': row.css('h2 a::text').get(),'stars': row.css('[aria-label="Stars"]::text').get() }
适用场景:大型爬虫项目
5️⃣ Playwright - 浏览器自动化
安装:
pip install playwrightplaywright install
用法:
from playwright.sync_api import sync_playwrightwith sync_playwright() as p: browser = p.chromium.launch() page = browser.new_page() page.goto('https://github.com/trending') page.screenshot(path='trending.png') browser.close()
适用场景:动态网页/需要 JavaScript
6️⃣ Selenium - 老牌自动化
安装:
pip install selenium
用法:
from selenium import webdriverfrom selenium.webdriver.common.by import Bydriver = webdriver.Chrome()driver.get('https://github.com/trending')elements = driver.find_elements(By.CSS_SELECTOR, 'article.Box-row')for el in elements:print(el.text)driver.quit()
适用场景:老项目维护
7️⃣ httpx - 现代 HTTP 客户端 ⭐
安装:
pip install httpx
用法:
import httpx# 同步with httpx.Client() as client: response = client.get('https://api.github.com/users/openclaw')print(response.json())# 异步asyncwith httpx.AsyncClient() as client: response = await client.get('https://api.github.com/users/openclaw')print(response.json())
为什么好用:
90% 的人不知道这个库!
8️⃣ fake-useragent - 伪造 UA
安装:
pip install fake-useragent
用法:
from fake_useragent import UserAgentua = UserAgent()headers = {'User-Agent': ua.random}print(headers['User-Agent'])# Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...
适用场景:绕过 UA 检测
9️⃣ retrying - 自动重试
安装:
pip install retrying
用法:
from retrying import retryimport requests@retry(stop_max_attempt_number=3, wait_fixed=2000)deffetch_url(url): response = requests.get(url) response.raise_for_status()return response.textdata = fetch_url('https://api.github.com/users/openclaw')
适用场景:网络不稳定
🔟 proxy-pool - 代理池
安装:
pip install proxy-pool
用法:
import requestsproxies = {'http': 'http://proxy.example.com:8080','https': 'http://proxy.example.com:8080',}response = requests.get('https://www.example.com', proxies=proxies)
适用场景:防封 IP
📊 库对比总结
🎁 组合推荐
新手组合:
requests + BeautifulSoup + fake-useragent
进阶组合:
httpx + BeautifulSoup + retrying + proxy-pool
高级组合:
aiohttp + Scrapy + Playwright
你还知道哪些好用的爬虫库?评论区分享~