import requestsheaders = {    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'}resp = requests.get('https://example.com', headers=headers)

更自动化的方式使用 fake-useragent 库：

from fake_useragent import UserAgentua = UserAgent()headers = {'User-Agent': ua.random}

2. IP 封锁 / 频率限制

原理：同一 IP 在短时间内请求过多会被封禁或返回验证码。

应对：使用代理 IP 池，添加随机延时。

proxies = {    'http': 'http://127.0.0.1:8080',    'https': 'https://127.0.0.1:8080'}resp = requests.get('https://example.com', proxies=proxies)

动态获取免费代理（不稳定，仅学习用）：

import requestsfrom bs4 import BeautifulSoupdef get_free_proxies():    url = 'https://free-proxy-list.net/'    soup = BeautifulSoup(requests.get(url).text, 'html.parser')    proxies = []    for row in soup.select('table.table tbody tr'):        cells = row.find_all('td')        if cells[6].text == 'yes':  # HTTPS 支持            proxy = f"{cells[4].text.lower()}://{cells[0].text}:{cells[1].text}"            proxies.append(proxy)    return proxies

延时示例

import timeimport randomfor url in url_list:    resp = requests.get(url, headers=headers)    time.sleep(random.uniform(1, 3))  # 随机睡眠1~3秒

3. 请求头完整性检查

原理：检查 Referer、Accept-Language、Cookie 等是否真实。

应对：复制完整浏览器请求头。

headers = {    'User-Agent': '...',    'Accept': 'text/html,application/xhtml+xml,...',    'Accept-Language': 'zh-CN,zh;q=0.9',    'Referer': 'https://www.google.com/',    'Cookie': 'your_cookie_value'}

对于账号登入的，使用 requests.Session 自动维持 Cookie：

session = requests.Session()session.headers.update(headers)resp = session.get('https://example.com/login')# 登录后 session 会自动携带返回的 Cookie

4. 动态内容（JavaScript 渲染）

原理：数据由 JS 动态生成，直接获取 HTML 不包含目标数据。

应对：

分析 AJAX 接口（最轻量）

使用 Selenium / Playwright / Puppeteer

模拟浏览器方法一：寻找真实 API

打开浏览器开发者工具 → Network → XHR，找到返回数据的真实请求。

# 直接请求数据接口api_url = 'https://example.com/api/getData?param=value'resp = requests.get(api_url, headers=headers)data = resp.json()

方法二：Selenium 示例

from selenium import webdriverfrom selenium.webdriver.chrome.options import Optionsoptions = Options()options.add_argument('--headless')  # 无头模式options.add_argument('--disable-blink-features=AutomationControlled')driver = webdriver.Chrome(options=options)driver.get('https://example.com')content = driver.page_sourcedriver.quit()

方法三：Playwright（更现代）

from playwright.sync_api import sync_playwrightwith sync_playwright() as p:    browser = p.chromium.launch(headless=True)    page = browser.new_page()    page.goto('https://example.com')    content = page.content()    browser.close()

5. 验证码（CAPTCHA）

原理：弹出图片验证码、滑块验证等。

应对：对接打码平台（如 2Captcha、超级鹰），机器学习识别简单验证码，降低请求频率避免触发

示例（使用打码平台 API）：

# 超级鹰示例（伪代码）from chaojiying import Chaojiying_Clientchaojiying = Chaojiying_Client('username', 'password', 'soft_id')im = open('captcha.png', 'rb').read()result = chaojiying.PostPic(im, 1902)  # 1902 为验证码类型code = result['pic_str']