当前位置：首页>python>如何用Python爬取网站数据

如何用Python爬取网站数据

2026-06-29 02:43:43

READING

一、Python爬虫入门

我们先来看一个简单的Python爬虫程序，爬取一个网页的标题：

import requestsfrom bs4 import BeautifulSoup# 发送HTTP请求url = 'http://www.baidu.com/'response = requests.get(url)# 解析HTML文档soup = BeautifulSoup(response.text, 'html.parser')title = soup.title# 输出结果print('网页标题:', title.string)

通过requests库来发送HTTP请求，并使用BeautifulSoup库来解析HTML文档。这两个库可以让我们轻松地获取网页数据，进而完成数据分析和处理。

READING

二、使用代理IP

使用代理IP的方法很简单，只需向requests库的get()或post()方法传递proxies参数即可。使用代理IP的Python爬虫程序，爬取一个网站的代理IP：

import requestsfrom bs4 import BeautifulSoup# 设置代理IPproxies = {    'http': 'http://127.0.0.1:8080',    'https': 'http://127.0.0.1:8080'}# 发送HTTP请求url = 'http://www.zdaye.cn/freeproxy.html'response = requests.get(url, proxies=proxies)# 解析HTML文档soup = BeautifulSoup(response.text, 'html.parser')trs = soup.select('.table tbody tr')# 输出结果for tr in trs:    tds = tr.select('td')    ip = tds[0].string    port = tds[1].string    print(f'{ip}:{port}')

在这个程序中，我们设置了一个代理IP，然后使用requests库发送HTTP请求，传递了proxies参数。接着我们解析HTML文档，使用BeautifulSoup库找到了代理IP，并输出了结果。

READING

三、反爬虫技术

有些网站为了防止被爬虫抓取，会采取一些反爬虫技术，如设置限流、验证码等。

① 间隔时间

可以通过设置间隔时间来减小对目标网站的压力，缓解反爬虫措施带来的影响。代码实现如下：

import requestsimport time# 发送HTTP请求url = 'http://www.baidu.com/'while True:    response = requests.get(url)    print(response.text)    time.sleep(5)  # 每隔5秒钟发送一次请求

在这段代码中，我们使用了time库来让程序等待5秒钟，然后再继续发送HTTP请求。

② 随机UA

有些网站会根据User-Agent来判断是否是爬虫程序，可以通过随机User-Agent的方法，来爬虫程序更难被发现。代码实现如下：

import requestsfrom fake_useragent import UserAgent# 获取随机User-Agentua = UserAgent()headers = {    'User-Agent': ua.random}# 发送HTTP请求url = 'http://www.baidu.com/'response = requests.get(url, headers=headers)print(response.text)

用fake_useragent库来生成随机的User-Agent，将其设置到HTTP请求的headers中。

③ 使用 Cookies

有些网站会根据用户的 Cookies 来判断是否是爬虫程序，通过获取网站的 Cookies，将其设置到爬虫程序中，来伪装成正常用户。代码实现如下：

import requests# 发送HTTP请求url = 'http://www.baidu.com/'response = requests.get(url)# 获取Cookiescookies = response.cookies# 设置Cookiesheaders = {    'Cookies': cookies}# 发送HTTP请求url = 'http://www.baidu.com/'response = requests.get(url, headers=headers)print(response.text)

向发送 HTTP 请求获取了网站的 Cookies，然后将其设置到 HTTP 请求的 headers 中。

本文来自网友投稿或网络内容，如有侵犯您的权益请联系我们删除，联系邮箱：wyl860211@qq.com 。

如何用Python爬取网站数据

最新文章

热门文章

随机文章

如何用Python爬取网站数据

终端安全指南|Linux系统账号与权限

Linux测网速只会ping?这10个命令让你从＂小白＂变＂专家＂

最新文章

热门文章

随机文章