当前位置：首页>python>Python爬虫入门:爬取网页其实很简单

Python爬虫入门:爬取网页其实很简单

2026-02-22 11:37:33

想抓取网站数据却不知道从何下手？这篇带你从零开始，写出第一个爬虫程序！

爬虫（Spider/Scraper） = 自动化程序 + 模拟浏览器访问 + 提取数据

1
2
3

我们的电脑 ──请求──> 网站服务器
             <─返回网页HTML──
              ──解析提取──> 有用数据

爬虫能干什么？

应用场景	例子
数据采集	爬取新闻、房价、股票数据
内容抓取	下载图片、视频、文献
价格监控	电商商品价格变动提醒
搜索引擎	百度、Google的爬虫
数据分析	获取数据做分析报告

二、准备工作

1. 安装Python库

# 最基础的爬虫库
pip install requests

# HTML解析库（后面会讲）
pip install beautifulsoup4 lxml

2. 认识HTTP请求

我们访问网页，本质上是在发送HTTP请求：

请求方法	用途	例子
GET	获取数据	访问网页、查询信息
POST	提交数据	登录、提交表单

三、第一个爬虫： requests库基础

1. 最简单的GET请求

import requests

# 发送GET请求
response = requests.get('https://www.example.com')

# 查看状态码（200=成功，404=不存在，403=被禁止）
print(response.status_code)

# 查看返回的HTML内容
print(response.text)

# 查看二进制内容（适合图片、视频）
print(response.content)

2. 带请求头的请求

很多网站会检测请求头，如果发现是爬虫会拒绝访问：

import requests

# 伪装成浏览器
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}

response = requests.get(
'https://www.example.com',
    headers=headers
)

print(response.status_code)
print(response.text)

3. 带参数的请求

import requests

# URL参数
params = {
'wd': 'Python爬虫',  # 搜索关键词
'pn': 10# 页码
}

headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}

response = requests.get(
'https://www.baidu.com/s',
    params=params,
    headers=headers
)

# 打印完整URL
print(response.url)
# 输出: https://www.baidu.com/s?wd=Python%E7%88%AC%E8%99%AB&pn=10

四、数据解析：BeautifulSoup入门

网页返回的是HTML代码，我们需要从中提取有用信息。

from bs4 import BeautifulSoup

html = """
<html>
    <body>
        <div class="article">
            <h1>文章标题</h1>
            <p class="content">这是文章内容</p>
            <a href="https://example.com">链接</a>
        </div>
    </body>
</html>
"""

# 解析HTML
soup = BeautifulSoup(html, 'lxml')

# 1. 通过标签名查找
print(soup.h1)           # <h1>文章标题</h1>
print(soup.p)            # <p class="content">这是文章内容</p>

# 2. 获取文本内容
print(soup.h1.text)      # 文章标题
print(soup.p.get_text()) # 这是文章内容

# 3. 通过class查找
article = soup.find('div', class_='article')
print(article.text)

# 4. 获取属性
link = soup.find('a')
print(link['href'])      # https://example.com

# 5. 查找所有
all_links = soup.find_all('a')
for link in all_links:
print(link['href'])

常用查找方法

方法	作用
`find()`	找到第一个
`find_all()`	找到所有
`select()`	CSS选择器

# CSS选择器示例
soup.select('.article')          # class="article"
soup.select('#header')           # id="header"
soup.select('div > a')           # div下的a标签
soup.select('a[href="xxx"]')     # 属性匹配

五、实战：爬取简单网页

目标：爬取一个简单的文章页面

import requests
from bs4 import BeautifulSoup

# 目标网页
url = 'https://example.com/blog/article-1'

# 发送请求
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}

response = requests.get(url, headers=headers)
response.encoding = 'utf-8'# 指定编码

# 解析网页
soup = BeautifulSoup(response.text, 'lxml')

# 提取信息
title = soup.find('h1').text
content = soup.find('div', class_='content').text
date = soup.find('time').text

# 打印结果
print(f'标题: {title}')
print(f'发布时间: {date}')
print(f'内容: {content[:100]}...')  # 只显示前100字

六、常见问题

问题	原因	解决方法
乱码	编码问题	`response.encoding = 'utf-8'`
403错误	被识别为爬虫	加User-Agent、换IP
404错误	页面不存在	检查URL是否正确
速度慢	请求太频繁	加延时 `time.sleep(1)`
找不到元素	网页结构变了	检查HTML，用开发者工具看结构

延时和异常处理

import requests
from bs4 import BeautifulSoup
import time

url = 'https://example.com/page'

try:
    response = requests.get(url, timeout=10)  # 10秒超时
    response.raise_for_status()  # 检查状态码
    soup = BeautifulSoup(response.text, 'lxml')
print('爬取成功')
except requests.exceptions.RequestException as e:
print(f'请求失败: {e}')

# 延时：不要爬太快
time.sleep(2)  # 停2秒

七、写在最后

第一篇爬虫入门核心就这些：

1	`requests发送请求 → BeautifulSoup解析 → 提取数据`

下篇预告：爬取图片/文件下载、多页面翻页爬取、存储到本地。

如果这篇文章对你有帮助，欢迎点赞+在看👍

有问题欢迎留言，我们一起进步！

#AI学习 #Python爬虫 #requests #BeautifulSoup #数据分析

Photo by zenigame photo on Unsplash

本文来自网友投稿或网络内容，如有侵犯您的权益请联系我们删除，联系邮箱：wyl860211@qq.com 。

Python爬虫入门:爬取网页其实很简单

爬虫能干什么？

二、准备工作

1. 安装Python库

三、第一个爬虫： requests库基础

1. 最简单的GET请求

2. 带请求头的请求

3. 带参数的请求

四、数据解析：BeautifulSoup入门

常用查找方法

五、实战：爬取简单网页

目标：爬取一个简单的文章页面

六、常见问题

延时和异常处理

最新文章

热门文章

随机文章

Python爬虫入门:爬取网页其实很简单

爬虫能干什么？

二、准备工作

1. 安装Python库

三、第一个爬虫： requests库基础

1. 最简单的GET请求

2. 带请求头的请求

3. 带参数的请求

四、数据解析：BeautifulSoup入门

常用查找方法

五、实战：爬取简单网页

目标：爬取一个简单的文章页面

六、常见问题

延时和异常处理

Linux 防火墙 iptables 中核心的四张表概述及其功能

《零基础Python成长日记》第9篇:JSON与XML格式

最新文章

热门文章

随机文章