当前位置：首页>python>Python爬虫常用函数大揭秘!

Python爬虫常用函数大揭秘!

2026-06-23 14:53:23

新手写爬虫总是抓不到数据？可能是函数用错了！我整理了 Python 爬虫中最常用的 10 个函数，配代码实战： requests.get() 发送请求 BeautifulSoup() 解析 HTML json.loads() 处理 JSON re.findall() 正则提取内容 headers 和 cookies 模拟浏览器行为代理IP、防封处理也有！

1. `requests.get(url)`

作用：发送 GET 请求，获取网页内容。

import requestsurl = "https://httpbin.org/get"response = requests.get(url)print(response.text)

**解析**：requests.get是最基础的爬虫入口，用于获取网页的 HTML 或接口返回数据。

2. `BeautifulSoup(html, 'html.parser')`

作用：解析 HTML 页面，提取元素。

from bs4 import BeautifulSouphtml = "<html><body><h1>标题</h1></body></html>"soup = BeautifulSoup(html, 'html.parser')print(soup.h1.text)

解析：BeautifulSoup是解析 HTML 的利器，提供便捷的标签访问方式

3. `soup.select(css_selector)`

作用：使用 CSS 选择器提取 HTML 元素。

html = '<div><span class="price">$99</span></div>'soup = BeautifulSoup(html, 'html.parser')price = soup.select('.price')[0].textprint(price)

解析：支持类似 jQuery 的 CSS 选择器，非常强大灵活。

4. `re.findall(pattern, string)`

作用：用正则表达式提取字符串中的内容。

import retext = "价格是：100元，优惠后是80元"prices = re.findall(r"\d+", text)print(prices)

解析：正则适合结构不规整的内容提取，是 HTML 无法直接提取时的利器。

5. `json.loads(json_str)`

作用：将 JSON 字符串转换为 Python 字典。

import jsonjson_str = '{"name": "Tom", "age": 22}'data = json.loads(json_str)print(data['name'])

解析：常用于接口爬虫，对返回的 JSON 数据做结构化解析。

6. `headers` 伪装浏览器

作用：设置请求头，避免反爬机制。

headers = {    "User-Agent": "Mozilla/5.0"}response = requests.get("https://httpbin.org/headers", headers=headers)print(response.text)

解析：很多网站会验证User-Agent，这是基本反爬手段之一。

7. `session = requests.Session()`

作用：保持会话，模拟登录后的行为。

session = requests.Session()session.get("https://httpbin.org/cookies/set?name=value")res = session.get("https://httpbin.org/cookies")print(res.text)

解析：保持 Cookie，适合模拟登录后的爬虫。

8. `xpath()` 提取 HTML 元素

作用：通过路径方式精确提取元素。

from lxml import etreehtml = "<div><p>Hello</p><p>World</p></div>"tree = etree.HTML(html)text = tree.xpath("//p/text()")print(text)

解析：XPath 是结构化 HTML 最精准的提取方式之一。

9. `time.sleep(seconds)`

作用：控制请求速度，避免 IP 被封。

import timefor i in range(3):    print(f"爬取第{i+1}页")    time.sleep(1)

10. `try-except` 异常处理

作用：捕获错误，保证爬虫健壮性。

try:    r = requests.get("http://bad.url")except requests.exceptions.RequestException as e:    print("请求出错：", e)

解析：实际爬虫中经常会遇到网络问题，必须用try-except包裹请求逻辑。这些函数是 Python 爬虫的核心工具，掌握它们就能应对大部分数据采集场景。#编程#python#python学习 #Python入门 #本人高级码农一枚，从事 Python 与数据分析工作已经多年，想带几个徒弟。从零开始教 python、爬虫、人工智能、数据分析等。之后也可以跟我做单子同时也整理了一些零基础的资料：Python 基础，数据库，爬虫，数据分析，人工智能，全部打包好后台回复（123）即可

本文来自网友投稿或网络内容，如有侵犯您的权益请联系我们删除，联系邮箱：wyl860211@qq.com 。

Python爬虫常用函数大揭秘!

1. `requests.get(url)`

2. `BeautifulSoup(html, 'html.parser')`

3. `soup.select(css_selector)`

4. `re.findall(pattern, string)`

5. `json.loads(json_str)`

6. `headers` 伪装浏览器

7. `session = requests.Session()`

8. `xpath()` 提取 HTML 元素

9. `time.sleep(seconds)`

10. `try-except` 异常处理

最新文章

热门文章

随机文章

Python爬虫常用函数大揭秘!

1. requests.get(url)

2. BeautifulSoup(html, 'html.parser')

3. soup.select(css_selector)

4. re.findall(pattern, string)

5. json.loads(json_str)

6. headers 伪装浏览器

7. session = requests.Session()

8. xpath() 提取 HTML 元素

9. time.sleep(seconds)

10. try-except 异常处理

Python实战项目(由小项目到大项目进阶)

R、Perl挑战Python的领先地位?

最新文章

热门文章

随机文章

1. `requests.get(url)`

2. `BeautifulSoup(html, 'html.parser')`

3. `soup.select(css_selector)`

4. `re.findall(pattern, string)`

5. `json.loads(json_str)`

6. `headers` 伪装浏览器

7. `session = requests.Session()`

8. `xpath()` 提取 HTML 元素

9. `time.sleep(seconds)`

10. `try-except` 异常处理