🐍 Python Day43:网页解析 - XPath — 更精确的提取利器
🕐 预计用时:2-3 小时 | 🎯 目标:掌握 lxml 库、XPath 语法、lxml.etree 解析
📖 今日目录
- BeautifulSoup vs XPath 对比
1. 什么是 XPath?
XPath(XML Path Language)是用于在 XML/HTML 文档中查找信息的语言。比 CSS 选择器更强大。
2. lxml 入门
# 安装: pip install lxml
from lxml import etree
html = """
<html>
<body>
<div class="container">
<h1>标题</h1>
<p id="p1">第一段</p>
<p id="p2">第二段</p>
</div>
</body>
</html>
"""
# 解析 HTML
tree = etree.HTML(html)
# 用 XPath 查找
result = tree.xpath("//h1/text()")
print(result) # ['标题']
result = tree.xpath("//p/text()")
print(result) # ['第一段', '第二段']
# 从文件解析
from lxml import etree
# 解析本地 HTML 文件
tree = etree.parse("page.html", etree.HTMLParser())
# 解析网络页面
import requests
response = requests.get("https://example.com")
tree = etree.HTML(response.text)
3. XPath 基础语法
| | |
|---|
/ | | /html/body/div |
// | | //div |
. | | .//p |
.. | | ../.. |
@ | | //a/@href |
* | | //div/* |
text() | | //p/text() |
from lxml import etree
html = """
<div class="content">
<h2>标题</h2>
<p>段落1</p>
<a href="/page1">链接1</a>
<a href="/page2">链接2</a>
</div>
"""
tree = etree.HTML(html)
# //tag — 选取所有该标签
print(tree.xpath("//a/text()")) # ['链接1', '链接2']
# //tag/@attr — 选取属性
print(tree.xpath("//a/@href")) # ['/page1', '/page2']
# /html/body — 绝对路径
print(tree.xpath("/html/body/div/h2/text()")) # ['标题']
# //div/p — div 下的 p
print(tree.xpath("//div/p/text()")) # ['段落1']
# .//tag — 相对路径(当前节点下)
div = tree.xpath("//div")[0]
print(div.xpath(".//a/@href")) # ['/page1', '/page2']
4. 常用 XPath 表达式
from lxml import etree
html = """
<div class="products">
<div class="item" data-id="1">
<h3>手机</h3>
<span class="price">¥2999</span>
<span class="tag">热销</span>
</div>
<div class="item" data-id="2">
<h3>笔记本</h3>
<span class="price">¥5999</span>
<span class="tag">新品</span>
</div>
<div class="item" data-id="3">
<h3>耳机</h3>
<span class="price">¥399</span>
</div>
</div>
"""
tree = etree.HTML(html)
# 所有商品名
names = tree.xpath("//div[@class='item']/h3/text()")
print(names) # ['手机', '笔记本', '耳机']
# 所有价格
prices = tree.xpath("//span[@class='price']/text()")
print(prices) # ['¥2999', '¥5999', '¥399']
# 有标签的商品
tagged = tree.xpath("//span[@class='tag']/../h3/text()")
print(tagged) # ['手机', '笔记本']
# 获取 data-id 属性
ids = tree.xpath("//div[@class='item']/@data-id")
print(ids) # ['1', '2', '3']
5. 谓语(条件筛选)
from lxml import etree
html = """
<ul id="products">
<li class="item">手机 ¥2999</li>
<li class="item">笔记本 ¥5999</li>
<li class="item">耳机 ¥399</li>
<li class="item">平板 ¥3999</li>
<li class="item">手表 ¥1999</li>
</ul>
"""
tree = etree.HTML(html)
# [n] — 第 n 个(从 1 开始!)
print(tree.xpath("//li[1]/text()")) # ['手机 ¥2999']
print(tree.xpath("//li[last()]/text()")) # ['手表 ¥1999']
# [position()] — 位置
print(tree.xpath("//li[position()<3]/text()")) # 前2个
# [contains()] — 包含文本
print(tree.xpath("//li[contains(text(),'5999')]/text()")) # ['笔记本 ¥5999']
# [starts-with()] — 以...开头
print(tree.xpath("//li[starts-with(@class,'item')]/text()")) # 所有
# [@attr='value'] — 属性等于
print(tree.xpath("//li[@class='item']/text()")) # 所有 item
6. XPath 函数
from lxml import etree
html = """
<div>
<p class="title">Hello World</p>
<p class="content">学习 Python 很有趣</p>
<p class="content">XPath 很强大</p>
<a href="/page1" target="_blank">链接1</a>
<a href="/page2">链接2</a>
</div>
"""
tree = etree.HTML(html)
# text() — 获取文本
print(tree.xpath("//p/text()")) # ['Hello World', '学习 Python 很有趣', 'XPath 很强大']
# contains() — 包含
print(tree.xpath("//p[contains(text(),'Python')]/text()")) # ['学习 Python 很有趣']
# normalize-space() — 去空格
print(tree.xpath("normalize-space(//p[@class='title'])")) # 'Hello World'
# string() — 获取所有文本
print(tree.xpath("string(//div)")) # 'Hello World 学习 Python 很有趣 XPath 很强大'
# count() — 计数
print(tree.xpath("count(//p)")) # 3.0
# concat() — 拼接
print(tree.xpath("concat(//p[1]/text(), ' - ', //p[2]/text())")) # 'Hello World - 学习 Python 很有趣'
# 同时获取多个属性
print(tree.xpath("//a/@href")) # ['/page1', '/page2']
print(tree.xpath("//a/@target")) # ['_blank'] (没有 target 的不会返回)
# 逻辑运算
print(tree.xpath("//a[@target='_blank' and @href='/page1']/text()")) # ['链接1']
print(tree.xpath("//p[@class='title' or @class='content']/text()"))
7. lxml + requests 实战
import requests
from lxml import etree
def scrape_quotes(url="https://quotes.toscrape.com/"):
"""抓取名言网站"""
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
tree = etree.HTML(response.text)
# 用 XPath 提取数据
quotes = tree.xpath("//div[@class='quote']")
results = []
for quote in quotes:
text = quote.xpath(".//span[@class='text']/text()")[0]
author = quote.xpath(".//small[@class='author']/text()")[0]
tags = quote.xpath(".//a[@class='tag']/text()")
results.append({
"text": text,
"author": author,
"tags": tags
})
return results
quotes = scrape_quotes()
for q in quotes[:3]:
print(f"📝 {q['text']}")
print(f" —— {q['author']} | 标签: {', '.join(q['tags'])}")
print()
8. BeautifulSoup vs XPath 对比
| | |
|---|
| soup.find("div") | tree.xpath("//div") |
| find("div", class_="a") | xpath("//div[@class='a']") |
| tag.text | xpath("//div/text()") |
| tag["href"] | xpath("//a/@href") |
| | contains(text(),'x') |
| tag.parent | xpath("../") |
| | //a[text()='链接'] |
💡 选型建议:
• 简单解析 → BeautifulSoup(更 Pythonic)
• 复杂提取 → XPath(更强大)
• Scrapy 框架 → 默认用 XPath
• 两者可以混用:用 BS 解析,用 XPath 查找
9. 今日小结
XPath 速查表
//tag//tag[@attr='value']//tag/text()//tag/@attr//tag[n]//tag[last()]//tag[contains(@class,'x')].././/
🎯 练习建议:
1. 用 XPath 提取一个新闻网站的标题和链接
2. 解析一个电商页面,提取商品名/价格/评分到 CSV
3. 对比 BeautifulSoup 和 XPath 处理同一个网页的效率
📚 Day43 完成!明天综合练习 — 豆瓣电影 Top250 爬虫