Python爬虫实战有多爽?学会了你想爬啥就爬啥!
- 什么是爬虫?
- 环境准备
- requests:发起网络请求
- BeautifulSoup:解析网页内容
- 实战案例:批量爬取文章
- 常见问题
- 总结
想象一下:你去图书馆看书,一本本翻太慢了。爬虫就像派了一个管理员进去,帮你把想要的内容快速抓出来。
- 批量下载图片、文案
- 抓取商品价格做监控
- 采集新闻资讯
- 汇总各种公开数据
pip install requests beautifulsoup4
- requests- 发起HTTP请求,相当于"管理员"去拿网页
- BeautifulSoup- 解析HTML,相当于"剪刀"裁剪内容
import requests # 发送GET请求 response = requests.get('https://www.baidu.com') # 查看状态码 print(response.status_code) # 200表示成功 # 查看网页内容 print(response.text)
# 搜索关键词 url = 'https://www.baidu.com/s' params = {'wd': 'Python爬虫'} response = requests.get(url, params=params) print(response.text)
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36' } response = requests.get(url, headers=headers)
💡小提示:加上User-Agent可以伪装成浏览器,防止被网站识别为爬虫
from bs4 import BeautifulSoup html = ''' <html> <title>测试页面</title> <body> <h1>欢迎</h1> <a href="http://example.com";>链接</a> </body> </html> ''' soup = BeautifulSoup(html, 'html.parser') # 提取标题 print(soup.title.text) # 提取所有链接 for link in soup.find_all('a'): print(link.get('href'))
# 根据标签查找 soup.find('h1') # 找第一个 soup.find_all('a') # 找所有 # 根据class查找 soup.find_all('div', class_='content') # 根据id查找 soup.find(id='main') # 获取文本内容 soup.get_text()
import requests from bs4 import BeautifulSoup base_url = 'https://www.cnblogs.com/' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36' } # 爬取多页 for page in range(1, 4): # 爬3页 url = f'{base_url}?page={page}' response = requests.get(url, headers=headers) soup = BeautifulSoup(response.text, 'html.parser') # 提取文章标题 titles = soup.find_all('a', class_='post-title') print(f'=== 第{page}页 ===') for title in titles: print(title.get_text().strip()) print('爬取完成!')
import requests from bs4 import BeautifulSoup import time headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36' } url = 'https://www.cnblogs.com/' response = requests.get(url, headers=headers) soup = BeautifulSoup(response.text, 'html.parser') # 保存到文件 with open('文章列表.txt', 'w', encoding='utf-8') as f: titles = soup.find_all('a', class_='post-title') for title in titles: f.write(title.get_text().strip() + '\ ') print('已保存到文章列表.txt')
A: 加headers伪装成浏览器,或者加time.sleep(1)延时请求
A: 试试response.encoding = 'utf-8'
学习最重要的是多动手。可以先从爬取简单的网页开始,慢慢熟悉后再尝试更复杂的网站。