当前位置：首页>python>Python小白成长记 · 第12课 | 网络爬虫与Web自动化(下)

Python小白成长记 · 第12课 | 网络爬虫与Web自动化(下)

2026-07-01 04:02:18

大家好，我是星源，一个正在自学Python的19岁编程小白 🤓。这是我的学习笔记系列，我会把每天学到的知识点整理出来，分享给同样在路上的小伙伴，希望能和大家一起进步 🚀。

📌 今日学习内容

👉 "今天学习网络爬虫的高级应用与实战项目"

✨ 知识点讲解

实战项目：构建站点地图

概念说明：通过爬取网站的所有链接，生成完整的站点地图，帮助SEO优化或内容审计。

代码示例：

importrequests, bs4defbuild_sitemap(url):visited_urls=set()urls_to_visit= [url]whileurls_to_visit:current_url=urls_to_visit.pop(0)ifcurrent_urlinvisited_urls:continuetry:res=requests.get(current_url)res.raise_for_status()visited_urls.add(current_url)print(f'Crawling: {current_url}')soup=bs4.BeautifulSoup(res.text, 'html.parser')forlinkinsoup.find_all('a'):href=link.get('href')ifhref:abs_url=requests.compat.urljoin(current_url, href)ifabs_url.startswith(url) andabs_urlnotinvisited_urls:urls_to_visit.append(abs_url)exceptExceptionase:print(f'Failed to crawl {current_url}: {e}')# 将结果保存到文件withopen('sitemap.txt', 'w') asf:forurlinvisited_urls:f.write(url+'\n')# 使用示例build_sitemap('https://example.com')

小提示：在爬取网站时，尊重robots.txt文件规定的爬取范围，避免对服务器造成过大压力。

实战项目：价格监控爬虫

概念说明：编写爬虫监控电商网站商品价格，当价格低于设定阈值时发送通知。

代码示例：

importrequests, bs4, smtplibdefcheck_price(url, target_price, notify_email):try:res=requests.get(url)res.raise_for_status()soup=bs4.BeautifulSoup(res.text, 'html.parser')# 提取价格（需根据实际网页结构调整选择器）price_elem=soup.select_one('.price')ifprice_elem:price_text=price_elem.text.strip()price=float(price_text.replace('$', ''))ifprice<=target_price:# 发送通知邮件msg=MIMEText(f'The price of the product is now ${price}!')msg['Subject'] ='Price Drop Alert'msg['From'] ='your_email@gmail.com'msg['To'] =notify_emailwithsmtplib.SMTP('smtp.gmail.com', 587) assmtp_obj:smtp_obj.ehlo()smtp_obj.starttls()smtp_obj.login('your_email@gmail.com', 'your_password')smtp_obj.sendmail('your_email@gmail.com', notify_email, msg.as_string())exceptExceptionase:print(f'Error checking price: {e}')# 使用示例check_price('https://example.com/product', 100.0, 'notify@example.com')

小提示：电商网站通常对爬虫访问有限制，建议设置合理的请求间隔，避免被封禁。

实战项目：社交媒体数据采集

概念说明：采集社交媒体平台的公开数据（如微博、论坛帖子），进行舆情分析。

代码示例：

fromseleniumimportwebdriverfrombs4importBeautifulSoupimporttimedefscrape_social_media(url, scroll_times, output_file):driver=webdriver.Chrome()driver.get(url)# 模拟滚动加载for_inrange(scroll_times):driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')time.sleep(2)soup=BeautifulSoup(driver.page_source, 'html.parser')driver.quit()# 提取帖子信息（需根据实际网页结构调整选择器）posts=soup.find_all('div', class_='post-container')withopen(output_file, 'w', encoding='utf-8') asf:forpostinposts:text=post.find('div', class_='post-text')author=post.find('span', class_='author-name')iftextandauthor:f.write(f'Author: {author.text.strip()}\n')f.write(f'Content: {text.text.strip()}\n\n')# 使用示例scrape_social_media('https://example.com/social-media', 5, 'social_media_posts.txt')

小提示：采集社交媒体数据时，务必遵守平台的使用条款和相关法律法规。

实战项目：构建自动化表单填写工具

概念说明：利用selenium实现自动填写在线表单，节省重复操作时间。

代码示例：

fromseleniumimportwebdriverfromselenium.webdriver.common.keysimportKeysdriver=webdriver.Chrome()driver.get('https://example.com/form')try:# 定位并填写表单字段name_input=driver.find_element_by_name('name')email_input=driver.find_element_by_name('email')message_input=driver.find_element_by_name('message')name_input.send_keys('Your Name')email_input.send_keys('your_email@example.com')message_input.send_keys('This is an automated message.')# 提交表单submit_button=driver.find_element_by_class_name('submit-btn')submit_button.click()finally:driver.quit()

小提示：在实际应用中，可以结合配置文件或数据库存储表单数据，实现更灵活的自动化。

习题

WebDriverWait与time.sleep的区别：WebDriverWait会智能等待元素加载，提高效率；time.sleep则简单粗暴地固定等待，可能导致资源浪费。
如何处理反爬虫机制：通过设置合理的请求头、使用代理IP、模拟用户行为等方式降低被识别的风险。
编写一个监控特定商品价格的爬虫：需结合requests和BeautifulSoup提取价格信息，结合邮件模块发送通知。
自动化表单填写工具中，如何处理下拉选择框：使用selenium的Select类操作下拉选择框。
在实际项目中，如何高效存储爬取的数据：可以使用数据库（如MySQL、MongoDB）或数据框架（如Pandas）进行结构化存储。

✅ 总结

站点地图构建：通过广度优先爬取网站所有链接，生成完整的站点地图。
价格监控爬虫：结合HTTP请求和邮件通知，实现价格监控功能。
社交媒体数据采集：利用动态加载处理和数据解析，采集公开社交数据。
自动化表单填写：通过selenium操作表单元素，实现自动填写和提交。
数据存储与管理：选择合适的存储方式，确保爬取数据的有效利用。

📢 互动提问

你在实际项目中，有没有尝试过将爬取的数据进行进一步分析（如可视化或生成报告）？是如何实现的？

本文来自网友投稿或网络内容，如有侵犯您的权益请联系我们删除，联系邮箱：wyl860211@qq.com 。

Python小白成长记 · 第12课 | 网络爬虫与Web自动化(下)

📌 今日学习内容

✨ 知识点讲解

实战项目：构建站点地图

实战项目：价格监控爬虫

实战项目：社交媒体数据采集

实战项目：构建自动化表单填写工具

习题

✅ 总结

📢 互动提问

最新文章

热门文章

随机文章

Python小白成长记 · 第12课 | 网络爬虫与Web自动化(下)

📌 今日学习内容

✨ 知识点讲解

实战项目：构建站点地图

实战项目：价格监控爬虫

实战项目：社交媒体数据采集

实战项目：构建自动化表单填写工具

习题

✅ 总结

📢 互动提问

我劝你千万不要盲目自学Python

python做cmd命令行包装窗体

最新文章

热门文章

随机文章