当前位置：首页>python>Python小白成长记 · 第12课 | 网络爬虫与Web自动化(中)

Python小白成长记 · 第12课 | 网络爬虫与Web自动化(中)

2026-07-02 11:55:57

大家好，我是星源，一个正在自学Python的19岁编程小白 🤓。这是我的学习笔记系列，我会把每天学到的知识点整理出来，分享给同样在路上的小伙伴，希望能和大家一起进步 🚀。

📌 今日学习内容

👉 "今天我们深入网络爬虫与Web自动化，学习进阶技巧"

✨ 知识点讲解

处理JavaScript渲染页面

概念说明：许多现代网页依赖JavaScript动态加载内容，requests和BeautifulSoup无法直接处理这类页面。需要借助selenium模块启动浏览器，等待JavaScript执行完成后再获取页面内容。

代码示例：

fromseleniumimportwebdriverfromselenium.webdriver.common.byimportByfromselenium.webdriver.support.uiimportWebDriverWaitfromselenium.webdriver.supportimportexpected_conditionsasECdriver=webdriver.Chrome()driver.get('https://example.com')try:# 等待指定元素加载完成element=WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "target_element"))    )print('Page title:', driver.title)print('Element text:', element.text)finally:driver.quit()

小提示：使用WebDriverWait配合expected_conditions可以有效处理页面加载延迟问题。

处理登录与表单提交

概念说明：许多网站需要登录才能访问内容，可以通过selenium模拟用户输入并提交表单。

代码示例：

fromseleniumimportwebdriverdriver=webdriver.Chrome()driver.get('https://accounts.google.com/signin')# 定位邮箱输入框并输入内容email_input=driver.find_element_by_id('identifierId')email_input.send_keys('your_email@gmail.com')email_input.submit()# 等待并处理下一步try:password_input=WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.NAME, 'password'))    )password_input.send_keys('your_password')password_input.submit()finally:driver.quit()

小提示：处理登录流程时，注意不要在代码中硬编码敏感信息（如密码），可以使用环境变量或配置文件存储。

爬取动态数据：以Twitter为例

概念说明：爬取动态加载内容（如无限滚动的社交媒体页面），需要结合selenium模拟用户滚动行为并持续加载内容。

代码示例：

fromseleniumimportwebdriverfrombs4importBeautifulSoupdriver=webdriver.Chrome()driver.get('https://twitter.com/search?q=python')# 模拟用户滚动操作for_inrange(3):  # 滚动3次driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')time.sleep(2)  # 等待新内容加载# 获取页面内容并解析soup=BeautifulSoup(driver.page_source, 'html.parser')tweets=soup.find_all('div', class_='tweet')fortweetintweets:text=tweet.find('div', class_='tweet-text')iftext:print(text.text.strip())driver.quit()

小提示：动态爬取过程中，注意不要过于频繁地发送请求，以免对目标网站服务器造成压力。

实战项目：批量下载文件

概念说明：编写一个程序，从指定网页下载所有链接指向的文件（如PDF、图片等）。

代码示例：

importos, requests, bs4defdownload_files(url, file_type, download_folder):# 获取页面内容res=requests.get(url)res.raise_for_status()soup=bs4.BeautifulSoup(res.text, 'html.parser')# 创建下载文件夹os.makedirs(download_folder, exist_ok=True)# 查找所有指定类型的文件链接link_elems=soup.select(f'a[href$="{file_type}"]')forlinkinlink_elems:file_url=link.get('href')ifnotfile_url.startswith('http'):file_url=requests.compat.urljoin(url, file_url)  # 处理相对路径# 下载文件print(f'Downloading {file_url}...')file_res=requests.get(file_url)file_res.raise_for_status()# 保存文件file_name=os.path.basename(file_url)withopen(os.path.join(download_folder, file_name), 'wb') asf:forchunkinfile_res.iter_content(100000):f.write(chunk)# 使用示例download_files('https://example.com/resources', '.pdf', 'downloaded_pdfs')

小提示：使用requests.compat.urljoin()处理相对链接与绝对链接的转换。

实战项目：监控网页更新

概念说明：编写一个程序，定期检查指定网页是否有更新，并在检测到更新时发送通知。

代码示例：

importrequests, time, smtplibfromemail.mime.textimportMIMETextdefcheck_website_update(url, check_interval, notify_email):last_content=NonewhileTrue:# 获取网页内容res=requests.get(url)res.raise_for_status()current_content=res.text# 检查内容是否变化iflast_content!=current_content:last_content=current_content# 发送通知邮件msg=MIMEText(f'The website {url} has been updated!')msg['Subject'] ='Website Update Notification'msg['From'] ='your_email@gmail.com'msg['To'] =notify_emailwithsmtplib.SMTP('smtp.gmail.com', 587) assmtp_obj:smtp_obj.ehlo()smtp_obj.starttls()smtp_obj.login('your_email@gmail.com', 'your_password')smtp_obj.sendmail('your_email@gmail.com', notify_email, msg.as_string())time.sleep(check_interval)# 使用示例check_website_update('https://example.com', 3600, 'notify@example.com')

小提示：在实际应用中，可以结合任务调度工具（如Linux的cron或Windows的任务计划程序）定期运行监控脚本。

✅ 总结

动态页面处理：使用selenium结合显式等待，解决JavaScript渲染页面的爬取问题。
登录与表单提交：通过selenium模拟用户输入，完成登录流程。
动态数据爬取：模拟用户滚动行为，爬取无限滚动加载的网页内容。
批量文件下载：定位指定类型的文件链接并下载保存。
网页更新监控：定期检查网页内容变化，通过邮件发送更新通知。

📢 互动提问

你在爬取动态加载网页或处理登录流程时，遇到过哪些挑战？是如何解决的？

本文来自网友投稿或网络内容，如有侵犯您的权益请联系我们删除，联系邮箱：wyl860211@qq.com 。

Python小白成长记 · 第12课 | 网络爬虫与Web自动化(中)

📌 今日学习内容

✨ 知识点讲解

处理JavaScript渲染页面

处理登录与表单提交

爬取动态数据：以Twitter为例

实战项目：批量下载文件

实战项目：监控网页更新

✅ 总结

📢 互动提问

最新文章

热门文章

随机文章

Python小白成长记 · 第12课 | 网络爬虫与Web自动化(中)

📌 今日学习内容

✨ 知识点讲解

处理JavaScript渲染页面

处理登录与表单提交

爬取动态数据：以Twitter为例

实战项目：批量下载文件

实战项目：监控网页更新

✅ 总结

📢 互动提问

Python基础笔记3:运算符与类型转换入门

python学习笔记-序列

最新文章

热门文章

随机文章