当前位置：首页>python>Skills之Scrapling:Python网络爬虫框架,支持反机器人绕过、JS渲染和Cloudflare保护网站抓取 GitHub Stars 13.9万+

Skills之Scrapling:Python网络爬虫框架,支持反机器人绕过、JS渲染和Cloudflare保护网站抓取 GitHub Stars 13.9万+

2026-06-30 16:30:35

Scrapling 是一个具备反机器人绕过、隐蔽浏览器自动化和爬虫框架功能的网络抓取框架。它提供三种获取策略（HTTP、动态 JS、隐蔽/Cloudflare）和一个完整的 CLI。

此技能仅供教育和研究目的使用。 用户必须遵守当地/国际数据抓取法律并尊重网站的服务条款。

使用时机

抓取静态 HTML 页面（比浏览器工具更快）
抓取需要真实浏览器渲染的 JS 页面
绕过 Cloudflare Turnstile 或机器人检测
使用爬虫抓取多个页面
当内置的 web_extract 工具无法返回所需数据时

安装

pip install "scrapling[all]"scrapling install

最小化安装（仅 HTTP，无浏览器）：

pip install scrapling

仅安装浏览器自动化功能：

pip install "scrapling[fetchers]"scrapling install

快速参考：

方法	类	使用时机
HTTP	Fetcher/FetcherSession	静态页面、API、快速批量请求
动态	DynamicFetcher/DynamicSession	JS 渲染内容、SPA
隐蔽	StealthyFetcher/StealthySession	Cloudflare、受反机器人保护的站点
爬虫	Spider	带链接跟随的多页面抓取

CLI 用法

提取静态页面

scrapling extract get 'https://example.com' output.md

使用 CSS 选择器和浏览器模拟：

scrapling extract get 'https://example.com' output.md \  --css-selector '.content' \  --impersonate 'chrome'

提取 JS 渲染页面

scrapling extract fetch 'https://example.com' output.md \  --css-selector '.dynamic-content' \  --disable-resources \  --network-idle

提取受 Cloudflare 保护的页面

scrapling extract stealthy-fetch 'https://protected-site.com' output.html \  --solve-cloudflare \  --block-webrtc \  --hide-canvas

POST 请求

scrapling extract post 'https://example.com/api' output.json \  --json '{"query": "search term"}'

输出格式

输出格式由文件扩展名决定：

.html -- 原始 HTML
.md -- 转换为 Markdown
.txt -- 纯文本
.json / .jsonl -- JSON

Python: HTTP 抓取

单次请求

from scrapling.fetchers import Fetcherpage = Fetcher.get('https://quotes.toscrape.com/')quotes = page.css('.quote .text::text').getall()for q in quotes:    print(q)

会话（持久化 Cookies）

from scrapling.fetchers import FetcherSessionwith FetcherSession(impersonate='chrome') as session:    page = session.get('https://example.com/', stealthy_headers=True)    links = page.css('a::attr(href)').getall()    for link in links[:5]:        sub = session.get(link)        print(sub.css('h1::text').get())

POST / PUT / DELETE

page = Fetcher.post('https://api.example.com/data', json={"key": "value"})page = Fetcher.put('https://api.example.com/item/1', data={"name": "updated"})page = Fetcher.delete('https://api.example.com/item/1')

使用代理

page = Fetcher.get('https://example.com', proxy='http://user:pass@proxy:8080')

Python: 动态页面（JS 渲染）

对于需要执行 JavaScript 的页面（SPA、懒加载内容）：

from scrapling.fetchers import DynamicFetcherpage = DynamicFetcher.fetch('https://example.com', headless=True)data = page.css('.js-loaded-content::text').getall()

等待特定元素

page = DynamicFetcher.fetch(    'https://example.com',    wait_selector=('.results', 'visible'),    network_idle=True,)

禁用资源以提升速度

阻止字体、图片、媒体、样式表（约快 25%）：

from scrapling.fetchers import DynamicSessionwith DynamicSession(headless=True, disable_resources=True, network_idle=True) as session:    page = session.fetch('https://example.com')    items = page.css('.item::text').getall()

自定义页面自动化

from playwright.sync_api import Pagefrom scrapling.fetchers import DynamicFetcherdef scroll_and_click(page: Page):    page.mouse.wheel(0, 3000)    page.wait_for_timeout(1000)    page.click('button.load-more')    page.wait_for_selector('.extra-results')page = DynamicFetcher.fetch('https://example.com', page_action=scroll_and_click)results = page.css('.extra-results .item::text').getall()

Python: 隐蔽模式（反机器人绕过）

针对受 Cloudflare 保护或指纹检测严格的站点：

from scrapling.fetchers import StealthyFetcherpage = StealthyFetcher.fetch(    'https://protected-site.com',    headless=True,    solve_cloudflare=True,    block_webrtc=True,    hide_canvas=True,)content = page.css('.protected-content::text').getall()

隐蔽会话

from scrapling.fetchers import StealthySessionwith StealthySession(headless=True, solve_cloudflare=True) as session:    page1 = session.fetch('https://protected-site.com/page1')    page2 = session.fetch('https://protected-site.com/page2')

元素选择

所有获取器都返回一个 Selector 对象，包含以下方法：

CSS 选择器

page.css('h1::text').get()              # 第一个 h1 文本page.css('a::attr(href)').getall()      # 所有链接的 hrefpage.css('.quote .text::text').getall() # 嵌套选择

XPath

page.xpath('//div[@class="content"]/text()').getall()page.xpath('//a/@href').getall()

查找方法

page.find_all('div', class_='quote')       # 通过标签 + 属性page.find_by_text('Read more', tag='a')    # 通过文本内容page.find_by_regex(r'\$\d+\.\d{2}')       # 通过正则表达式模式

相似元素

查找结构相似的元素（适用于产品列表等）：

first_product = page.css('.product')[0]all_similar = first_product.find_similar()

Python: 爬虫框架

用于带链接跟随的多页面抓取：

from scrapling.spiders import Spider, Request, Responseclass QuotesSpider(Spider):    name = "quotes"    start_urls = ["https://quotes.toscrape.com/"]    concurrent_requests = 10    download_delay = 1    async def parse(self, response: Response):        for quote in response.css('.quote'):            yield {                "text": quote.css('.text::text').get(),                "author": quote.css('.author::text').get(),                "tags": quote.css('.tag::text').getall(),            }        next_page = response.css('.next a::attr(href)').get()        if next_page:            yield response.follow(next_page)result = QuotesSpider().start()print(f"Scraped {len(result.items)} quotes")result.items.to_json("quotes.json")

多会话爬虫

将请求路由到不同的获取器类型：

from scrapling.fetchers import FetcherSession, AsyncStealthySessionclass SmartSpider(Spider):    name = "smart"    start_urls = ["https://example.com/"]    def configure_sessions(self, manager):        manager.add("fast", FetcherSession(impersonate="chrome"))        manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)    async def parse(self, response: Response):        for link in response.css('a::attr(href)').getall():            if "protected" in link:                yield Request(link, sid="stealth")            else:                yield Request(link, sid="fast", callback=self.parse)

暂停/恢复抓取

spider = QuotesSpider(crawldir="./crawl_checkpoint")spider.start()  # 按 Ctrl+C 暂停，重新运行以从检查点恢复

注意事项

需要安装浏览器：在 pip install 后运行 scrapling install -- 否则 DynamicFetcher 和 StealthyFetcher 将失败
超时设置：DynamicFetcher/StealthyFetcher 的超时单位为毫秒（默认 30000），Fetcher 的超时单位为秒
Cloudflare 绕过：solve_cloudflare=True 会使获取时间增加 5-15 秒 -- 仅在需要时启用
资源使用：StealthyFetcher 运行真实浏览器 -- 限制并发使用
法律问题：抓取前务必检查 robots.txt 和网站服务条款。此库仅供教育和研究目的
Python 版本：需要 Python 3.10+

请在微信客户端打开

安装命令

npx skills add https://github.com/nousresearch/hermes-agent --skill scrapling

仓库

https://github.com/D4Vinci/Scrapling

GitHub 星标数：83.2K

首次出现：11 天前

安全审计：Gen Agent Trust HubPass SocketPass SnykWarn

已安装于：opencode2，deepagents2，antigravity2，github-copilot2，codex2，warp2

更多技能>>>

怎么安装AI Skills

find-skills 技能搜索工具 - 让AI更智能的skill

Skills之AI SEO优化指南：让内容被元宝，百度，ChatGPT、Google AI概览等AI系统引用为来源 GitHub Stars 2.7万+

Skills之Laravel TDD 测试驱动开发指南：PHPUnit 与 Pest 实现 80% 以上测试覆盖率 GitHub Stars 17.8万+

Skills之UI/UX Pro Max 前端设计技能：50+样式、97调色板、57字体配对、99条UX规则，前端设计降AI率 GitHub Stars 7.4万+

Skills之Python测试技能：pytest、TDD与最佳实践，提升代码质量与覆盖率 GitHub Stars 17.4万+

本文来自网友投稿或网络内容，如有侵犯您的权益请联系我们删除，联系邮箱：wyl860211@qq.com 。

Skills之Scrapling:Python网络爬虫框架,支持反机器人绕过、JS渲染和Cloudflare保护网站抓取 GitHub Stars 13.9万+

使用时机

安装

快速参考：

CLI 用法

提取静态页面

提取 JS 渲染页面

提取受 Cloudflare 保护的页面

POST 请求

输出格式

Python: HTTP 抓取

单次请求

会话（持久化 Cookies）

POST / PUT / DELETE

使用代理

Python: 动态页面（JS 渲染）

等待特定元素

禁用资源以提升速度

自定义页面自动化

Python: 隐蔽模式（反机器人绕过）

隐蔽会话

元素选择

CSS 选择器

XPath

查找方法

相似元素

导航

Python: 爬虫框架

多会话爬虫

暂停/恢复抓取

注意事项

最新文章

热门文章

随机文章

Skills之Scrapling:Python网络爬虫框架,支持反机器人绕过、JS渲染和Cloudflare保护网站抓取 GitHub Stars 13.9万+

使用时机

安装

快速参考：

CLI 用法

提取静态页面

提取 JS 渲染页面

提取受 Cloudflare 保护的页面

POST 请求

输出格式

Python: HTTP 抓取

单次请求

会话（持久化 Cookies）

POST / PUT / DELETE

使用代理

Python: 动态页面（JS 渲染）

等待特定元素

禁用资源以提升速度

自定义页面自动化

Python: 隐蔽模式（反机器人绕过）

隐蔽会话

元素选择

CSS 选择器

XPath

查找方法

相似元素

导航

Python: 爬虫框架

多会话爬虫

暂停/恢复抓取

注意事项

启航指南——Linux安装软件的方式有哪些

Python+OSM迅速获取全球矢量要素

最新文章

热门文章

随机文章