Scrapling 是一个具备反机器人绕过、隐蔽浏览器自动化和爬虫框架功能的网络抓取框架。它提供三种获取策略(HTTP、动态 JS、隐蔽/Cloudflare)和一个完整的 CLI。
此技能仅供教育和研究目的使用。 用户必须遵守当地/国际数据抓取法律并尊重网站的服务条款。

web_extract 工具无法返回所需数据时pip install "scrapling[all]"scrapling install
最小化安装(仅 HTTP,无浏览器):
pip install scrapling仅安装浏览器自动化功能:
pip install "scrapling[fetchers]"scrapling install
方法 | 类 | 使用时机 |
HTTP | Fetcher/FetcherSession | 静态页面、API、快速批量请求 |
动态 | DynamicFetcher/DynamicSession | JS 渲染内容、SPA |
隐蔽 | StealthyFetcher/StealthySession | Cloudflare、受反机器人保护的站点 |
爬虫 | Spider | 带链接跟随的多页面抓取 |
scrapling extract get 'https://example.com' output.md使用 CSS 选择器和浏览器模拟:
scrapling extract get 'https://example.com' output.md \--css-selector '.content' \--impersonate 'chrome'
scrapling extract fetch 'https://example.com' output.md \--css-selector '.dynamic-content' \--disable-resources \--network-idle
scrapling extract stealthy-fetch 'https://protected-site.com' output.html \--solve-cloudflare \--block-webrtc \--hide-canvas
scrapling extract post 'https://example.com/api' output.json \--json '{"query": "search term"}'
输出格式由文件扩展名决定:
.html -- 原始 HTML.md -- 转换为 Markdown.txt -- 纯文本.json / .jsonl -- JSONfrom scrapling.fetchers import Fetcherpage = Fetcher.get('https://quotes.toscrape.com/')quotes = page.css('.quote .text::text').getall()for q in quotes:print(q)
from scrapling.fetchers import FetcherSessionwith FetcherSession(impersonate='chrome') as session:page = session.get('https://example.com/', stealthy_headers=True)links = page.css('a::attr(href)').getall()for link in links[:5]:sub = session.get(link)print(sub.css('h1::text').get())
page = Fetcher.post('https://api.example.com/data', json={"key": "value"})page = Fetcher.put('https://api.example.com/item/1', data={"name": "updated"})page = Fetcher.delete('https://api.example.com/item/1')
page = Fetcher.get('https://example.com', proxy='http://user:pass@proxy:8080')对于需要执行 JavaScript 的页面(SPA、懒加载内容):
from scrapling.fetchers import DynamicFetcherpage = DynamicFetcher.fetch('https://example.com', headless=True)data = page.css('.js-loaded-content::text').getall()
page = DynamicFetcher.fetch('https://example.com',wait_selector=('.results', 'visible'),network_idle=True,)
阻止字体、图片、媒体、样式表(约快 25%):
from scrapling.fetchers import DynamicSessionwith DynamicSession(headless=True, disable_resources=True, network_idle=True) as session:page = session.fetch('https://example.com')items = page.css('.item::text').getall()
from playwright.sync_api import Pagefrom scrapling.fetchers import DynamicFetcherdef scroll_and_click(page: Page):page.mouse.wheel(0, 3000)page.wait_for_timeout(1000)page.click('button.load-more')page.wait_for_selector('.extra-results')page = DynamicFetcher.fetch('https://example.com', page_action=scroll_and_click)results = page.css('.extra-results .item::text').getall()
针对受 Cloudflare 保护或指纹检测严格的站点:
from scrapling.fetchers import StealthyFetcherpage = StealthyFetcher.fetch('https://protected-site.com',headless=True,solve_cloudflare=True,block_webrtc=True,hide_canvas=True,)content = page.css('.protected-content::text').getall()
from scrapling.fetchers import StealthySessionwith StealthySession(headless=True, solve_cloudflare=True) as session:page1 = session.fetch('https://protected-site.com/page1')page2 = session.fetch('https://protected-site.com/page2')
所有获取器都返回一个 Selector 对象,包含以下方法:
page.css('h1::text').get() # 第一个 h1 文本page.css('a::attr(href)').getall() # 所有链接的 hrefpage.css('.quote .text::text').getall() # 嵌套选择
page.xpath('//div[@class="content"]/text()').getall()page.xpath('//a/@href').getall()
page.find_all('div', class_='quote') # 通过标签 + 属性page.find_by_text('Read more', tag='a') # 通过文本内容page.find_by_regex(r'\$\d+\.\d{2}') # 通过正则表达式模式
查找结构相似的元素(适用于产品列表等):
first_product = page.css('.product')[0]all_similar = first_product.find_similar()
el = page.css('.target')[0]el.parent # 父元素el.children # 子元素el.next_sibling # 下一个兄弟元素el.prev_sibling # 上一个兄弟元素
用于带链接跟随的多页面抓取:
from scrapling.spiders import Spider, Request, Responseclass QuotesSpider(Spider):name = "quotes"start_urls = ["https://quotes.toscrape.com/"]concurrent_requests = 10download_delay = 1async def parse(self, response: Response):for quote in response.css('.quote'):yield {"text": quote.css('.text::text').get(),"author": quote.css('.author::text').get(),"tags": quote.css('.tag::text').getall(),}next_page = response.css('.next a::attr(href)').get()if next_page:yield response.follow(next_page)result = QuotesSpider().start()print(f"Scraped {len(result.items)} quotes")result.items.to_json("quotes.json")
将请求路由到不同的获取器类型:
from scrapling.fetchers import FetcherSession, AsyncStealthySessionclass SmartSpider(Spider):name = "smart"start_urls = ["https://example.com/"]def configure_sessions(self, manager):manager.add("fast", FetcherSession(impersonate="chrome"))manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)async def parse(self, response: Response):for link in response.css('a::attr(href)').getall():if "protected" in link:yield Request(link, sid="stealth")else:yield Request(link, sid="fast", callback=self.parse)
spider = QuotesSpider(crawldir="./crawl_checkpoint")spider.start() # 按 Ctrl+C 暂停,重新运行以从检查点恢复
scrapling install -- 否则 DynamicFetcher 和 StealthyFetcher 将失败solve_cloudflare=True 会使获取时间增加 5-15 秒 -- 仅在需要时启用请在微信客户端打开
安装命令
npx skills add https://github.com/nousresearch/hermes-agent --skill scrapling仓库
https://github.com/D4Vinci/Scrapling
GitHub 星标数:83.2K
首次出现:11 天前
安全审计:Gen Agent Trust HubPass SocketPass SnykWarn
已安装于:opencode2,deepagents2,antigravity2,github-copilot2,codex2,warp2
更多技能>>>
find-skills 技能搜索工具 - 让AI更智能的skill
Skills之AI SEO优化指南:让内容被元宝,百度,ChatGPT、Google AI概览等AI系统引用为来源 GitHub Stars 2.7万+
Skills之Laravel TDD 测试驱动开发指南:PHPUnit 与 Pest 实现 80% 以上测试覆盖率 GitHub Stars 17.8万+
Skills之UI/UX Pro Max 前端设计技能:50+样式、97调色板、57字体配对、99条UX规则,前端设计降AI率 GitHub Stars 7.4万+
Skills之Python测试技能:pytest、TDD与最佳实践,提升代码质量与覆盖率 GitHub Stars 17.4万+
