当前位置：首页>python>python的crawlee库介绍

python的crawlee库介绍

2026-06-28 06:55:42

Crawlee for Python 是一个现代化的、端到端的开源网络爬取与浏览器自动化库，由 Apify 团队开发并维护。它的核心理念是帮助开发者构建可靠、高效的爬虫，同时通过内置的智能功能让爬虫显得“更像真人”，从而规避现代反爬机制的检测。

核心架构与运行机制

Crawlee 整体架构围绕三个核心概念展开：

RequestQueue（请求队列）：这是一个动态的、持久化的 URL 请求队列。你可以用它来预先设置一批种子 URL（起始链接），也可以在爬虫运行过程中，根据已访问页面的内容，实时地“入队”新的 URL。这使得爬虫能够从一个或几个入口，系统地遍历整个网站。

Crawler（爬虫）：这是 Crawlee 的核心组件，它负责整个爬取流程的编排。它会从 RequestQueue 中取出 URL，执行请求处理，管理并发请求的数量，自动处理失败重试，并协调存储和数据处理。根据目标网页的性质，你可以选择不同类型的爬虫。

Request Handler（请求处理器）：这是你编写核心抓取逻辑的地方。它是一个用户定义的异步函数，会被爬虫为每个请求调用。在这个函数里，你可以访问当前页面的完整内容，提取所需的数据，保存结果，并决定是否需要将新的 URL 添加到队列中。

爬虫类型与选择

Crawlee 提供了多种爬虫类型，以适应不同的抓取场景。

HTTP 爬虫：BeautifulSoupCrawler 和 ParselCrawler 基于 HTTP 协议，直接获取服务器返回的 HTML 代码。它们速度快、资源消耗低，但无法执行 JavaScript。前者使用广受欢迎的 BeautifulSoup 库解析 HTML，后者则使用 Parsel 库，其 API 风格与 Scrapy 框架相似，对熟悉 Scrapy 的开发者更友好。

浏览器爬虫：PlaywrightCrawler 会启动一个真实的浏览器（如 Chromium、Firefox）来渲染页面，因此可以处理任何需要执行 JavaScript 的动态网站。它基于 Playwright 库构建，功能强大。

智能爬虫：AdaptivePlaywrightCrawler 是 v0.6 版本引入的混合型爬虫，它会智能地分析页面，自动在高效的 HTTP 抓取和功能完整的浏览器渲染之间切换，以平衡性能和兼容性。

代码实践

以下示例展示了如何使用不同的 Crawler 完成各种抓取任务。

安装

Crawlee 需要 Python 3.10 或更高版本。你可以通过 pip 安装：

pip install 'crawlee[all]'

如果你需要使用 PlaywrightCrawler，还需要安装其依赖的浏览器：

playwright install

1. BeautifulSoupCrawler 基础示例

这个示例展示了如何使用 BeautifulSoupCrawler 抓取单个网页并提取其标题。

import asynciofrom crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContextasync def main() -> None:    crawler = BeautifulSoupCrawler()    @crawler.router.default_handler    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:        url = context.request.url        # 使用 BeautifulSoup 对象提取标题        title = context.soup.title.string if context.soup.title else ''        context.log.info(f'网页 {url} 的标题是: {title}.')    await crawler.run(['https://crawlee.dev/'])if __name__ == '__main__':    asyncio.run(main())

2. 使用 PlaywrightCrawler 抓取动态页面

如果你的目标网站依赖 JavaScript 加载内容，就需要使用 PlaywrightCrawler。这个例子演示了如何处理页面交互。

import asynciofrom datetime import timedeltafrom crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContextasync def main() -> None:    crawler = PlaywrightCrawler(        headless=False,  # 设为 False 可以看到浏览器窗口，便于调试        max_requests_per_crawl=10,    )    @crawler.router.default_handler    async def request_handler(context: PlaywrightCrawlingContext) -> None:        context.log.info(f'正在处理 {context.request.url}')        # 等待页面中的特定元素加载完成        await context.page.wait_for_selector('h1', timeout=timedelta(seconds=5))        # 获取页面标题        title = await context.page.title()        context.log.info(f'页面标题: {title}')        # 获取页面内容        content = await context.page.content()        context.log.info(f'页面内容长度: {len(content)}')        # 查找页面中的所有链接并加入队列，实现深度爬取        await context.enqueue_links()    await crawler.run(['https://example.com'])if __name__ == '__main__':    asyncio.run(main())

3. 使用 Request Router 管理复杂爬取逻辑

当爬取逻辑变得复杂，需要对不同类型的页面执行不同的处理时，可以使用 Router 来组织代码。

import asynciofrom crawlee.crawlers import ParselCrawler, ParselCrawlingContextfrom crawlee.router import Routerrouter = Router[ParselCrawlingContext]()@router.default_handlerasync def default_handler(context: ParselCrawlingContext) -> None:    """处理普通页面的默认逻辑"""    context.log.info(f'默认处理器正在处理: {context.request.url}')    # 为特定模式的 URL 添加标签，使其被其他处理器处理    if '/product/' in context.request.url:        context.request.label = 'PRODUCT'    else:        # 提取所有产品页面的链接并加入队列        await context.enqueue_links(            selector='a.product-link',            label='PRODUCT'        )@router.handler('PRODUCT', label='PRODUCT')async def product_handler(context: ParselCrawlingContext) -> None:    """专门处理产品页面的逻辑"""    context.log.info(f'产品处理器正在处理: {context.request.url}')    # 使用 CSS 选择器提取产品数据    product_name = context.selector.css('h1.product-title::text').get()    product_price = context.selector.css('.price::text').get()    context.log.info(f'产品名称: {product_name}, 价格: {product_price}')    # 将结构化数据保存到数据集中    await context.push_data({        'url': context.request.url,        'name': product_name,        'price': product_price    })async def main() -> None:    crawler = ParselCrawler(        request_handler=router,        max_requests_per_crawl=20,    )    await crawler.run(['https://example-ecommerce-site.com'])if __name__ == '__main__':    asyncio.run(main())

4. 带登录功能的爬虫

许多网站需要认证才能访问内容。下面的例子展示了如何使用 PlaywrightCrawler 完成登录流程，并使用 SessionPool 保持登录状态。

import asynciofrom datetime import timedeltafrom crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContextfrom crawlee.sessions import SessionPoolasync def main() -> None:    # 配置会话池，确保登录状态在整个爬取过程中保持    session_pool = SessionPool(max_pool_size=1, session_rotation_enabled=False)    crawler = PlaywrightCrawler(        headless=False,  # 设为 True 可在后台运行        max_requests_per_crawl=10,        session_pool=session_pool,    )    @crawler.router.default_handler    async def request_handler(context: PlaywrightCrawlingContext) -> None:        # 检查是否已登录        if '/login' in context.request.url:            # 填写并提交登录表单            await context.page.fill('#username', 'your_username')            await context.page.fill('#password', 'your_password')            await context.page.click('button[type="submit"]')            await context.page.wait_for_navigation()            context.log.info('登录完成')            # 登录成功后，可以跳转到需要认证的页面            await context.add_requests(['https://example.com/dashboard'])        else:            # 处理需要登录后的页面逻辑            title = await context.page.title()            context.log.info(f'访问到内容页: {title}')            await context.enqueue_links()    await crawler.run(['https://example.com/login'])if __name__ == '__main__':    asyncio.run(main())

5. 处理无限滚动页面

对于无限滚动（Infinite Scroll）的页面，可以使用 PlaywrightCrawler 通过滚动来加载更多内容。

import asynciofrom crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContextasync def main() -> None:    crawler = PlaywrightCrawler(headless=False)    @crawler.router.default_handler    async def request_handler(context: PlaywrightCrawlingContext) -> None:        context.log.info(f'正在处理: {context.request.url}')        # 模拟滚动，加载更多内容        previous_height = await context.page.evaluate('document.body.scrollHeight')        while True:            # 滚动到底部            await context.page.evaluate('window.scrollTo(0, document.body.scrollHeight)')            await context.page.wait_for_timeout(2000)  # 等待新内容加载            # 检查是否有新内容加载            new_height = await context.page.evaluate('document.body.scrollHeight')            if new_height == previous_height:                break  # 没有新内容，停止滚动            previous_height = new_height        # 获取所有商品元素        items = await context.page.query_selector_all('.product-item')        context.log.info(f'共找到 {len(items)} 个商品')        for item in items:            # 提取每个商品的名称和价格            name = await item.query_selector('.product-name')            price = await item.query_selector('.product-price')            if name and price:                context.log.info(f'商品: {await name.inner_text()}, 价格: {await price.inner_text()}')    await crawler.run(['https://example.com/products'])if __name__ == '__main__':    asyncio.run(main())

6. 存储和管理数据

Crawlee 提供了内置的 Dataset 来方便地存储和管理爬取到的结构化数据。数据会以 JSON 格式保存在本地 storage 目录中。

import asynciofrom crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContextfrom crawlee.storages import Datasetasync def main() -> None:    # 创建或打开一个数据集    dataset = await Dataset.open()    crawler = BeautifulSoupCrawler()    @crawler.router.default_handler    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:        url = context.request.url        title = context.soup.title.string if context.soup.title else ''        # 将数据推送到数据集        await dataset.push_data({            'url': url,            'title': title,            'scraped_at': context.request.loaded_at.isoformat() if context.request.loaded_at else None        })        context.log.info(f'已保存: {url}')        # 继续爬取新链接        await context.enqueue_links()    await crawler.run(['https://crawlee.dev/'])    # 爬取完成后，可以获取数据集的条目数量    data = await dataset.get_data()    context.log.info(f'共抓取并保存了 {len(data.items)} 条数据')if __name__ == '__main__':    asyncio.run(main())

7. 使用 AdaptivePlaywrightCrawler 实现高效抓取

AdaptivePlaywrightCrawler 是 Crawlee 的智能爬虫，它能自动在轻量级的 HTTP 抓取和功能完整的浏览器抓取之间切换。

import asynciofrom datetime import timedeltafrom crawlee.crawlers import AdaptivePlaywrightCrawler, AdaptivePlaywrightCrawlingContextasync def main() -> None:    # 使用 BeautifulSoup 作为静态解析器    crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser(        max_requests_per_crawl=5,        playwright_crawler_specific_kwargs={'browser_type': 'chromium'},    )    @crawler.router.default_handler    async def request_handler(context: AdaptivePlaywrightCrawlingContext) -> None:        context.log.info(f'正在处理: {context.request.url}')        # 在静态模式下，可以直接使用 parsed_content 属性        if context.is_static:            title = context.parsed_content.title.string if context.parsed_content.title else ''            context.log.info(f'[静态模式] 页面标题: {title}')        else:            # 在浏览器模式下，可以使用页面交互功能            title = await context.page.title()            context.log.info(f'[浏览器模式] 页面标题: {title}')            # 定位特定元素            element = await context.query_selector_one('h2', timeout=timedelta(seconds=5))            if element:                text = await element.text_content()                context.log.info(f'找到元素: {text}')        # 发现并加入新的链接        await context.enqueue_links()    await crawler.run(['https://crawlee.dev/'])if __name__ == '__main__':    asyncio.run(main())

核心特性

除了灵活的爬虫和请求路由，Crawlee 还内置了许多强大功能，进一步简化了爬虫的开发工作：

自动化并发与重试：Crawlee 能够根据系统资源自动进行并行抓取。如果遇到请求失败或被屏蔽，它会智能地重试请求，从而提高爬虫的稳定性。

会话与代理管理：对于需要登录或频繁切换 IP 的场景，Crawlee 内置了 SessionPool 和代理轮换功能，可以方便地管理会话状态和身份标识，避免被封禁。

防探测与智能指纹：通过集成 browserforge 库，Crawlee 能够自动生成真实的浏览器指纹和 HTTP 请求头，让你的爬虫看起来更像真人访问，大大增强了对抗反爬机制的能力。

持久化存储：Crawlee 提供了 RequestQueue 和 Dataset 等组件，分别用于持久化地存储待抓取的 URL 队列和已抓取的结构化数据，即使程序中断也能恢复任务。

本文来自网友投稿或网络内容，如有侵犯您的权益请联系我们删除，联系邮箱：wyl860211@qq.com 。

python的crawlee库介绍

最新文章

热门文章

随机文章

python的crawlee库介绍

Python 爬虫别再卡壳:DrissionPage 融合浏览器和请求,零依赖高效抓取

Python+数据库:双剑合璧

最新文章

热门文章

随机文章