为什么异步爬虫更高效?
传统同步爬虫遇到网络IO就会等待,异步可以同时发起多个请求。测试显示,异步爬虫吞吐量能达到同步的8到10倍。
import asyncio
import aiohttp
asyncdeffetch(url):
asyncwith aiohttp.ClientSession() as session:
asyncwith session.get(url) as response:
returnawait response.text()
asyncio事件循环怎么玩?
事件循环是异步编程的核心,理解它才能写出高效代码。简单说就是一个无限循环,不断从任务队列里取任务执行。
asyncdefmain():
urls = ["https://example.com"] * 10
tasks = [fetch(url) for url in urls]
results = await asyncio.gather(*tasks)
print(f"爬取了{len(results)}个页面")
if __name__ == "__main__":
asyncio.run(main())
实战:爬取豆瓣Top250电影
直接上代码,包含并发控制,避免被封IP。
asyncdeffetch_movie(session, url):
asyncwith session.get(url) as resp:
returnawait resp.text()
asyncdefcrawl_douban():
asyncwith aiohttp.ClientSession() as session:
tasks = []
for page in range(0, 250, 25):
url = f"https://movie.douban.com/top250?start={page}"
tasks.append(fetch_movie(session, url))
if len(tasks) >= 5:
await asyncio.gather(*tasks)
tasks = []
if tasks:
await asyncio.gather(*tasks)
asyncio.run(crawl_douban())
常见错误排查
混用同步阻塞代码,比如在async函数里用requests库,会导致程序卡住。
忘记await,结果拿到的是coroutine对象,不是实际数据。
并发数太大容易被限流,建议用semaphore控制。
semaphore = asyncio.Semaphore(10)
asyncdefsafe_fetch(url):
asyncwith semaphore:
asyncwith aiohttp.ClientSession() as session:
asyncwith session.get(url) as resp:
returnawait resp.text()
性能对比:异步vs同步
数据来自实际测试,网络延迟约100ms。
进阶玩法:结合异步数据库
用asyncpg连接PostgreSQL,整个流程都异步。
import asyncpg
asyncdefsave_to_db(data):
conn = await asyncpg.connect(user='user', password='pass', database='crawl')
await conn.execute('INSERT INTO movies(name, rating) VALUES($1, $2)',
data['name'], data['rating'])
await conn.close()
实战案例:批量下载图片
异步下载大量图片效率更高。
asyncdefdownload_image(session, url, save_path):
asyncwith session.get(url) as resp:
with open(save_path, 'wb') as f:
f.write(await resp.read())
asyncdefbatch_download(urls):
asyncwith aiohttp.ClientSession() as session:
tasks = [download_image(session, url, f"img/{i}.jpg")
for i, url in enumerate(urls)]
await asyncio.gather(*tasks)
总结
异步编程是Python进阶必备技能,尤其适合IO密集型任务。从爬虫到API服务,掌握asyncio能让代码效率提升不少。