
接不同模型的 API 是个让人烦的事:OpenAI 一套格式,Anthropic 一套,Google 又一套,认证方式各不相同,响应字段也对不上。项目里如果要同时用几个模型,或者以后想换模型,适配代码就是个噩梦。
LiteLLM 解决的就是这件事——它用一个统一的 OpenAI 兼容接口封装了 100+ 个模型,切换模型只改一个参数。
往期阅读>>>
Python 自动化管理Jenkins的15个实用脚本,提升效率
App2Docker:如何无需编写Dockerfile也可以创建容器镜像
Python 自动化识别Nginx配置并导出为excel文件,提升Nginx管理效率
LiteLLM 做了三件事:
格式统一:不管底层是哪个模型,输入都用 OpenAI 风格的messages 数组,响应都从 response.choices[0].message.content 取。不同模型的字段差异由库内部处理,业务代码不用管。
认证统一:把各平台的 API Key 设到环境变量里,LiteLLM 调用时自动处理认证细节,不需要为每个模型单独写认证逻辑。
切换简单:换模型就改 model 参数,其他代码不动。
需要对比不同模型在同一任务上的表现时,用 LiteLLM 很省事:
importosfromlitellmimportcompletionos.environ.update({"OPENAI_API_KEY": "your-openai-key","ANTHROPIC_API_KEY": "your-anthropic-key","GOOGLE_API_KEY": "your-google-key"})defcompare_models(prompt, models):"""对比多个模型对同一提示的响应"""results = {}formodel_nameinmodels:try:response = completion(model=model_name,messages=[{"role": "user", "content": prompt}],temperature=0.7,max_tokens=500 )results[model_name] = response.choices[0].message.contentprint(f"✓ {model_name}: 响应成功")exceptExceptionase:results[model_name] = f"调用失败: {str(e)}"print(f"✗ {model_name}: 调用失败")returnresultsprompt = "请用Python实现一个快速排序算法,并添加详细注释。"models_to_test = ["openai/gpt-4","anthropic/claude-3-sonnet","google/gemini-1.5-pro"]results = compare_models(prompt, models_to_test)formodel, responseinresults.items():print(f"\n{'='*50}")print(f"模型: {model}")print(f"响应长度: {len(response)} 字符")print(f"响应预览: {response[:200]}...")
生产环境里,单个模型的可用性不可能 100% 保证。LiteLLM 的 Router 提供了多模型池+自动故障转移:
fromlitellmimportRouterimportosmodel_pool = [ {"model_name": "primary-chat","litellm_params": {"model": "openai/gpt-4","api_key": os.environ["OPENAI_API_KEY"],"api_base": "https://api.openai.com/v1" } }, {"model_name": "primary-chat","litellm_params": {"model": "anthropic/claude-3-haiku","api_key": os.environ["ANTHROPIC_API_KEY"],"max_tokens": 1000 } }, {"model_name": "primary-chat","litellm_params": {"model": "google/gemini-1.5-flash","api_key": os.environ["GOOGLE_API_KEY"] } }]router = Router(model_list=model_pool,routing_strategy="usage-based", # 按使用量做负载均衡timeout=30,num_retries=2)defrobust_chat_completion(messages, fallback_models=None):"""支持自动故障转移的调用封装"""try:response = router.completion(model="primary-chat",messages=messages,temperature=0.8,stream=False )return {"success": True,"model_used": response._hidden_params.get("model", "unknown"),"content": response.choices[0].message.content,"usage": response.usageifhasattr(response, 'usage') elseNone }exceptExceptionase:print(f"主模型池调用失败: {e}")iffallback_models:forfallback_modelinfallback_models:try:response = completion(model=fallback_model,messages=messages,temperature=0.8 )return {"success": True,"model_used": fallback_model,"content": response.choices[0].message.content,"fallback": True }except:continuereturn {"success": False,"error": str(e) }messages = [ {"role": "system", "content": "你是一个有帮助的AI助手。"}, {"role": "user", "content": "请解释什么是机器学习中的过拟合现象,并提供预防方法。"}]result = robust_chat_completion(messages)ifresult["success"]:print(f"使用的模型: {result.get('model_used')}")print(f"响应内容: {result['content'][:300]}...")else:print(f"所有模型调用失败: {result['error']}")
调用量上来之后,Token 费用是绕不开的问题。下面是一个带预算上限的调用封装:
fromlitellmimportcompletionimportosfromdatetimeimportdatetimeos.environ["HELICONE_API_KEY"] = "your-helicone-key"classCostAwareLLMClient:def__init__(self, project_name, budget_limit=None):self.project_name = project_nameself.budget_limit = budget_limitself.total_cost = 0.0self.usage_log = []defcall_with_cost_tracking(self, model, messages, **kwargs):"""带成本跟踪的模型调用"""ifself.budget_limitandself.total_cost>= self.budget_limit:raiseValueError(f"项目 '{self.project_name}' 已超出预算限制 ${self.budget_limit}")metadata = {"project": self.project_name,"call_timestamp": datetime.now().isoformat(),**kwargs.pop('metadata', {}) }try:response = completion(model=model,messages=messages,metadata=metadata,**kwargs )call_record = {"timestamp": datetime.now(),"model": model,"input_tokens": getattr(response.usage, 'prompt_tokens', 0),"output_tokens": getattr(response.usage, 'completion_tokens', 0),"total_tokens": getattr(response.usage, 'total_tokens', 0) }self.usage_log.append(call_record)estimated_cost = self._estimate_cost(call_record)self.total_cost += estimated_costprint(f"调用记录: {model} | 输入Token: {call_record['input_tokens']} | "f"输出Token: {call_record['output_tokens']} | "f"预估成本: ${estimated_cost:.6f}")returnresponseexceptExceptionase:print(f"模型调用失败: {e}")raisedef_estimate_cost(self, call_record):"""根据模型和Token用量估算成本(简化示例,实际需按各模型定价计算)"""model = call_record["model"]total_tokens = call_record["total_tokens"]# 示例定价(美元/千Token)pricing = {"openai/gpt-4": 0.03,"openai/gpt-3.5-turbo": 0.0015,"anthropic/claude-3-sonnet": 0.015,"google/gemini-1.5-pro": 0.0075 }base_price = pricing.get(model, 0.01)return (total_tokens/1000) *base_pricedefget_cost_summary(self):"""获取成本摘要"""return {"project": self.project_name,"total_calls": len(self.usage_log),"total_tokens": sum(r["total_tokens"] forrinself.usage_log),"total_cost": self.total_cost,"average_cost_per_call": self.total_cost/len(self.usage_log) ifself.usage_logelse0 }client = CostAwareLLMClient(project_name="智能客服系统",budget_limit=50.0)try:foriinrange(5):response = client.call_with_cost_tracking(model="openai/gpt-3.5-turbo",messages=[{"role": "user", "content": f"这是第{i+1}个测试问题:如何优化Python代码性能?"}],max_tokens=200 )print(f"响应 {i+1}: {response.choices[0].message.content[:100]}...\n")summary = client.get_cost_summary()print(f"\n成本摘要:")forkey, valueinsummary.items():print(f" {key}: {value}")exceptValueErrorase:print(f"预算超限: {e}")
团队规模大了之后,让每个人各自管 API Key 是个麻烦事。LiteLLM 提供代理网关模式,统一管理所有模型调用:
# 启动代理(命令行)# litellm --model openai/gpt-4 --port 8000 --api_base "https://api.openai.com/v1"# 客户端直接用 OpenAI SDK,指向本地代理fromopenaiimportOpenAIclient = OpenAI(api_key="your-enterprise-key",base_url="http://localhost:8000")response = client.chat.completions.create(model="gpt-4",messages=[ {"role": "system", "content": "你是一个专业的技术顾问。"}, {"role": "user", "content": "请评估微服务架构的优缺点。"} ],temperature=0.7,max_tokens=500)print(response.choices[0].message.content)
Proxy 模式主要解决四个问题:统一鉴权(对接企业已有的 LDAP/OAuth)、QPS 限流(按团队或项目设置)、完整审计日志、以及按项目分摊成本。
密钥管理:API Key 放 .env 文件,用 python-dotenv 加载,不要硬编码在代码里。
fromdotenvimportload_dotenvimportosload_dotenv()api_keys = {"openai": os.getenv("OPENAI_API_KEY"),"anthropic": os.getenv("ANTHROPIC_API_KEY"),"google": os.getenv("GOOGLE_API_KEY")}
重试机制:遇到限流或超时时加指数退避重试,推荐用 tenacity:
importtimefromlitellmimportcompletionfromtenacityimportretry, stop_after_attempt, wait_exponential@retry(stop=stop_after_attempt(3),wait=wait_exponential(multiplier=1, min=4, max=10))defrobust_completion_with_retry(model, messages, **kwargs):"""带指数退避重试的调用"""try:returncompletion(model=model, messages=messages, **kwargs)exceptExceptionase:error_msg = str(e).lower()if"rate limit"inerror_msgor"timeout"inerror_msg:print(f"遇到限制或超时,准备重试: {error_msg}")raiseelse:print(f"不可重试错误: {error_msg}")raise
其他性能点:对重复查询做响应缓存;多个请求尽量合并批量调用;保持 HTTP 连接复用,减少每次建连的开销。
安装:
pip install litellm文档:https://docs.litellm.ai
