LLM Gateway Design: Rate Limiting, Caching e Fallback para Múltiplos Providers
Seu app chama openai.chat.completions.create() diretamente. Amanhã, OpenAI quebra ou aumenta preço 10x. LLM Gateway abstrai providers e adiciona resilience.
O Problema: Vendor Lock-In
// ❌ Acoplado a OpenAI
import OpenAI from 'openai';
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
async function chat(message: string) {
const response = await openai.chat.completions.create({
model: 'gpt-4',
messages: [{ role: 'user', content: message }]
});
return response.choices[0].message.content;
}
Se quiser trocar para Anthropic Claude, precisa reescrever tudo.
A Solução: LLM Gateway
// Gateway abstrai providers
interface LLMGateway {
chat(messages: Message[]): Promise<string>;
}
class OpenAIProvider implements LLMGateway {
async chat(messages) {
const response = await openai.chat.completions.create({
model: 'gpt-4',
messages
});
return response.choices[0].message.content;
}
}
class AnthropicProvider implements LLMGateway {
async chat(messages) {
const response = await anthropic.messages.create({
model: 'claude-3-opus',
messages
});
return response.content[0].text;
}
}
// App usa interface genérica
const gateway: LLMGateway = new OpenAIProvider();
const answer = await gateway.chat([{ role: 'user', content: 'Hello' }]);
Features Essenciais
1. Caching
class CachingGateway implements LLMGateway {
private cache = new Map<string, string>();
constructor(private provider: LLMGateway) {}
async chat(messages: Message[]): Promise<string> {
const key = JSON.stringify(messages);
if (this.cache.has(key)) {
return this.cache.get(key)!;
}
const response = await this.provider.chat(messages);
this.cache.set(key, response);
return response;
}
}
2. Rate Limiting
class RateLimitedGateway implements LLMGateway {
private bucket: TokenBucket;
async chat(messages: Message[]): Promise<string> {
await this.bucket.consume(1);
return this.provider.chat(messages);
}
}
3. Fallback (Resilience)
class FallbackGateway implements LLMGateway {
constructor(
private primary: LLMGateway,
private fallback: LLMGateway
) {}
async chat(messages: Message[]): Promise<string> {
try {
return await this.primary.chat(messages);
} catch (error) {
console.warn('Primary failed, using fallback');
return await this.fallback.chat(messages);
}
}
}
4. Load Balancing
class LoadBalancedGateway implements LLMGateway {
private currentIndex = 0;
constructor(private providers: LLMGateway[]) {}
async chat(messages: Message[]): Promise<string> {
const provider = this.providers[this.currentIndex];
this.currentIndex = (this.currentIndex + 1) % this.providers.length;
return provider.chat(messages);
}
}
Composição
// Combinar features
const gateway = new CachingGateway(
new RateLimitedGateway(
new FallbackGateway(
new OpenAIProvider(),
new AnthropicProvider()
),
new TokenBucket(100, 10)
)
);
// 1. Checa cache
// 2. Rate limit
// 3. Tenta OpenAI
// 4. Se falhar, usa Anthropic
Ferramentas Prontas
LiteLLM
from litellm import completion
# Abstrai providers automaticamente
response = completion(
model="gpt-4", # ou "claude-3-opus", "gemini-pro"
messages=[{"role": "user", "content": "Hello"}]
)
LangChain
import { ChatOpenAI, ChatAnthropic } from 'langchain/chat_models';
const model = new ChatOpenAI(); // Facilmente trocável
const response = await model.call([{ role: 'user', content: 'Hello' }]);
Conclusão
LLM Gateway desacopla app de vendors, adiciona caching, rate limiting e resilience. Para produção, é essencial.
Se você chama OpenAI diretamente sem abstração, você está a uma API change de reescrever tudo.
Referências
- LiteLLM - https://github.com/BerriAI/litellm
- "Gateway Pattern" - Enterprise Integration Patterns
- LangChain Documentation
Checklist de implementação
- Defina policies de auth/rate/cache.
- Roteie por custo e qualidade.
- Faça logging/token metering.
- Implemente fallback e failover.
Fonte: https://lemon.dev.br/pt