Your forecast is in
Single-turn RAG agent that retrieves chunks from our internal engineering docs (architecture decisions, runbooks, API references) and answers questions. ~120 engineers ask 4-5 questions a day.
Action plan
The full reasoning behind each recommendation — copy into your build doc.
V3.2 Chat from deepseek runs the same workload at lower cost (budget tier, one tier below). Spec lists it as good for: general purpose. Verify quality on a sample of your traffic before fully switching.
Considered, didn’t apply
PitCrew checks every lever — model fit, prompt caching, batch lanes, prompt trimming. Here’s why the rest didn’t make the cut on this build.
- Prompt cachingopenai GPT-4o doesn't support prompt caching.
- Trim system promptNo redundancy detected — your 49-token prompt is already tight.
- Batch APIThis is a real-time agent (0% async traffic). No work to route to a batch lane.
Alternative models
Same quality tier, your wizard inputs. No caching or batch applied — every row is a directly-comparable raw monthly cost. Click Try as default to re-render this report with that model as the new baseline.
| Model | Input $/Mtok | Output $/Mtok | Context | Monthly cost | vs default | Open in audit |
|---|---|---|---|---|---|---|
deepgramVoice Agent voicelow-latencyaccurate-stt | $0 | $0 | — | $0/mo | $-36/mo | Try as default → |
cartesiaConversational voicelow-costhigh-throughput | $0 | $0 | — | $0/mo | $-36/mo | Try as default → |
voyagevoyage-3 embeddingsemantic-searchcode-friendly | $0.06 | $0 | — | $0.13/mo | $-35/mo | Try as default → |
openaitext-embedding-ada-002 embeddinglegacysemantic-search | $0.10 | $0 | — | $0.22/mo | $-35/mo | Try as default → |
cohereembed-english-v3 embeddingenglish-onlyhigh-quality | $0.10 | $0 | — | $0.22/mo | $-35/mo | Try as default → |
cohereembed-multilingual-v3 embeddingmultilingual100+ languages | $0.10 | $0 | — | $0.22/mo | $-35/mo | Try as default → |
deepseekV4 codingreasoning | $0.30 | $0.50 | — | $2/mo | $-33/mo | Try as default → |
deepseekR1 complex reasoning | $0.55 | $2 | — | $8/mo | $-28/mo | Try as default → |
mistralLarge 2 multilingualreasoning | $2 | $6 | — | $22/mo | $-13/mo | Try as default → |
googleGemini 2.5 Pro long contextmultimodal | $1 | $10 | — | $33/mo | $-3/mo | Try as default → |
openaiGPT-4o Defaultmultimodal | $3 | $10 | — | $36/mo | — | Try as default → |
openaiGPT-5.2 balanced | $2 | $14 | — | $46/mo | +$10/mo | Try as default → |
anthropicSonnet 4.6 general purposebalanced | $3 | $15 | — | $52/mo | +$16/mo | Try as default → |
What we assumed
These are the inputs we used. If anything looks off, re-run the audit with better numbers.
- Call volume is your guess — typical pre-deploy estimates land within ±50% of actual.
- Conversation length is a coarse bucket — actual tokens vary by ±40% per call.
What’s not included
PitCrew forecasts steady-state AI API spend — the dollars the LLM / embedding provider bills for the deterministic workload your wizard described. A production bill carries two kinds of cost on top that PitCrew doesn’t model:
1. Inference overhead — proportional (20–50% on top of steady-state)
- Dev / eval loops (often 10-30% of total spend)
- Retries, error recovery, idempotency replays
- Background batch jobs (summaries, classification of past data)
- A/B traffic on alternate models
- Embeddings + fine-tunes that ride alongside the agent
| Scenario | Steady-state (PitCrew) | With inference overhead |
|---|---|---|
| Default build | $33/mo | $39–$49/mo |
| PitCrew plan | $2/mo | $2–$3/mo |
2. Hosting & infra — flat (workload-dependent, typically $10–80/mo)
- Cloud hosting (Vercel / Render / Fly / AWS / etc.)
- Database (Supabase / Postgres / Mongo / etc.)
- Managed vector DB or search — Pinecone, Weaviate, OpenSearch typically $25–100/mo (if not already entered in Step 5)
- CDN, scraping APIs, telephony minutes, transport (Twilio, LiveKit, Zyte, etc.)
- Vendor SaaS margin if going through a wrapper (Cursor, Vapi, Evee, etc.) instead of direct API
The 20-50% inference multiplier comes from public engineering postmortems and the validation cases in docs/accuracy-validation.md. If your team has tight eval loops and minimal retry traffic, target the low end. The hosting/infra range is highly workload-dependent — small RAG bots may spend nothing extra, voice agents add telephony costs on top.
How sensitive is this forecast?
Pre-deploy estimates are guesses. Here’s how the savings shift if the volume or conversation length you guessed turns out to be off.
Run another audit
for a different build
Tweak inputs, swap the model, see how the forecast moves.
New audit