Your forecast is in
Slack bot answering HR benefits questions for an 800-person company. People mostly ask about health insurance, 401k, and PTO. Escalates legal/medical questions to a human.
Action plan
The full reasoning behind each recommendation — copy into your build doc.
V3.2 Chat from deepseek runs the same workload at lower cost (budget tier, one tier below). Spec lists it as good for: general purpose. Verify quality on a sample of your traffic before fully switching.
Considered, didn’t apply
PitCrew checks every lever — model fit, prompt caching, batch lanes, prompt trimming. Here’s why the rest didn’t make the cut on this build.
- Prompt cachingYour system prompt is 62 tokens; caching needs ≥1,024 tokens to amortize the cache-write cost.
- Trim system promptNo redundancy detected — your 62-token prompt is already tight.
- Batch APIThis is a real-time agent (0% async traffic). No work to route to a batch lane.
Alternative models
Same quality tier, your wizard inputs. No caching or batch applied — every row is a directly-comparable raw monthly cost. Click Try as default to re-render this report with that model as the new baseline.
| Model | Input $/Mtok | Output $/Mtok | Context | Monthly cost | vs default | Open in audit |
|---|---|---|---|---|---|---|
deepseekV4 codingreasoning | $0.30 | $0.50 | — | $2/mo | $-50/mo | Try as default → |
deepseekR1 complex reasoning | $0.55 | $2 | — | $8/mo | $-44/mo | Try as default → |
mistralLarge 2 multilingualreasoning | $2 | $6 | — | $23/mo | $-29/mo | Try as default → |
googleGemini 2.5 Pro long contextmultimodal | $1 | $10 | — | $33/mo | $-19/mo | Try as default → |
openaiGPT-4o multimodal | $3 | $10 | — | $36/mo | $-16/mo | Try as default → |
openaiGPT-5.2 balanced | $2 | $14 | — | $46/mo | $-6/mo | Try as default → |
anthropicSonnet 4.6 Defaultgeneral purposebalanced | $3 | $15 | — | $52/mo | — | Try as default → |
What we assumed
These are the inputs we used. If anything looks off, re-run the audit with better numbers.
- Call volume is your guess — typical pre-deploy estimates land within ±50% of actual.
- Conversation length is a coarse bucket — actual tokens vary by ±40% per call.
Real-bill expectation
PitCrew forecasts steady-state inference cost — the dollars the LLM provider bills for the deterministic, no-extras workload your wizard described. Real production bills are typically 1.2-1.5× higher because the steady-state model excludes:
- Dev / eval loops (often 10-30% of total spend)
- Retries, error recovery, idempotency replays
- Background batch jobs (summaries, classification of past data)
- A/B traffic on alternate models
- Embeddings + fine-tunes that ride alongside the agent
| Scenario | Steady-state (PitCrew) | Expected real bill |
|---|---|---|
| Default build | $46/mo | $55–$69/mo |
| PitCrew plan | $2/mo | $2–$3/mo |
The 20-50% multiplier comes from public engineering postmortems and the validation cases in docs/accuracy-validation.md. If your team has tight eval loops and minimal retry traffic, target the low end.
How sensitive is this forecast?
Pre-deploy estimates are guesses. Here’s how the savings shift if the volume or conversation length you guessed turns out to be off.
Run another audit
for a different build
Tweak inputs, swap the model, see how the forecast moves.
New audit