The three decisions that shape an AI feature
Every production AI integration starts with three decisions. Get them right and the feature earns its place. Get them wrong and the feature ships as a demo, not a product.
- Model choice - Claude, GPT-4o, Gemini, or a local model via Ollama. Picked against the task, the latency budget, and the cost ceiling.
- Eval loop - how you know the model is doing the right thing before, during, and after every change. No eval loop, no production.
- Failure-mode UX - what the user sees when the model refuses, takes 30 seconds, or hallucinates. The UX of failure is the UX of production.
We set these up in week 1. The fancy prompt comes later.
What production AI actually requires
Most “AI features” shipped in 2024 and 2025 were ChatGPT wrappers: API key in the app, one-shot prompt, no memory, no evals, no cost controls. They work for a weekend demo. They break on the Monday when a user jailbreaks the prompt, or a friend shares the API key, or a model update changes the output shape.
A production AI integration has five moving parts. Every one of them is boring. All of them matter.
The five parts we ship every time
1. A server proxy holding credentials. FastAPI in front of every model call. Your app authenticates a device; our service holds every API key. Rotation, rate limits, and model switching all happen server-side.
2. A prompt middleware layer. The raw user input never reaches the model. We normalize, tag intent, inject the right system prompt, strip obvious jailbreak attempts. Versioned like code, logged like code, tested like code.
3. Retrieval when it matters. If the question is “what does our internal policy say”, the model must see the policy. pgvector on your existing Postgres covers most use cases. Pinecone or Weaviate for scale past a few million vectors. Chunking, embedding, and reranking - we set up the whole pipeline.
4. Structured outputs. JSON schemas via the model’s function-calling API. “Tell me the customer’s address” returns a {street, city, zip} object, not a paragraph. Wrong shapes fail fast; the app never renders hallucinated data.
5. An eval harness. Before we switch models, we replay sampled production queries through both and compare outputs against labelled ground truth. No one finds out a model swap broke accuracy from a customer complaint.
What we integrate
- OpenAI - GPT-4, GPT-4o, Whisper for speech, image generation. Shipped in four production apps.
- Anthropic - Claude 3.5 Sonnet, Claude Opus. Our default for reasoning-heavy tasks.
- Google - Gemini 1.5 Pro for long context, Vertex AI for Google Cloud deployments.
- Local and OSS - Llama 3, Mistral, Qwen via Ollama or vLLM. When data cannot leave the network.
- Embedding models - OpenAI ada, Voyage, Cohere. Picked per language and retrieval target.
- Vector stores - Postgres pgvector (default), Pinecone, Weaviate, Qdrant.
- Observability - Langfuse, OpenLLMetry, or self-hosted traces.
How we keep keys safe
Every LLM key lives on the server. The app authenticates a device to our FastAPI proxy, which holds every vendor credential. Rotation is a config change. Swapping a model is a flag.
The wrong answer - shipping the key in the mobile bundle - leaks the key on day one. Decompiled APKs, inspected web bundles, intercepted traffic. This is the first thing we fix when we take over an AI project built by someone else.
Case studies
- GPT-4 advisor chat in a wealth-management app - client-facing chat tied to GPT-4 inside a regulated (AMFI-registered) Indian wealth app. The OpenAI key lives on our server; the app authenticates and requests through our custom FastAPI endpoint. Model swaps (GPT-4 to GPT-4o) require no app release.
- Voice-first GPT-4 assistant from one Flutter codebase - Didier AI. Voice in via native platform ASR, prompt-engineering layer in front of GPT-4, image generation piped into the same chat view. One Flutter codebase ships to Android, iPad, and web. FastAPI proxy holds the OpenAI key. Freemium subscription via StoreKit and Play Billing.
- Speech-in, speech-out GPT-4 Turbo assistant in 51 languages - Botinfo. Native platform ASR for 51-language voice input, prompt middleware classifying intent before the model call, Google TTS for natural-sounding output. FastAPI + Postgres proxy with per-user rate limits. Shipped to App Store and Google Play.
“Techy Panther’s collaboration on developing the Didier-AI Chat Application was exceptional. Their expertise in Flutter, API integration, and AI technologies resulted in a user-friendly chatting experience.”
- Didier AI, client
Voice input, done right
Voice is not “another AI feature.” It is three products glued together: ASR (speech to text), intent handling, and TTS (text to speech). We use the device’s native ASR - Apple Speech on iOS, Android SpeechRecognizer on Android - not a third-party SDK. The phones already handle 60+ languages better than any bundled library, and we do not ship megabytes of extra weight to the user.
The transcript goes into the same prompt middleware layer as typed input. The response streams back via SSE. TTS runs on the device or through Google TTS when voice quality matters.
Eval and observability
- Prompt versioning - every prompt has a git SHA and an eval score on the golden set.
- Sampled production traces - Langfuse captures inputs, outputs, latency, and cost on a sample of real traffic.
- Regression evals on every prompt or model change - the golden set runs before merge; a drop in accuracy blocks the PR.
- Drift detection - if the live input distribution drifts from what the eval set covers, we see it before users do.
Cost controls are a feature, not polish
- Per-user quotas at the proxy (e.g. 30 chat turns per hour on free tier).
- Per-feature budgets (e.g. image generation capped at a hard daily limit across all users).
- Per-day spend caps that hard-fail requests if exceeded and alert the team.
- Prompt caching where the provider supports it - up to 90% savings on repeated system prompts.
We instrument this from day one, not “when the bill gets scary.”
How we work on an AI engagement
Week 1 - scoping. The use case, the data, the model, the budget, the failure mode. Written: which prompt, what context, what failure looks like, what success looks like. If we cannot describe failure precisely, we cannot measure it.
Weeks 2 to 4 - vertical slice. One real flow working from device to proxy to prompt layer to model to structured response to rendered result. Real data. Real users (internal). Real cost tracking.
Weeks 5 to 12 - iterate and broaden. Add surfaces, tune prompts, grow the eval set. Move from “works in 80% of cases” to “works in the 95% that matter.”
When AI is the wrong answer
- Chatbot for your docs - search with a good ranker often wins. Try Algolia or Typesense before you reach for RAG.
- AI summaries of emails - sometimes. Often the email is the summary and users do not want a second layer.
- AI in onboarding - almost always worse than a good form.
- Agent loops for single-step tasks - one model call beats an agent loop on latency, cost, and reliability.
We will talk you out of AI when a simpler approach ships a better product. The goal is the user getting what they want, not the resume getting a bullet point.
Why teams pick our AI delivery
The API key lives on our server from day one. A decompiled APK will not hand your account to a stranger, and a bug in a prompt will not burn a five-figure bill overnight because the budget caps sit at the proxy.
We build provider-agnostic from the first call. Model swaps (GPT-4o today, Claude tomorrow, a local Llama 3 next quarter) are flags, not rewrites. The eval harness ships in week 1 alongside the first prompt, so a regression on a model swap is caught in CI and not by your users. Refusals, timeouts, and hallucinations all get explicit UX, not a blank screen and a dev-tools stack trace.
Senior engineers write the prompts and the proxy. Source, prompts, evals, and proxy code yours at launch.