Why not ship the API key in the app?

Because it leaks on day one. Decompiled APKs, inspected web bundles, intercepted traffic - all trivial to get the key out. A server proxy is the only correct answer. Cost: half a day of engineering.

OpenAI, Anthropic, Google, or open-source?

Claude for most production use - strong reasoning, good instruction following, predictable refusal behaviour. GPT-4o for latency-critical voice and image work. Gemini when you need long context and the workload runs on Google Cloud. Llama 3 or Mistral via Ollama or vLLM when budget, data locality, or on-prem requirements rule out cloud APIs. We pick per task, not per vendor.

What is RAG and do I need it?

Retrieval-Augmented Generation - you fetch relevant documents from a vector DB, then send them to the LLM as context. Needed when the model must answer about your specific documents (internal wiki, product catalogue, customer tickets). Not needed for a general-purpose chat assistant.

How do you handle hallucinations?

Three layers. (1) RAG so the model sees the actual documents. (2) Structured outputs via JSON schema so wrong-shape responses fail fast. (3) A separate evaluator model that rates answers against ground truth on sampled queries. 100% is not a real target; 95%+ on the critical paths is.

Can you fine-tune a model for us?

Usually the wrong first step. Prompt engineering + RAG gets you 90% there. Fine-tuning matters when you have a narrow, repeatable task with lots of labelled data (classification, extraction) - then yes. We tell you which case you are in.

When should I use agents instead of a single prompt?

Agents (multi-step, tool-using) earn their place when the task has branches that depend on intermediate results - research, scheduling, multi-step refactoring. A single prompt beats an agent for anything one model call can answer. Start simple, add loops when evidence forces it.

What about streaming and voice?

Server-Sent Events (SSE) for streaming tokens into any client. For voice, we use the device's native ASR (Apple Speech, Android SpeechRecognizer) - no third-party SDK bloating the app. 51+ languages come for free. Text in, text out, TTS on the device or through Google TTS when voice quality matters.

How do you track cost?

Every request logs model, tokens in and out, latency, and cost. Dashboards per user, per feature, per day. Spend caps enforced at the proxy so a bug or abuse cannot burn a five-figure bill overnight.

AI Integration Services | Techy Panther

What you get

Capabilities included in every engagement.

OpenAI (GPT-4, GPT-4o, Whisper) SDK
Anthropic (Claude) SDK
Google Gemini SDK
Local / OSS models (Ollama, vLLM)
RAG on pgvector or Pinecone
Streaming responses (SSE)
Function calling and tool use
Structured outputs (JSON schema)
Native ASR + Google TTS
Eval harness + prompt versioning
Cost and rate-limit controls

Reference architecture

How it fits together.

Your client never holds the key. A FastAPI service holds credentials
routes prompts
hits the vector DB
and logs for evals.

The three decisions that shape an AI feature

Every production AI integration starts with three decisions. Get them right and the feature earns its place. Get them wrong and the feature ships as a demo, not a product.

Model choice - Claude, GPT-4o, Gemini, or a local model via Ollama. Picked against the task, the latency budget, and the cost ceiling.
Eval loop - how you know the model is doing the right thing before, during, and after every change. No eval loop, no production.
Failure-mode UX - what the user sees when the model refuses, takes 30 seconds, or hallucinates. The UX of failure is the UX of production.

We set these up in week 1. The fancy prompt comes later.

What production AI actually requires

Most “AI features” shipped in 2024 and 2025 were ChatGPT wrappers: API key in the app, one-shot prompt, no memory, no evals, no cost controls. They work for a weekend demo. They break on the Monday when a user jailbreaks the prompt, or a friend shares the API key, or a model update changes the output shape.

A production AI integration has five moving parts. Every one of them is boring. All of them matter.

The five parts we ship every time

1. A server proxy holding credentials. FastAPI in front of every model call. Your app authenticates a device; our service holds every API key. Rotation, rate limits, and model switching all happen server-side.

2. A prompt middleware layer. The raw user input never reaches the model. We normalize, tag intent, inject the right system prompt, strip obvious jailbreak attempts. Versioned like code, logged like code, tested like code.

3. Retrieval when it matters. If the question is “what does our internal policy say”, the model must see the policy. pgvector on your existing Postgres covers most use cases. Pinecone or Weaviate for scale past a few million vectors. Chunking, embedding, and reranking - we set up the whole pipeline.

4. Structured outputs. JSON schemas via the model’s function-calling API. “Tell me the customer’s address” returns a {street, city, zip} object, not a paragraph. Wrong shapes fail fast; the app never renders hallucinated data.

5. An eval harness. Before we switch models, we replay sampled production queries through both and compare outputs against labelled ground truth. No one finds out a model swap broke accuracy from a customer complaint.

What we integrate

OpenAI - GPT-4, GPT-4o, Whisper for speech, image generation. Shipped in four production apps.
Anthropic - Claude 3.5 Sonnet, Claude Opus. Our default for reasoning-heavy tasks.
Google - Gemini 1.5 Pro for long context, Vertex AI for Google Cloud deployments.
Local and OSS - Llama 3, Mistral, Qwen via Ollama or vLLM. When data cannot leave the network.
Embedding models - OpenAI ada, Voyage, Cohere. Picked per language and retrieval target.
Vector stores - Postgres pgvector (default), Pinecone, Weaviate, Qdrant.
Observability - Langfuse, OpenLLMetry, or self-hosted traces.

How we keep keys safe

Every LLM key lives on the server. The app authenticates a device to our FastAPI proxy, which holds every vendor credential. Rotation is a config change. Swapping a model is a flag.

The wrong answer - shipping the key in the mobile bundle - leaks the key on day one. Decompiled APKs, inspected web bundles, intercepted traffic. This is the first thing we fix when we take over an AI project built by someone else.

Case studies

GPT-4 advisor chat in a wealth-management app - client-facing chat tied to GPT-4 inside a regulated (AMFI-registered) Indian wealth app. The OpenAI key lives on our server; the app authenticates and requests through our custom FastAPI endpoint. Model swaps (GPT-4 to GPT-4o) require no app release.
Voice-first GPT-4 assistant from one Flutter codebase - Didier AI. Voice in via native platform ASR, prompt-engineering layer in front of GPT-4, image generation piped into the same chat view. One Flutter codebase ships to Android, iPad, and web. FastAPI proxy holds the OpenAI key. Freemium subscription via StoreKit and Play Billing.
Speech-in, speech-out GPT-4 Turbo assistant in 51 languages - Botinfo. Native platform ASR for 51-language voice input, prompt middleware classifying intent before the model call, Google TTS for natural-sounding output. FastAPI + Postgres proxy with per-user rate limits. Shipped to App Store and Google Play.

“Techy Panther’s collaboration on developing the Didier-AI Chat Application was exceptional. Their expertise in Flutter, API integration, and AI technologies resulted in a user-friendly chatting experience.”

Didier AI, client

Voice input, done right

Voice is not “another AI feature.” It is three products glued together: ASR (speech to text), intent handling, and TTS (text to speech). We use the device’s native ASR - Apple Speech on iOS, Android SpeechRecognizer on Android - not a third-party SDK. The phones already handle 60+ languages better than any bundled library, and we do not ship megabytes of extra weight to the user.

The transcript goes into the same prompt middleware layer as typed input. The response streams back via SSE. TTS runs on the device or through Google TTS when voice quality matters.

Eval and observability

Prompt versioning - every prompt has a git SHA and an eval score on the golden set.
Sampled production traces - Langfuse captures inputs, outputs, latency, and cost on a sample of real traffic.
Regression evals on every prompt or model change - the golden set runs before merge; a drop in accuracy blocks the PR.
Drift detection - if the live input distribution drifts from what the eval set covers, we see it before users do.

Cost controls are a feature, not polish

Per-user quotas at the proxy (e.g. 30 chat turns per hour on free tier).
Per-feature budgets (e.g. image generation capped at a hard daily limit across all users).
Per-day spend caps that hard-fail requests if exceeded and alert the team.
Prompt caching where the provider supports it - up to 90% savings on repeated system prompts.

We instrument this from day one, not “when the bill gets scary.”

How we work on an AI engagement

Week 1 - scoping. The use case, the data, the model, the budget, the failure mode. Written: which prompt, what context, what failure looks like, what success looks like. If we cannot describe failure precisely, we cannot measure it.

Weeks 2 to 4 - vertical slice. One real flow working from device to proxy to prompt layer to model to structured response to rendered result. Real data. Real users (internal). Real cost tracking.

Weeks 5 to 12 - iterate and broaden. Add surfaces, tune prompts, grow the eval set. Move from “works in 80% of cases” to “works in the 95% that matter.”

When AI is the wrong answer

Chatbot for your docs - search with a good ranker often wins. Try Algolia or Typesense before you reach for RAG.
AI summaries of emails - sometimes. Often the email is the summary and users do not want a second layer.
AI in onboarding - almost always worse than a good form.
Agent loops for single-step tasks - one model call beats an agent loop on latency, cost, and reliability.

We will talk you out of AI when a simpler approach ships a better product. The goal is the user getting what they want, not the resume getting a bullet point.

Why teams pick our AI delivery

The API key lives on our server from day one. A decompiled APK will not hand your account to a stranger, and a bug in a prompt will not burn a five-figure bill overnight because the budget caps sit at the proxy.

We build provider-agnostic from the first call. Model swaps (GPT-4o today, Claude tomorrow, a local Llama 3 next quarter) are flags, not rewrites. The eval harness ships in week 1 alongside the first prompt, so a regression on a model swap is caught in CI and not by your users. Refusals, timeouts, and hallucinations all get explicit UX, not a blank screen and a dev-tools stack trace.

Senior engineers write the prompts and the proxy. Source, prompts, evals, and proxy code yours at launch.

Pick this instead if

Where to go from here.

Three honest forks in the road. Each page names the stack, the failure mode it avoids, and the kind of product it suits.

backend

Backend Development Services

FastAPI on the hot path. Postgres as source of truth. Redis for cache. Celery for jobs. Real-time via WebSocket. CI deploys that don't page you on Friday.

Read

mobile

Flutter App Development Services

Flutter apps for iOS, Android, iPad, and web from one Dart codebase. Named state management, typed backend contracts, store pipelines that ship on a Tuesday.

Read

web

React Development Services

React web apps on Next.js App Router. Server Components, TanStack Query, Tailwind + shadcn/ui. Fast by default, typed end to end, tested where it counts.

Read

Start a ai integration services engagement

Tell us what you’re building.

Tell us the use case, the data, and the latency budget. We reply with a model pick, a prompt strategy, and a scoped proposal for a discovery sprint. NDA signed on request. Response within 1 business day.

Book a 20-min scoping call

The team on the call

Sameer Donga

Founder & Lead Engineer

Shipping Flutter, FastAPI, and AI systems since 2019. Reviews the architecture on every engagement.

AI Integration Services

Capabilities included in every engagement.

What we use. Why we pick it.

How it fits together.

The three decisions that shape an AI feature

What production AI actually requires

The five parts we ship every time

What we integrate

How we keep keys safe

Case studies

Voice input, done right

Eval and observability

Cost controls are a feature, not polish

How we work on an AI engagement

When AI is the wrong answer

Why teams pick our AI delivery

Where to go from here.

Backend Development Services

Flutter App Development Services

React Development Services

Questions people ask before we start.

Tell us what you’re building.

Sameer Donga

Tell us what you’re building.