All services
Pillar 13 · AI

AI Integration Services

What you get

Capabilities included in every engagement.

  • OpenAI (GPT-4, GPT-4o, Whisper) SDK
  • Anthropic (Claude) SDK
  • Google Gemini SDK
  • Local / OSS models (Ollama, vLLM)
  • RAG on pgvector or Pinecone
  • Streaming responses (SSE)
  • Function calling and tool use
  • Structured outputs (JSON schema)
  • Native ASR + Google TTS
  • Eval harness + prompt versioning
  • Cost and rate-limit controls
The stack we default to

What we use. Why we pick it.

We useWhy we pick it
FastAPI proxy (server-held keys)Every LLM key lives on the server. Your client authenticates a device; our service holds every vendor credential. Key rotation is a config change, not an app release.
OpenAI + Anthropic + Google SDKsProvider-agnostic from day one. Switching from GPT-4o to Claude for a specific task is a flag, not a rewrite.
Prompt middleware layerIntent tagging, system-prompt injection, transcript normalization before the model sees anything. The thing that keeps voice assistants coherent across languages.
pgvector (default) or Pineconepgvector lives in your existing Postgres - no new infrastructure. Pinecone for scale past a few million vectors. Weaviate when hybrid search matters.
Streaming + Server-Sent EventsTokens render as they arrive. The user starts reading before the model finishes generating.
Structured outputsJSON schemas via the model's function-calling API. Wrong-shape responses fail fast; the UI never renders hallucinated data.
Eval harnessPrompts are code. We snapshot inputs, outputs, and cost per request so regressions on a model swap are caught before shipping.
Rate limits + cost meterPer-user quotas, per-feature budgets, per-day spend caps. 'The OpenAI bill exploded overnight' stops being a story.
Langfuse (observability)Every prompt, every completion, every cost traced. Self-hostable when data locality matters.
Reference architecture

How it fits together.

Your client never holds the key. A FastAPI service holds credentials, routes prompts, hits the vector DB, and logs for evals.

Client App · web · chat Streams tokens · renders cards Proxy FastAPI Auth · rate limit Prompt middleware · logs LLM OpenAI / Anthropic / local Swappable · streaming Retrieval pgvector / Pinecone Chunked docs · reranker Evals Prompt · output logs Regression tests · cost meter HTTPS completion lookup trace

The three decisions that shape an AI feature

Every production AI integration starts with three decisions. Get them right and the feature earns its place. Get them wrong and the feature ships as a demo, not a product.

  • Model choice - Claude, GPT-4o, Gemini, or a local model via Ollama. Picked against the task, the latency budget, and the cost ceiling.
  • Eval loop - how you know the model is doing the right thing before, during, and after every change. No eval loop, no production.
  • Failure-mode UX - what the user sees when the model refuses, takes 30 seconds, or hallucinates. The UX of failure is the UX of production.

We set these up in week 1. The fancy prompt comes later.

What production AI actually requires

Most “AI features” shipped in 2024 and 2025 were ChatGPT wrappers: API key in the app, one-shot prompt, no memory, no evals, no cost controls. They work for a weekend demo. They break on the Monday when a user jailbreaks the prompt, or a friend shares the API key, or a model update changes the output shape.

A production AI integration has five moving parts. Every one of them is boring. All of them matter.

The five parts we ship every time

1. A server proxy holding credentials. FastAPI in front of every model call. Your app authenticates a device; our service holds every API key. Rotation, rate limits, and model switching all happen server-side.

2. A prompt middleware layer. The raw user input never reaches the model. We normalize, tag intent, inject the right system prompt, strip obvious jailbreak attempts. Versioned like code, logged like code, tested like code.

3. Retrieval when it matters. If the question is “what does our internal policy say”, the model must see the policy. pgvector on your existing Postgres covers most use cases. Pinecone or Weaviate for scale past a few million vectors. Chunking, embedding, and reranking - we set up the whole pipeline.

4. Structured outputs. JSON schemas via the model’s function-calling API. “Tell me the customer’s address” returns a {street, city, zip} object, not a paragraph. Wrong shapes fail fast; the app never renders hallucinated data.

5. An eval harness. Before we switch models, we replay sampled production queries through both and compare outputs against labelled ground truth. No one finds out a model swap broke accuracy from a customer complaint.

What we integrate

  • OpenAI - GPT-4, GPT-4o, Whisper for speech, image generation. Shipped in four production apps.
  • Anthropic - Claude 3.5 Sonnet, Claude Opus. Our default for reasoning-heavy tasks.
  • Google - Gemini 1.5 Pro for long context, Vertex AI for Google Cloud deployments.
  • Local and OSS - Llama 3, Mistral, Qwen via Ollama or vLLM. When data cannot leave the network.
  • Embedding models - OpenAI ada, Voyage, Cohere. Picked per language and retrieval target.
  • Vector stores - Postgres pgvector (default), Pinecone, Weaviate, Qdrant.
  • Observability - Langfuse, OpenLLMetry, or self-hosted traces.

How we keep keys safe

Every LLM key lives on the server. The app authenticates a device to our FastAPI proxy, which holds every vendor credential. Rotation is a config change. Swapping a model is a flag.

The wrong answer - shipping the key in the mobile bundle - leaks the key on day one. Decompiled APKs, inspected web bundles, intercepted traffic. This is the first thing we fix when we take over an AI project built by someone else.

Case studies

  • GPT-4 advisor chat in a wealth-management app - client-facing chat tied to GPT-4 inside a regulated (AMFI-registered) Indian wealth app. The OpenAI key lives on our server; the app authenticates and requests through our custom FastAPI endpoint. Model swaps (GPT-4 to GPT-4o) require no app release.
  • Voice-first GPT-4 assistant from one Flutter codebase - Didier AI. Voice in via native platform ASR, prompt-engineering layer in front of GPT-4, image generation piped into the same chat view. One Flutter codebase ships to Android, iPad, and web. FastAPI proxy holds the OpenAI key. Freemium subscription via StoreKit and Play Billing.
  • Speech-in, speech-out GPT-4 Turbo assistant in 51 languages - Botinfo. Native platform ASR for 51-language voice input, prompt middleware classifying intent before the model call, Google TTS for natural-sounding output. FastAPI + Postgres proxy with per-user rate limits. Shipped to App Store and Google Play.

“Techy Panther’s collaboration on developing the Didier-AI Chat Application was exceptional. Their expertise in Flutter, API integration, and AI technologies resulted in a user-friendly chatting experience.”

  • Didier AI, client

Voice input, done right

Voice is not “another AI feature.” It is three products glued together: ASR (speech to text), intent handling, and TTS (text to speech). We use the device’s native ASR - Apple Speech on iOS, Android SpeechRecognizer on Android - not a third-party SDK. The phones already handle 60+ languages better than any bundled library, and we do not ship megabytes of extra weight to the user.

The transcript goes into the same prompt middleware layer as typed input. The response streams back via SSE. TTS runs on the device or through Google TTS when voice quality matters.

Eval and observability

  • Prompt versioning - every prompt has a git SHA and an eval score on the golden set.
  • Sampled production traces - Langfuse captures inputs, outputs, latency, and cost on a sample of real traffic.
  • Regression evals on every prompt or model change - the golden set runs before merge; a drop in accuracy blocks the PR.
  • Drift detection - if the live input distribution drifts from what the eval set covers, we see it before users do.

Cost controls are a feature, not polish

  • Per-user quotas at the proxy (e.g. 30 chat turns per hour on free tier).
  • Per-feature budgets (e.g. image generation capped at a hard daily limit across all users).
  • Per-day spend caps that hard-fail requests if exceeded and alert the team.
  • Prompt caching where the provider supports it - up to 90% savings on repeated system prompts.

We instrument this from day one, not “when the bill gets scary.”

How we work on an AI engagement

Week 1 - scoping. The use case, the data, the model, the budget, the failure mode. Written: which prompt, what context, what failure looks like, what success looks like. If we cannot describe failure precisely, we cannot measure it.

Weeks 2 to 4 - vertical slice. One real flow working from device to proxy to prompt layer to model to structured response to rendered result. Real data. Real users (internal). Real cost tracking.

Weeks 5 to 12 - iterate and broaden. Add surfaces, tune prompts, grow the eval set. Move from “works in 80% of cases” to “works in the 95% that matter.”

When AI is the wrong answer

  • Chatbot for your docs - search with a good ranker often wins. Try Algolia or Typesense before you reach for RAG.
  • AI summaries of emails - sometimes. Often the email is the summary and users do not want a second layer.
  • AI in onboarding - almost always worse than a good form.
  • Agent loops for single-step tasks - one model call beats an agent loop on latency, cost, and reliability.

We will talk you out of AI when a simpler approach ships a better product. The goal is the user getting what they want, not the resume getting a bullet point.

Why teams pick our AI delivery

The API key lives on our server from day one. A decompiled APK will not hand your account to a stranger, and a bug in a prompt will not burn a five-figure bill overnight because the budget caps sit at the proxy.

We build provider-agnostic from the first call. Model swaps (GPT-4o today, Claude tomorrow, a local Llama 3 next quarter) are flags, not rewrites. The eval harness ships in week 1 alongside the first prompt, so a regression on a model swap is caught in CI and not by your users. Refusals, timeouts, and hallucinations all get explicit UX, not a blank screen and a dev-tools stack trace.

Senior engineers write the prompts and the proxy. Source, prompts, evals, and proxy code yours at launch.

FAQ

Questions people ask before we start.

Because it leaks on day one. Decompiled APKs, inspected web bundles, intercepted traffic - all trivial to get the key out. A server proxy is the only correct answer. Cost: half a day of engineering.

Claude for most production use - strong reasoning, good instruction following, predictable refusal behaviour. GPT-4o for latency-critical voice and image work. Gemini when you need long context and the workload runs on Google Cloud. Llama 3 or Mistral via Ollama or vLLM when budget, data locality, or on-prem requirements rule out cloud APIs. We pick per task, not per vendor.

Retrieval-Augmented Generation - you fetch relevant documents from a vector DB, then send them to the LLM as context. Needed when the model must answer about your specific documents (internal wiki, product catalogue, customer tickets). Not needed for a general-purpose chat assistant.

Three layers. (1) RAG so the model sees the actual documents. (2) Structured outputs via JSON schema so wrong-shape responses fail fast. (3) A separate evaluator model that rates answers against ground truth on sampled queries. 100% is not a real target; 95%+ on the critical paths is.

Usually the wrong first step. Prompt engineering + RAG gets you 90% there. Fine-tuning matters when you have a narrow, repeatable task with lots of labelled data (classification, extraction) - then yes. We tell you which case you are in.

Agents (multi-step, tool-using) earn their place when the task has branches that depend on intermediate results - research, scheduling, multi-step refactoring. A single prompt beats an agent for anything one model call can answer. Start simple, add loops when evidence forces it.

Server-Sent Events (SSE) for streaming tokens into any client. For voice, we use the device's native ASR (Apple Speech, Android SpeechRecognizer) - no third-party SDK bloating the app. 51+ languages come for free. Text in, text out, TTS on the device or through Google TTS when voice quality matters.

Every request logs model, tokens in and out, latency, and cost. Dashboards per user, per feature, per day. Spend caps enforced at the proxy so a bug or abuse cannot burn a five-figure bill overnight.

Start a ai integration services engagement

Tell us what you’re building.

Tell us the use case, the data, and the latency budget. We reply with a model pick, a prompt strategy, and a scoped proposal for a discovery sprint. NDA signed on request. Response within 1 business day.

The team on the call

Named engineers, not a pool.

You speak to the person who’ll review the architecture. No account-manager layer. No offshore switcheroo.

Founder & Lead Engineer

Sameer Donga

Shipping Flutter, FastAPI, and AI systems since 2019. Reviews the architecture on every engagement.