Building AI Products That Respond in Under 2 Seconds

Ravi Subramanian·Mar 30, 2026·11 min read

Engineering

Two seconds is the threshold. Below it, an AI response feels like a conversation. Above it, it feels like waiting. This isn't a guess — it's a well-established principle in conversational UX. Human turn-taking in dialogue operates on a 200-to-500 millisecond gap, and anything beyond about two seconds breaks the cognitive flow of an interaction. Users don't consciously count milliseconds, but they feel the difference between "this thing is fast" and "this thing is thinking."

When we set out to build an AI product that participates in real-time conversations, the 2-second target wasn't aspirational. It was a hard requirement — anything slower and the product wouldn't work.

This post is about how you engineer an AI pipeline to hit that target consistently. Not in a demo environment with a warm cache and a short prompt, but in production, with variable-length audio input, cold starts, and users who don't structure their questions neatly.

The Pipeline: Four Stages, One Budget

A real-time AI response system isn't a single operation — it's a pipeline with four sequential stages. Each stage consumes part of your 2-second budget:

Stage	What happens	Typical latency
1. Transcription	Audio → text	300–800ms
2. Processing	Context assembly, prompt construction	50–150ms
3. Inference	LLM generates a response	500–1,500ms
4. Delivery	Response rendered to the user	10–50ms

Add those up naively and you're at 860ms to 2,500ms — right on the edge of your budget, with no room for variance. The engineering challenge is compressing each stage and overlapping them where possible.

Stage 1: Getting Speech to Text Fast

If your input is audio (voice commands, meeting transcription, voice agents), the transcription step is your first latency bottleneck.

The Whisper trap. OpenAI's Whisper is the most widely-known speech-to-text model, and for batch transcription it's excellent. But Whisper lacks native streaming support. You have to buffer audio into chunks, send each chunk as a complete request, and stitch the results together. With chunk sizes of 3-5 seconds (needed for accuracy), you're adding 3-5 seconds of latency before the first word is even transcribed. For a real-time product, that's the entire budget gone before you've started.

Streaming transcription APIs. Services like Deepgram's Nova-3 deliver transcripts with sub-300ms latency via WebSocket streaming. Audio goes in continuously; partial transcripts come back as the user speaks. You don't wait for the user to finish — you're transcribing in real time, and by the time they stop speaking, you have most of the text already.

The accuracy-latency trade-off. Faster transcription models make more errors. For a product where the transcribed text is user-facing (captions, notes), you need high accuracy and can tolerate slightly more latency. For a product where the transcription is intermediate — feeding into an LLM that's robust to minor errors — you can trade accuracy for speed. A word error rate of 8% might be unacceptable for captions but perfectly fine as LLM input, because the model can infer intent from imperfect text.

Budget target: 300–500ms. With a streaming API and a willingness to accept intermediate-quality transcription, this is achievable.

Stage 2: Processing — The Cheapest Stage You'll Still Mess Up

Between transcription and inference, there's a processing step: assembling the context the LLM needs to generate a useful response. This includes the transcribed user input, conversation history, any retrieved documents or data (RAG), system prompts, and formatting instructions.

This stage should be fast — it's mostly string concatenation and database lookups. But teams frequently blow their latency budget here by doing too much.

The RAG latency trap. Retrieval-augmented generation is powerful, but each retrieval step adds latency. A vector similarity search against a well-indexed database takes 20-50ms. But if you're doing a retrieval, re-ranking the results, then doing a second retrieval based on the re-ranked context, you've added 100-200ms. At real-time scales, that matters.

The fix is to pre-compute where possible. If you know the user's context (their role, the meeting they're in, the documents they've uploaded), you can pre-fetch and cache the most likely relevant context before the user speaks. When the transcription comes in, you're doing a lightweight filter against pre-loaded context, not a cold retrieval.

Token count matters. Every input token adds to inference latency. OpenAI's own latency guide notes that reducing input tokens typically yields a 1-5% improvement — modest per token, but significant when you're sending 3,000 tokens of context and could send 800. Be aggressive about pruning irrelevant context. Send the LLM what it needs, not everything you have.

Budget target: 50–100ms. Pre-fetch context, cache aggressively, prune input tokens.

Stage 3: Inference — Where Most of Your Budget Goes

LLM inference is the single largest latency contributor in the pipeline. There are two metrics that matter:

Time to first token (TTFT). How long before the model starts generating output. This determines when you can begin streaming a response to the user. For GPT-4o, TTFT is typically 200-400ms. For smaller models like GPT-4o-mini or Claude Haiku, it's 100-200ms.

Per-token latency. How long each subsequent token takes. At 30-50ms per token, a 100-token response (roughly two sentences) takes 3-5 seconds to fully generate. But with streaming, the user starts reading after the first token arrives — the perceived latency is the TTFT, not the total generation time.

Five Techniques That Actually Move the Needle

1. Stream everything. This is the single most impactful optimization. OpenAI's latency guide ranks it first for a reason. Instead of waiting for the full response, send tokens to the user as they're generated. The user perceives the response as starting in 200-400ms, even if the full response takes 3 seconds to generate.

2. Use the smallest model that works. This sounds obvious, but teams consistently default to their most powerful model for every request. A complex reasoning task might need GPT-4o. A simple classification or reformatting task probably doesn't. Cutting the model size can reduce latency by 50-70% with minimal quality loss for straightforward tasks. Route requests based on complexity.

3. Generate fewer tokens. Latency is roughly proportional to output length. If you can get a useful answer in 50 tokens instead of 200, you've cut generation time by 75%. Instruct the model to be concise. Use structured output formats. Set max_tokens aggressively. OpenAI's data shows that cutting 50% of output tokens cuts roughly 50% of latency.

4. Semantic caching. If users frequently ask similar questions, cache the responses. Not just exact-match caching — use embedding similarity to identify semantically equivalent queries and serve precomputed answers. This drops response time from seconds to milliseconds for repeat or near-repeat queries.

5. Speculative execution. For request patterns where you can predict likely queries (e.g., in a meeting context, the user is likely to ask about the topic currently being discussed), start inference before the user finishes speaking. When the final transcription arrives, you either have a head start on the right answer or you discard the speculative result and start fresh. On average, this saves 200-500ms when the prediction hits.

Budget target: 500–1,000ms perceived (with streaming). TTFT of 200-400ms, with the rest of the generation streaming in progressively.

Stage 4: Delivery — Perceived vs. Actual Latency

The final stage is rendering the response. In a web app, this is trivial. In a more complex UI — an overlay, a mobile app, a voice agent — there are tricks to make the response feel faster than it is.

Progressive rendering. Don't wait for the full response to display it. Show tokens as they arrive. If the response is a paragraph, the user starts reading the first sentence while the third sentence is still generating.

Skeleton states. Show an indication that the response is coming (a typing indicator, a loading shimmer) within 100ms of the request. This doesn't reduce actual latency, but it dramatically reduces perceived latency. Users who see immediate feedback are more tolerant of a 1.5-second wait than users who see nothing for 1.5 seconds.

Priority rendering. If the response has a clear structure (a key answer followed by supporting detail), render the key answer first at full confidence and stream the supporting detail progressively. The user gets the actionable information immediately.

Budget target: 10-50ms. This stage shouldn't be your bottleneck. If it is, you have a rendering problem, not an AI problem.

The Big Lever: Pipeline Parallelism

The latency math changes dramatically when you stop treating the pipeline as strictly sequential.

Overlap transcription and processing. Don't wait for the user to finish speaking before you start assembling context. As partial transcripts arrive via streaming, begin pre-fetching relevant context. By the time the final transcript is ready, your context window is already assembled.

Overlap processing and inference. If your system prompt and context are ready before the user's final input, you can send the partial prompt to the LLM and append the user input when it arrives. Some inference APIs support this via prompt caching or session continuity.

Overlap inference and delivery. This is just streaming — the user receives tokens as they're generated, not after the full response is complete.

With aggressive parallelism, the effective latency isn't the sum of all stages — it's closer to the latency of the slowest stage plus a small overhead for coordination. In practice, this can compress a 2,500ms sequential pipeline down to 1,200-1,500ms.

When to Skip the LLM Entirely

OpenAI's latency guide includes a recommendation most engineers overlook: "Don't default to an LLM."

Some requests don't need generative AI at all. A lookup question ("what's the pricing for the Enterprise plan?") can be answered by a keyword match against a document store in 20ms. A formatting request ("convert this to bullet points") can be handled by a rule-based transformation. A classification task ("is this question about pricing or about features?") can be done by a lightweight classifier faster than the time it takes to construct an LLM prompt.

Build a routing layer that classifies incoming requests and sends them to the appropriate handler. The LLM is the fallback for complex, open-ended queries — not the first option for every request. This doesn't just reduce latency; it reduces cost proportionally.

The Latency Budget Spreadsheet

Every real-time AI team should maintain a latency budget spreadsheet. It looks like this:

Stage	P50 target	P99 target	Current P50	Current P99
Transcription	350ms	600ms	—	—
Processing	75ms	150ms	—	—
Inference (TTFT)	250ms	500ms	—	—
Delivery	20ms	50ms	—	—
End-to-end	700ms	1,300ms	—	—

Measure each stage independently. When total latency creeps up, you know which stage is responsible. When a new feature adds context to the prompt (increasing inference latency) or adds a retrieval step (increasing processing latency), you can see the impact immediately and decide whether the quality improvement justifies the latency cost.

The P99 target matters more than the P50. A product that's fast 50% of the time and slow 1% of the time feels unreliable. Users remember the slow responses. Build for the tail, not the median.

Latency Is a Product Decision

The engineering is hard, but the harder part is the organizational discipline. Every feature request, every additional context source, every "let's also include X in the prompt" decision is a latency decision. Someone on the team has to be the person who asks "what does this cost us in milliseconds?" before it ships.

At Neothi, the 2-second budget isn't negotiable — the product is a real-time overlay during live conversations, and anything slower makes it useless. That constraint forces every engineering decision through a latency lens, which turns out to be a surprisingly effective way to build a focused product. When you can't afford to do everything, you figure out what actually matters.

Latency isn't a performance metric. It's a feature. Treat it like one.