Question 1

What does 'AI engineer' actually mean in your portfolio?

Accepted Answer

Shipping AI-native products end-to-end. That means model selection with a written tradeoff brief, prompt design, tool-calling architecture, RAG when retrieval matters, voice latency tuning when speech is the surface, evals so you know when it breaks, and cost monitoring so it doesn't blow up your budget. Not just calling chat.completions.create.

Question 2

How do you keep voice agent latency under a second?

Accepted Answer

Explicit per-subsystem budget written into the scope doc, P95 traces from day one, streaming everywhere (STT, LLM, TTS), and a Node/Twilio mediator pattern that overlaps STT-final with LLM start. The voice-agents-latency-budget article on this site walks through the architecture and the 700ms target.

Question 3

RAG architecture — where do most teams get it wrong?

Accepted Answer

Three places: chunking by token count instead of by semantic unit, embedding-only retrieval when their data needs keyword recall, and no eval set so they can't tell when retrieval quality regresses. Fix those three and you've solved most of what people blame on the LLM.

Question 4

How do you evaluate prompt changes without manual QA?

Accepted Answer

Golden set of inputs and expected outputs per task. Deterministic checks for structured outputs, judge-LLM scoring with a calibrated rubric for free-form text. Runs in CI on every prompt change. Prompts are versioned in code with the eval results in the PR description.

Question 5

Claude vs OpenAI — how do you choose per-task?

Accepted Answer

Capability, latency, cost, and failure modes for the specific task. Claude for long-context reasoning and tool-use reliability. GPT for voice (Realtime) and the broader function-calling ecosystem. Open-source via Groq when latency or per-token cost dominates. Every engagement gets a written tradeoff brief; choices are not theological.

Question 6

What does 'production' mean for an AI feature?

Accepted Answer

Tracing on every model call, latency P95 alerts, golden-set evals in CI, cost dashboards per tenant, structured-output validation, retry/fallback chains for tool-calls, human-in-the-loop gating for irreversible actions, and a runbook for the moments when the model starts refusing or drifting. Until those exist, the feature is a demo.

AI engineer portfolio

Shipped work for the same brief.

Where most AI builds fail, and how I close the gap.

1. Latency budget, written before the first line of code

2. Eval loop with golden sets and regression detection

3. Tool-calling reliability with structured-output validation

4. RAG architecture, where most teams actually get stuck

5. Cost-of-failure tiers and human-in-the-loop gating

What founders ask before reaching out.