OpenAI Realtime · RAG · Voice · Evals1 slot · June 2026

AI engineer portfolio

Freelance AI engineer shipping production AI-native products end-to-end, not just bolting on an OpenAI call. Voice agents on OpenAI Realtime + Twilio with sub-second latency budgets, RAG systems with proper chunking and eval loops, tool-calling architectures, and the eval discipline to know when the model is failing in production. Currently building two AI-native products of my own.

  • OpenAI Realtime API with Twilio for sub-700ms voice agents
  • RAG: chunking strategies, embeddings, retrieval tuning, eval sets
  • Tool-calling architectures and structured outputs
  • Multi-model routing across Claude, GPT, and open-source
  • Prompt design, prompt caching, and cost optimization
  • Eval loops with golden sets, regression testing, and live monitoring
  • Voice agent handoff to humans, with context preservation
  • Vector stores (pgvector, Pinecone) and hybrid search
Book a 30-min call
Selected case studies

Shipped work for the same brief.

Common questions

What founders ask before reaching out.

  • What does 'AI engineer' actually mean in your portfolio?

    Shipping AI-native products end-to-end. That means model selection with a tradeoff brief, prompt design, tool-calling architecture, RAG when retrieval matters, voice latency tuning when speech is the surface, evals so you know when it breaks, and cost monitoring so it doesn't blow up your budget. Not just calling chat.completions.create.

  • Can you build a voice agent on OpenAI Realtime?

    Yes — currently shipping one. The voice-agents-latency-budget article on this site walks through the architecture, the 700ms target, why P95 matters more than median, and the Twilio + Node mediator pattern that hits it.

  • Do you build RAG systems?

    Yes. RAG done right is mostly retrieval engineering, not LLM engineering: chunking strategy for your specific data shape, embedding model selection, hybrid search when keyword recall matters, reranking, and an eval set with golden answers so you know retrieval quality before it ships.

  • Which models do you work with?

    Claude (Anthropic), GPT (OpenAI), and open-source via Together/Groq when latency or cost demands it. Model choice gets a written tradeoff brief on every engagement: capability, latency, cost, and failure modes for the specific task.

  • How do you handle evals and quality?

    Golden set of inputs and expected outputs per task, scored by either heuristic checks or a judge LLM. Regression-tested on every prompt change. For voice, latency P95 and turn-success rate. For RAG, retrieval precision and answer faithfulness. No 'looks fine' shipping.

  • Will you train a custom model?

    Usually no — fine-tuning is rarely the right answer for product work. Better prompts, better retrieval, and tool-calling solve most problems faster and cheaper. I'll fine-tune when there's a clear quality or latency reason and the training data exists.

Related
Next step

Let's see if it's a fit.

30-minute call. No pitch, no slides. Tell me what you're building, including the AI parts, and the constraints. I'll tell you if I can help, and who else to call if I can't.

Book a 30-min call
1 slot · June 2026Usually replies within 24 hoursAsync-friendly · UTC+5:30