Skip to main content

Command Palette

Search for a command to run...

Building an AI Interview Coach: Lessons Learned from Prompt Engineering, Memory, and Workflow Design

Updated
6 min read
A
Software Developer | Data Analyst | AI & Web Development Enthusiast Building practical applications, exploring Generative AI workflows, and sharing lessons learned from real projects. Interested in AI engineering, backend development, and solving real-world problems through technology.

Introduction

Preparing for technical interviews is a real problem. Reading static question lists gives no feedback. Practising with friends is inconsistent. I wanted to see if a Generative AI workflow could simulate a realistic, adaptive interview without needing an enterprise budget.

While learning more about prompt engineering, conversational memory, and agentic workflows, I became interested in how these concepts could be applied to a practical interview preparation tool. Instead of just reading about them, I decided to build a small project: an AI Interview Coach.

This article covers what I built, what broke, what surprised me, and the lessons I learned along the way.

Key Takeaways

  • Workflow design matters more than model size.

  • Structured outputs reduce parser failures.

  • Short-term memory dramatically improves follow-up quality.

  • Breaking tasks into smaller prompts improves reliability.

  • Hallucinations decrease when prompts include evidence-based evaluation criteria.

Why a Basic Q&A Chatbot Fails

A simple chatbot that asks a question and says "Good answer" is useless for interview practice. I needed:

  • Memory – the system should recall previous answers to ask relevant follow-ups.

  • Structured feedback – not just a score, but specific, role-relevant suggestions.

  • Adaptive flow – vague answers trigger probing; good answers move to new topics.

These requirements map directly to agentic AI workflow concepts, systems that reason, decide, and orchestrate multiple steps. Those workflow patterns ultimately shaped the way I designed the project.

The Workflow I Built

Tech stack used: Python, Flask, OpenAI API, speech-to-text model, prompt-based evaluation pipeline.

The goal wasn't to build a production-ready system. I wanted a lightweight prototype that would help me understand how workflow design affects AI application behavior.

The system runs through a loop:

  1. User picks a role (Java backend, data science, etc.)

  2. System generates a question

  3. User types or speaks an answer

  4. Speech-to-text transcribes audio if needed

  5. A prompt is assembled – includes question, answer, role criteria, and an evaluation rubric

  6. The LLM returns a scored assessment with strengths and suggestions

Based on answer quality, the system either asks a follow-up or moves to the next question

Breaking this into smaller tasks rather than a single giant prompt made debugging much easier. It also made failures easier to isolate and fix.

Example Feedback Output

Question:
Explain Java Garbage Collection.

User Answer:
Garbage Collection removes unused objects from memory.

Feedback:
Clarity: 2/3
Technical Depth: 1/3
Relevance: 3/3

Suggestion:
Mention generations, mark-and-sweep collection, and pause behaviour to improve technical depth.

Before diving into the challenges, here's a high-level view of the workflow architecture.

Figure 1. AI Interview Coach workflow architecture showing orchestration, memory handling, and adaptive interview flow.

Real Technical Problems (And How I Fixed Them)

  1. Hallucinated Feedback That Made No Sense

    Early on, the model praised completely wrong answers. Once, it told me my Python answer was "excellent for a Java memory management question." I also saw it suggest technical solutions that didn't exist.

    What caused it: The initial prompt was too vague. No rubric, no role context, no examples.
    How I fixed it: I added a three-dimensional rubric (clarity, technical accuracy, relevance) and told the model to cite the user's exact words as evidence. This significantly reduced hallucinated responses during testing.

  2. The Scoring Format Kept Breaking

    I asked for output like Score: 7/10. But the model sometimes returned Score: 7 (but only if you consider...) with extra text inside the field. My parser failed repeatedly.

    Fix: I switched to a strict template:
    Clarity: [1/2/3]
    Technical depth: [1/2/3]
    Relevance: [1/2/3]
    Overall: [3-9]

    Then I added a retry loop: if parsing failed, the system re-prompted once with a reminder to follow the format exactly. That handled most of the malformed outputs.

  3. The Agent Forgot What It Just Asked

    Without the conversation history, the model would sometimes ask a follow-up question unrelated to my previous answer. Example: I talked about debugging a memory leak, and the next question was "What's your favourite sorting algorithm?"

    Fix: I implemented a simple sliding window; the prompt now includes the last two exchanges (question + answer) plus the current turn. Not a full vector database, but enough for a short interview. This also increased token usage, but I stayed under typical context limits.

  4. Latency Spikes Made the Flow Feel Clunky

    Each turn required multiple LLM calls for evaluation, then follow-up generation. Sometimes the total latency hits 8-10 seconds, which kills conversational flow.

    Fix: I combined evaluation and next action decision into a single prompt. The model now returns both the feedback and a flag: next_action: "follow_up" or "next_question". This cut the latency roughly in half.

    Oracle University Concepts That Actually Helped

    Although this project does not run on OCI infrastructure, several concepts I encountered while exploring Oracle University's Generative AI learning resources were directly useful:

    • Prompt engineering – the difference between a generic "evaluate this answer" prompt and a rubric-based, role-specific prompt was significant.

    • Workflow orchestration – breaking a task into smaller, single-purpose steps instead of relying on one massive prompt.

    • Context management – maintaining conversational state without exhausting the context window.

    • Hallucination mitigation – using evidence-based instructions and output constraints to improve response quality.

    The biggest value wasn't specific code or implementation details. It was learning a framework for thinking about AI systems and understanding how different workflow components interact.

    What I'd Try Next (But Haven't Built Yet)

    A few directions that make sense for a future version:

    • RAG for grounded feedback – pull interview examples from a curated knowledge base instead of relying purely on model knowledge.

    • Memory across sessions – store a user's weak areas and retrieve them later using vector search to support long-term improvement.

    • Adaptive difficulty – adjust question complexity automatically based on the user's performance.

    • Automated evaluation – use a second model to validate the quality and consistency of the generated feedback.

    These aren't implemented yet, but they feel like logical next steps based on what I learned from building the prototype.

    Final Thoughts

    This project confirmed something for me: workflow design matters more than model choice.

    A modest model with clean, step-by-step orchestration and careful prompt engineering can outperform a more powerful model paired with a single messy prompt.

    If you're exploring agentic AI, start small. Build something you'd actually use. Let it break. Fix one thing at a time.

    The biggest lesson wasn't about models.

    It was about workflows.

    A reasonably capable model combined with good prompt design, state management, and structured outputs consistently outperformed more complicated approaches.

    That's the lesson I'll carry into future AI projects.

    Have you run into similar issues with prompt drift, parser failures, or conversational memory? I'd be interested to hear how others are approaching these challenges.