AI Engineer Roadmap

CURRICULUM

The Three Phases

The roadmap is divided into three phases. Each phase has a clear goal and produces real, deployable projects. Do them in order.

PHASE 01 · 5–7 WEEKS

Foundations & Tooling

Build the two systems every future project depends on. RAG and structured data extraction.

PHASE 02 · 7–9 WEEKS

Agentic Systems

Build three agents: one for data analysis, one for research, one that speaks the language of 2026.

PHASE 03 · 9–11 WEEKS

Production Engineering

Ship systems that self-correct, scale to real users, and prove their own quality.

PHASE 01 — FOUNDATIONS & TOOLING

Build the Core Infrastructure

Before you build agents, you need to know how to retrieve information accurately and extract structured data reliably. These are the two most-used skills in AI engineering. Both projects are portfolio-ready and immediately recognized by hiring managers.

5–7

WEEKS

RAG ENGINEERING ★ PORTFOLIO

DIFFICULTY

2–3 wks

DURATION

PROJECT 01

Production RAG Pipeline

In plain English: Turn a pile of documents into a smart search engine that answers questions accurately — and prove with numbers that it works.

You will build a document retrieval system that ingests PDFs, Word files, and web pages, finds the most relevant content for any question, and generates a grounded answer. More importantly, you will build an evaluation framework that measures how good your system actually is — something most tutorial-level projects never do.

RAG ArchitectureVector DatabasesEmbedding ModelsHybrid SearchRAGAS EvaluationFastAPIStreaming

THE REAL-WORLD PROBLEM YOU ARE SOLVING

Every company has thousands of internal documents — manuals, reports, contracts — that nobody can quickly search. A basic keyword search returns too many irrelevant results. This project builds the system that actually understands the meaning of a question and finds the right answer in a large collection of documents. The key differentiator: you will measure your system's accuracy with real metrics, not just "it seems to work."

WHAT YOU WILL BUILD — STEP BY STEP

Document ingestion pipeline. A system that reads PDFs, HTML pages, Markdown files, and Word documents, then splits them into chunks. You will try three splitting strategies: fixed size, recursive (smarter), and semantic (splits by meaning). You will learn why the choice of chunk size matters for accuracy.

Hybrid retrieval system. You will combine two search methods: dense search (finds documents with similar meaning, using embeddings) and sparse search (BM25, finds documents with matching keywords). Combining both is more accurate than either alone. Then you add a re-ranker that picks the best results from the combined list.

Evaluation dashboard. Using the RAGAS framework, you will measure four things: (1) Is the answer faithful to the retrieved documents? (2) Are the retrieved documents actually relevant to the question? (3) How complete is the answer? (4) How precise is the context? These numbers will be displayed in a dashboard you can screenshot for your portfolio.

FastAPI service with streaming. You will wrap everything in a REST API with Server-Sent Events (SSE), so the answer appears word by word in real time — like ChatGPT. This is the production-standard way to serve LLM responses.

Comparison UI. A simple web interface where you can run the same question through different retrieval strategies and see the accuracy and speed side by side. This is the thing you show in interviews to prove your system works.

WHAT MAKES THIS NON-GENERIC

Most RAG tutorials build a chatbot that "seems to work" but has no measurements. Your version has a live benchmark dashboard with real accuracy numbers (RAGAS scores). When an interviewer asks "how do you know it works?", you will have a precise, quantitative answer. This is the difference between a junior and a senior AI engineer's approach.

INTERVIEW QUESTIONS THIS PREPARES
→Why does chunk size affect retrieval quality?
→When does dense search beat sparse search, and why?
→How do you evaluate an AI system when you have no labeled data?

TECH STACK

LangChainQdrantFastAPIRAGASSentence-TransformersBM25sReactDocker

NEW INDUSTRY TERM YOU WILL LEARN

Hybrid Search: The combination of meaning-based search (dense) and keyword-based search (sparse). It consistently outperforms either method alone. RRF (Reciprocal Rank Fusion) is the algorithm that merges the two result lists. This is standard in all serious production RAG systems in 2026.

AGENTS + STRUCTURED EXTRACTION ★ PORTFOLIO ✦ YOUR IDEA

DIFFICULTY

2–3 wks

DURATION

PROJECT 02

AI Resume Builder + Job Intelligence Agent

In plain English: Type in a job posting URL. The agent reads it, understands what the company wants, and rewrites your resume specifically for that job — in 30 seconds.

You will build an agent that scrapes any job posting, extracts what the company is really looking for (even the unstated requirements), matches those requirements against your experience, and generates a tailored resume with a match score. Your demo audience is recruiters — you can literally demo this in the interview itself.

LangGraph (Basic)Web Scraping + LLM ExtractionStructured Outputs / PydanticPrompt EngineeringRAG as Knowledge BasePDF Generation

THE REAL-WORLD PROBLEM YOU ARE SOLVING

Customizing a resume for each job application takes 1–2 hours and most people skip it. Those who do it manually miss half the important keywords. This agent automates the whole pipeline: read the job → understand requirements → match to your experience → write a tailored resume → score how well it fits. A 2-hour task becomes 30 seconds. This also teaches you LangGraph at a beginner level, which you will use at an advanced level in Phase 2.

WHAT YOU WILL BUILD — STEP BY STEP

Web scraper tool. Given any job posting URL, the scraper extracts the raw text. An LLM then converts that messy text into clean, structured data: required skills, nice-to-have skills, seniority level, company culture keywords, and the hidden requirements buried in the description.

Candidate knowledge base. Your experience, projects, education, and skills are stored as a RAG corpus (the system you built in Project 1). When writing the resume, the agent retrieves the most relevant parts of your background for each job requirement. This ensures the resume is accurate — the AI only uses real experience you have.

ATS keyword optimizer. ATS stands for Applicant Tracking System — the software most companies use to filter resumes before a human sees them. Your agent compares the generated resume against the job description, flags missing important keywords, and suggests additions. It does this without inventing experience you don't have.

Structured output pipeline. Using Pydantic schemas (a Python tool for defining data structures), every resume section is generated in a consistent, well-formatted structure. This is how you make LLM outputs reliable and predictable — a critical production skill.

Three output formats + match score. The system generates: (1) a formatted PDF resume, (2) an ATS-plain-text version, and (3) a LinkedIn summary. It also gives a 0–100 match score that explains which requirements you meet strongly and which are gaps.

LangGraph orchestration. The full pipeline runs as a LangGraph state machine: Scraper → Extractor → Matcher → Writer → Optimizer nodes. Each step passes its result to the next. You will learn how to build, debug, and trace a basic agent graph — the foundation for everything in Phase 2.

WHAT MAKES THIS NON-GENERIC

Your portfolio audience is recruiters. Showing a recruiter a live demo where you type a job URL and watch a tailored resume appear in 30 seconds is immediately understood. It also demonstrates a complete engineering pipeline: data extraction → retrieval → generation → evaluation — all in one coherent product. The match score makes it measurable, not just "it generated something."

INTERVIEW QUESTIONS THIS PREPARES
→How do you extract structured data reliably from messy web pages?
→What is Pydantic and why does it matter for LLM outputs?
→How would you prevent the agent from hallucinating experience you don't have?

TECH STACK

LangGraphBeautifulSoupInstructor / PydanticLangChainFastAPIWeasyPrintReact

NEW INDUSTRY TERM YOU WILL LEARN

Structured Outputs: Getting an LLM to return data in a predictable format (like JSON) instead of free-form text. Instructor is a Python library that wraps LLMs and forces them to return data that matches a Pydantic schema. Without this, LLM outputs are unreliable in production systems.

PHASE 02 — AGENTIC SYSTEMS

Build Agents That Do Real Work

In Phase 2, you stop building pipelines and start building agents — systems that reason, plan, use tools, and work autonomously. You will build three agents, each teaching a different architectural pattern. Phase 2 also includes your first MCP project, which is the most important new protocol in AI engineering as of 2026.

7–9

WEEKS

LANGGRAPH AGENTIC LOOP ★ PORTFOLIO

DIFFICULTY

2–3 wks

DURATION

PROJECT 03

Autonomous Data Analyst Agent

In plain English: Upload any spreadsheet or dataset. The agent figures out what's interesting, writes the code to analyze it, runs the code, fixes any errors by itself, and delivers a report — without you writing a single line of analysis code.

This is where your data science background becomes a superpower. You know exactly what good data analysis looks like. Now you encode that knowledge into a set of tools that an AI agent can use autonomously. The agent does in 2 minutes what used to take you 2 hours.

LangGraph State MachinesTool Use / Function CallingReAct Agent PatternHuman-in-the-LoopCode Execution SandboxingStreaming UI

THE REAL-WORLD PROBLEM YOU ARE SOLVING

Every company has data analysts who spend hours doing the same exploratory analysis on every new dataset: check distributions, find outliers, look for correlations, test hypotheses, make charts. This is valuable work but it follows a repeatable pattern. This agent automates the pattern while preserving the insight. Because you already know what good analysis looks like, you are uniquely qualified to design this agent's tools correctly.

WHAT YOU WILL BUILD — STEP BY STEP

LangGraph state machine with five nodes. The agent's workflow is: Plan (decide what to analyze) → Code (write Python to do the analysis) → Execute (run the code safely) → Reflect (read the output and decide what to do next) → Report (write a narrative report). Each step is a separate node. LangGraph manages the flow and lets you add conditional logic — for example, if code execution fails, go back to the Code node and fix it.

Data science tool suite. You will build a set of tools the agent can call: pandas profiling (automatic summary of any dataset), statistical tests (t-test, chi-square, correlation), anomaly detection, and Plotly chart generation. These tools encode your data science knowledge — the agent does not need to figure out how to do statistics, it just calls the right tool.

Safe Python code execution. The agent writes and runs real Python code. You will build a sandboxed execution environment with: timeout handling (stops runaway code), stdout and stderr capture (the agent reads its own error messages), and error recovery (the agent rewrites its code when it sees an error). This is how production AI coding systems work.

Human-in-the-loop checkpoints. If the agent detects a problem — unusual data types, very high missing-value rates, conflicting distributions — it pauses and asks you a clarifying question before continuing. You will learn where to put human checkpoints and where to let the agent run freely.

Auto-generated HTML report. The final output is a professional HTML report containing: executive summary, all generated visualizations, statistical findings, and the agent's code. The report looks like something a senior data analyst wrote — because the tools behind it were designed by one (you).

WHAT MAKES THIS NON-GENERIC

Most agent demos do trivial tasks like "search the web and summarize results." This agent does complex analytical work on real data. You can walk into any interview, hand them a CSV, run the agent live, and show a complete analysis report in 2 minutes. Your DS background is the reason the tools are actually good — not toy examples. This combination of domain expertise + agent architecture is rare.

INTERVIEW QUESTIONS THIS PREPARES
→How do you stop an agent from running dangerous or infinite-loop code?
→When do you add human-in-the-loop versus full automation?
→How does LangGraph's state machine differ from a simple while loop?

TECH STACK

LangGraphClaude APIPython REPL ToolPandasPlotlyFastAPIReactDocker

KEY AGENT PATTERN YOU WILL LEARN

ReAct Pattern: Reason + Act. The agent alternates between thinking about what to do next (reason) and actually doing it (act), then observes the result and reasons again. This back-and-forth loop is how most real-world agents work. LangGraph makes this loop explicit and controllable.

MULTI-AGENT PIPELINE ★ PORTFOLIO ✦ YOUR IDEA

DIFFICULTY

3–4 wks

DURATION

PROJECT 04

Research Paper → Implementation Agent

In plain English: Give the agent a scientific paper PDF. It reads the paper, understands the method, writes the code to reproduce the experiments, runs them, and tells you how close the results are to what the paper claimed.

This is the most unique project in the entire roadmap. Reproducing ML research is something PhD students spend weeks doing. Your multi-agent system does it autonomously — and when it can't fully reproduce results (due to compute limits), it explains exactly why and estimates what would happen at full scale.

Multi-Agent OrchestrationLangGraph Sub-AgentsPDF ParsingAutomated Code GenerationCompute-Aware PlanningScientific Benchmarking

THE REAL-WORLD PROBLEM YOU ARE SOLVING

AI research is published at a pace that is impossible to follow manually. Researchers and engineers need to quickly evaluate whether a new technique works, but reproducing the experiments takes days or weeks. This agent compresses that to hours. It also handles the common situation where the paper required 8 powerful GPUs but you only have a laptop — the agent scales the experiment down and tells you how confident the scaled result is. This is recursive proof of AI capability: an AI agent that understands and implements AI research.

WHAT YOU WILL BUILD — STEP BY STEP

Parser Agent. Reads the PDF and extracts structured information: the model architecture, the training procedure, the hyperparameters (learning rate, batch size, etc.), the datasets used, and the metrics reported. Handles figures, tables, and equations. Outputs clean structured JSON.

Planner Agent. Takes the Parser Agent's output and creates an ordered implementation checklist. It identifies what code needs to be written in what order, what Python packages are needed, and which parts of the paper are underspecified (missing details that are common in academic papers).

Coder Agent. Writes modular Python code to implement the paper's method. It uses the paper's exact variable names and notation to make the code traceable back to the paper. It also writes unit tests for each component.

Environment Agent. Reads the dependency list from the Planner, creates a virtual environment, installs all packages, and handles version conflicts. Fully automated — no manual setup required.

Evaluator Agent. Runs the experiments and captures the metrics. Then compares your results to the paper's reported numbers, with statistical analysis: "Our result was X. The paper reported Y. The difference is Z, which is within/outside the expected range of experimental variance."

Compute Estimator. The most unique feature. If the paper required more compute than you have, the agent runs a scaled-down version and extrapolates: "At 10% scale, we got X. Based on the scaling curves in the paper, we estimate full-scale would produce Y with ±Z confidence." This handles the real-world situation where most papers are not reproducible on a laptop.

Reporter Agent. Generates a structured reproduction report: what matched, what diverged, why it likely diverged, and what hardware/time would be needed to close the gap. This report is the final portfolio artifact.

WHAT MAKES THIS NON-GENERIC

No one has this in their portfolio. It is technically deep (multi-agent, code generation, scientific evaluation), immediately impressive in a demo, and understood by any ML researcher or AI engineer. The Compute Estimator feature alone is a conversation-stopper in interviews. It shows you understand real-world constraints, not just ideal scenarios.

INTERVIEW QUESTIONS THIS PREPARES
→How do you handle a paper that omits important implementation details?
→How do you architect multi-agent systems to handle partial failures?
→What is the difference between a single agent with many tools vs multiple specialized agents?

TECH STACK

LangGraphClaude APIPyMuPDFInstructor / PydanticDockersubprocessFastAPIReact

KEY PATTERN YOU WILL LEARN

Supervisor + Sub-Agent Architecture: One "boss" agent (the Supervisor) receives the main task and delegates subtasks to specialized workers. Each worker agent has its own tools and focus area. The Supervisor collects results and makes the final decision. This is the standard architecture for complex multi-agent systems.

MCP + CONTEXT ENGINEERING ★ PORTFOLIO ✦ NEW IN 2026

DIFFICULTY

2–3 wks

DURATION

PROJECT 05

Build Your Own MCP Server + Developer Intelligence Tool

In plain English: Build the "USB-C adapter" that lets any AI model connect to your own tools and data — then use it to create an AI tool that understands a codebase and helps developers work on it.

MCP (Model Context Protocol) is the most important new standard in AI engineering — adopted by Anthropic, OpenAI, Google, and Microsoft in 2025. Almost no portfolio projects demonstrate it. You will build a custom MCP server from scratch, then use it to power a developer tool that connects Claude to a GitHub repository and helps engineers navigate and improve code.

MCP ProtocolContext EngineeringClaude Code SDKTool DesignGitHub API IntegrationAPI Design

WHY MCP IS THE MOST IMPORTANT THING TO LEARN IN 2026

Before MCP, every AI tool that needed to connect to external data (GitHub, databases, Slack, etc.) required its own custom integration. A developer had to build separate code for "Claude connecting to GitHub" and "ChatGPT connecting to GitHub." With MCP, you build one server that connects to GitHub — and any MCP-compatible AI model can use it instantly. MCP was adopted by OpenAI in March 2025, Google DeepMind in April 2025, and is now the de facto standard. In 2026, "running an MCP server has become almost as common as running a web server." Knowing how to build one puts you ahead of most AI engineers.

WHAT YOU WILL BUILD — STEP BY STEP

Custom MCP server (the foundation). Using the Python MCP SDK, you will build a server that exposes three types of primitives: Tools (actions the AI can call, like "search this repo for a function"), Resources (data the AI can read, like "get the content of this file"), and Prompts (pre-written instructions for specific tasks). You will understand the JSON-RPC 2.0 protocol that powers MCP.

GitHub repository connector. Your MCP server will connect to any GitHub repository and expose tools: list files, read file contents, search code, get commit history, list open pull requests, and read PR diffs. Any AI model that supports MCP can now work with any GitHub repo through your server.

Codebase semantic indexer. Beyond basic file access, your server builds a semantic index of the repository: what each module does, how functions call each other, what design patterns are used. This is Context Engineering — you are designing the richest, most useful context an AI can receive about a codebase.

Developer intelligence tool. Using your MCP server, you will build a Claude-powered tool that: (1) reviews pull requests with architecture-level understanding, (2) scans repositories for technical debt and generates prioritized GitHub Issues, and (3) answers questions like "where is the authentication logic?" and "what would break if I changed this function?"

GitHub Action integration. Package the tool as a GitHub Action that automatically runs on every PR. Each PR gets an AI review that understands your codebase's architecture, not just the changed lines.

Context engineering documentation. Document exactly what information you chose to include in the agent's context, what you excluded, and why. This written analysis of your context design decisions is a portfolio artifact that proves senior-level thinking.

WHAT MAKES THIS NON-GENERIC

First-mover advantage. Almost nobody has an MCP server in their portfolio right now. Building one from scratch (not just using existing servers) demonstrates you understand the protocol at the architecture level, not just as a user. The fact that your server works with Claude, ChatGPT, and any future MCP-compatible AI model proves you understand the standardization — a key signal for senior roles.

INTERVIEW QUESTIONS THIS PREPARES
→What is MCP and why did every major AI company adopt it?
→What is the difference between a Tool, Resource, and Prompt in MCP?
→What is Context Engineering and how does it differ from Prompt Engineering?

TECH STACK

Python MCP SDKClaude Code SDKGitHub API / PyGithubFastAPIPostgreSQLGitHub ActionsReact

2026 INDUSTRY CONTEXT

The Thoughtworks Technology Radar (the industry's most respected technology assessment) placed MCP in "Trial" status in late 2025 — meaning: use it in production now, but with awareness. By March 2026, it is the de facto standard for AI tool integration. Employers will increasingly expect AI engineers to know this protocol.

PHASE 03 — PRODUCTION ENGINEERING

Ship Systems That Prove Themselves

Phase 3 is about production-grade systems: systems that self-correct errors, serve real users, measure their own quality, and are deployed at a real URL. These three projects are what senior AI engineers work on. They require everything you built in Phases 1 and 2.

9–11

WEEKS

ADVANCED RAG — RESEARCH → PRODUCTION ★ PORTFOLIO

DIFFICULTY

3 wks

DURATION

PROJECT 06

Self-Correcting Adaptive RAG

In plain English: Upgrade your RAG system so it checks if the documents it retrieved are actually useful — and if not, it searches again with a better question. It also detects when it's making things up and flags those answers before they reach the user.

You will implement two published research techniques — CRAG (Corrective RAG) and Self-RAG — as a production system. These techniques fix the biggest problem with standard RAG: it blindly trusts whatever it retrieves, even when the documents are irrelevant. Implementing research papers in production is one of the most valued skills at frontier AI companies.

CRAG AlgorithmSelf-RAG TokensLLM-as-JudgeQuery ReformulationHallucination Detection (NLI)Adaptive RetrievalLangSmith Tracing

THE REAL-WORLD PROBLEM YOU ARE SOLVING

Standard RAG has a serious flaw: it always retrieves something, even when nothing relevant exists in the document collection. Then it generates an answer based on irrelevant documents — and the answer sounds confident but is wrong. CRAG and Self-RAG (published research papers from 2024) solve this with two mechanisms: (1) grade the retrieved documents before using them, and (2) let the model decide whether to retrieve at all. You are bringing cutting-edge research into a production system and measuring whether it actually improves accuracy.

WHAT YOU WILL BUILD — STEP BY STEP

CRAG (Corrective RAG) implementation. After retrieval, a "grader" LLM reads each retrieved document and gives it a relevance score (0.0 to 1.0). Documents that score below a threshold are discarded. If too many documents are discarded, the system automatically reformulates the original question into a better search query and retrieves again. This loop continues until useful documents are found.

Self-RAG implementation. Special decision tokens are added to the generation process that let the model decide: "Do I need to search for information, or can I answer from what I already know?" For knowledge-rich queries, skipping retrieval entirely is faster and equally accurate. Self-RAG reduces unnecessary retrieval by approximately 30%, making the system cheaper and faster.

Adaptive chunking. A query complexity classifier reads the incoming question and decides what chunk size to use for retrieval. Simple factual questions ("What year was X founded?") work best with large chunks. Complex multi-hop reasoning ("How did X's decision in 2018 affect Y's strategy in 2021?") works best with fine-grained chunks. The system adjusts automatically.

Hallucination detection layer. Using an NLI (Natural Language Inference) model, every sentence in the generated answer is checked against the retrieved source documents. If a sentence in the answer is not supported by the sources, it is flagged with a confidence score before being shown to the user. NLI is a classical NLP technique that your data science background makes easy to understand.

Technical blog post. Write and publish: "Standard RAG vs CRAG vs Self-RAG — A Benchmarked Comparison." Include your actual RAGAS numbers from all three systems. This post becomes a portfolio artifact and a source of inbound recruiter interest. This is how senior AI engineers establish credibility publicly.

WHAT MAKES THIS NON-GENERIC

This project is both a GitHub repository and a published technical article. The combination of research-paper implementation + measured benchmarks + written findings is exactly what AI research roles and senior engineering roles look for. Your data science background makes the statistical analysis in the blog post credible and rigorous — you know how to run a proper experiment and present results.

INTERVIEW QUESTIONS THIS PREPARES
→Walk me through the CRAG algorithm step by step.
→How do you programmatically detect when an AI is hallucinating?
→What are the known failure modes of standard RAG?

TECH STACK

LangGraphLangSmithQdrantCohere RerankRAGASHuggingFace (NLI)FastAPI

NEW TERM: NLI

Natural Language Inference (NLI) is a classical NLP task: given a premise sentence and a hypothesis sentence, determine if the hypothesis is supported by the premise, contradicted by it, or unrelated. Used here to check if generated answers are supported by retrieved documents. You likely already know the underlying ML (it's a classification model).

MLOPS + RESEARCH STUDY ★ PORTFOLIO

DIFFICULTY

3–4 wks

DURATION

PROJECT 07

Fine-Tuning vs RAG Benchmark + Synthetic Data Engine

In plain English: Empirically answer the question every AI team argues about — "should we fine-tune a model or use RAG?" — by running a controlled experiment on a domain you know, and publishing the results.

You will fine-tune a small open-source model (Mistral or Llama) using your own synthetically generated training data, then run a rigorous 4-way comparison: base model vs fine-tuned vs RAG-only vs fine-tuned+RAG. Your data science background makes the statistical design of this experiment rigorous. The published findings become a career asset.

LoRA / QLoRA Fine-TuningSynthetic Data GenerationEvol-Instruct MethodGGUF QuantizationOllama Local InferenceLLM BenchmarkingW&B Experiment Tracking

THE REAL-WORLD PROBLEM YOU ARE SOLVING

"Should we fine-tune or use RAG?" is asked in every senior AI engineering interview and every enterprise AI project meeting. Most people give an opinion. You will give data. This project is also where your idea for a "Synthetic Data Creator" is implemented — not as a standalone demo but as the engine that generates thousands of training examples you actually need, published as a separate open-source CLI tool.

WHAT YOU WILL BUILD — STEP BY STEP

Synthetic Data Engine (CLI tool, published separately). A command-line tool that generates domain-specific training data using Claude and the Evol-Instruct method. Evol-Instruct takes simple questions and "evolves" them into harder variants (add constraints, reasoning steps, domain jargon). You configure: number of examples, difficulty distribution, domain, and output format (JSONL for fine-tuning). This becomes its own open-source GitHub repository.

Domain dataset generation. Using your Synthetic Data Engine, generate 5,000+ domain-specific Q&A pairs in a domain you know (legal, medical, financial, or whatever you worked on as a data scientist). This gives your fine-tuned model a genuine knowledge advantage in that domain.

LoRA/QLoRA fine-tuning. Fine-tune Mistral 7B or Llama 3.1 8B using the Unsloth library, which makes fine-tuning 4x faster on consumer hardware (a regular laptop GPU or Google Colab). LoRA works by adding small "adapter" layers to the model instead of retraining all parameters — this is how 99% of production fine-tuning is done in 2026.

GGUF quantization + Ollama deployment. Convert your fine-tuned model to GGUF format (a compressed format that runs on CPU) and deploy it locally using Ollama. This makes it fast enough to run benchmarks without expensive cloud GPU costs.

4-way automated benchmark. Build a pipeline that runs the same 200 test questions through all four configurations — base model, fine-tuned, RAG-only, fine-tuned+RAG — and scores each answer using an LLM-as-judge and a rubric. Results are tracked in Weights & Biases (W&B) for reproducibility.

Published findings. Write the report: "When Fine-Tuning Beats RAG, and When It Doesn't — With Data." Include your actual benchmark numbers broken down by question type (factual recall, reasoning, synthesis), plus cost per query and knowledge update cost for each approach. Publish on Medium and Hugging Face.

WHAT MAKES THIS NON-GENERIC

The written report is the primary asset. Engineers who can design experiments, run them rigorously, and communicate findings clearly are rare. This study answers a question that companies pay AI consultants to answer. The Synthetic Data Engine, published separately on GitHub, gives you a second portfolio artifact from the same project. Both together show breadth (engineering tool) and depth (scientific study).

INTERVIEW QUESTIONS THIS PREPARES
→When would you choose fine-tuning over RAG? Give a decision framework.
→What is LoRA mathematically, and why does it work?
→How do you generate training data when you have no labeled dataset?

TECH STACK

UnslothLoRA/QLoRAOllamaHuggingFace TransformersLangChainRAGASW&BFastAPI

KEY TERMS YOU WILL LEARN

LoRA (Low-Rank Adaptation): A technique that fine-tunes LLMs by adding small adapter layers, instead of modifying all billions of parameters. Uses 80–90% less memory than full fine-tuning. GGUF: A file format for compressed LLM models that can run efficiently on CPU. Evol-Instruct: A method to generate progressively harder training examples from simple seeds using an LLM.

FULL-STACK PRODUCTION SYSTEM — CAPSTONE ★ PORTFOLIO ✦ YOUR IDEA

DIFFICULTY

4–5 wks

DURATION

PROJECT 08 — CAPSTONE

Carpenters of Quebec — Production Agentic RAG Platform

In plain English: Deploy a real, working AI platform that Quebec tradespeople can actually use — in French, with multiple organizations each having their own isolated workspace, and an intelligent agent that routes different types of questions to different specialized retrieval systems.

This is the capstone. Every technique from every previous project comes together here: the RAG system from P1, the agent loop from P3, the MCP tools from P5, the self-correcting retrieval from P6 — all deployed at a real URL, serving real users, in French. Domain specificity is a portfolio superpower: "production AI platform for Quebec tradespeople" is unforgettable. "RAG chatbot" is not.

Multi-Tenant ArchitectureFrench NLP / Multilingual EmbeddingsAgentic RAG RoutingAuth + AuthorizationProduction Cloud DeploymentAI Product Design

THE REAL-WORLD PROBLEM YOU ARE SOLVING

Quebec carpenters and tradespeople deal with CCQ (Commission de la construction du Québec) regulations, CSST safety standards, union agreements, and supplier catalogs — all in French, all spread across PDFs and government websites. Finding a specific regulation requires knowing which document to look in, which requires knowing the regulatory structure, which takes years of experience. This platform makes that knowledge instantly accessible to anyone. The specificity of the domain is what makes the project credible and memorable — it is solving a real problem for real people.

WHAT YOU WILL BUILD — STEP BY STEP

Multi-tenant ingestion pipeline. Each organization (carpentry company, union, trade school) gets its own isolated namespace in the vector database. Organization A cannot see Organization B's documents. You will learn how to implement data isolation in Qdrant using namespaces, and how to enforce it at the API level with authorization middleware.

French-language embedding stack. English embedding models perform poorly on French text. You will use multilingual-e5-large (the current best multilingual embedding model) and document your comparison against OpenAI's text-embedding-3-small. This is a technical decision that shows depth — most developers never think about this until retrieval quality drops.

Agentic RAG with LangGraph query router. An incoming question is first classified by type: regulatory (needs CCQ documents), safety (needs CSST documents), technical (needs trade manuals), or commercial (needs supplier catalogs). The classifier routes each question to a specialized sub-agent with domain-specific retrieval configuration and prompts. This is the Agentic RAG pattern — retrieval decisions are made by an agent, not a static pipeline.

Self-correcting retrieval from P6. The CRAG algorithm from Project 6 is integrated: retrieved documents are graded for relevance, poor results trigger query reformulation, and hallucinations are flagged before responses are sent.

Full authentication system. JWT-based user accounts, organization workspaces, document upload permissions, and usage quotas per subscription tier. This is what makes it multi-tenant: every request is tied to an authenticated user with a specific set of permissions.

Admin dashboard. Organization administrators see: which documents are uploaded, usage analytics (most searched topics, questions with no good answers, response quality ratings), and user management. This data tells you how to improve the system.

Live deployment at a real URL. Docker + Cloud Run or Railway.app. Always on. A real URL you can send to anyone. This is the difference between a project and a product. Write a one-page technical case study: problem, architecture diagram, key technical decisions, performance metrics, what you would do differently.

WHAT MAKES THIS NON-GENERIC

A live URL at a real domain, for real potential users, with a bilingual technical stack, multi-tenant architecture, and a written case study. Domain specificity is what separates this from the 10,000 generic RAG chatbots in portfolios. "I built a production AI platform serving Quebec tradespeople in French, handling regulatory, safety, and commercial queries through specialized sub-agents" is a story. "I built a RAG chatbot" is not.

INTERVIEW QUESTIONS THIS PREPARES
→How do you isolate data between different tenants in a vector database?
→What changes when your embedding and generation language is not English?
→Walk me through your full system architecture from user query to response.

TECH STACK

LangGraphLangSmithQdrant (namespaces)multilingual-e5-largeFastAPINext.jsPostgreSQLDockerCloud Run / Railway

DEPLOYMENT RECOMMENDATION

Use Railway.app for your first deployment — it handles Docker containers, databases, and environment variables with minimal configuration, and has a generous free tier. Once you're comfortable, migrate to Google Cloud Run for auto-scaling. Both work perfectly for this project. Keep it running permanently: a dead demo is worse than no demo.

From Data Scientistto AI Engineer.

From Data Scientist
to AI Engineer.