AI ENGINEERING PORTFOLIO ROADMAP · 2026 EDITION · BOTIDESK.COM

From Data Scientist
to AI Engineer.

A project-by-project curriculum built on your existing data science foundation. Every project you build here produces a real, deployable artifact — something you can demo, share, and talk about in any job interview. Updated for March 2026 industry standards.

8
PROJECTS TOTAL
7
PORTFOLIO-READY
21–27 wks
FULL TIMELINE
3
PHASES
HOW TO USE THIS ROADMAP
① DO THE PHASES IN ORDER

Each phase builds on the previous one. Phase 1 gives you the foundations that Phase 2 relies on. Don't skip ahead.

② EXPAND EACH PROJECT CARD

Click any project to see the full details: what you'll build, what you'll learn, interview questions it prepares, and what tech stack to use.

③ READ THE GLOSSARY FIRST

If you see an industry term you don't know, it is explained in the Glossary section below. Industry vocabulary is important — learn the language.

④ BUILD, DEPLOY, DOCUMENT

Every project should end with a live demo URL and a written technical summary. A working URL is worth 10 GitHub repositories to a recruiter.

INDUSTRY VOCABULARY
Key Terms, Explained Simply

These are the terms you will hear in interviews, job postings, and team meetings. Learn what each one means in plain English before you start building.

RAG
"Give the AI access to documents before it answers"
Retrieval-Augmented Generation. Instead of relying only on what the AI learned during training, you first search your own documents for relevant information, then give that information to the AI along with the question. Result: accurate, up-to-date answers based on your real data.
AI AGENT
"An AI that takes actions by itself to complete a task"
An agent is an AI system that can decide what to do next, use tools (like running code or searching the web), check its results, and repeat until the task is done. Unlike a chatbot that just answers, an agent actually does work.
MCP
"The universal cable connector for AI tools"
Model Context Protocol. A standard created by Anthropic (the makers of Claude) that lets any AI model connect to any external tool or database using the same format. Think of it like USB-C — instead of a different cable for every device, one standard works for everything. Now adopted by OpenAI, Google, and Microsoft.
LangChain / LangGraph
"Toolkits for building AI pipelines and agents"
LangChain is a library that makes it easier to connect LLMs with tools, databases, and APIs. LangGraph is built on top of LangChain and lets you design AI workflows as a flowchart (called a graph) — with branching, looping, and error recovery. Used by 43% of companies building AI systems in 2025.
CONTEXT ENGINEERING
"Designing exactly what information the AI receives"
The 2026 evolution of "prompt engineering." Instead of just writing a good instruction (prompt), context engineering is about designing the entire information package the AI receives: the instructions, the retrieved documents, the conversation history, the available tools, and how all of it is structured.
VECTOR DATABASE
"A database that searches by meaning, not keywords"
Traditional databases search for exact words. A vector database converts text into numbers that represent its meaning, then finds documents that are similar in meaning to your query. Essential for RAG systems. Common tools: Qdrant, Weaviate, Pinecone.
EMBEDDING
"Converting text into a list of numbers that captures its meaning"
An embedding model converts any piece of text into a long list of numbers (called a vector) that represents what that text means. Two sentences with similar meanings will have similar numbers. This is how vector databases can search by meaning.
FINE-TUNING
"Teaching an existing AI model to be better at your specific task"
Instead of building an AI from scratch, fine-tuning takes an existing model (like Llama or Mistral) and trains it further on your specific data. The result is a model that performs much better in your domain. LoRA and QLoRA are efficient techniques that make this possible on regular hardware.
LLM EVALUATION (EVALS)
"Measuring how well your AI system actually works"
Evals are tests you design to measure your AI system's quality. RAGAS is a popular framework that measures things like: "Is the answer faithful to the retrieved document?" and "Is the retrieved document actually relevant to the question?" Without evals, you cannot know if your system is good or bad.
OBSERVABILITY / TRACING
"Seeing exactly what the AI did, step by step"
In production AI systems, you need to log every LLM call: what was the input, what was the output, how long did it take, how much did it cost. LangSmith and Langfuse are tools that do this automatically, so you can debug problems and optimize performance.
HUMAN-IN-THE-LOOP
"The AI pauses and asks a human for approval before continuing"
For high-stakes actions (deleting files, sending emails, making purchases), agents should pause and get human approval before proceeding. Designing where in the workflow to add these checkpoints is one of the most important skills in agent engineering.
MULTI-TENANT
"One system that serves many different users or organizations, each isolated from each other"
If you build a product where Company A and Company B both use it, their data must be completely separated. Company A should never see Company B's documents or history. Building this correctly is a key production engineering skill.
CURRICULUM
The Three Phases

The roadmap is divided into three phases. Each phase has a clear goal and produces real, deployable projects. Do them in order.

PHASE 01 · 5–7 WEEKS
Foundations & Tooling
Build the two systems every future project depends on. RAG and structured data extraction.
PHASE 02 · 7–9 WEEKS
Agentic Systems
Build three agents: one for data analysis, one for research, one that speaks the language of 2026.
PHASE 03 · 9–11 WEEKS
Production Engineering
Ship systems that self-correct, scale to real users, and prove their own quality.
PHASE 01 — FOUNDATIONS & TOOLING
Build the Core Infrastructure

Before you build agents, you need to know how to retrieve information accurately and extract structured data reliably. These are the two most-used skills in AI engineering. Both projects are portfolio-ready and immediately recognized by hiring managers.

5–7
WEEKS
RAG ENGINEERING ★ PORTFOLIO
DIFFICULTY
2–3 wks
DURATION
+
PROJECT 01
Production RAG Pipeline
In plain English: Turn a pile of documents into a smart search engine that answers questions accurately — and prove with numbers that it works.
You will build a document retrieval system that ingests PDFs, Word files, and web pages, finds the most relevant content for any question, and generates a grounded answer. More importantly, you will build an evaluation framework that measures how good your system actually is — something most tutorial-level projects never do.
RAG ArchitectureVector DatabasesEmbedding ModelsHybrid SearchRAGAS EvaluationFastAPIStreaming
THE REAL-WORLD PROBLEM YOU ARE SOLVING

Every company has thousands of internal documents — manuals, reports, contracts — that nobody can quickly search. A basic keyword search returns too many irrelevant results. This project builds the system that actually understands the meaning of a question and finds the right answer in a large collection of documents. The key differentiator: you will measure your system's accuracy with real metrics, not just "it seems to work."

WHAT YOU WILL BUILD — STEP BY STEP
01
Document ingestion pipeline. A system that reads PDFs, HTML pages, Markdown files, and Word documents, then splits them into chunks. You will try three splitting strategies: fixed size, recursive (smarter), and semantic (splits by meaning). You will learn why the choice of chunk size matters for accuracy.
02
Hybrid retrieval system. You will combine two search methods: dense search (finds documents with similar meaning, using embeddings) and sparse search (BM25, finds documents with matching keywords). Combining both is more accurate than either alone. Then you add a re-ranker that picks the best results from the combined list.
03
Evaluation dashboard. Using the RAGAS framework, you will measure four things: (1) Is the answer faithful to the retrieved documents? (2) Are the retrieved documents actually relevant to the question? (3) How complete is the answer? (4) How precise is the context? These numbers will be displayed in a dashboard you can screenshot for your portfolio.
04
FastAPI service with streaming. You will wrap everything in a REST API with Server-Sent Events (SSE), so the answer appears word by word in real time — like ChatGPT. This is the production-standard way to serve LLM responses.
05
Comparison UI. A simple web interface where you can run the same question through different retrieval strategies and see the accuracy and speed side by side. This is the thing you show in interviews to prove your system works.
WHAT MAKES THIS NON-GENERIC

Most RAG tutorials build a chatbot that "seems to work" but has no measurements. Your version has a live benchmark dashboard with real accuracy numbers (RAGAS scores). When an interviewer asks "how do you know it works?", you will have a precise, quantitative answer. This is the difference between a junior and a senior AI engineer's approach.


INTERVIEW QUESTIONS THIS PREPARES
Why does chunk size affect retrieval quality?
When does dense search beat sparse search, and why?
How do you evaluate an AI system when you have no labeled data?
TECH STACK
LangChainQdrantFastAPIRAGASSentence-TransformersBM25sReactDocker

NEW INDUSTRY TERM YOU WILL LEARN

Hybrid Search: The combination of meaning-based search (dense) and keyword-based search (sparse). It consistently outperforms either method alone. RRF (Reciprocal Rank Fusion) is the algorithm that merges the two result lists. This is standard in all serious production RAG systems in 2026.

AGENTS + STRUCTURED EXTRACTION ★ PORTFOLIO ✦ YOUR IDEA
DIFFICULTY
2–3 wks
DURATION
+
PROJECT 02
AI Resume Builder + Job Intelligence Agent
In plain English: Type in a job posting URL. The agent reads it, understands what the company wants, and rewrites your resume specifically for that job — in 30 seconds.
You will build an agent that scrapes any job posting, extracts what the company is really looking for (even the unstated requirements), matches those requirements against your experience, and generates a tailored resume with a match score. Your demo audience is recruiters — you can literally demo this in the interview itself.
LangGraph (Basic)Web Scraping + LLM ExtractionStructured Outputs / PydanticPrompt EngineeringRAG as Knowledge BasePDF Generation
THE REAL-WORLD PROBLEM YOU ARE SOLVING

Customizing a resume for each job application takes 1–2 hours and most people skip it. Those who do it manually miss half the important keywords. This agent automates the whole pipeline: read the job → understand requirements → match to your experience → write a tailored resume → score how well it fits. A 2-hour task becomes 30 seconds. This also teaches you LangGraph at a beginner level, which you will use at an advanced level in Phase 2.

WHAT YOU WILL BUILD — STEP BY STEP
01
Web scraper tool. Given any job posting URL, the scraper extracts the raw text. An LLM then converts that messy text into clean, structured data: required skills, nice-to-have skills, seniority level, company culture keywords, and the hidden requirements buried in the description.
02
Candidate knowledge base. Your experience, projects, education, and skills are stored as a RAG corpus (the system you built in Project 1). When writing the resume, the agent retrieves the most relevant parts of your background for each job requirement. This ensures the resume is accurate — the AI only uses real experience you have.
03
ATS keyword optimizer. ATS stands for Applicant Tracking System — the software most companies use to filter resumes before a human sees them. Your agent compares the generated resume against the job description, flags missing important keywords, and suggests additions. It does this without inventing experience you don't have.
04
Structured output pipeline. Using Pydantic schemas (a Python tool for defining data structures), every resume section is generated in a consistent, well-formatted structure. This is how you make LLM outputs reliable and predictable — a critical production skill.
05
Three output formats + match score. The system generates: (1) a formatted PDF resume, (2) an ATS-plain-text version, and (3) a LinkedIn summary. It also gives a 0–100 match score that explains which requirements you meet strongly and which are gaps.
06
LangGraph orchestration. The full pipeline runs as a LangGraph state machine: Scraper → Extractor → Matcher → Writer → Optimizer nodes. Each step passes its result to the next. You will learn how to build, debug, and trace a basic agent graph — the foundation for everything in Phase 2.
WHAT MAKES THIS NON-GENERIC

Your portfolio audience is recruiters. Showing a recruiter a live demo where you type a job URL and watch a tailored resume appear in 30 seconds is immediately understood. It also demonstrates a complete engineering pipeline: data extraction → retrieval → generation → evaluation — all in one coherent product. The match score makes it measurable, not just "it generated something."


INTERVIEW QUESTIONS THIS PREPARES
How do you extract structured data reliably from messy web pages?
What is Pydantic and why does it matter for LLM outputs?
How would you prevent the agent from hallucinating experience you don't have?
TECH STACK
LangGraphBeautifulSoupInstructor / PydanticLangChainFastAPIWeasyPrintReact

NEW INDUSTRY TERM YOU WILL LEARN

Structured Outputs: Getting an LLM to return data in a predictable format (like JSON) instead of free-form text. Instructor is a Python library that wraps LLMs and forces them to return data that matches a Pydantic schema. Without this, LLM outputs are unreliable in production systems.

PHASE 02 — AGENTIC SYSTEMS
Build Agents That Do Real Work

In Phase 2, you stop building pipelines and start building agents — systems that reason, plan, use tools, and work autonomously. You will build three agents, each teaching a different architectural pattern. Phase 2 also includes your first MCP project, which is the most important new protocol in AI engineering as of 2026.

7–9
WEEKS
LANGGRAPH AGENTIC LOOP ★ PORTFOLIO
DIFFICULTY
2–3 wks
DURATION
+
PROJECT 03
Autonomous Data Analyst Agent
In plain English: Upload any spreadsheet or dataset. The agent figures out what's interesting, writes the code to analyze it, runs the code, fixes any errors by itself, and delivers a report — without you writing a single line of analysis code.
This is where your data science background becomes a superpower. You know exactly what good data analysis looks like. Now you encode that knowledge into a set of tools that an AI agent can use autonomously. The agent does in 2 minutes what used to take you 2 hours.
LangGraph State MachinesTool Use / Function CallingReAct Agent PatternHuman-in-the-LoopCode Execution SandboxingStreaming UI
THE REAL-WORLD PROBLEM YOU ARE SOLVING

Every company has data analysts who spend hours doing the same exploratory analysis on every new dataset: check distributions, find outliers, look for correlations, test hypotheses, make charts. This is valuable work but it follows a repeatable pattern. This agent automates the pattern while preserving the insight. Because you already know what good analysis looks like, you are uniquely qualified to design this agent's tools correctly.

WHAT YOU WILL BUILD — STEP BY STEP
01
LangGraph state machine with five nodes. The agent's workflow is: Plan (decide what to analyze) → Code (write Python to do the analysis) → Execute (run the code safely) → Reflect (read the output and decide what to do next) → Report (write a narrative report). Each step is a separate node. LangGraph manages the flow and lets you add conditional logic — for example, if code execution fails, go back to the Code node and fix it.
02
Data science tool suite. You will build a set of tools the agent can call: pandas profiling (automatic summary of any dataset), statistical tests (t-test, chi-square, correlation), anomaly detection, and Plotly chart generation. These tools encode your data science knowledge — the agent does not need to figure out how to do statistics, it just calls the right tool.
03
Safe Python code execution. The agent writes and runs real Python code. You will build a sandboxed execution environment with: timeout handling (stops runaway code), stdout and stderr capture (the agent reads its own error messages), and error recovery (the agent rewrites its code when it sees an error). This is how production AI coding systems work.
04
Human-in-the-loop checkpoints. If the agent detects a problem — unusual data types, very high missing-value rates, conflicting distributions — it pauses and asks you a clarifying question before continuing. You will learn where to put human checkpoints and where to let the agent run freely.
05
Auto-generated HTML report. The final output is a professional HTML report containing: executive summary, all generated visualizations, statistical findings, and the agent's code. The report looks like something a senior data analyst wrote — because the tools behind it were designed by one (you).
WHAT MAKES THIS NON-GENERIC

Most agent demos do trivial tasks like "search the web and summarize results." This agent does complex analytical work on real data. You can walk into any interview, hand them a CSV, run the agent live, and show a complete analysis report in 2 minutes. Your DS background is the reason the tools are actually good — not toy examples. This combination of domain expertise + agent architecture is rare.


INTERVIEW QUESTIONS THIS PREPARES
How do you stop an agent from running dangerous or infinite-loop code?
When do you add human-in-the-loop versus full automation?
How does LangGraph's state machine differ from a simple while loop?
TECH STACK
LangGraphClaude APIPython REPL ToolPandasPlotlyFastAPIReactDocker

KEY AGENT PATTERN YOU WILL LEARN

ReAct Pattern: Reason + Act. The agent alternates between thinking about what to do next (reason) and actually doing it (act), then observes the result and reasons again. This back-and-forth loop is how most real-world agents work. LangGraph makes this loop explicit and controllable.

MULTI-AGENT PIPELINE ★ PORTFOLIO ✦ YOUR IDEA
DIFFICULTY
3–4 wks
DURATION
+
PROJECT 04
Research Paper → Implementation Agent
In plain English: Give the agent a scientific paper PDF. It reads the paper, understands the method, writes the code to reproduce the experiments, runs them, and tells you how close the results are to what the paper claimed.
This is the most unique project in the entire roadmap. Reproducing ML research is something PhD students spend weeks doing. Your multi-agent system does it autonomously — and when it can't fully reproduce results (due to compute limits), it explains exactly why and estimates what would happen at full scale.
Multi-Agent OrchestrationLangGraph Sub-AgentsPDF ParsingAutomated Code GenerationCompute-Aware PlanningScientific Benchmarking
THE REAL-WORLD PROBLEM YOU ARE SOLVING

AI research is published at a pace that is impossible to follow manually. Researchers and engineers need to quickly evaluate whether a new technique works, but reproducing the experiments takes days or weeks. This agent compresses that to hours. It also handles the common situation where the paper required 8 powerful GPUs but you only have a laptop — the agent scales the experiment down and tells you how confident the scaled result is. This is recursive proof of AI capability: an AI agent that understands and implements AI research.

WHAT YOU WILL BUILD — STEP BY STEP
01
Parser Agent. Reads the PDF and extracts structured information: the model architecture, the training procedure, the hyperparameters (learning rate, batch size, etc.), the datasets used, and the metrics reported. Handles figures, tables, and equations. Outputs clean structured JSON.
02
Planner Agent. Takes the Parser Agent's output and creates an ordered implementation checklist. It identifies what code needs to be written in what order, what Python packages are needed, and which parts of the paper are underspecified (missing details that are common in academic papers).
03
Coder Agent. Writes modular Python code to implement the paper's method. It uses the paper's exact variable names and notation to make the code traceable back to the paper. It also writes unit tests for each component.
04
Environment Agent. Reads the dependency list from the Planner, creates a virtual environment, installs all packages, and handles version conflicts. Fully automated — no manual setup required.
05
Evaluator Agent. Runs the experiments and captures the metrics. Then compares your results to the paper's reported numbers, with statistical analysis: "Our result was X. The paper reported Y. The difference is Z, which is within/outside the expected range of experimental variance."
06
Compute Estimator. The most unique feature. If the paper required more compute than you have, the agent runs a scaled-down version and extrapolates: "At 10% scale, we got X. Based on the scaling curves in the paper, we estimate full-scale would produce Y with ±Z confidence." This handles the real-world situation where most papers are not reproducible on a laptop.
07
Reporter Agent. Generates a structured reproduction report: what matched, what diverged, why it likely diverged, and what hardware/time would be needed to close the gap. This report is the final portfolio artifact.
WHAT MAKES THIS NON-GENERIC

No one has this in their portfolio. It is technically deep (multi-agent, code generation, scientific evaluation), immediately impressive in a demo, and understood by any ML researcher or AI engineer. The Compute Estimator feature alone is a conversation-stopper in interviews. It shows you understand real-world constraints, not just ideal scenarios.


INTERVIEW QUESTIONS THIS PREPARES
How do you handle a paper that omits important implementation details?
How do you architect multi-agent systems to handle partial failures?
What is the difference between a single agent with many tools vs multiple specialized agents?
TECH STACK
LangGraphClaude APIPyMuPDFInstructor / PydanticDockersubprocessFastAPIReact

KEY PATTERN YOU WILL LEARN

Supervisor + Sub-Agent Architecture: One "boss" agent (the Supervisor) receives the main task and delegates subtasks to specialized workers. Each worker agent has its own tools and focus area. The Supervisor collects results and makes the final decision. This is the standard architecture for complex multi-agent systems.

MCP + CONTEXT ENGINEERING ★ PORTFOLIO ✦ NEW IN 2026
DIFFICULTY
2–3 wks
DURATION
+
PROJECT 05
Build Your Own MCP Server + Developer Intelligence Tool
In plain English: Build the "USB-C adapter" that lets any AI model connect to your own tools and data — then use it to create an AI tool that understands a codebase and helps developers work on it.
MCP (Model Context Protocol) is the most important new standard in AI engineering — adopted by Anthropic, OpenAI, Google, and Microsoft in 2025. Almost no portfolio projects demonstrate it. You will build a custom MCP server from scratch, then use it to power a developer tool that connects Claude to a GitHub repository and helps engineers navigate and improve code.
MCP ProtocolContext EngineeringClaude Code SDKTool DesignGitHub API IntegrationAPI Design
WHY MCP IS THE MOST IMPORTANT THING TO LEARN IN 2026

Before MCP, every AI tool that needed to connect to external data (GitHub, databases, Slack, etc.) required its own custom integration. A developer had to build separate code for "Claude connecting to GitHub" and "ChatGPT connecting to GitHub." With MCP, you build one server that connects to GitHub — and any MCP-compatible AI model can use it instantly. MCP was adopted by OpenAI in March 2025, Google DeepMind in April 2025, and is now the de facto standard. In 2026, "running an MCP server has become almost as common as running a web server." Knowing how to build one puts you ahead of most AI engineers.

WHAT YOU WILL BUILD — STEP BY STEP
01
Custom MCP server (the foundation). Using the Python MCP SDK, you will build a server that exposes three types of primitives: Tools (actions the AI can call, like "search this repo for a function"), Resources (data the AI can read, like "get the content of this file"), and Prompts (pre-written instructions for specific tasks). You will understand the JSON-RPC 2.0 protocol that powers MCP.
02
GitHub repository connector. Your MCP server will connect to any GitHub repository and expose tools: list files, read file contents, search code, get commit history, list open pull requests, and read PR diffs. Any AI model that supports MCP can now work with any GitHub repo through your server.
03
Codebase semantic indexer. Beyond basic file access, your server builds a semantic index of the repository: what each module does, how functions call each other, what design patterns are used. This is Context Engineering — you are designing the richest, most useful context an AI can receive about a codebase.
04
Developer intelligence tool. Using your MCP server, you will build a Claude-powered tool that: (1) reviews pull requests with architecture-level understanding, (2) scans repositories for technical debt and generates prioritized GitHub Issues, and (3) answers questions like "where is the authentication logic?" and "what would break if I changed this function?"
05
GitHub Action integration. Package the tool as a GitHub Action that automatically runs on every PR. Each PR gets an AI review that understands your codebase's architecture, not just the changed lines.
06
Context engineering documentation. Document exactly what information you chose to include in the agent's context, what you excluded, and why. This written analysis of your context design decisions is a portfolio artifact that proves senior-level thinking.
WHAT MAKES THIS NON-GENERIC

First-mover advantage. Almost nobody has an MCP server in their portfolio right now. Building one from scratch (not just using existing servers) demonstrates you understand the protocol at the architecture level, not just as a user. The fact that your server works with Claude, ChatGPT, and any future MCP-compatible AI model proves you understand the standardization — a key signal for senior roles.


INTERVIEW QUESTIONS THIS PREPARES
What is MCP and why did every major AI company adopt it?
What is the difference between a Tool, Resource, and Prompt in MCP?
What is Context Engineering and how does it differ from Prompt Engineering?
TECH STACK
Python MCP SDKClaude Code SDKGitHub API / PyGithubFastAPIPostgreSQLGitHub ActionsReact

2026 INDUSTRY CONTEXT

The Thoughtworks Technology Radar (the industry's most respected technology assessment) placed MCP in "Trial" status in late 2025 — meaning: use it in production now, but with awareness. By March 2026, it is the de facto standard for AI tool integration. Employers will increasingly expect AI engineers to know this protocol.

PHASE 03 — PRODUCTION ENGINEERING
Ship Systems That Prove Themselves

Phase 3 is about production-grade systems: systems that self-correct errors, serve real users, measure their own quality, and are deployed at a real URL. These three projects are what senior AI engineers work on. They require everything you built in Phases 1 and 2.

9–11
WEEKS
ADVANCED RAG — RESEARCH → PRODUCTION ★ PORTFOLIO
DIFFICULTY
3 wks
DURATION
+
PROJECT 06
Self-Correcting Adaptive RAG
In plain English: Upgrade your RAG system so it checks if the documents it retrieved are actually useful — and if not, it searches again with a better question. It also detects when it's making things up and flags those answers before they reach the user.
You will implement two published research techniques — CRAG (Corrective RAG) and Self-RAG — as a production system. These techniques fix the biggest problem with standard RAG: it blindly trusts whatever it retrieves, even when the documents are irrelevant. Implementing research papers in production is one of the most valued skills at frontier AI companies.
CRAG AlgorithmSelf-RAG TokensLLM-as-JudgeQuery ReformulationHallucination Detection (NLI)Adaptive RetrievalLangSmith Tracing
THE REAL-WORLD PROBLEM YOU ARE SOLVING

Standard RAG has a serious flaw: it always retrieves something, even when nothing relevant exists in the document collection. Then it generates an answer based on irrelevant documents — and the answer sounds confident but is wrong. CRAG and Self-RAG (published research papers from 2024) solve this with two mechanisms: (1) grade the retrieved documents before using them, and (2) let the model decide whether to retrieve at all. You are bringing cutting-edge research into a production system and measuring whether it actually improves accuracy.

WHAT YOU WILL BUILD — STEP BY STEP
01
CRAG (Corrective RAG) implementation. After retrieval, a "grader" LLM reads each retrieved document and gives it a relevance score (0.0 to 1.0). Documents that score below a threshold are discarded. If too many documents are discarded, the system automatically reformulates the original question into a better search query and retrieves again. This loop continues until useful documents are found.
02
Self-RAG implementation. Special decision tokens are added to the generation process that let the model decide: "Do I need to search for information, or can I answer from what I already know?" For knowledge-rich queries, skipping retrieval entirely is faster and equally accurate. Self-RAG reduces unnecessary retrieval by approximately 30%, making the system cheaper and faster.
03
Adaptive chunking. A query complexity classifier reads the incoming question and decides what chunk size to use for retrieval. Simple factual questions ("What year was X founded?") work best with large chunks. Complex multi-hop reasoning ("How did X's decision in 2018 affect Y's strategy in 2021?") works best with fine-grained chunks. The system adjusts automatically.
04
Hallucination detection layer. Using an NLI (Natural Language Inference) model, every sentence in the generated answer is checked against the retrieved source documents. If a sentence in the answer is not supported by the sources, it is flagged with a confidence score before being shown to the user. NLI is a classical NLP technique that your data science background makes easy to understand.
05
Technical blog post. Write and publish: "Standard RAG vs CRAG vs Self-RAG — A Benchmarked Comparison." Include your actual RAGAS numbers from all three systems. This post becomes a portfolio artifact and a source of inbound recruiter interest. This is how senior AI engineers establish credibility publicly.
WHAT MAKES THIS NON-GENERIC

This project is both a GitHub repository and a published technical article. The combination of research-paper implementation + measured benchmarks + written findings is exactly what AI research roles and senior engineering roles look for. Your data science background makes the statistical analysis in the blog post credible and rigorous — you know how to run a proper experiment and present results.


INTERVIEW QUESTIONS THIS PREPARES
Walk me through the CRAG algorithm step by step.
How do you programmatically detect when an AI is hallucinating?
What are the known failure modes of standard RAG?
TECH STACK
LangGraphLangSmithQdrantCohere RerankRAGASHuggingFace (NLI)FastAPI

NEW TERM: NLI

Natural Language Inference (NLI) is a classical NLP task: given a premise sentence and a hypothesis sentence, determine if the hypothesis is supported by the premise, contradicted by it, or unrelated. Used here to check if generated answers are supported by retrieved documents. You likely already know the underlying ML (it's a classification model).

MLOPS + RESEARCH STUDY ★ PORTFOLIO
DIFFICULTY
3–4 wks
DURATION
+
PROJECT 07
Fine-Tuning vs RAG Benchmark + Synthetic Data Engine
In plain English: Empirically answer the question every AI team argues about — "should we fine-tune a model or use RAG?" — by running a controlled experiment on a domain you know, and publishing the results.
You will fine-tune a small open-source model (Mistral or Llama) using your own synthetically generated training data, then run a rigorous 4-way comparison: base model vs fine-tuned vs RAG-only vs fine-tuned+RAG. Your data science background makes the statistical design of this experiment rigorous. The published findings become a career asset.
LoRA / QLoRA Fine-TuningSynthetic Data GenerationEvol-Instruct MethodGGUF QuantizationOllama Local InferenceLLM BenchmarkingW&B Experiment Tracking
THE REAL-WORLD PROBLEM YOU ARE SOLVING

"Should we fine-tune or use RAG?" is asked in every senior AI engineering interview and every enterprise AI project meeting. Most people give an opinion. You will give data. This project is also where your idea for a "Synthetic Data Creator" is implemented — not as a standalone demo but as the engine that generates thousands of training examples you actually need, published as a separate open-source CLI tool.

WHAT YOU WILL BUILD — STEP BY STEP
01
Synthetic Data Engine (CLI tool, published separately). A command-line tool that generates domain-specific training data using Claude and the Evol-Instruct method. Evol-Instruct takes simple questions and "evolves" them into harder variants (add constraints, reasoning steps, domain jargon). You configure: number of examples, difficulty distribution, domain, and output format (JSONL for fine-tuning). This becomes its own open-source GitHub repository.
02
Domain dataset generation. Using your Synthetic Data Engine, generate 5,000+ domain-specific Q&A pairs in a domain you know (legal, medical, financial, or whatever you worked on as a data scientist). This gives your fine-tuned model a genuine knowledge advantage in that domain.
03
LoRA/QLoRA fine-tuning. Fine-tune Mistral 7B or Llama 3.1 8B using the Unsloth library, which makes fine-tuning 4x faster on consumer hardware (a regular laptop GPU or Google Colab). LoRA works by adding small "adapter" layers to the model instead of retraining all parameters — this is how 99% of production fine-tuning is done in 2026.
04
GGUF quantization + Ollama deployment. Convert your fine-tuned model to GGUF format (a compressed format that runs on CPU) and deploy it locally using Ollama. This makes it fast enough to run benchmarks without expensive cloud GPU costs.
05
4-way automated benchmark. Build a pipeline that runs the same 200 test questions through all four configurations — base model, fine-tuned, RAG-only, fine-tuned+RAG — and scores each answer using an LLM-as-judge and a rubric. Results are tracked in Weights & Biases (W&B) for reproducibility.
06
Published findings. Write the report: "When Fine-Tuning Beats RAG, and When It Doesn't — With Data." Include your actual benchmark numbers broken down by question type (factual recall, reasoning, synthesis), plus cost per query and knowledge update cost for each approach. Publish on Medium and Hugging Face.
WHAT MAKES THIS NON-GENERIC

The written report is the primary asset. Engineers who can design experiments, run them rigorously, and communicate findings clearly are rare. This study answers a question that companies pay AI consultants to answer. The Synthetic Data Engine, published separately on GitHub, gives you a second portfolio artifact from the same project. Both together show breadth (engineering tool) and depth (scientific study).


INTERVIEW QUESTIONS THIS PREPARES
When would you choose fine-tuning over RAG? Give a decision framework.
What is LoRA mathematically, and why does it work?
How do you generate training data when you have no labeled dataset?
TECH STACK
UnslothLoRA/QLoRAOllamaHuggingFace TransformersLangChainRAGASW&BFastAPI

KEY TERMS YOU WILL LEARN

LoRA (Low-Rank Adaptation): A technique that fine-tunes LLMs by adding small adapter layers, instead of modifying all billions of parameters. Uses 80–90% less memory than full fine-tuning. GGUF: A file format for compressed LLM models that can run efficiently on CPU. Evol-Instruct: A method to generate progressively harder training examples from simple seeds using an LLM.

FULL-STACK PRODUCTION SYSTEM — CAPSTONE ★ PORTFOLIO ✦ YOUR IDEA
DIFFICULTY
4–5 wks
DURATION
+
PROJECT 08 — CAPSTONE
Carpenters of Quebec — Production Agentic RAG Platform
In plain English: Deploy a real, working AI platform that Quebec tradespeople can actually use — in French, with multiple organizations each having their own isolated workspace, and an intelligent agent that routes different types of questions to different specialized retrieval systems.
This is the capstone. Every technique from every previous project comes together here: the RAG system from P1, the agent loop from P3, the MCP tools from P5, the self-correcting retrieval from P6 — all deployed at a real URL, serving real users, in French. Domain specificity is a portfolio superpower: "production AI platform for Quebec tradespeople" is unforgettable. "RAG chatbot" is not.
Multi-Tenant ArchitectureFrench NLP / Multilingual EmbeddingsAgentic RAG RoutingAuth + AuthorizationProduction Cloud DeploymentAI Product Design
THE REAL-WORLD PROBLEM YOU ARE SOLVING

Quebec carpenters and tradespeople deal with CCQ (Commission de la construction du Québec) regulations, CSST safety standards, union agreements, and supplier catalogs — all in French, all spread across PDFs and government websites. Finding a specific regulation requires knowing which document to look in, which requires knowing the regulatory structure, which takes years of experience. This platform makes that knowledge instantly accessible to anyone. The specificity of the domain is what makes the project credible and memorable — it is solving a real problem for real people.

WHAT YOU WILL BUILD — STEP BY STEP
01
Multi-tenant ingestion pipeline. Each organization (carpentry company, union, trade school) gets its own isolated namespace in the vector database. Organization A cannot see Organization B's documents. You will learn how to implement data isolation in Qdrant using namespaces, and how to enforce it at the API level with authorization middleware.
02
French-language embedding stack. English embedding models perform poorly on French text. You will use multilingual-e5-large (the current best multilingual embedding model) and document your comparison against OpenAI's text-embedding-3-small. This is a technical decision that shows depth — most developers never think about this until retrieval quality drops.
03
Agentic RAG with LangGraph query router. An incoming question is first classified by type: regulatory (needs CCQ documents), safety (needs CSST documents), technical (needs trade manuals), or commercial (needs supplier catalogs). The classifier routes each question to a specialized sub-agent with domain-specific retrieval configuration and prompts. This is the Agentic RAG pattern — retrieval decisions are made by an agent, not a static pipeline.
04
Self-correcting retrieval from P6. The CRAG algorithm from Project 6 is integrated: retrieved documents are graded for relevance, poor results trigger query reformulation, and hallucinations are flagged before responses are sent.
05
Full authentication system. JWT-based user accounts, organization workspaces, document upload permissions, and usage quotas per subscription tier. This is what makes it multi-tenant: every request is tied to an authenticated user with a specific set of permissions.
06
Admin dashboard. Organization administrators see: which documents are uploaded, usage analytics (most searched topics, questions with no good answers, response quality ratings), and user management. This data tells you how to improve the system.
07
Live deployment at a real URL. Docker + Cloud Run or Railway.app. Always on. A real URL you can send to anyone. This is the difference between a project and a product. Write a one-page technical case study: problem, architecture diagram, key technical decisions, performance metrics, what you would do differently.
WHAT MAKES THIS NON-GENERIC

A live URL at a real domain, for real potential users, with a bilingual technical stack, multi-tenant architecture, and a written case study. Domain specificity is what separates this from the 10,000 generic RAG chatbots in portfolios. "I built a production AI platform serving Quebec tradespeople in French, handling regulatory, safety, and commercial queries through specialized sub-agents" is a story. "I built a RAG chatbot" is not.


INTERVIEW QUESTIONS THIS PREPARES
How do you isolate data between different tenants in a vector database?
What changes when your embedding and generation language is not English?
Walk me through your full system architecture from user query to response.
TECH STACK
LangGraphLangSmithQdrant (namespaces)multilingual-e5-largeFastAPINext.jsPostgreSQLDockerCloud Run / Railway

DEPLOYMENT RECOMMENDATION

Use Railway.app for your first deployment — it handles Docker containers, databases, and environment variables with minimal configuration, and has a generous free tier. Once you're comfortable, migrate to Google Cloud Run for auto-scaling. Both work perfectly for this project. Keep it running permanently: a dead demo is worse than no demo.

NOT STANDALONE PROJECTS
Skills to Build in Parallel

These are not separate projects — they are habits and tools you adopt as you go. Start each one when the timeline says "From Phase X." Each one makes every project better.

FROM PHASE 1
Structured Outputs
Instructor · Pydantic · Zod (JS)

Getting reliable, typed data from LLMs is essential for production. Raw LLM text output is unpredictable. Instructor + Pydantic forces LLMs to return data that matches a strict schema. Every agent tool call in this roadmap depends on this skill.

FROM PHASE 1
LLM Observability
LangSmith · Langfuse (open-source)

You cannot debug, optimize, or explain a system you cannot see. Every LLM call should be logged: input, output, latency, tokens, cost. LangSmith is the industry default (paid). Langfuse is the open-source alternative you can self-host. Start using one from your very first project.

FROM PHASE 2
Async Python
asyncio · httpx · aiofiles

Agent systems run many LLM calls at the same time. Synchronous (blocking) code kills performance because it waits for each call to finish before starting the next. Async Python lets you start multiple calls simultaneously. Learn this before building Phase 2 agents.

FROM PHASE 2
Docker + Cloud Deployment
Docker · Railway · Cloud Run

A live URL is worth 10 GitHub repositories to a recruiter. Docker packages your application so it runs identically anywhere. Railway.app or Cloud Run deploys it to the internet with minimal setup. Learn Docker basics in Phase 2 so every Phase 3 project ships with a live demo.

FROM PHASE 3
AI Red Teaming
Garak · Manual adversarial testing

Red teaming means deliberately trying to break your own AI system — testing for prompt injections, jailbreaks, misleading inputs, and failure modes. Companies building production AI need engineers who think about safety. A documented red-teaming section on any project is an extreme rarity in portfolios and signals senior thinking.

THROUGHOUT
Technical Writing
Medium · Substack · Your own site

AI engineers who write clearly get 3× more inbound recruiter interest. Write one technical post per project — what you built, what was hard, what you measured, what you learned. Projects 6 and 7 produce posts that could genuinely go viral in the AI engineering community. Start writing before you think you're ready.

HOW TO MAXIMIZE IMPACT
Career Strategy

Building the projects is 60% of the work. How you present, document, and position them is the other 40%.

📊
Benchmark Everything — Numbers Are Credibility

Your data science background makes this natural and powerful. Every project should have a metrics section with real numbers. "My RAG system achieves 87% answer relevancy on RAGAS, compared to 61% for a baseline chatbot" is infinitely more compelling than "I built a RAG system." Numbers make you credible. Opinions make you another candidate.

✍️
Write Every Project — Engineers Who Write Get Hired

Publish a technical post for each portfolio project. The fine-tuning benchmark (P7) and the CRAG comparison (P6) are research papers in disguise — treat them that way. Post on Medium, share on LinkedIn, and link from your GitHub. AI engineers who explain their work clearly are rare. Be one of them.

🎯
Own the Carpentry Domain From Day One

Use carpentry and trades regulation documents as your test data for Projects 1, 2, and 6. By the time you reach Project 8, you know the domain deeply and your portfolio tells a coherent story: "I specialized in AI for the trades and construction sector." A focused portfolio reads like expertise. A scattered portfolio reads like curiosity.

🔗
Frame the DS → AI Transition Deliberately

Never hide your data science background — it is a rare advantage. In every interview and LinkedIn post, frame it: "Classical ML gave me strong foundations in statistics and evaluation rigor. AI engineering adds orchestration, retrieval, and agent design. Here is how the two combine in my work." This narrative resonates with senior hiring panels who have seen too many AI engineers who cannot evaluate their own systems.

FULL TIMELINE SUMMARY
Phase 1
5–7 WEEKS
P1: RAG Pipeline
P2: Resume Builder
Phase 2
7–9 WEEKS
P3: Data Analyst Agent
P4: Paper → Code Agent
P5: MCP Server
Phase 3
9–11 WEEKS
P6: Self-Correcting RAG
P7: Fine-Tune Benchmark
P8: Quebec Platform
Total
21–27 WEEKS
8 projects
7 portfolio pieces
1 live product