Reverse-Engineering Exams: A Deep Dive into Professor Profiler

I built this during exam season, half out of frustration and half out of curiosity. The frustration was real: I kept watching people grind through entire textbooks when, if you actually looked at five years of past papers side by side, the same three topics showed up in roughly the same format every single time. The insight isn't deep. Professors are creatures of habit. The problem is that nobody has the time or patience to do that analysis manually before every exam.

So I built Professor Profiler, a Hierarchical Multi-Agent System (HMAS) that takes past exam PDFs and reverse-engineers them. Not just "here are the topics" but "here's what this professor actually cares about, here's what they always ask at the analysis level vs. pure recall, and here's where you should spend your last 48 hours."

It runs on Google Gemini 2.5 and the whole thing is structured as a pipeline of specialized agents, each doing one thing well.

The Hub-and-Spoke Architecture

My first instinct was to throw everything into a single prompt. That didn't work. A monolithic prompt trying to classify questions, spot trends, and generate strategy all at once produced output that was confidently vague. It would say things like "thermodynamics appears frequently" without any actual frequency count behind it. Useless.

The fix was to stop treating this as a single-shot problem and start treating it like a team of analysts. I landed on a hub-and-spoke architecture: one Root Agent acting as the coordinator, with specialized sub-agents handling distinct cognitive tasks. Each agent gets a focused job and a Gemini model matched to what that job actually requires.

The whole thing is a directed acyclic graph (DAG) under the hood. The Root Agent owns the execution order.


100%
flowchart TD
    subgraph Orchestration_Layer ["🧠 Orchestration Layer"]
        Runner[<b>Runner</b><br/><i>State Management</i>]
        Memory[(<b>Memory Bank</b><br/><i>JSON Persistence</i>)]
    end

    subgraph Agent_Layer ["🤖 Agent Hierarchy"]
        Root[<b>ROOT AGENT</b><br/><i>Gemini 2.5 Pro</i><br/>The Project Manager]
        
        subgraph Workers ["Specialized Sub-Agents"]
            Taxonomist[<b>Taxonomist</b><br/><i>Gemini Flash</i><br/>Classification]
            Trend[<b>Trend Spotter</b><br/><i>Gemini Pro</i><br/>Analysis]
            Strat[<b>Strategist</b><br/><i>Gemini Thinking</i><br/>Planning]
        end
    end

    Runner --> Root
    Root --Delegates--> Taxonomist
    Root --Delegates--> Trend
    Root --Delegates--> Strat

Meet the Agents

Each agent has a specific model behind it, and that choice was deliberate. Throwing the most expensive model at every task is wasteful and actually slower for the parts that don't need deep reasoning.

The Taxonomist runs on gemini-2.0-flash-exp. Its only job is to read every question and spit out structured JSON: topic label, Bloom's Taxonomy level (Recall, Understand, Apply, Analyze, Evaluate), and question type. Speed matters here because it's processing every question across potentially dozens of papers. I originally had it doing trend analysis too, and the output was a mess. Separating classification from analysis was one of the better decisions I made on this project.

The Trend Spotter uses gemini-2.0-pro-exp, mainly for the larger context window. This agent receives the Taxonomist's JSON output and looks for patterns across years: which topics dominate, which have appeared only once (potential traps), and where the difficulty distribution sits. The first version of this agent hallucinated frequency counts, just made them up with total confidence. I had to add explicit instructions forcing it to only cite numbers present in the data it received.

The Strategist uses gemini-2.0-flash-thinking-exp because this is the step that actually requires reasoning, not just classification or pattern-matching. It takes the Trend Spotter's analysis and outputs two things: Safe Zones (topics with high historical frequency and well-understood question formats) and a Drop List (topics that have appeared rarely enough that cramming them is a bad expected-value bet). This agent also had a rough early version. It kept generating extremely balanced "study everything" recommendations that were completely useless. I had to push it hard with system prompts to force actual prioritization.

From Raw PDF to Study Plan

The pipeline has four phases. They have to run in order since each one feeds the next.


100%
sequenceDiagram
    autonumber
    actor Student
    participant Root as 🧠 Root Agent
    participant Tax as 🏷️ Taxonomist
    participant Strat as 🎯 Strategist

    Student->>Root: "Analyze Physics_2024.pdf"
    
    note right of Root: Phase 1 & 2: Classification
    Root->>Tax: "Classify these questions"
    Tax-->>Root: JSON List of Classified Questions

    note right of Root: Phase 3: Visualization
    Root->>Root: Generate Charts (Matplotlib)

    note right of Root: Phase 4: Strategy
    Root->>Strat: "Identify Safe Zones"
    Strat-->>Root: Final Recommendations

    Root-->>Student: Report + Images + Plan

Phase 3 is the Root Agent generating charts directly using matplotlib. I went back and forth on whether to make this a separate agent. In the end it didn't need to be. The chart generation is deterministic given the Taxonomist's JSON, so there's no reasoning required. Spinning up an agent for that would've been overkill.

Production Considerations

I didn't want this to be a demo that falls apart the moment you feed it anything unexpected. A few things I added to make it more robust:

Structured logging with correlation IDs runs throughout the entire agent hierarchy. When something goes wrong inside a sub-agent call, you can trace exactly which request triggered it. This saved me hours of debugging when the Trend Spotter started failing silently on PDFs with unusual formatting.

The session management uses ADK's InMemorySessionService, which lets the Runner track state across the full multi-agent call chain. Long-running analyses don't lose their place if something downstream takes longer than expected.

PDF text extraction uses pypdf. This sounds simple until you've seen the range of ways people scan and export exam papers. Scanned image-only PDFs still break things. That's a known limitation and I haven't solved it yet.

Try It Yourself

Here's how to wire it up and run an analysis programmatically:

import asyncio
from google.genai import types
from profiler_agent.agent import root_agent
from google.adk.runners import Runner
from google.adk.sessions import InMemorySessionService

async def main():
    # Initialize memory
    session = InMemorySessionService()
    
    # Initialize runner
    runner = Runner(agent=root_agent, session_service=session)
    
    # Execute
    print("🤖 Agent is thinking...")
    async for event in runner.run_async(
        user_id="prof_user",
        session_id="sess_01",
        new_message=types.Content(
            role="user",
            parts=[types.Part.from_text("Analyze the midterms.")]
        )
    ):
        if event.is_final_response():
            print(f"\n🎓 Final Answer:\n{event.content.parts[0].text}")

if __name__ == "__main__":
    asyncio.run(main())

The thing I'd change if I rebuilt this from scratch: I'd invest more time upfront in the data format that flows between agents. The contract between the Taxonomist's output and the Trend Spotter's input caused more debugging headaches than anything else. Schema validation between agent handoffs would've been worth the extra setup time.

Full source code is on GitHub: Professor Profiler.

Connect With Me

I'm always happy to talk through multi-agent systems, agentic architecture decisions, or anything else in this space.

GitHub: @amitdevx
LinkedIn: Amit Divekar