Something shifted in technical hiring this month. Not a subtle shift - a category-level move.

CodeSignal launched what they call "agentic coding assessments." Meta started rolling out AI-assisted coding interviews across backend roles. Karat published data showing 71% of engineering leaders say AI has made technical skills harder to assess. The industry is converging on one conclusion: AI belongs in the interview room. The old "block the AI" approach is dying.

I agree with all of that. I've been saying it for a year. What I disagree with is what most of these platforms measure once AI is in the room.

The agentic assessment premise

Here is what an agentic coding assessment typically looks like. The candidate gets a task. They have access to AI coding tools - Claude Code, Cursor, Codex, whatever the platform supports. They build a working solution. Then they explain what they built to a human reviewer.

The platform scores the output. Did the code work? Did it meet the requirements? Was the solution architecturally sound?

CodeSignal reports that 91% of engineers use AI tools at work and 75% have shipped production code that is at least partially AI-generated in the last six months. Those numbers feel right to me. The argument follows: if engineers use AI at work, the interview should let them use AI too. I have no argument with that logic. It is correct as far as it goes.

The problem is where it stops.

Output is the easy part

When you evaluate what an engineer builds with AI, you are measuring the result. The deliverable. The artifact. This is a meaningful signal - a candidate who cannot produce a working solution, even with AI help, has a problem. But it is not a sufficient signal, and for senior hires it is barely the beginning.

Here is why. I ran over 1,000 technical interviews as CTO across five companies. At the senior level, the question was never "can this person produce working code?" Almost everyone at that level can produce working code - especially now, with AI generating most of it. The question was always "how does this person think when things get ambiguous, when the requirements shift, when the first approach doesn't work?"

That question cannot be answered by looking at the final output. You need to see the process.

Consider two candidates completing the same agentic assessment. Both deliver a working solution that meets the requirements. Same output quality. Same test pass rate.

Candidate A copy-pasted the entire problem into Claude, accepted the first output, made one minor fix, and submitted. Total AI prompts: 2. Time spent reading the AI output before accepting: under 30 seconds. Zero architectural decisions made by the human.

Candidate B read the problem, sketched an approach, asked the AI for a specific implementation of one component, read the output critically, rejected the initial approach because it didn't handle an edge case they anticipated, asked for a revised version with a specific constraint, then integrated it into their own architecture. Total AI prompts: 7. Average time between prompt and next action: 90 seconds. Three explicit architectural decisions visible in the timeline.

Both passed. Same score on an output-based assessment. Same "working solution." But one of them is a senior engineer who uses AI as a power tool. The other is a prompt relay who got lucky on a well-scoped problem. On day one of the job, when the ticket is ambiguous and the AI's first output is wrong, these two people will perform very differently.

An output-based assessment cannot distinguish between them. A behavioral one can.

Behavioral timeline showing decision nodes and AI interaction patterns

What behavioral evaluation actually captures

At Eval-X, we built the evaluation around the process, not the result. Here is what that means concretely.

When a candidate works in our environment, we capture a complete behavioral timeline. Every keystroke, every pause, every prompt sent to the AI, every response received, every edit made to the AI's output, every backspace. The full sequence of decisions the engineer made, in order, with timing.

From that timeline, we score across multiple dimensions. Not a single pass/fail, but a profile:

Problem framing. Did the candidate think before coding? Did they clarify ambiguity or assume the happy path? The timeline shows whether the candidate's first action was to write code or to ask a question. That distinction matters more than most interviewers realize.

AI usage quality. This is the dimension no output-based assessment can measure. Did the candidate drive the AI or follow it? Did they give specific, well-scoped prompts or dump the entire problem and hope? Did they read the AI's output critically or accept it blind? The behavioral timeline makes this visible. You can see the exact moment a candidate stopped thinking and started copy-pasting.

System design. Did the candidate make architectural choices, or did they let the AI choose? When the AI suggested an approach, did the candidate evaluate it against alternatives or take it at face value?

Adaptability. When something broke - a test failed, the AI gave a wrong answer, the requirements shifted mid-task - did the candidate panic, start over, or pivot cleanly? The timeline shows the response to failure in real time. This is one of the hardest things to fake and one of the best predictors of on-the-job performance.

Code quality. Yes, the output matters too. But we evaluate it in context. Code that works because the engineer understood and shaped it is different from code that works because the AI happened to get it right on the first try.

Explanation and ownership. Can the candidate explain what they built and why? Not "the AI suggested this" but "I chose this approach because of X, and I modified the AI's suggestion to handle Y." Ownership of the solution - even when AI helped produce it - is the clearest signal of engineering maturity.

The Meta signal

Meta's move to AI-assisted coding interviews is worth paying attention to, not because Meta invented something new, but because they confirmed the direction at massive scale. Their format uses multi-file projects instead of algorithmic puzzles. Candidates have access to multiple AI models. Interviewers watch how candidates work with AI - their prompting, debugging, judgment, trade-off reasoning.

That is the behavioral evaluation model. Meta did not ship an output-based agentic assessment. They shipped an observation-based process evaluation. The world's largest engineering employer independently arrived at the same conclusion: the signal is in the process, not the deliverable.

Why this distinction matters for your next hire

If you are evaluating agentic assessment platforms right now, the question to ask is simple: what does this platform actually measure?

If the answer is "whether the candidate produced a working solution using AI tools" - that is a real signal, but it is a thin one. It tells you the candidate can use AI. It does not tell you whether they can think.

If the answer is "how the candidate thought, prompted, iterated, adapted, and made decisions throughout the process" - that is a thick signal. It tells you what the candidate will do on day one when the problem is ambiguous and the AI's first answer is wrong.

The difference is not academic. It is the difference between hiring someone who can pass a well-structured assessment with AI help (most competent engineers can) and hiring someone whose engineering judgment makes them effective in the messy, unstructured reality of production work.

At the senior level - where the cost of a wrong hire is $200K to $400K and the interview is supposed to predict performance on problems nobody has pre-scoped - the behavioral signal is the one that matters.

The industry is converging. The question is how far.

Every major player in technical assessment is acknowledging that AI belongs in the interview. HackerRank added "Unguarded AI" mode. CodeSignal launched agentic assessments. Karat's NextGen pairs candidates with AI tools and human interviewers. The debate about whether to allow AI is over.

The next debate is about what to measure once AI is allowed. Output or process. Deliverable or decision-making. Result or reasoning.

We built Eval-X on the thesis that the process is the signal. Every week the market gives us more evidence that this thesis is correct. The question for engineering leaders is not whether your interview should include AI - that's settled. The question is whether your evaluation captures the thinking behind the output, or just the output itself.

If you are making $150K+ hiring decisions based on whether a candidate's AI-assisted code passes tests, you are measuring the AI's capability as much as the engineer's. If you are evaluating how the engineer drives the AI - the prompts they write, the output they reject, the decisions they make, the problems they anticipate - you are measuring the human. That is what you are hiring.

Agentic Assessments Test What Engineers Build. We Evaluate How They Think.