No Black Box. Just Evidence.
Every score Storm produces points to a real quote from the candidate’s actual work. The AI does the homework. The hiring manager makes the call. Here’s exactly how it works — step by step, in plain English.
Three things every Storm score guarantees
Most AI hiring tools hand you a number and call it done. We hand you the number, the question that produced it, the quote from the candidate’s work that justifies it, and the reasoning that connects them.
Cited
Every “yes” or “no” the AI gives must point to an exact quote from the candidate’s chat, code, spreadsheet, or slide. If the AI can’t find evidence, it must say so — explicitly.
Deterministic
Same evidence, same prompt, same model — same result. Scores are reproducible quarters from now, with the prompt version and model snapshot stored alongside the score.
Human-decides
The AI scores. You decide. Hiring managers see every observation, every quote, every “Insufficient Evidence” tag — and can override any score with a comment that becomes part of the audit trail.
Six stages, end to end
From defining what “good” looks like for a role, all the way through to the hiring-manager view — here’s every step the platform takes, and what we do at each one to make the result trustworthy.
Turn the role into testable questions
You define what matters for the role — competencies, weights, importance. The AI takes each criterion and writes a small set of yes/no observation questions a hiring manager could ask while watching the candidate work. “Did they distinguish leading from lagging metrics?” “Did they explain the trade-off before picking a path?”
These aren’t trivia questions. They’re the operational definition of “good” for this role. They get versioned, reviewed, and bound to the position.
You can edit, add, or veto any question- Q1 Did the candidate distinguish “started application” from “submitted application” in their reasoning?
- Q2 Did the candidate verify the metric definition against the source before recommending a fix?
- Q3 Did the candidate connect the funnel mismatch to the post-launch conversion drop?
- Q4 Did the candidate name a concrete remediation rather than a generic “needs investigation”?
Capture what they actually did, not what they said they’d do
As the candidate works through the simulation, every artifact gets captured — their voice-call transcript, their code edits, their spreadsheet formulas, their slide content, every line of chat. We bundle it into a structured evidence package tagged by source and timestamp.
Before the bundle ever reaches an AI evaluator, we run it through a redactor that strips the candidate’s name, location, and identifying details. The model scores the work , not the person .
PII-redacted before scoringScore every question against the evidence
For each criterion’s question, we run an observation check : a focused AI call that gets one yes/no question, the candidate’s full evidence bundle, and a single instruction — “answer the question, but only if you can quote the exact text that justifies your answer.”
The AI must return three things together: an answer, a citation (a literal quote), and a one-sentence reasoning. If it can’t find a quote, the only legal answer is Insufficient Evidence .
One question · one citation · one answerVerify every citation. Punish hallucinations.
AI models can fabricate plausible-sounding quotes. We don’t take their word for it. Every citation goes through a deterministic validator — pure code, no AI — that checks the quoted text actually appears in the candidate’s evidence.
If the validator can’t find the quote, the answer is automatically downgraded to Insufficient Evidence and a citation-violation event is logged. We track those rates per evaluation as a leading indicator of prompt drift.
Hallucinated quotes are auto-downgradedAggregate honestly. Never fake confidence.
We don’t average our way to a clean number when the evidence isn’t there. Each criterion gets one of three states based on what we actually saw:
Assessed means we have enough evidence to give a real score. Partially Assessed means we have some signal but missed pieces of the question. Not Assessed means the simulation didn’t surface evidence one way or the other — and we say so plainly, instead of guessing.
Honesty about what we don’t know3 of 4 questions confirmed from evidence
Voice call evidence; deck evidence missing
No evidence surfaced in this simulation
Make every score reproducible — forever
Every evaluation writes a structured audit line: which prompt version generated the questions, which prompt version scored the observations, which model snapshot produced the answer, which seed was used, and how many citation violations occurred.
Six months from now, when someone questions a decision, we can replay the exact evaluation against the exact evidence and produce the same score. That’s not a marketing claim — it’s how the pipeline is built.
Full chain of custodyWhat we built so the AI can’t drift
Five engineering decisions that turn a clever LLM into a defensible evaluator. None of these are toggles or nice-to-haves — they’re the spine of the pipeline.
Versioned prompts
The instructions we give the AI are file-versioned and the version is stamped onto every score. Changing a prompt means a new version — and a new test gauntlet.
Pinned model snapshot
We don’t ride the latest model — we pin to a specific snapshot, recorded alongside every score in the audit trail, so a vendor update can’t silently change a score you already shipped.
Self-consistency sampling
For high-stakes criteria we can run each observation check N times in parallel and take a majority vote, with each call seeded independently so a flake in one lane can’t bias the result. Sampling ties are surfaced as their own signal, not silently averaged away.
PII redaction
Names and locations are stripped from the evidence bundle before it ever reaches the evaluator. The model scores the candidate’s work, not who they are.
Built for ADET compliance
AI hiring tools now face explicit US regulation — NYC Local Law 144, the Illinois AI Video Interview Act, the Colorado AI Act. Storm is engineered to clear those bars: questions referencing protected characteristics are rejected at generation, the observation-check prompt blocks demographic and stylistic signals, and PII (name, location) is stripped from the evidence bundle before the evaluator sees it. Independent annual bias audit and public summary publication are on the roadmap as the standard requires.
AI scores. You decide.
The pipeline produces evidence — a structured argument for or against each criterion, with citations attached. It does not produce decisions. Hiring managers see every piece of work, every observation, every quote, and every “Insufficient Evidence” tag.
Every override a hiring manager makes is recorded with a comment and becomes part of the audit trail. Over time, those comments are exactly the data we’d use to improve the questions for the role.
- Every observation, citation, and reasoning trail is visible in the evaluation view.
- Hiring managers can override any score, with a required comment that’s stored with the evaluation.
- “Insufficient Evidence” is a first-class outcome — never coerced into a fake number.
- The platform never auto-rejects, auto-rolls-forward, or hides candidates from a human reviewer.
Things buyers ask us, with honest answers
If a question you have isn’t here, tell us — we’d rather answer it directly than wave it away.
How do I know it’s not just a black box?
What happens when the AI gets it wrong?
How do you prevent demographic bias from creeping in?
Can I reproduce a score from six months ago?
Is the AI making the hiring decision?
What if a simulation didn’t surface enough evidence?
Want to see this on a real candidate?
We’ll walk you through a live evaluation — every observation, every citation, every override — for a role you’re hiring for right now.
Book a 30-min walkthrough