A Psychometrician’s Guide to Valid, Defensible AI Assessment
People are using AI to write, analyse, summarise, generate ideas, solve problems, and support decision-making across almost every professional role. Yet most assessment and hiring processes still evaluate candidates as if AI does not exist.
This creates a growing gap between how people are assessed and how work is actually performed.
The most effective way to close that gap is through AI work samples.
An AI work sample is a structured, job-relevant task that requires candidates to interact with AI in a way that reflects real-world performance. It moves beyond abstract testing and focuses directly on what people do when AI is part of their workflow.
At Rob Williams Assessment, AI work samples are increasingly central to defensible assessment design. They provide a direct bridge between construct definition and job performance, which is why they are becoming a critical component of modern, high-stakes evaluation systems.
AI Work Sample Design
Designing AI-enabled assessment that actually reflects real work?
RWA designs AI work samples that are valid, fair, and defensible in high-stakes hiring and assessment contexts.
Want to Discuss AI Work Sample Designs?
What Is an AI Work Sample?
An AI work sample is not simply “letting candidates use AI.” That is where many organisations go wrong.
A properly designed AI work sample is a structured assessment task that:
- replicates a realistic job scenario
- requires meaningful interaction with AI tools
- targets clearly defined constructs
- produces observable, scorable behaviour
- links directly to job performance outcomes
This matters because traditional assessments often measure capability in isolation. AI work samples measure capability in context.
That distinction is critical.
Why AI Work Samples Are Now Essential
There are three major shifts driving the rise of AI work samples.
1. AI Has Changed the Nature of Performance
In many roles, performance is no longer about unaided cognition. It is about how effectively individuals:
- frame problems for AI
- interpret AI outputs
- challenge or refine responses
- integrate AI into workflows
- apply judgement under uncertainty
Traditional tests do not capture this.
AI work samples do.
2. Validity Requires Realistic Task Design
One of the strongest forms of validity comes from alignment with real work. AI work samples provide this alignment directly.
This is why they act as a validity anchor within modern assessment systems.
Related RWA work on validation can be found here: Using AI for Validation
3. Defensibility Depends on Observable Behaviour
As outlined in the AI Audit Checklist, defensible assessment requires observable, explainable evidence.
AI work samples provide:
- clear behavioural data
- transparent scoring logic
- direct links to job performance
This makes them far easier to defend than opaque AI scoring systems.
The Core Design Principle: Measure Judgement, Not Tool Use
The most common mistake in AI work sample design is focusing on tool proficiency.
This is the wrong target.
AI tools change rapidly. What matters is not whether someone can use a specific tool, but how they think when using AI.
At RWA, this is framed through capabilities such as:
- output evaluation
- bias recognition
- decision calibration
- information credibility assessment
- structured reasoning
These map directly to the Mosaic AI Skills Framework.
The Five-Step Framework for Designing an AI Work Sample
Step 1: Define the Construct Clearly
Start with precision. What exactly are you trying to measure?
Weak example:
“AI capability”
Strong example:
“Ability to evaluate AI-generated recommendations and identify flawed reasoning in decision contexts”
Without this clarity, the rest of the design will be unstable.
Step 2: Identify a Realistic Job Scenario
The scenario should reflect actual work.
For example:
- reviewing an AI-generated report
- evaluating candidate recommendations
- analysing AI-produced insights
- challenging AI-driven conclusions
This is where many assessments fail. They drift into artificial tasks that do not resemble real decisions.
Step 3: Design the AI Interaction
The AI element must be purposeful.
This could include:
- presenting AI-generated outputs with embedded flaws
- allowing candidates to prompt AI themselves
- requiring critique or refinement of AI responses
The goal is not passive consumption. It is active judgement.
Step 4: Define Observable Behaviours
What exactly will you observe and score?
Examples:
- identification of incorrect assumptions
- ability to challenge AI output
- quality of reasoning
- decision justification
If behaviour cannot be observed, it cannot be scored.
Step 5: Build a Defensible Scoring Framework
This is where psychometric discipline matters.
A strong scoring model should include:
- clear scoring criteria
- defined performance levels
- examples of responses
- consistency checks
This is essential for both reliability and defensibility.
Example AI Work Sample Task
Scenario: A hiring manager must decide whether to shortlist a candidate based on an AI-generated evaluation report.
Task:
- Review the AI-generated summary
- Identify strengths and weaknesses
- Highlight any flawed reasoning
- Make a recommendation with justification
What is being measured:
- critical evaluation of AI output
- decision-making under uncertainty
- bias recognition
- structured reasoning
This type of task directly reflects real-world decision-making.
Where Most Organisations Get This Wrong
Common mistakes include:
- focusing on AI tool use rather than judgement
- using unrealistic or trivial tasks
- failing to define constructs clearly
- weak or inconsistent scoring models
- no link to real job performance
These issues reduce both validity and defensibility.
How AI Work Samples Strengthen Defensibility
AI work samples directly address the core risks identified in an AI Defensibility Audit.
They provide:
- clear construct alignment
- observable evidence
- transparent scoring
- job relevance
This makes them one of the strongest tools for defensible AI assessment.
Integration Into Assessment Systems
AI work samples should not stand alone.
They are most effective when combined with:
- cognitive assessment
- structured interviews
- situational judgement tests
- AI readiness diagnostics
For school-sector parallels, see: AI Readiness in Schools
The Strategic Value of Getting This Right
Organisations that adopt AI work samples effectively gain:
- stronger validity
- better prediction of performance
- greater fairness and transparency
- improved candidate credibility
- reduced decision risk
Design Defensible AI Work Samples
If you are redesigning hiring or assessment for AI-enabled work, RWA can help you build valid, defensible AI work samples tailored to your roles.
Work With Us
We help organisations evaluate validity, fairness, and candidate experience across AI-enabled recruitment processes and assessments. Typical corporate engagement areas include AI-enhanced assessment design (SJTs, simulations, structured interviews), validation strategy, bias and fairness monitoring/audits, and construct definitions.
In addition to designing AI work samples, we offer these aligned services:
- Firstly, our organisational AI readiness diagnostic
- Secondly, our AI readiness diagnostic for schools
- Thirdly, our AI readiness diagnostic for individual development
- And then next our AI career readiness diagnostic
- Plus, also our guide to AI leadership diagnostic designs
- Then also our AI skills framework
- And AI competency framework for organisations
- Plus also our guide to AI leadership Readiness Diagnostic designs
- And then also how to use AI to validate an AI-enabled assessment
- Then finally, our guide to AI-enabled situational judgement test designs
(C) 2026 Rob Williams Assessment Ltd. This article is educational and not legal advice. Always align to your local jurisdiction, counsel, and internal governance requirements.