A Psychometrician’s Guide to Valid, Defensible AI Assessment

People are using AI to write, analyse, summarise, generate ideas, solve problems, and support decision-making across almost every professional role. Yet most assessment and hiring processes still evaluate candidates as if AI does not exist.

This creates a growing gap between how people are assessed and how work is actually performed.

The most effective way to close that gap is through AI work samples.

An AI work sample is a structured, job-relevant task that requires candidates to interact with AI in a way that reflects real-world performance. It moves beyond abstract testing and focuses directly on what people do when AI is part of their workflow.

At Rob Williams Assessment, AI work samples are increasingly central to defensible assessment design. They provide a direct bridge between construct definition and job performance, which is why they are becoming a critical component of modern, high-stakes evaluation systems.

AI Work Sample Design

Designing AI-enabled assessment that actually reflects real work?

RWA designs AI work samples that are valid, fair, and defensible in high-stakes hiring and assessment contexts.

Want to Discuss AI Work Sample Designs?

What Is an AI Work Sample?

An AI work sample is not simply “letting candidates use AI.” That is where many organisations go wrong.

A properly designed AI work sample is a structured assessment task that:

  • replicates a realistic job scenario
  • requires meaningful interaction with AI tools
  • targets clearly defined constructs
  • produces observable, scorable behaviour
  • links directly to job performance outcomes

This matters because traditional assessments often measure capability in isolation. AI work samples measure capability in context.

That distinction is critical.

Why AI Work Samples Are Now Essential

There are three major shifts driving the rise of AI work samples.

1. AI Has Changed the Nature of Performance

In many roles, performance is no longer about unaided cognition. It is about how effectively individuals:

  • frame problems for AI
  • interpret AI outputs
  • challenge or refine responses
  • integrate AI into workflows
  • apply judgement under uncertainty

Traditional tests do not capture this.

AI work samples do.

2. Validity Requires Realistic Task Design

One of the strongest forms of validity comes from alignment with real work. AI work samples provide this alignment directly.

This is why they act as a validity anchor within modern assessment systems.

Related RWA work on validation can be found here: Using AI for Validation

3. Defensibility Depends on Observable Behaviour

As outlined in the AI Audit Checklist, defensible assessment requires observable, explainable evidence.

AI work samples provide:

  • clear behavioural data
  • transparent scoring logic
  • direct links to job performance

This makes them far easier to defend than opaque AI scoring systems.

The Core Design Principle: Measure Judgement, Not Tool Use

The most common mistake in AI work sample design is focusing on tool proficiency.

This is the wrong target.

AI tools change rapidly. What matters is not whether someone can use a specific tool, but how they think when using AI.

At RWA, this is framed through capabilities such as:

  • output evaluation
  • bias recognition
  • decision calibration
  • information credibility assessment
  • structured reasoning

These map directly to the Mosaic AI Skills Framework.

The Five-Step Framework for Designing an AI Work Sample

Step 1: Define the Construct Clearly

Start with precision. What exactly are you trying to measure?

Weak example:

“AI capability”

Strong example:

“Ability to evaluate AI-generated recommendations and identify flawed reasoning in decision contexts”

Without this clarity, the rest of the design will be unstable.

Step 2: Identify a Realistic Job Scenario

The scenario should reflect actual work.

For example:

  • reviewing an AI-generated report
  • evaluating candidate recommendations
  • analysing AI-produced insights
  • challenging AI-driven conclusions

This is where many assessments fail. They drift into artificial tasks that do not resemble real decisions.

Step 3: Design the AI Interaction

The AI element must be purposeful.

This could include:

  • presenting AI-generated outputs with embedded flaws
  • allowing candidates to prompt AI themselves
  • requiring critique or refinement of AI responses

The goal is not passive consumption. It is active judgement.

Step 4: Define Observable Behaviours

What exactly will you observe and score?

Examples:

  • identification of incorrect assumptions
  • ability to challenge AI output
  • quality of reasoning
  • decision justification

If behaviour cannot be observed, it cannot be scored.

Step 5: Build a Defensible Scoring Framework

This is where psychometric discipline matters.

A strong scoring model should include:

  • clear scoring criteria
  • defined performance levels
  • examples of responses
  • consistency checks

This is essential for both reliability and defensibility.

Example AI Work Sample Task

Scenario: A hiring manager must decide whether to shortlist a candidate based on an AI-generated evaluation report.

Task:

  • Review the AI-generated summary
  • Identify strengths and weaknesses
  • Highlight any flawed reasoning
  • Make a recommendation with justification

What is being measured:

  • critical evaluation of AI output
  • decision-making under uncertainty
  • bias recognition
  • structured reasoning

This type of task directly reflects real-world decision-making.

Where Most Organisations Get This Wrong

Common mistakes include:

  • focusing on AI tool use rather than judgement
  • using unrealistic or trivial tasks
  • failing to define constructs clearly
  • weak or inconsistent scoring models
  • no link to real job performance

These issues reduce both validity and defensibility.

How AI Work Samples Strengthen Defensibility

AI work samples directly address the core risks identified in an AI Defensibility Audit.

They provide:

  • clear construct alignment
  • observable evidence
  • transparent scoring
  • job relevance

This makes them one of the strongest tools for defensible AI assessment.

Integration Into Assessment Systems

AI work samples should not stand alone.

They are most effective when combined with:

  • cognitive assessment
  • structured interviews
  • situational judgement tests
  • AI readiness diagnostics

For school-sector parallels, see: AI Readiness in Schools

The Strategic Value of Getting This Right

Organisations that adopt AI work samples effectively gain:

  • stronger validity
  • better prediction of performance
  • greater fairness and transparency
  • improved candidate credibility
  • reduced decision risk

Design Defensible AI Work Samples

If you are redesigning hiring or assessment for AI-enabled work, RWA can help you build valid, defensible AI work samples tailored to your roles.

 
 

Work With Us

We help organisations evaluate validity, fairness, and candidate experience across AI-enabled recruitment processes and assessments. Typical corporate engagement areas include AI-enhanced assessment design (SJTs, simulations, structured interviews), validation strategy, bias and fairness monitoring/audits, and construct definitions.

In addition to designing AI work samples, we offer these aligned services:

(C) 2026 Rob Williams Assessment Ltd. This article is educational and not legal advice. Always align to your local jurisdiction, counsel, and internal governance requirements.