Skip to main content
Add ANTHROPIC_API_KEY or OPENAI_API_KEY to your environment variables before using these helpers. askClaudeTextValidation and askChatGPTTextValidation share the same signature and return shape, and can be swapped with no other changes to your test.

Examples

Assert that an AI response is valid and within the token budget
const { validation, usage } = await askClaudeTextValidation(
  "You are a strict but fair evaluator. Focus on accuracy and completeness.",
  originalPrompt,
  aiResponseText,
);

// Hard gate — judge model must deem the response valid
expect(
  validation.isValidForPrompt,
  `Response invalid.\nMissing: ${JSON.stringify(validation.issues.missingRequirements)}\n` +
  `Incorrect: ${JSON.stringify(validation.issues.incorrectInformation)}\n` +
  `Explanation: ${validation.explanation}`,
).toBe(true);

// Soft gate — tune threshold per client
expect(
  validation.score,
  `Score ${validation.score} below threshold.\nExplanation: ${validation.explanation}`,
).toBeGreaterThanOrEqual(0.8);

// No contradictions
expect(
  validation.issues.incorrectInformation.length,
  `Incorrect info flagged: ${JSON.stringify(validation.issues.incorrectInformation)}`,
).toBe(0);

// Token budget — tune per client and model
expect(usage.input_tokens).toBeLessThanOrEqual(2000);
expect(usage.output_tokens).toBeLessThanOrEqual(500);
Use ChatGPT as the judge model instead
// Drop-in replacement — identical signature and return shape
const { validation, usage } = await askChatGPTTextValidation(
  "You are a strict but fair evaluator. Focus on accuracy and completeness.",
  originalPrompt,
  aiResponseText,
);

When to use

  • Your app surfaces AI-generated content (summaries, prep notes, chat responses) that must be checked for accuracy.
  • Your app’s AI feature must not introduce contradictions or hallucinations relative to source material.
  • Your app has a quality bar for AI output that a simple string match cannot enforce.
  • Your app sends AI prompts that could drift in cost and you need to assert token budgets.
  • Your app uses different AI providers and you want a consistent validation interface across both.

Helpers

askClaudeTextValidation

Sends the original prompt and candidate response to Claude and returns a structured verdict. Conforms to the Anthropic Messages API.
async function askClaudeTextValidation(
  systemPrompt: string,
  originalPrompt: string,
  aiResponseText: string,
) {
  const SCHEMA_INSTRUCTIONS = `
You MUST respond with ONLY a valid JSON object — no markdown, no explanation, no backticks.
{
  "isValidForPrompt": boolean,
  "score": number (0.00–1.00),
  "issues": {
    "missingRequirements":    [string],
    "incorrectInformation":   [string],
    "offTopicContent":        [string],
    "formattingProblems":     [string],
    "safetyOrPolicyConcerns": [string]
  },
  "explanation": string
}`;

  const resp = await fetch("https://api.anthropic.com/v1/messages", {
    method: "POST",
    headers: {
      "x-api-key":         process.env.ANTHROPIC_API_KEY,
      "anthropic-version": "2023-06-01",
      "content-type":      "application/json",
    },
    body: JSON.stringify({
      model:      "claude-sonnet-4-6",
      max_tokens: 1024,
      system:     systemPrompt + "\n\n" + SCHEMA_INSTRUCTIONS,
      messages: [{
        role:    "user",
        content:
          "PROMPT:\n" + originalPrompt +
          "\n\nCANDIDATE ANSWER:\n" + aiResponseText +
          "\n\nReturn the JSON verdict.",
      }],
    }),
  });

  if (!resp.ok) throw new Error(`Anthropic error ${resp.status}: ${await resp.text()}`);

  const data       = await resp.json();
  const rawText    = data.content?.find((b: any) => b.type === "text")?.text ?? "";
  const validation = JSON.parse(rawText.replace(/```json|```/g, "").trim());
  const usage      = data.usage; // { input_tokens, output_tokens }

  return { validation, usage };
}

askChatGPTTextValidation

Same signature and return shape as askClaudeTextValidation. Uses the OpenAI Responses API with structured JSON output. Conforms to the OpenAI Responses API. NPM needed: openai@latest
import OpenAI from "openai";

async function askChatGPTTextValidation(
  systemPrompt: string,
  originalPrompt: string,
  aiResponseText: string,
) {
  const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

  const resp = await client.responses.create({
    model:        "gpt-4.1",
    instructions: systemPrompt,
    input: [{
      role:    "user",
      content:
        "PROMPT:\n" + originalPrompt +
        "\n\nCANDIDATE ANSWER:\n" + aiResponseText +
        "\n\nReturn the JSON verdict.",
    }],
    text: {
      format: {
        type:   "json_schema",
        name:   "TextPromptValidation",
        strict: true,
        schema: {
          type:                 "object",
          additionalProperties: false,
          required:             ["isValidForPrompt", "score", "issues", "explanation"],
          properties: {
            isValidForPrompt: { type: "boolean" },
            score:            { type: "number", minimum: 0, maximum: 1, multipleOf: 0.01 },
            issues: {
              type:                 "object",
              additionalProperties: false,
              required: ["missingRequirements", "incorrectInformation", "offTopicContent", "formattingProblems", "safetyOrPolicyConcerns"],
              properties: {
                missingRequirements:    { type: "array", items: { type: "string" } },
                incorrectInformation:   { type: "array", items: { type: "string" } },
                offTopicContent:        { type: "array", items: { type: "string" } },
                formattingProblems:     { type: "array", items: { type: "string" } },
                safetyOrPolicyConcerns: { type: "array", items: { type: "string" } },
              },
            },
            explanation: { type: "string" },
          },
        },
      },
    },
  });

  const validation = JSON.parse(resp.output_text);
  const usage = {
    input_tokens:  resp.usage?.input_tokens  ?? 0,
    output_tokens: resp.usage?.output_tokens ?? 0,
  };

  return { validation, usage };
}

Return shape

Both helpers return { validation, usage } with the same structure.
FieldTypeDescription
validation.isValidForPromptbooleanHard gate — did the response satisfy the prompt?
validation.scorenumberQuality score 0–1. Assert >= 0.8 as a starting threshold.
validation.issues.*string[]Arrays of flagged issues by category.
validation.explanationstringStep-by-step reasoning from the judge model.
usage.input_tokensnumberTokens consumed by the prompt.
usage.output_tokensnumberTokens consumed by the response.

Full sample test

import { flow, expect } from "@qawolf/flows/web";
import { llms } from "./helpers/llm-helper.t";

export default flow(
  "AI response is valid, scored, and within token budget",
  { target: "Web - Chrome", launch: true },
  async ({ page }) => {

    //--------------------------------
    // Arrange
    //--------------------------------

    // The prompt sent to the AI feature under test
    const originalPrompt =
      "Summarize the key action items from the meeting transcript below " +
      "in a bulleted list. Be concise and accurate.\n\n" +
      "Transcript:\n" +
      "Alice: We need to ship the new onboarding flow by Friday.\n" +
      "Bob: I'll own the front-end changes.\n" +
      "Alice: Great. Carol, can you handle QA?\n" +
      "Carol: Yes, I'll have test cases ready by Thursday.\n" +
      "Alice: Perfect. Also, let's schedule a retro for next Monday at 10am.";

     const {
      askChatGPTTextValidation,
      askClaudeTextValidation
    } = await llms();

    //--------------------------------
    // Act
    //--------------------------------

    await page.getByRole("link", { name: "AI Assistant" }).click();

    await page.getByRole("textbox", { name: "Ask anything" }).fill(originalPrompt);

    const [response] = await Promise.all([
      page.waitForResponse(
        (res) =>
          res.url().includes("/api/chat") &&
          res.request().method() === "POST",
        { timeout: 60_000 },
      ),
      page.getByRole("button", { name: "Send" }).click(),
    ]);

    expect(response.status()).toBe(200);

    const aiResponseContainer = page.locator("[data-testid='ai-response']:last-of-type");
    await expect(aiResponseContainer).toBeVisible({ timeout: 15_000 });
    const aiResponseText = await aiResponseContainer.innerText();

    //--------------------------------
    // Assert
    //--------------------------------

    const { validation, usage } = await askClaudeTextValidation(
      "You are a strict but fair evaluator of AI-generated meeting summaries. " +
      "Focus on whether the response captures the correct action items, owners, and deadlines.",
      originalPrompt,
      aiResponseText,
    );

    // Structure
    expect(typeof validation.isValidForPrompt).toBe("boolean");
    expect(typeof validation.score).toBe("number");
    expect(Array.isArray(validation.issues.missingRequirements)).toBe(true);
    expect(Array.isArray(validation.issues.incorrectInformation)).toBe(true);
    expect(Array.isArray(validation.issues.offTopicContent)).toBe(true);
    expect(Array.isArray(validation.issues.formattingProblems)).toBe(true);
    expect(Array.isArray(validation.issues.safetyOrPolicyConcerns)).toBe(true);
    expect(typeof validation.explanation).toBe("string");

    // Hard gate
    expect(
      validation.isValidForPrompt,
      `Response invalid.\n` +
      `Missing: ${JSON.stringify(validation.issues.missingRequirements)}\n` +
      `Incorrect: ${JSON.stringify(validation.issues.incorrectInformation)}\n` +
      `Off-topic: ${JSON.stringify(validation.issues.offTopicContent)}\n` +
      `Explanation: ${validation.explanation}`,
    ).toBe(true);

    // Soft gate
    expect(
      validation.score,
      `Score ${validation.score} below threshold.\nExplanation: ${validation.explanation}`,
    ).toBeGreaterThanOrEqual(0.8);

    // No contradictions
    expect(
      validation.issues.incorrectInformation.length,
      `Incorrect info: ${JSON.stringify(validation.issues.incorrectInformation)}`,
    ).toBe(0);

    expect(
      validation.issues.offTopicContent.length,
      `Off-topic content: ${JSON.stringify(validation.issues.offTopicContent)}`,
    ).toBe(0);

    // Token budget
    expect(
      usage.input_tokens,
      `Input tokens ${usage.input_tokens} exceeded budget of 2000`,
    ).toBeLessThanOrEqual(2000);

    expect(
      usage.output_tokens,
      `Output tokens ${usage.output_tokens} exceeded budget of 500`,
    ).toBeLessThanOrEqual(500);
  },
);
Last modified on April 24, 2026