Graphorin API reference / @graphorin/evals

@graphorin/evals

Eval framework for the Graphorin framework. Ships scorer libraries (code, LLM-judge, prebuilt rubrics), dataset loaders (JSONL / CSV / from-traces / iterable), reporters (terminal / markdown / JSON / JUnit / HTML), a parallel runner with bounded concurrency, and regression detection that compares the current run against a stored baseline.
Project Graphorin · v0.1.0 · MIT License · © 2026 Oleksiy Stepurenko · https://github.com/o-stepper/graphorin

Status

Published: v0.1.0 (optional sub-pack; the full orchestrator is decoupled from @graphorin/observability per RB-17 / DEC-152).

Install

bash

pnpm add @graphorin/evals

The package depends only on @graphorin/core and @graphorin/observability; reporters / loaders are part of the same bundle so consumers do not need additional installs.

Quickstart

import {
  runEvals,
  loadJsonlDataset,
  exactMatch,
  renderTerminalReport,
  exitOnFailures,
} from '@graphorin/evals';

const dataset = await loadJsonlDataset('./fixtures/golden.jsonl');
const report = await runEvals({
  agent,                       // anything with `run(input)`
  dataset,
  scorers: [exactMatch({ caseInsensitive: true })],
  concurrency: 4,
});
console.log(renderTerminalReport(report));
exitOnFailures(report);

Scorers

Scorer family	Identifiers	Notes
`code/`	`exactMatch`, `regexMatch`, `jsonPath`, `predicate`	Pure-code grading. No provider call.
`llm/`	`llmJudge`	LLM-as-judge. Default `gpt-4o-mini`-class judge with `temperature: 0`.
`prebuilt/`	`toxicityScorer`, `factualityScorer`, `helpfulnessScorer`	Wrap `llmJudge` with a project-tested rubric.

Dataset loaders

Loader	Use
`loadJsonlDataset(path)`	Read a JSONL file. Each line is a JSON object with `input` + optional `expected` / `id` / `metadata`.
`loadCsvDataset(path)`	Read a CSV file (RFC 4180 strict subset). Columns map by name.
`loadDatasetFromTraces(path, { extract })`	Distil a dataset from the framework's replay log.
`fromIterable(cases)`	Wrap an in-memory array as a dataset (tests / ad-hoc data).

Reporters

Reporter	Output	Best for
`renderTerminalReport(report)`	Plain text (no ANSI).	CI logs, local dev.
`renderMarkdownReport(report)`	Markdown.	PR descriptions, doc sites.
`renderJsonReport(report)`	Canonical JSON.	Dashboards, regression checkers.
`renderJunitReport(report)`	JUnit XML.	GitHub Actions / GitLab / CircleCI.
`renderHtmlReport(report)`	Self-contained HTML.	Artifact viewers.

Parallel runner

const report = await runEvals({
  agent,
  dataset,
  scorers,
  iterations: 3,            // each case run 3 times for variance estimation
  concurrency: 8,           // up to 8 parallel agent.run() calls
  signal: controller.signal,
  onProgress: (e) => console.log(`${e.index}/${e.total} ${e.caseId}`),
});

Regression detection

import { detectRegressions, exitOnFailures } from '@graphorin/evals';

const baseline = JSON.parse(await readFile('./baselines/golden.json', 'utf8'));
const report = await runEvals(...);
const regression = detectRegressions(report, baseline, {
  maxPassRateDropPct: 5,
  maxAvgScoreDrop: 0.05,
  maxAvgDurationIncreaseMs: 250,
});
if (regression.hasRegressions) {
  for (const f of regression.findings) {
    console.error(`regression — ${f.kind}: ${f.message}`);
  }
}
exitOnFailures(report, regression);

Multi-format report writing

import { writeReports } from '@graphorin/evals';

await writeReports({
  report,
  outDir: './eval-out',
  formats: ['terminal', 'markdown', 'json', 'junit', 'html'],
  basename: 'golden',
});

DEC-152 — Eval split: keep evaluation interfaces in @graphorin/observability, ship the full eval framework as @graphorin/evals.

License

Modules

Module	Description
	@graphorin/evals — eval framework for the Graphorin framework.
cli	CLI integration helpers. Convenience wrappers that combine the runner + a reporter + an exit-code mapping so consumer scripts can stay short.
loaders	Dataset loaders. Every loader returns a fully-materialised Dataset that the runner can iterate over without further I/O. Streaming loaders are a post-MVP follow-up.
reporters	Barrel export for every shipped reporter. Each renderer takes an `EvalReport` and returns the canonical text representation; the caller decides where to write it (`writeFile`, `process.stdout`, GitHub Actions step summary, etc.).
scorers	Barrel export for every shipped scorer family.

errors

Classes

Type Aliases

evaluator-optimizer

Interfaces

Type Aliases

Functions

factory

Functions

fallback

Interfaces

Type Aliases

Functions

fanout

Interfaces

Type Aliases

Functions

filters

Interfaces

Variables

Functions

preferred-model

Interfaces

Functions

progress

Interfaces

Functions

run-state

Interfaces

Variables

Functions

client

Classes

Interfaces

Type Aliases

errors

Classes

Type Aliases

reconnect

Interfaces

Functions

cli

Interfaces

Type Aliases

Functions

loaders

Functions

errors

Classes

Interfaces

Type Aliases

conflict

Variables

errors

Classes

facade

Interfaces

Functions

tools

Functions

openinference

Type Aliases

Variables

Functions

redaction

imperative-patterns

Interfaces

Type Aliases

Variables

Functions

patterns

Interfaces

Type Aliases

Variables

telemetry

Interfaces

Functions

client-message

Type Aliases

Variables