Standards

Evaluation Methodology

METHOD-VLAB-01v0.17.0 Sealed

A rigorous, non-certifying approach to verifying MPLP lifecycle invariants using evidence-based verdicts.

⚠

Non-Certification & Non-Endorsement

The Validation Lab evaluates evidence packs, not agentic systems. We issue verdicts based on deterministic rulesets, not certifications of quality or safety. A PASS verdict means the submitted evidence satisfies the claimed invariants under the specified Ruleset.

01. How to Read a Verdict

Evidence maturity classifications and cryptographic guarantees.

📊 How to Read This Table

Reproduced

Real evidence pack with verifiable hash/seal

Can download, re-run locally, compare hash

✓ Reproduced = downloadable pack + deterministic recheck + hash matches release seal

Simulated

Synthetic evidence pack (not from real execution)

For demo/coverage only, not dispute-ready

Dispute Ready

FAIL verdict with evidence pointers to triggered clauses

Arbitration-ready: clause + evidence + FMM pointer

Declared

Manifest/metadata only, no downloadable evidence

Cannot be independently verified

Domain Labels (D1, D2, D3, D4)

d1Provenance: Identity, environment, provenance integrity

d2Lifecycle: Execution lifecycle, state transitions

d3Arbitration: Dispute resolution, evidence pointers

d4Interop: Cross-framework protocol compliance

Host vs Interop

Host= Orchestration framework running the agent (LangGraph, CrewAI, etc.)

Interop= Protocol stack used for cross-framework communication (MCP, A2A, ACP)

02. What We Evaluate

Four domains of lifecycle inquiry (D1–D4).

View Rulesets →

D1Provenance

Is the agent identity and execution environment cryptographically verifiable? (e.g., Transparency logs, remote attestation)

D2Lifecycle

Did the agent respect state transitions (Initialize → Run → Terminate) without leaking resources or zombie processes?

D3Arbitration

Does the evidence contain necessary pointers (lines, diffs, snapshots) for human or machine arbitration in case of failure?

D4Interop

Did the communication between components adhere to standard protocols (MCP, A2A) without proprietary side-channels?

03. Evidence Pack Format

The atomic unit of evaluation.

The Evidence Pack is a ZIP archive containing the minimal set of files needed to satisfy the Ruleset. It serves as a portable proof of conformance.

✓
manifest.jsonMetadata, claims, and self-reported verdict.
✓
integrity/SHA256 checksums for all artifacts in the pack.
✓
timeline/NDJSON event stream of the execution for replay.
✓
artifacts/Full capture of inputs, outputs, and side-effects.

manifest.json structure

{
  "id": "run-xyz-123",
  "tier": "REPRODUCED",
  "verdict": "PASS",
  "ruleset": "ruleset-1.1",
  "claims": {
    "d1.provenance": true,
    "d4.mcp_compliance": true
  },
  "signatures": { 
    "signer": "vlab-signer-01",
    "algo": "ed25519", 
    "value": "a7f...9c2" 
  }
}

04. Self-Audit Path

Official verification without trusting the Lab UI.

Start Self-Audit →

1. Select Run

Find a run marked REPRODUCED in the /runs index.

2. Check Ruleset

Verify the evaluation logic matches the Ruleset ID claimed.

3. Verify Seal

Find the Freeze Date in /releases and get the Seal Hash.

4. Local Hash

Run sha256sum on the pack. Must match Seal.

Start HereBrowse Runsets EvidenceAll Runs VerificationRelease Seals