Notice|This lab provides evidence-based verdicts, not certification. No compliance marks or endorsements are issued.
Buildrc-20260131161432
SSOT2026-01-31

Standards

Evaluation Methodology

METHOD-VLAB-01v0.17.0 Sealed

A rigorous, non-certifying approach to verifying MPLP lifecycle invariants using evidence-based verdicts.

Non-Certification & Non-Endorsement

The Validation Lab evaluates evidence packs, not agentic systems. We issue verdicts based on deterministic rulesets, not certifications of quality or safety. A PASS verdict means the submitted evidence satisfies the claimed invariants under the specified Ruleset.

01. How to Read a Verdict

Evidence maturity classifications and cryptographic guarantees.

📊 How to Read This Table

Reproduced

Real evidence pack with verifiable hash/seal

Can download, re-run locally, compare hash

Reproduced = downloadable pack + deterministic recheck + hash matches release seal

Simulated

Synthetic evidence pack (not from real execution)

For demo/coverage only, not dispute-ready

Dispute Ready

FAIL verdict with evidence pointers to triggered clauses

Arbitration-ready: clause + evidence + FMM pointer

Declared

Manifest/metadata only, no downloadable evidence

Cannot be independently verified

Domain Labels (D1, D2, D3, D4)

d1Provenance: Identity, environment, provenance integrity
d2Lifecycle: Execution lifecycle, state transitions
d3Arbitration: Dispute resolution, evidence pointers
d4Interop: Cross-framework protocol compliance

Host vs Interop

Host= Orchestration framework running the agent (LangGraph, CrewAI, etc.)
Interop= Protocol stack used for cross-framework communication (MCP, A2A, ACP)

02. What We Evaluate

Four domains of lifecycle inquiry (D1–D4).

Is the agent identity and execution environment cryptographically verifiable? (e.g., Transparency logs, remote attestation)

Did the agent respect state transitions (Initialize → Run → Terminate) without leaking resources or zombie processes?

Does the evidence contain necessary pointers (lines, diffs, snapshots) for human or machine arbitration in case of failure?

Did the communication between components adhere to standard protocols (MCP, A2A) without proprietary side-channels?

03. Evidence Pack Format

The atomic unit of evaluation.

The Evidence Pack is a ZIP archive containing the minimal set of files needed to satisfy the Ruleset. It serves as a portable proof of conformance.

  • manifest.jsonMetadata, claims, and self-reported verdict.
  • integrity/SHA256 checksums for all artifacts in the pack.
  • timeline/NDJSON event stream of the execution for replay.
  • artifacts/Full capture of inputs, outputs, and side-effects.
manifest.json structure
{
  "id": "run-xyz-123",
  "tier": "REPRODUCED",
  "verdict": "PASS",
  "ruleset": "ruleset-1.1",
  "claims": {
    "d1.provenance": true,
    "d4.mcp_compliance": true
  },
  "signatures": { 
    "signer": "vlab-signer-01",
    "algo": "ed25519", 
    "value": "a7f...9c2" 
  }
}

04. Self-Audit Path

Official verification without trusting the Lab UI.

1. Select Run

Find a run marked REPRODUCED in the /runs index.

2. Check Ruleset

Verify the evaluation logic matches the Ruleset ID claimed.

3. Verify Seal

Find the Freeze Date in /releases and get the Seal Hash.

4. Local Hash

Run sha256sum on the pack. Must match Seal.