Standards
Evaluation Methodology
A rigorous, non-certifying approach to verifying MPLP lifecycle invariants using evidence-based verdicts.
Non-Certification & Non-Endorsement
The Validation Lab evaluates evidence packs, not agentic systems. We issue verdicts based on deterministic rulesets, not certifications of quality or safety. A PASS verdict means the submitted evidence satisfies the claimed invariants under the specified Ruleset.
01. How to Read a Verdict
Evidence maturity classifications and cryptographic guarantees.
📊 How to Read This Table
Real evidence pack with verifiable hash/seal
Can download, re-run locally, compare hash
✓ Reproduced = downloadable pack + deterministic recheck + hash matches release seal
Synthetic evidence pack (not from real execution)
For demo/coverage only, not dispute-ready
FAIL verdict with evidence pointers to triggered clauses
Arbitration-ready: clause + evidence + FMM pointer
Manifest/metadata only, no downloadable evidence
Cannot be independently verified
Domain Labels (D1, D2, D3, D4)
Host vs Interop
02. What We Evaluate
Four domains of lifecycle inquiry (D1–D4).
Is the agent identity and execution environment cryptographically verifiable? (e.g., Transparency logs, remote attestation)
Did the agent respect state transitions (Initialize → Run → Terminate) without leaking resources or zombie processes?
Does the evidence contain necessary pointers (lines, diffs, snapshots) for human or machine arbitration in case of failure?
Did the communication between components adhere to standard protocols (MCP, A2A) without proprietary side-channels?
03. Evidence Pack Format
The atomic unit of evaluation.
The Evidence Pack is a ZIP archive containing the minimal set of files needed to satisfy the Ruleset. It serves as a portable proof of conformance.
- ✓manifest.jsonMetadata, claims, and self-reported verdict.
- ✓integrity/SHA256 checksums for all artifacts in the pack.
- ✓timeline/NDJSON event stream of the execution for replay.
- ✓artifacts/Full capture of inputs, outputs, and side-effects.
{
"id": "run-xyz-123",
"tier": "REPRODUCED",
"verdict": "PASS",
"ruleset": "ruleset-1.1",
"claims": {
"d1.provenance": true,
"d4.mcp_compliance": true
},
"signatures": {
"signer": "vlab-signer-01",
"algo": "ed25519",
"value": "a7f...9c2"
}
}04. Self-Audit Path
Official verification without trusting the Lab UI.
Find a run marked REPRODUCED in the /runs index.
Verify the evaluation logic matches the Ruleset ID claimed.
Find the Freeze Date in /releases and get the Seal Hash.
Run sha256sum on the pack. Must match Seal.