Project Context Primer
This book focuses on the Nomos Testing Framework. It assumes familiarity with the Nomos architecture, but for completeness, here is a short primer.
- Nomos is a modular blockchain protocol composed of validators, executors, and a data-availability (DA) subsystem.
- Validators participate in consensus and produce blocks.
- Executors are validators with the DA dispersal service enabled. They perform all validator functions plus submit blob data to the DA network.
- Data Availability (DA) ensures that blob data submitted via channel operations in transactions is published and retrievable by the network.
These roles interact tightly, which is why meaningful testing must be performed in multi-node environments that include real networking, timing, and DA interaction.
What You Will Learn
This book gives you a clear mental model for Nomos multi-node testing, shows how to author scenarios that pair realistic workloads with explicit expectations, and guides you to run them across local, containerized, and cluster environments without changing the plan.
Part I — Foundations
Conceptual chapters that establish the mental model for the framework and how it approaches multi-node testing.
Introduction
The Nomos Testing Framework is a purpose-built toolkit for exercising Nomos in realistic, multi-node environments. It solves the gap between small, isolated tests and full-system validation by letting teams describe a cluster layout, drive meaningful traffic, and assert the outcomes in one coherent plan.
It is for protocol engineers, infrastructure operators, and QA teams who need repeatable confidence that validators, executors, and data-availability components work together under network and timing constraints.
Multi-node integration testing is required because many Nomos behaviors—block progress, data availability, liveness under churn—only emerge when several roles interact over real networking and time. This framework makes those checks declarative, observable, and portable across environments.
Architecture Overview
The framework follows a clear flow: Topology → Scenario → Deployer → Runner → Workloads → Expectations.
Core Flow
flowchart LR
A(Topology<br/>shape cluster) --> B(Scenario<br/>plan)
B --> C(Deployer<br/>provision & readiness)
C --> D(Runner<br/>orchestrate execution)
D --> E(Workloads<br/>drive traffic)
E --> F(Expectations<br/>verify outcomes)
Components
- Topology describes the cluster: how many nodes, their roles, and the high-level network and data-availability parameters they should follow.
- Scenario combines that topology with the activities to run and the checks to perform, forming a single plan.
- Deployer provisions infrastructure on the chosen backend (local processes, Docker Compose, or Kubernetes), waits for readiness, and returns a Runner.
- Runner orchestrates scenario execution: starts workloads, observes signals, evaluates expectations, and triggers cleanup.
- Workloads generate traffic and conditions that exercise the system.
- Expectations observe the run and judge success or failure once activity completes.
Each layer has a narrow responsibility so that cluster shape, deployment choice, traffic generation, and health checks can evolve independently while fitting together predictably.
Entry Points
The framework is consumed via runnable example binaries in examples/src/bin/:
local_runner.rs— Spawns nodes as local processescompose_runner.rs— Deploys via Docker Compose (requiresNOMOS_TESTNET_IMAGEbuilt)k8s_runner.rs— Deploys via Kubernetes Helm (requires cluster + image)
Run with: POL_PROOF_DEV_MODE=true cargo run -p runner-examples --bin <name>
Important: All runners require POL_PROOF_DEV_MODE=true to avoid expensive Groth16 proof generation that causes timeouts.
These binaries use the framework API (ScenarioBuilder) to construct and execute scenarios.
Builder API
Scenarios are defined using a fluent builder pattern:
#![allow(unused)] fn main() { let mut plan = ScenarioBuilder::topology_with(|t| { t.network_star() // Topology configuration .validators(3) .executors(2) }) .wallets(50) // Wallet seeding .transactions_with(|txs| { txs.rate(5) .users(20) }) .da_with(|da| { da.channel_rate(1) .blob_rate(2) }) .expect_consensus_liveness() // Expectations .with_run_duration(Duration::from_secs(90)) .build(); }
Key API Points:
- Topology uses
.topology_with(|t| { t.validators(N).executors(M) })closure pattern - Workloads are configured via
_withclosures (transactions_with,da_with,chaos_with) - Chaos workloads require
.enable_node_control()and a compatible runner
Deployers
Three deployer implementations:
| Deployer | Backend | Prerequisites | Node Control |
|---|---|---|---|
LocalDeployer | Local processes | Binaries in sibling checkout | No |
ComposeDeployer | Docker Compose | NOMOS_TESTNET_IMAGE built | Yes |
K8sDeployer | Kubernetes Helm | Cluster + image loaded | Not yet |
Compose-specific features:
- Includes Prometheus at
http://localhost:9090(override viaTEST_FRAMEWORK_PROMETHEUS_PORT) - Optional OTLP trace/metrics endpoints (
NOMOS_OTLP_ENDPOINT,NOMOS_OTLP_METRICS_ENDPOINT) - Node control for chaos testing (restart validators/executors)
Assets and Images
Docker Image
Built via testing-framework/assets/stack/scripts/build_test_image.sh:
- Embeds KZG circuit parameters from
testing-framework/assets/stack/kzgrs_test_params/ - Includes runner scripts:
run_nomos_node.sh,run_nomos_executor.sh - Tagged as
NOMOS_TESTNET_IMAGE(default:nomos-testnet:local)
Circuit Assets
KZG parameters required for DA workloads:
- Default path:
testing-framework/assets/stack/kzgrs_test_params/ - Override:
NOMOS_KZGRS_PARAMS_PATH=/custom/path - Fetch via:
scripts/setup-nomos-circuits.sh v0.3.1 /tmp/circuits
Compose Stack
Templates and configs in testing-framework/runners/compose/assets/:
docker-compose.yml.tera— Stack template (validators, executors, Prometheus)- Cfgsync config:
testing-framework/assets/stack/cfgsync.yaml - Monitoring:
testing-framework/assets/stack/monitoring/prometheus.yml
Logging Architecture
Two separate logging pipelines:
| Component | Configuration | Output |
|---|---|---|
| Runner binaries | RUST_LOG | Framework orchestration logs |
| Node processes | NOMOS_LOG_LEVEL, NOMOS_LOG_FILTER, NOMOS_LOG_DIR | Consensus, DA, mempool logs |
Node logging:
- Local runner: Writes to temporary directories by default (cleaned up). Set
NOMOS_TESTS_TRACING=true+NOMOS_LOG_DIRfor persistent files. - Compose runner: Default logs to container stdout/stderr (
docker logs). Optional per-node files ifNOMOS_LOG_DIRis set and mounted. - K8s runner: Logs to pod stdout/stderr (
kubectl logs). Optional per-node files ifNOMOS_LOG_DIRis set and mounted.
File naming: Per-node files use prefix nomos-node-{index} or nomos-executor-{index} (may include timestamps).
Observability
Prometheus (Compose only):
- Exposed at
http://localhost:9090(configurable) - Scrapes all validator and executor metrics
- Accessible in expectations:
ctx.telemetry().prometheus_endpoint()
Node APIs:
- HTTP endpoints per node for consensus info, network status, DA membership
- Accessible in expectations:
ctx.node_clients().validators().get(0)
OTLP (optional):
- Trace endpoint:
NOMOS_OTLP_ENDPOINT=http://localhost:4317 - Metrics endpoint:
NOMOS_OTLP_METRICS_ENDPOINT=http://localhost:4318 - Disabled by default (no noise if unset)
For detailed logging configuration, see Logging and Observability.
Testing Philosophy
This framework embodies specific principles that shape how you author and run scenarios. Understanding these principles helps you write effective tests and interpret results correctly.
Declarative over Imperative
Describe what you want to test, not how to orchestrate it:
#![allow(unused)] fn main() { // Good: declarative ScenarioBuilder::topology_with(|t| { t.network_star() .validators(2) .executors(1) }) .transactions_with(|txs| { txs.rate(5) // 5 transactions per block }) .expect_consensus_liveness() .build(); // Bad: imperative (framework doesn't work this way) // spawn_validator(); spawn_executor(); // loop { submit_tx(); check_block(); } }
Why it matters: The framework handles deployment, readiness, and cleanup. You focus on test intent, not infrastructure orchestration.
Protocol Time, Not Wall Time
Reason in blocks and consensus intervals, not wall-clock seconds.
Consensus defaults:
- Slot duration: 2 seconds (NTP-synchronized, configurable via
CONSENSUS_SLOT_TIME) - Active slot coefficient: 0.9 (90% block probability per slot)
- Expected rate: ~27 blocks per minute
#![allow(unused)] fn main() { // Good: protocol-oriented thinking let plan = ScenarioBuilder::topology_with(|t| { t.network_star() .validators(2) .executors(1) }) .transactions_with(|txs| { txs.rate(5) // 5 transactions per block }) .with_run_duration(Duration::from_secs(60)) // Let framework calculate expected blocks .expect_consensus_liveness() // "Did we produce the expected blocks?" .build(); // Bad: wall-clock assumptions // "I expect exactly 30 blocks in 60 seconds" // This breaks on slow CI where slot timing might drift }
Why it matters: Slot timing is fixed (2s by default, NTP-synchronized), so the expected number of blocks is predictable: ~27 blocks in 60s with the default 0.9 active slot coefficient. The framework calculates expected blocks from slot duration and run window, making assertions protocol-based rather than tied to specific wall-clock expectations. Assert on "blocks produced relative to slots" not "blocks produced in exact wall-clock seconds".
Determinism First, Chaos When Needed
Default scenarios are repeatable:
- Fixed topology
- Predictable traffic rates
- Deterministic checks
Chaos is opt-in:
#![allow(unused)] fn main() { // Separate: functional test (deterministic) let plan = ScenarioBuilder::topology_with(|t| { t.network_star() .validators(2) .executors(1) }) .transactions_with(|txs| { txs.rate(5) // 5 transactions per block }) .expect_consensus_liveness() .build(); // Separate: chaos test (introduces randomness) let chaos_plan = ScenarioBuilder::topology_with(|t| { t.network_star() .validators(3) .executors(2) }) .enable_node_control() .chaos_with(|c| { c.restart() .min_delay(Duration::from_secs(30)) .max_delay(Duration::from_secs(60)) .target_cooldown(Duration::from_secs(45)) .apply() }) .transactions_with(|txs| { txs.rate(5) // 5 transactions per block }) .expect_consensus_liveness() .build(); }
Why it matters: Mixing determinism with chaos creates noisy, hard-to-debug failures. Separate concerns make failures actionable.
Observable Health Signals
Prefer user-facing signals over internal state:
Good checks:
- Blocks progressing at expected rate (liveness)
- Transactions included within N blocks (inclusion)
- DA blobs retrievable (availability)
Avoid internal checks:
- Memory pool size
- Internal service state
- Cache hit rates
Why it matters: User-facing signals reflect actual system health. Internal state can be "healthy" while the system is broken from a user perspective.
Minimum Run Windows
Always run long enough for meaningful block production:
#![allow(unused)] fn main() { // Bad: too short .with_run_duration(Duration::from_secs(5)) // ~2 blocks (with default 2s slots, 0.9 coeff) // Good: enough blocks for assertions .with_run_duration(Duration::from_secs(60)) // ~27 blocks (with default 2s slots, 0.9 coeff) }
Note: Block counts assume default consensus parameters:
- Slot duration: 2 seconds (configurable via
CONSENSUS_SLOT_TIME) - Active slot coefficient: 0.9 (90% block probability per slot)
- Formula:
blocks ≈ (duration / slot_duration) × active_slot_coeff
If upstream changes these parameters, adjust your duration expectations accordingly.
The framework enforces minimum durations (at least 2× slot duration), but be explicit. Very short runs risk false confidence—one lucky block doesn't prove liveness.
Summary
These principles keep scenarios:
- Portable across environments (protocol time, declarative)
- Debuggable (determinism, separation of concerns)
- Meaningful (observable signals, sufficient duration)
When authoring scenarios, ask: "Does this test the protocol behavior or my local environment quirks?"
Scenario Lifecycle
- Build the plan: Declare a topology, attach workloads and expectations, and set the run window. The plan is the single source of truth for what will happen.
- Deploy: Hand the plan to a deployer. It provisions the environment on the chosen backend, waits for nodes to signal readiness, and returns a runner.
- Drive workloads: The runner starts traffic and behaviors (transactions, data-availability activity, restarts) for the planned duration.
- Observe blocks and signals: Track block progression and other high-level metrics during or after the run window to ground assertions in protocol time.
- Evaluate expectations: Once activity stops (and optional cooldown completes), the runner checks liveness and workload-specific outcomes to decide pass or fail.
- Cleanup: Tear down resources so successive runs start fresh and do not inherit leaked state.
flowchart LR
P[Plan<br/>topology + workloads + expectations] --> D[Deploy<br/>deployer provisions]
D --> R[Runner<br/>orchestrates execution]
R --> W[Drive Workloads]
W --> O[Observe<br/>blocks/metrics]
O --> E[Evaluate Expectations]
E --> C[Cleanup]
Design Rationale
- Modular crates keep configuration, orchestration, workloads, and runners decoupled so each can evolve without breaking the others.
- Pluggable runners let the same scenario run on a laptop, a Docker host, or a Kubernetes cluster, making validation portable across environments.
- Separated workloads and expectations clarify intent: what traffic to generate versus how to judge success. This simplifies review and reuse.
- Declarative topology makes cluster shape explicit and repeatable, reducing surprise when moving between CI and developer machines.
- Maintainability through predictability: a clear flow from plan to deployment to verification lowers the cost of extending the framework and interpreting failures.
Part II — User Guide
Practical guidance for shaping scenarios, combining workloads and expectations, and running them across different environments.
Workspace Layout
The workspace focuses on multi-node integration testing and sits alongside a
nomos-node checkout. Its crates separate concerns to keep scenarios
repeatable and portable:
- Configs: prepares high-level node, network, tracing, and wallet settings used across test environments.
- Core scenario orchestration: the engine that holds topology descriptions, scenario plans, runtimes, workloads, and expectations.
- Workflows: ready-made workloads (transactions, data-availability, chaos) and reusable expectations assembled into a user-facing DSL.
- Runners: deployment backends for local processes, Docker Compose, and Kubernetes, all consuming the same scenario plan.
- Runner Examples (
examples/runner-examples): runnable binaries (local_runner.rs,compose_runner.rs,k8s_runner.rs) that demonstrate complete scenario execution with each deployer.
This split keeps configuration, orchestration, reusable traffic patterns, and deployment adapters loosely coupled while sharing one mental model for tests.
Annotated Tree
Directory structure with key paths annotated:
nomos-testing/
├─ testing-framework/ # Core library crates
│ ├─ configs/ # Node config builders, topology generation, tracing/logging config
│ ├─ core/ # Scenario model (ScenarioBuilder), runtime (Runner, Deployer), topology, node spawning
│ ├─ workflows/ # Workloads (transactions, DA, chaos), expectations (liveness), builder DSL extensions
│ ├─ runners/ # Deployment backends
│ │ ├─ local/ # LocalDeployer (spawns local processes)
│ │ ├─ compose/ # ComposeDeployer (Docker Compose + Prometheus)
│ │ └─ k8s/ # K8sDeployer (Kubernetes Helm)
│ └─ assets/ # Docker/K8s stack assets
│ └─ stack/
│ ├─ kzgrs_test_params/ # KZG circuit parameters (fetch via setup-nomos-circuits.sh)
│ ├─ monitoring/ # Prometheus config
│ ├─ scripts/ # Container entrypoints, image builder
│ └─ cfgsync.yaml # Config sync server template
│
├─ examples/ # PRIMARY ENTRY POINT: runnable binaries
│ └─ src/bin/
│ ├─ local_runner.rs # Local processes demo (POL_PROOF_DEV_MODE=true)
│ ├─ compose_runner.rs # Docker Compose demo (requires image)
│ └─ k8s_runner.rs # Kubernetes demo (requires cluster + image)
│
├─ scripts/ # Helper utilities
│ └─ setup-nomos-circuits.sh # Fetch KZG circuit parameters
│
└─ book/ # This documentation (mdBook)
Key Directories Explained
testing-framework/
Core library crates providing the testing API.
| Crate | Purpose | Key Exports |
|---|---|---|
configs | Node configuration builders | Topology generation, tracing config |
core | Scenario model & runtime | ScenarioBuilder, Deployer, Runner |
workflows | Workloads & expectations | ScenarioBuilderExt, ChaosBuilderExt |
runners/local | Local process deployer | LocalDeployer |
runners/compose | Docker Compose deployer | ComposeDeployer |
runners/k8s | Kubernetes deployer | K8sDeployer |
testing-framework/assets/stack/
Docker/K8s deployment assets:
kzgrs_test_params/: Circuit parameters (override viaNOMOS_KZGRS_PARAMS_PATH)monitoring/: Prometheus configscripts/: Container entrypoints and image buildercfgsync.yaml: Configuration sync server template
examples/ (Start Here!)
Runnable binaries demonstrating framework usage:
local_runner.rs— Local processescompose_runner.rs— Docker Compose (requiresNOMOS_TESTNET_IMAGEbuilt)k8s_runner.rs— Kubernetes (requires cluster + image)
Run with: POL_PROOF_DEV_MODE=true cargo run -p runner-examples --bin <name>
All runners require POL_PROOF_DEV_MODE=true to avoid expensive proof generation.
scripts/
Helper utilities:
setup-nomos-circuits.sh: Fetch KZG parameters from releases
Observability
Compose runner includes:
- Prometheus at
http://localhost:9090(metrics scraping) - Node metrics exposed per validator/executor
- Access in expectations:
ctx.telemetry().prometheus_endpoint()
Logging controlled by:
NOMOS_LOG_DIR— Write per-node log filesNOMOS_LOG_LEVEL— Global log level (error/warn/info/debug/trace)NOMOS_LOG_FILTER— Target-specific filtering (e.g.,consensus=trace,da=debug)NOMOS_TESTS_TRACING— Enable file logging for local runner
See Logging and Observability for details.
Navigation Guide
| To Do This | Go Here |
|---|---|
| Run an example | examples/src/bin/ → cargo run -p runner-examples --bin <name> |
| Write a custom scenario | testing-framework/core/ → Implement using ScenarioBuilder |
| Add a new workload | testing-framework/workflows/src/workloads/ → Implement Workload trait |
| Add a new expectation | testing-framework/workflows/src/expectations/ → Implement Expectation trait |
| Modify node configs | testing-framework/configs/src/topology/configs/ |
| Extend builder DSL | testing-framework/workflows/src/builder/ → Add trait methods |
| Add a new deployer | testing-framework/runners/ → Implement Deployer trait |
For detailed guidance, see Internal Crate Reference.
Authoring Scenarios
Creating a scenario is a declarative exercise:
- Shape the topology: decide how many validators and executors to run, and what high-level network and data-availability characteristics matter for the test.
- Attach workloads: pick traffic generators that align with your goals (transactions, data-availability blobs, or chaos for resilience probes).
- Define expectations: specify the health signals that must hold when the run finishes (e.g., consensus liveness, inclusion of submitted activity; see Core Content: Workloads & Expectations).
- Set duration: choose a run window long enough to observe meaningful block progression and the effects of your workloads.
- Choose a runner: target local processes for fast iteration, Docker Compose for reproducible multi-node stacks, or Kubernetes for cluster-grade validation. For environment considerations, see Operations.
Keep scenarios small and explicit: make the intended behavior and the success criteria clear so failures are easy to interpret and act upon.
Core Content: Workloads & Expectations
Workloads describe the activity a scenario generates; expectations describe the signals that must hold when that activity completes. Both are pluggable so scenarios stay readable and purpose-driven.
Workloads
- Transaction workload: submits user-level transactions at a configurable rate and can limit how many distinct actors participate.
- Data-availability workload: drives blob and channel activity to exercise data-availability paths.
- Chaos workload: triggers controlled node restarts to test resilience and recovery behaviors (requires a runner that can control nodes).
Expectations
- Consensus liveness: verifies the system continues to produce blocks in line with the planned workload and timing window.
- Workload-specific checks: each workload can attach its own success criteria (e.g., inclusion of submitted activity) so scenarios remain concise.
Together, workloads and expectations let you express both the pressure applied to the system and the definition of “healthy” for that run.
flowchart TD
I[Inputs<br/>topology + wallets + rates] --> Init[Workload init]
Init --> Drive[Drive traffic]
Drive --> Collect[Collect signals]
Collect --> Eval[Expectations evaluate]
Core Content: ScenarioBuilderExt Patterns
Patterns that keep scenarios readable and reusable:
- Topology-first: start by shaping the cluster (counts, layout) so later steps inherit a clear foundation.
- Bundle defaults: use the DSL helpers to attach common expectations (like liveness) whenever you add a matching workload, reducing forgotten checks.
- Intentional rates: express traffic in per-block terms to align with protocol timing rather than wall-clock assumptions.
- Opt-in chaos: enable restart patterns only in scenarios meant to probe resilience; keep functional smoke tests deterministic.
- Wallet clarity: seed only the number of actors you need; it keeps transaction scenarios deterministic and interpretable.
These patterns make scenario definitions self-explanatory while staying aligned with the framework’s block-oriented timing model.
Best Practices
- State your intent: document the goal of each scenario (throughput, DA validation, resilience) so expectation choices are obvious.
- Keep runs meaningful: choose durations that allow multiple blocks and make timing-based assertions trustworthy.
- Separate concerns: start with deterministic workloads for functional checks; add chaos in dedicated resilience scenarios to avoid noisy failures.
- Reuse patterns: standardize on shared topology and workload presets so results are comparable across environments and teams.
- Observe first, tune second: rely on liveness and inclusion signals to interpret outcomes before tweaking rates or topology.
- Environment fit: pick runners that match the feedback loop you need—local for speed (including fast CI smoke tests), compose for reproducible stacks (recommended for CI), k8s for cluster-grade fidelity.
- Minimal surprises: seed only necessary wallets and keep configuration deltas explicit when moving between CI and developer machines.
Examples
Concrete scenario shapes that illustrate how to combine topologies, workloads, and expectations.
Runnable examples: The repo includes complete binaries in examples/src/bin/:
local_runner.rs— Local processescompose_runner.rs— Docker Compose (requiresNOMOS_TESTNET_IMAGEbuilt)k8s_runner.rs— Kubernetes (requires cluster access and image loaded)
Run with: POL_PROOF_DEV_MODE=true cargo run -p runner-examples --bin <name>
All runners require POL_PROOF_DEV_MODE=true to avoid expensive proof generation.
Code patterns below show how to build scenarios. Wrap these in #[tokio::test] functions for integration tests, or #[tokio::main] for binaries.
Simple consensus liveness
Minimal test that validates basic block production:
#![allow(unused)] fn main() { use testing_framework_core::scenario::{Deployer, ScenarioBuilder}; use testing_framework_runner_local::LocalDeployer; use testing_framework_workflows::ScenarioBuilderExt; use std::time::Duration; async fn simple_consensus() -> Result<(), Box<dyn std::error::Error + Send + Sync>> { let mut plan = ScenarioBuilder::topology_with(|t| { t.network_star() .validators(3) .executors(0) }) .expect_consensus_liveness() .with_run_duration(Duration::from_secs(30)) .build(); let deployer = LocalDeployer::default(); let runner = deployer.deploy(&plan).await?; let _handle = runner.run(&mut plan).await?; Ok(()) } }
When to use: smoke tests for consensus on minimal hardware.
Transaction workload
Test consensus under transaction load:
#![allow(unused)] fn main() { use testing_framework_core::scenario::{Deployer, ScenarioBuilder}; use testing_framework_runner_local::LocalDeployer; use testing_framework_workflows::ScenarioBuilderExt; use std::time::Duration; async fn transaction_workload() -> Result<(), Box<dyn std::error::Error + Send + Sync>> { let mut plan = ScenarioBuilder::topology_with(|t| { t.network_star() .validators(2) .executors(0) }) .wallets(20) .transactions_with(|txs| { txs.rate(5) .users(10) }) .expect_consensus_liveness() .with_run_duration(Duration::from_secs(60)) .build(); let deployer = LocalDeployer::default(); let runner = deployer.deploy(&plan).await?; let _handle = runner.run(&mut plan).await?; Ok(()) } }
When to use: validate transaction submission and inclusion.
DA + transaction workload
Combined test stressing both transaction and DA layers:
#![allow(unused)] fn main() { use testing_framework_core::scenario::{Deployer, ScenarioBuilder}; use testing_framework_runner_local::LocalDeployer; use testing_framework_workflows::ScenarioBuilderExt; use std::time::Duration; async fn da_and_transactions() -> Result<(), Box<dyn std::error::Error + Send + Sync>> { let mut plan = ScenarioBuilder::topology_with(|t| { t.network_star() .validators(3) .executors(2) }) .wallets(30) .transactions_with(|txs| { txs.rate(5) .users(15) }) .da_with(|da| { da.channel_rate(1) .blob_rate(2) }) .expect_consensus_liveness() .with_run_duration(Duration::from_secs(90)) .build(); let deployer = LocalDeployer::default(); let runner = deployer.deploy(&plan).await?; let _handle = runner.run(&mut plan).await?; Ok(()) } }
When to use: end-to-end coverage of transaction and DA layers.
Chaos resilience
Test system resilience under node restarts:
#![allow(unused)] fn main() { use testing_framework_core::scenario::{Deployer, ScenarioBuilder}; use testing_framework_runner_compose::ComposeDeployer; use testing_framework_workflows::{ScenarioBuilderExt, ChaosBuilderExt}; use std::time::Duration; async fn chaos_resilience() -> Result<(), Box<dyn std::error::Error + Send + Sync>> { let mut plan = ScenarioBuilder::topology_with(|t| { t.network_star() .validators(4) .executors(2) }) .enable_node_control() .wallets(20) .transactions_with(|txs| { txs.rate(3) .users(10) }) .chaos_with(|c| { c.restart() .min_delay(Duration::from_secs(20)) .max_delay(Duration::from_secs(40)) .target_cooldown(Duration::from_secs(30)) .apply() }) .expect_consensus_liveness() .with_run_duration(Duration::from_secs(120)) .build(); let deployer = ComposeDeployer::default(); let runner = deployer.deploy(&plan).await?; let _handle = runner.run(&mut plan).await?; Ok(()) } }
When to use: resilience validation and operational readiness drills.
Note: Chaos tests require ComposeDeployer or another runner with node control support.
Advanced Examples
Realistic advanced scenarios demonstrating framework capabilities for production testing.
Summary
| Example | Topology | Workloads | Deployer | Key Feature |
|---|---|---|---|---|
| Load Progression | 3 validators + 2 executors | Increasing tx rate | Compose | Dynamic load testing |
| Sustained Load | 4 validators + 2 executors | High tx + DA rate | Compose | Stress testing |
| Aggressive Chaos | 4 validators + 2 executors | Frequent restarts + traffic | Compose | Resilience validation |
Load Progression Test
Test consensus under progressively increasing transaction load:
#![allow(unused)] fn main() { use testing_framework_core::scenario::{Deployer, ScenarioBuilder}; use testing_framework_runner_compose::ComposeDeployer; use testing_framework_workflows::ScenarioBuilderExt; use std::time::Duration; async fn load_progression_test() -> Result<(), Box<dyn std::error::Error + Send + Sync>> { for rate in [5, 10, 20, 30] { println!("Testing with rate: {}", rate); let mut plan = ScenarioBuilder::topology_with(|t| { t.network_star() .validators(3) .executors(2) }) .wallets(50) .transactions_with(|txs| { txs.rate(rate) .users(20) }) .expect_consensus_liveness() .with_run_duration(Duration::from_secs(60)) .build(); let deployer = ComposeDeployer::default(); let runner = deployer.deploy(&plan).await?; let _handle = runner.run(&mut plan).await?; } Ok(()) } }
When to use: Finding the maximum sustainable transaction rate for a given topology.
Sustained Load Test
Run high transaction and DA load for extended duration:
#![allow(unused)] fn main() { use testing_framework_core::scenario::{Deployer, ScenarioBuilder}; use testing_framework_runner_compose::ComposeDeployer; use testing_framework_workflows::ScenarioBuilderExt; use std::time::Duration; async fn sustained_load_test() -> Result<(), Box<dyn std::error::Error + Send + Sync>> { let mut plan = ScenarioBuilder::topology_with(|t| { t.network_star() .validators(4) .executors(2) }) .wallets(100) .transactions_with(|txs| { txs.rate(15) .users(50) }) .da_with(|da| { da.channel_rate(2) .blob_rate(3) }) .expect_consensus_liveness() .with_run_duration(Duration::from_secs(300)) .build(); let deployer = ComposeDeployer::default(); let runner = deployer.deploy(&plan).await?; let _handle = runner.run(&mut plan).await?; Ok(()) } }
When to use: Validating stability under continuous high load over extended periods.
Aggressive Chaos Test
Frequent node restarts with active traffic:
#![allow(unused)] fn main() { use testing_framework_core::scenario::{Deployer, ScenarioBuilder}; use testing_framework_runner_compose::ComposeDeployer; use testing_framework_workflows::{ScenarioBuilderExt, ChaosBuilderExt}; use std::time::Duration; async fn aggressive_chaos_test() -> Result<(), Box<dyn std::error::Error + Send + Sync>> { let mut plan = ScenarioBuilder::topology_with(|t| { t.network_star() .validators(4) .executors(2) }) .enable_node_control() .wallets(50) .transactions_with(|txs| { txs.rate(10) .users(20) }) .chaos_with(|c| { c.restart() .min_delay(Duration::from_secs(10)) .max_delay(Duration::from_secs(20)) .target_cooldown(Duration::from_secs(15)) .apply() }) .expect_consensus_liveness() .with_run_duration(Duration::from_secs(180)) .build(); let deployer = ComposeDeployer::default(); let runner = deployer.deploy(&plan).await?; let _handle = runner.run(&mut plan).await?; Ok(()) } }
When to use: Validating recovery and liveness under aggressive failure conditions.
Note: Requires ComposeDeployer for node control support.
Extension Ideas
These scenarios require custom implementations but demonstrate framework extensibility:
Mempool & Transaction Handling
Transaction Propagation & Inclusion Test
Concept: Submit the same batch of independent transactions to different nodes in randomized order/offsets, then verify all transactions are included and final state matches across nodes.
Requirements:
- Custom workload: Generates a fixed batch of transactions and submits the same set to different nodes via
ctx.node_clients(), with randomized submission order and timing offsets per node - Custom expectation: Verifies all transactions appear in blocks (order may vary), final state matches across all nodes (compare balances or state roots), and no transactions are dropped
Why useful: Exercises mempool propagation, proposer fairness, and transaction inclusion guarantees under realistic race conditions. Tests that the protocol maintains consistency regardless of which node receives transactions first.
Implementation notes: Requires both a custom Workload implementation (to submit same transactions to multiple nodes with jitter) and a custom Expectation implementation (to verify inclusion and state consistency).
Cross-Validator Mempool Divergence & Convergence
Concept: Drive different transaction subsets into different validators (or differing arrival orders) to create temporary mempool divergence, then verify mempools/blocks converge to contain the union (no permanent divergence).
Requirements:
- Custom workload: Targets specific nodes via
ctx.node_clients()with disjoint or jittered transaction batches - Custom expectation: After a convergence window, verifies that all transactions appear in blocks (order may vary) or that mempool contents converge across nodes
- Run normal workloads during convergence period
Expectations:
- Temporary mempool divergence is acceptable (different nodes see different transactions initially)
- After convergence window, all transactions appear in blocks or mempools converge
- No transactions are permanently dropped despite initial divergence
- Mempool gossip/reconciliation mechanisms work correctly
Why useful: Exercises mempool gossip and reconciliation under uneven input or latency. Ensures no node "drops" transactions seen elsewhere, validating that mempool synchronization mechanisms correctly propagate transactions across the network even when they arrive at different nodes in different orders.
Implementation notes: Requires both a custom Workload implementation (to inject disjoint/jittered batches per node) and a custom Expectation implementation (to verify mempool convergence or block inclusion). Uses existing ctx.node_clients() capability—no new infrastructure needed.
Adaptive Mempool Pressure Test
Concept: Ramp transaction load over time to observe mempool growth, fee prioritization/eviction, and block saturation behavior, detecting performance regressions and ensuring backpressure/eviction work under increasing load.
Requirements:
- Custom workload: Steadily increases transaction rate over time (optional: use fee tiers)
- Custom expectation: Monitors mempool size, evictions, and throughput (blocks/txs per slot), flagging runaway growth or stalls
- Run for extended duration to observe pressure buildup
Expectations:
- Mempool size grows predictably with load (not runaway growth)
- Fee prioritization/eviction mechanisms activate under pressure
- Block saturation behavior is acceptable (blocks fill appropriately)
- Throughput (blocks/txs per slot) remains stable or degrades gracefully
- No stalls or unbounded mempool growth
Why useful: Detects performance regressions in mempool management. Ensures backpressure and eviction mechanisms work correctly under increasing load, preventing memory exhaustion or unbounded growth. Validates that fee prioritization correctly selects high-value transactions when mempool is full.
Implementation notes: Can be built with current workload model (ramping rate). Requires custom Expectation implementation that reads mempool metrics (via node HTTP APIs or Prometheus) and monitors throughput to judge behavior. No new infrastructure needed—uses existing observability capabilities.
Invalid Transaction Fuzzing
Concept: Submit malformed transactions and verify they're rejected properly.
Implementation approach:
- Custom workload that generates invalid transactions (bad signatures, insufficient funds, malformed structure)
- Expectation verifies mempool rejects them and they never appear in blocks
- Test mempool resilience and filtering
Why useful: Ensures mempool doesn't crash or include invalid transactions under fuzzing.
Network & Gossip
Gossip Latency Gradient Scenario
Concept: Test consensus robustness under skewed gossip delays by partitioning nodes into latency tiers (tier A ≈10ms, tier B ≈100ms, tier C ≈300ms) and observing propagation lag, fork rate, and eventual convergence.
Requirements:
- Partition nodes into three groups (tiers)
- Apply per-group network delay via chaos:
netem/iptablesin compose; NetworkPolicy +netemsidecar in k8s - Run standard workload (transactions/block production)
- Optional: Remove delays at end to check recovery
Expectations:
- Propagation: Messages reach all tiers within acceptable bounds
- Safety: No divergent finalized heads; fork rate stays within tolerance
- Liveness: Chain keeps advancing; convergence after delays relaxed (if healed)
Why useful: Real networks have heterogeneous latency. This stress-tests proposer selection and fork resolution when some peers are "far" (high latency), validating that consensus remains safe and live under realistic network conditions.
Current blocker: Runner support for per-group delay injection (network delay via netem/iptables) is not present today. Would require new chaos plumbing in compose/k8s deployers to inject network delays per node group.
Byzantine Gossip Flooding (libp2p Peer)
Concept: Spin up a custom workload/sidecar that runs a libp2p host, joins the cluster's gossip mesh, and publishes a high rate of syntactically valid but useless/stale messages to selected topics, testing gossip backpressure, scoring, and queue handling under a "malicious" peer.
Requirements:
- Custom workload/sidecar that implements a libp2p host
- Join the cluster's gossip mesh as a peer
- Publish high-rate syntactically valid but useless/stale messages to selected gossip topics
- Run alongside normal workloads (transactions/block production)
Expectations:
- Gossip backpressure mechanisms prevent message flooding from overwhelming nodes
- Peer scoring correctly identifies and penalizes the malicious peer
- Queue handling remains stable under flood conditions
- Normal consensus operation continues despite malicious peer
Why useful: Tests Byzantine behavior (malicious peer) which is critical for consensus protocol robustness. More realistic than RPC spam since it uses the actual gossip protocol. Validates that gossip backpressure, peer scoring, and queue management correctly handle adversarial peers without disrupting consensus.
Current blocker: Requires adding gossip-capable helper (libp2p integration) to the framework. Would need a custom workload/sidecar implementation that can join the gossip mesh and inject messages. The rest of the scenario can use existing runners/workloads.
Network Partition Recovery
Concept: Test consensus recovery after network partitions.
Requirements:
- Needs
block_peer()/unblock_peer()methods inNodeControlHandle - Partition subsets of validators, wait, then restore connectivity
- Verify chain convergence after partition heals
Why useful: Tests the most realistic failure mode in distributed systems.
Current blocker: Node control doesn't yet support network-level actions (only process restarts).
Time & Timing
Time-Shifted Blocks (Clock Skew Test)
Concept: Test consensus and timestamp handling when nodes run with skewed clocks (e.g., +1s, −1s, +200ms jitter) to surface timestamp validation issues, reorg sensitivity, and clock drift handling.
Requirements:
- Assign per-node time offsets (e.g., +1s, −1s, +200ms jitter)
- Run normal workload (transactions/block production)
- Observe whether blocks are accepted/propagated and the chain stays consistent
Expectations:
- Blocks with skewed timestamps are handled correctly (accepted or rejected per protocol rules)
- Chain remains consistent across nodes despite clock differences
- No unexpected reorgs or chain splits due to timestamp validation issues
Why useful: Clock skew is a common real-world issue in distributed systems. This validates that consensus correctly handles timestamp validation and maintains safety/liveness when nodes have different clock offsets, preventing timestamp-based attacks or failures.
Current blocker: Runner ability to skew per-node clocks (e.g., privileged containers with libfaketime/chrony or time-offset netns) is not available today. Would require a new chaos/time-skew hook in deployers to inject clock offsets per node.
Block Timing Consistency
Concept: Verify block production intervals stay within expected bounds.
Implementation approach:
- Custom expectation that consumes
BlockFeed - Collect block timestamps during run
- Assert intervals are within
(slot_duration * active_slot_coeff) ± tolerance
Why useful: Validates consensus timing under various loads.
Topology & Membership
Dynamic Topology (Churn) Scenario
Concept: Nodes join and leave mid-run (new identities/addresses added; some nodes permanently removed) to exercise peer discovery, bootstrapping, reputation, and load balancing under churn.
Requirements:
- Runner must be able to spin up new nodes with fresh keys/addresses at runtime
- Update peer lists and bootstraps dynamically as nodes join/leave
- Optionally tear down nodes permanently (not just restart)
- Run normal workloads (transactions/block production) during churn
Expectations:
- New nodes successfully discover and join the network
- Peer discovery mechanisms correctly handle dynamic topology changes
- Reputation systems adapt to new/removed peers
- Load balancing adjusts to changing node set
- Consensus remains safe and live despite topology churn
Why useful: Real networks experience churn (nodes joining/leaving). Unlike restarts (which preserve topology), churn changes the actual topology size and peer set, testing how the protocol handles dynamic membership. This exercises peer discovery, bootstrapping, reputation systems, and load balancing under realistic conditions.
Current blocker: Runner support for dynamic node addition/removal at runtime is not available today. Chaos today only restarts existing nodes; churn would require the ability to spin up new nodes with fresh identities/addresses, update peer lists/bootstraps dynamically, and permanently remove nodes. Would need new topology management capabilities in deployers.
API & External Interfaces
API DoS/Stress Test
Concept: Adversarial workload floods node HTTP/WS APIs with high QPS and malformed/bursty requests; expectation checks nodes remain responsive or rate-limit without harming consensus.
Requirements:
- Custom workload: Targets node HTTP/WS API endpoints with mixed valid/invalid requests at high rate
- Custom expectation: Monitors error rates, latency, and confirms block production/liveness unaffected
- Run alongside normal workloads (transactions/block production)
Expectations:
- Nodes remain responsive or correctly rate-limit under API flood
- Error rates/latency are acceptable (rate limiting works)
- Block production/liveness unaffected by API abuse
- Consensus continues normally despite API stress
Why useful: Validates API hardening under abuse and ensures control/telemetry endpoints don't destabilize the node. Tests that API abuse is properly isolated from consensus operations, preventing DoS attacks on API endpoints from affecting blockchain functionality.
Implementation notes: Requires custom Workload implementation that directs high-QPS traffic to node APIs (via ctx.node_clients() or direct HTTP clients) and custom Expectation implementation that monitors API responsiveness metrics and consensus liveness. Uses existing node API access—no new infrastructure needed.
State & Correctness
Wallet Balance Verification
Concept: Track wallet balances and verify state consistency.
Description: After transaction workload completes, query all wallet balances via node API and verify total supply is conserved. Requires tracking initial state, submitted transactions, and final balances. Validates that the ledger maintains correctness under load (no funds lost or created). This is a state assertion expectation that checks correctness, not just liveness.
Running Scenarios
Running a scenario follows the same conceptual flow regardless of environment:
- Select or author a scenario plan that pairs a topology with workloads, expectations, and a suitable run window.
- Choose a deployer aligned with your environment (local, compose, or k8s) and ensure its prerequisites are available.
- Deploy the plan through the deployer, which provisions infrastructure and returns a runner.
- The runner orchestrates workload execution for the planned duration; keep observability signals visible so you can correlate outcomes.
- The runner evaluates expectations and captures results as the primary pass/fail signal.
Use the same plan across different deployers to compare behavior between local development and CI or cluster settings. For environment prerequisites and flags, see Operations.
Runners
Runners turn a scenario plan into a live environment while keeping the plan unchanged. Choose based on feedback speed, reproducibility, and fidelity. For environment and operational considerations, see Operations.
Important: All runners require POL_PROOF_DEV_MODE=true to avoid expensive Groth16 proof generation that causes timeouts.
Local runner
- Launches node processes directly on the host.
- Fastest feedback loop and minimal orchestration overhead.
- Best for development-time iteration and debugging.
- Can run in CI for fast smoke tests.
- Node control: Not supported (chaos workloads not available)
Docker Compose runner
- Starts nodes in containers to provide a reproducible multi-node stack on a single machine.
- Discovers service ports and wires observability for convenient inspection.
- Good balance between fidelity and ease of setup.
- Recommended for CI pipelines (isolated environment, reproducible).
- Node control: Supported (can restart nodes for chaos testing)
Kubernetes runner
- Deploys nodes onto a cluster for higher-fidelity, longer-running scenarios.
- Suits CI with cluster access or shared test environments where cluster behavior and scheduling matter.
- Node control: Not supported yet (chaos workloads not available)
Common expectations
- All runners require at least one validator and, for transaction scenarios, access to seeded wallets.
- Readiness probes gate workload start so traffic begins only after nodes are reachable.
- Environment flags can relax timeouts or increase tracing when diagnostics are needed.
flowchart TD
Plan[Scenario Plan] --> RunSel{Runner<br/>(local | compose | k8s)}
RunSel --> Provision[Provision & readiness]
Provision --> Runtime[Runtime + observability]
Runtime --> Exec[Workloads & Expectations execute]
Operations
Operational readiness focuses on prerequisites, environment fit, and clear signals:
- Prerequisites: keep a sibling
nomos-nodecheckout available; ensure the chosen runner’s platform needs are met (local binaries for host runs, Docker for compose, cluster access for k8s). - Artifacts: DA scenarios require KZG parameters (circuit assets) located at
testing-framework/assets/stack/kzgrs_test_params. Fetch them viascripts/setup-nomos-circuits.shor override the path withNOMOS_KZGRS_PARAMS_PATH. - Environment flags:
POL_PROOF_DEV_MODE=trueis required for all runners (local, compose, k8s) unless you want expensive Groth16 proof generation that will cause tests to timeout. Configure logging viaNOMOS_LOG_DIR,NOMOS_LOG_LEVEL, andNOMOS_LOG_FILTER(see Logging and Observability for details). Note that nodes ignoreRUST_LOGand only respond toNOMOS_*variables. - Readiness checks: verify runners report node readiness before starting workloads; this avoids false negatives from starting too early.
- Failure triage: map failures to missing prerequisites (wallet seeding, node control availability), runner platform issues, or unmet expectations. Start with liveness signals, then dive into workload-specific assertions.
Treat operational hygiene—assets present, prerequisites satisfied, observability reachable—as the first step to reliable scenario outcomes.
CI Usage
Both LocalDeployer and ComposeDeployer work in CI environments:
LocalDeployer in CI:
- Faster (no Docker overhead)
- Good for quick smoke tests
- Trade-off: Less isolation (processes share host)
ComposeDeployer in CI (recommended):
- Better isolation (containerized)
- Reproducible environment
- Includes Prometheus/observability
- Trade-off: Slower startup (Docker image build)
- Trade-off: Requires Docker daemon
See .github/workflows/compose-mixed.yml for a complete CI example using ComposeDeployer.
Running Examples
Local Runner
POL_PROOF_DEV_MODE=true cargo run -p runner-examples --bin local_runner
Optional environment variables:
LOCAL_DEMO_VALIDATORS=3— Number of validators (default: 1)LOCAL_DEMO_EXECUTORS=2— Number of executors (default: 1)LOCAL_DEMO_RUN_SECS=120— Run duration in seconds (default: 60)NOMOS_TESTS_TRACING=true— Enable persistent file logging (required withNOMOS_LOG_DIR)NOMOS_LOG_DIR=/tmp/logs— Directory for per-node log files (only withNOMOS_TESTS_TRACING=true)NOMOS_LOG_LEVEL=debug— Set log level (default: info)NOMOS_LOG_FILTER=consensus=trace,da=debug— Fine-grained module filtering (rate is per-block, not per-second)
Note: The default local_runner example includes DA workload, so circuit assets in testing-framework/assets/stack/kzgrs_test_params/ are required (fetch via scripts/setup-nomos-circuits.sh).
Compose Runner
Prerequisites:
- Docker daemon running
- Circuit assets in
testing-framework/assets/stack/kzgrs_test_params(fetched viascripts/setup-nomos-circuits.sh) - Test image built (see below)
Build the test image:
# Fetch circuit assets first
chmod +x scripts/setup-nomos-circuits.sh
scripts/setup-nomos-circuits.sh v0.3.1 /tmp/nomos-circuits
cp -r /tmp/nomos-circuits/* testing-framework/assets/stack/kzgrs_test_params/
# Build image (embeds assets)
chmod +x testing-framework/assets/stack/scripts/build_test_image.sh
testing-framework/assets/stack/scripts/build_test_image.sh
Run the example:
NOMOS_TESTNET_IMAGE=nomos-testnet:local \
POL_PROOF_DEV_MODE=true \
cargo run -p runner-examples --bin compose_runner
Required environment variables:
NOMOS_TESTNET_IMAGE=nomos-testnet:local— Image tag (must match built image)POL_PROOF_DEV_MODE=true— Critical: Without this, proof generation is CPU-intensive and tests will timeout
Optional environment variables:
COMPOSE_NODE_PAIRS=1x1— Topology: "validators×executors" (default varies by example)TEST_FRAMEWORK_PROMETHEUS_PORT=9091— Override Prometheus port (default: 9090)COMPOSE_RUNNER_HOST=127.0.0.1— Host address for port mappings (default: 127.0.0.1)COMPOSE_RUNNER_PRESERVE=1— Keep containers running after test (for debugging)NOMOS_LOG_DIR=/tmp/compose-logs— Write logs to files inside containers (requires copy-out or volume mount)NOMOS_LOG_LEVEL=debug— Set log level
Compose-specific features:
- Node control support: Only runner that supports chaos testing (
.enable_node_control()+.chaos()workloads) - Prometheus observability: Metrics at
http://localhost:9090
Important: Chaos workloads (random restarts) only work with ComposeDeployer. LocalDeployer and K8sDeployer do not support node control.
K8s Runner
Prerequisites:
- Kubernetes cluster with
kubectlconfigured and working - Circuit assets in
testing-framework/assets/stack/kzgrs_test_params - Test image built (same as Compose:
testing-framework/assets/stack/scripts/build_test_image.sh) - Image available in cluster (loaded via
kind,minikube, or pushed to registry) - POL_PROOF_DEV_MODE=true environment variable set
Load image into cluster:
# For kind clusters
export NOMOS_TESTNET_IMAGE=nomos-testnet:local
kind load docker-image nomos-testnet:local
# For minikube
minikube image load nomos-testnet:local
# For remote clusters (push to registry)
docker tag nomos-testnet:local your-registry/nomos-testnet:local
docker push your-registry/nomos-testnet:local
export NOMOS_TESTNET_IMAGE=your-registry/nomos-testnet:local
Run the example:
export NOMOS_TESTNET_IMAGE=nomos-testnet:local
export POL_PROOF_DEV_MODE=true
cargo run -p runner-examples --bin k8s_runner
Important:
- K8s runner mounts
testing-framework/assets/stack/kzgrs_test_paramsas a hostPath volume. Ensure this directory exists and contains circuit assets on the node where pods will be scheduled. - No node control support yet: Chaos workloads (
.enable_node_control()) will fail. Use ComposeDeployer for chaos testing.
Circuit Assets (KZG Parameters)
DA workloads require KZG cryptographic parameters for polynomial commitment schemes.
Asset Location
Default path: testing-framework/assets/stack/kzgrs_test_params
Override: Set NOMOS_KZGRS_PARAMS_PATH to use a custom location:
NOMOS_KZGRS_PARAMS_PATH=/path/to/custom/params cargo run -p runner-examples --bin local_runner
Getting Circuit Assets
Option 1: Use helper script (recommended):
# From the repository root
chmod +x scripts/setup-nomos-circuits.sh
scripts/setup-nomos-circuits.sh v0.3.1 /tmp/nomos-circuits
# Copy to default location
cp -r /tmp/nomos-circuits/* testing-framework/assets/stack/kzgrs_test_params/
Option 2: Build locally (advanced):
# Requires Go, Rust, and circuit build tools
make kzgrs_test_params
CI Workflow
The CI automatically fetches and places assets:
- name: Install circuits for host build
run: |
scripts/setup-nomos-circuits.sh v0.3.1 "$TMPDIR/nomos-circuits"
cp -a "$TMPDIR/nomos-circuits"/. testing-framework/assets/stack/kzgrs_test_params/
When Are Assets Needed?
| Runner | When Required |
|---|---|
| Local | Always (for DA workloads) |
| Compose | During image build (baked into NOMOS_TESTNET_IMAGE) |
| K8s | During image build + deployed to cluster via hostPath volume |
Error without assets:
Error: missing KZG parameters at testing-framework/assets/stack/kzgrs_test_params
Logging and Observability
Node Logging vs Framework Logging
Critical distinction: Node logs and framework logs use different configuration mechanisms.
| Component | Controlled By | Purpose |
|---|---|---|
Framework binaries (cargo run -p runner-examples --bin local_runner) | RUST_LOG | Runner orchestration, deployment logs |
| Node processes (validators, executors spawned by runner) | NOMOS_LOG_LEVEL, NOMOS_LOG_FILTER, NOMOS_LOG_DIR | Consensus, DA, mempool, network logs |
Common mistake: Setting RUST_LOG=debug only increases verbosity of the runner binary itself. Node logs remain at their default level unless you also set NOMOS_LOG_LEVEL=debug.
Example:
# This only makes the RUNNER verbose, not the nodes:
RUST_LOG=debug cargo run -p runner-examples --bin local_runner
# This makes the NODES verbose:
NOMOS_LOG_LEVEL=debug cargo run -p runner-examples --bin local_runner
# Both verbose (typically not needed):
RUST_LOG=debug NOMOS_LOG_LEVEL=debug cargo run -p runner-examples --bin local_runner
Logging Environment Variables
| Variable | Default | Effect |
|---|---|---|
NOMOS_LOG_DIR | None (console only) | Directory for per-node log files. If unset, logs go to stdout/stderr. |
NOMOS_LOG_LEVEL | info | Global log level: error, warn, info, debug, trace |
NOMOS_LOG_FILTER | None | Fine-grained target filtering (e.g., consensus=trace,da=debug) |
NOMOS_TESTS_TRACING | false | Enable tracing subscriber for local runner file logging |
NOMOS_OTLP_ENDPOINT | None | OTLP trace endpoint (optional, disables OTLP noise if unset) |
NOMOS_OTLP_METRICS_ENDPOINT | None | OTLP metrics endpoint (optional) |
Example: Full debug logging to files:
NOMOS_TESTS_TRACING=true \
NOMOS_LOG_DIR=/tmp/test-logs \
NOMOS_LOG_LEVEL=debug \
NOMOS_LOG_FILTER="nomos_consensus=trace,nomos_da_sampling=debug" \
POL_PROOF_DEV_MODE=true \
cargo run -p runner-examples --bin local_runner
Per-Node Log Files
When NOMOS_LOG_DIR is set, each node writes logs to separate files:
File naming pattern:
- Validators: Prefix
nomos-node-0,nomos-node-1, etc. (may include timestamp suffix) - Executors: Prefix
nomos-executor-0,nomos-executor-1, etc. (may include timestamp suffix)
Local runner caveat: By default, the local runner writes logs to temporary directories in the working directory. These are automatically cleaned up after tests complete. To preserve logs, you MUST set both NOMOS_TESTS_TRACING=true AND NOMOS_LOG_DIR=/path/to/logs.
Filter Target Names
Common target prefixes for NOMOS_LOG_FILTER:
| Target Prefix | Subsystem |
|---|---|
nomos_consensus | Consensus (Cryptarchia) |
nomos_da_sampling | DA sampling service |
nomos_da_dispersal | DA dispersal service |
nomos_da_verifier | DA verification |
nomos_mempool | Transaction mempool |
nomos_blend | Mix network/privacy layer |
chain_network | P2P networking |
chain_leader | Leader election |
Example filter:
NOMOS_LOG_FILTER="nomos_consensus=trace,nomos_da_sampling=debug,chain_network=info"
Accessing Logs Per Runner
Local Runner
Default (temporary directories, auto-cleanup):
POL_PROOF_DEV_MODE=true cargo run -p runner-examples --bin local_runner
# Logs written to temporary directories in working directory
# Automatically cleaned up after test completes
Persistent file output:
NOMOS_TESTS_TRACING=true \
NOMOS_LOG_DIR=/tmp/local-logs \
POL_PROOF_DEV_MODE=true \
cargo run -p runner-examples --bin local_runner
# After test completes:
ls /tmp/local-logs/
# Files with prefix: nomos-node-0*, nomos-node-1*, nomos-executor-0*
# May include timestamps in filename
Both flags required: You MUST set both NOMOS_TESTS_TRACING=true (enables tracing file sink) AND NOMOS_LOG_DIR (specifies directory) to get persistent logs.
Compose Runner
Via Docker logs (default, recommended):
# List containers (note the UUID prefix in names)
docker ps --filter "name=nomos-compose-"
# Stream logs from specific container
docker logs -f <container-id-or-name>
# Or use name pattern matching:
docker logs -f $(docker ps --filter "name=nomos-compose-.*-validator-0" -q | head -1)
Via file collection (advanced):
Setting NOMOS_LOG_DIR writes files inside the container. To access them, you must either:
- Copy files out after the run:
NOMOS_LOG_DIR=/logs \
NOMOS_TESTNET_IMAGE=nomos-testnet:local \
POL_PROOF_DEV_MODE=true \
cargo run -p runner-examples --bin compose_runner
# After test, copy files from containers:
docker ps --filter "name=nomos-compose-"
docker cp <container-id>:/logs/nomos-node-0* /tmp/
- Mount a host volume (requires modifying compose template):
volumes:
- /tmp/host-logs:/logs # Add to docker-compose.yml.tera
Recommendation: Use docker logs by default. File collection inside containers is complex and rarely needed.
Keep containers for debugging:
COMPOSE_RUNNER_PRESERVE=1 \
NOMOS_TESTNET_IMAGE=nomos-testnet:local \
cargo run -p runner-examples --bin compose_runner
# Containers remain running after test—inspect with docker logs or docker exec
Note: Container names follow pattern nomos-compose-{uuid}-validator-{index}-1 where {uuid} changes per run.
K8s Runner
Via kubectl logs (use label selectors):
# List pods
kubectl get pods
# Stream logs using label selectors (recommended)
kubectl logs -l app=nomos-validator -f
kubectl logs -l app=nomos-executor -f
# Stream logs from specific pod
kubectl logs -f nomos-validator-0
# Previous logs from crashed pods
kubectl logs --previous -l app=nomos-validator
Download logs for offline analysis:
# Using label selectors
kubectl logs -l app=nomos-validator --tail=1000 > all-validators.log
kubectl logs -l app=nomos-executor --tail=1000 > all-executors.log
# Specific pods
kubectl logs nomos-validator-0 > validator-0.log
kubectl logs nomos-executor-1 > executor-1.log
Specify namespace (if not using default):
kubectl logs -n my-namespace -l app=nomos-validator -f
OTLP and Telemetry
OTLP exporters are optional. If you see errors about unreachable OTLP endpoints, it's safe to ignore them unless you're actively collecting traces/metrics.
To enable OTLP:
NOMOS_OTLP_ENDPOINT=http://localhost:4317 \
NOMOS_OTLP_METRICS_ENDPOINT=http://localhost:4318 \
cargo run -p runner-examples --bin local_runner
To silence OTLP errors: Simply leave these variables unset (the default).
Observability: Prometheus and Node APIs
Runners expose metrics and node HTTP endpoints for expectation code and debugging:
Prometheus (Compose only):
- Default:
http://localhost:9090 - Override:
TEST_FRAMEWORK_PROMETHEUS_PORT=9091 - Access from expectations:
ctx.telemetry().prometheus_endpoint()
Node APIs:
- Access from expectations:
ctx.node_clients().validators().get(0) - Endpoints: consensus info, network info, DA membership, etc.
- See
testing-framework/core/src/nodes/api_client.rsfor available methods
flowchart TD
Expose[Runner exposes endpoints/ports] --> Collect[Runtime collects block/health signals]
Collect --> Consume[Expectations consume signals<br/>decide pass/fail]
Consume --> Inspect[Operators inspect logs/metrics<br/>when failures arise]
Part III — Developer Reference
Deep dives for contributors who extend the framework, evolve its abstractions, or maintain the crate set.
Scenario Model (Developer Level)
The scenario model defines clear, composable responsibilities:
- Topology: a declarative description of the cluster—how many nodes, their roles, and the broad network and data-availability characteristics. It represents the intended shape of the system under test.
- Scenario: a plan combining topology, workloads, expectations, and a run window. Building a scenario validates prerequisites (like seeded wallets) and ensures the run lasts long enough to observe meaningful block progression.
- Workloads: asynchronous tasks that generate traffic or conditions. They use shared context to interact with the deployed cluster and may bundle default expectations.
- Expectations: post-run assertions. They can capture baselines before workloads start and evaluate success once activity stops.
- Runtime: coordinates workloads and expectations for the configured duration, enforces cooldowns when control actions occur, and ensures cleanup so runs do not leak resources.
Developers extending the model should keep these boundaries strict: topology describes, scenarios assemble, deployers provision, runners orchestrate, workloads drive, and expectations judge outcomes. For guidance on adding new capabilities, see Extending the Framework.
Extending the Framework
Adding a workload
- Implement
testing_framework_core::scenario::Workload:- Provide a name and any bundled expectations.
- In
init, derive inputs fromGeneratedTopologyandRunMetrics; fail fast if prerequisites are missing (e.g., wallet data, node addresses). - In
start, drive async traffic using theRunContextclients.
- Expose the workload from a module under
testing-framework/workflowsand consider adding a DSL helper for ergonomic wiring.
Adding an expectation
- Implement
testing_framework_core::scenario::Expectation:- Use
start_captureto snapshot baseline metrics. - Use
evaluateto assert outcomes after workloads finish; return all errors so the runner can aggregate them.
- Use
- Export it from
testing-framework/workflowsif it is reusable.
Adding a runner
- Implement
testing_framework_core::scenario::Deployerfor your backend.- Produce a
RunContextwithNodeClients, metrics endpoints, and optionalNodeControlHandle. - Guard cleanup with
CleanupGuardto reclaim resources even on failures.
- Produce a
- Mirror the readiness and block-feed probes used by the existing runners so workloads can rely on consistent signals.
Adding topology helpers
- Extend
testing_framework_core::topology::TopologyBuilderwith new layouts or configuration presets (e.g., specialized DA parameters). Keep defaults safe: ensure at least one participant and clamp dispersal factors as the current helpers do.
Example: New Workload & Expectation (Rust)
A minimal, end-to-end illustration of adding a custom workload and matching expectation. This shows the shape of the traits and where to plug into the framework; expand the logic to fit your real test.
Workload: simple reachability probe
Key ideas:
- name: identifies the workload in logs.
- expectations: workloads can bundle defaults so callers don’t forget checks.
- init: derive inputs from the generated topology (e.g., pick a target node).
- start: drive async activity using the shared
RunContext.
#![allow(unused)] fn main() { use std::sync::Arc; use async_trait::async_trait; use testing_framework_core::scenario::{ DynError, Expectation, RunContext, RunMetrics, Workload, }; use testing_framework_core::topology::GeneratedTopology; pub struct ReachabilityWorkload { target_idx: usize, bundled: Vec<Box<dyn Expectation>>, } impl ReachabilityWorkload { pub fn new(target_idx: usize) -> Self { Self { target_idx, bundled: vec![Box::new(ReachabilityExpectation::new(target_idx))], } } } #[async_trait] impl Workload for ReachabilityWorkload { fn name(&self) -> &'static str { "reachability_workload" } fn expectations(&self) -> Vec<Box<dyn Expectation>> { self.bundled.clone() } fn init( &mut self, topology: &GeneratedTopology, _metrics: &RunMetrics, ) -> Result<(), DynError> { if topology.validators().get(self.target_idx).is_none() { return Err("no validator at requested index".into()); } Ok(()) } async fn start(&self, ctx: &RunContext) -> Result<(), DynError> { let client = ctx .clients() .validators() .get(self.target_idx) .ok_or("missing target client")?; // Pseudo-action: issue a lightweight RPC to prove reachability. client.health_check().await.map_err(|e| e.into()) } } }
Expectation: confirm the target stayed reachable
Key ideas:
- start_capture: snapshot baseline if needed (not used here).
- evaluate: assert the condition after workloads finish.
#![allow(unused)] fn main() { use async_trait::async_trait; use testing_framework_core::scenario::{DynError, Expectation, RunContext}; pub struct ReachabilityExpectation { target_idx: usize, } impl ReachabilityExpectation { pub fn new(target_idx: usize) -> Self { Self { target_idx } } } #[async_trait] impl Expectation for ReachabilityExpectation { fn name(&self) -> &str { "target_reachable" } async fn evaluate(&mut self, ctx: &RunContext) -> Result<(), DynError> { let client = ctx .clients() .validators() .get(self.target_idx) .ok_or("missing target client")?; client.health_check().await.map_err(|e| { format!("target became unreachable during run: {e}").into() }) } } }
How to wire it
- Build your scenario as usual and call
.with_workload(ReachabilityWorkload::new(0)). - The bundled expectation is attached automatically; you can add more with
.with_expectation(...)if needed. - Keep the logic minimal and fast for smoke tests; grow it into richer probes for deeper scenarios.
Internal Crate Reference
High-level roles of the crates that make up the framework:
-
Configs (
testing-framework/configs/): Prepares reusable configuration primitives for nodes, networking, tracing, data availability, and wallets, shared by all scenarios and runners. Includes topology generation and circuit asset resolution. -
Core scenario orchestration (
testing-framework/core/): Houses the topology and scenario model, runtime coordination, node clients, and readiness/health probes. DefinesDeployerandRunnertraits,ScenarioBuilder, andRunContext. -
Workflows (
testing-framework/workflows/): Packages workloads (transaction, DA, chaos) and expectations (consensus liveness) into reusable building blocks. Offers fluent DSL extensions (ScenarioBuilderExt,ChaosBuilderExt). -
Runners (
testing-framework/runners/{local,compose,k8s}/): Implements deployment backends (local host, Docker Compose, Kubernetes) that all consume the same scenario plan. Each provides aDeployerimplementation (LocalDeployer,ComposeDeployer,K8sDeployer). -
Runner Examples (
examples/runner-examples): Runnable binaries demonstrating framework usage and serving as living documentation. These are the primary entry point for running scenarios (local_runner.rs,compose_runner.rs,k8s_runner.rs).
Where to Add New Capabilities
| What You're Adding | Where It Goes | Examples |
|---|---|---|
| Node config parameter | testing-framework/configs/src/topology/configs/ | Slot duration, log levels, DA params |
| Topology feature | testing-framework/core/src/topology/ | New network layouts, node roles |
| Scenario capability | testing-framework/core/src/scenario/ | New capabilities, context methods |
| Workload | testing-framework/workflows/src/workloads/ | New traffic generators |
| Expectation | testing-framework/workflows/src/expectations/ | New success criteria |
| Builder API | testing-framework/workflows/src/builder/ | DSL extensions, fluent methods |
| Deployer | testing-framework/runners/ | New deployment backends |
| Example scenario | examples/src/bin/ | Demonstration binaries |
Extension Workflow
Adding a New Workload
-
Define the workload in
testing-framework/workflows/src/workloads/your_workload.rs:#![allow(unused)] fn main() { use async_trait::async_trait; use testing_framework_core::scenario::{Workload, RunContext, DynError}; pub struct YourWorkload { // config fields } #[async_trait] impl Workload for YourWorkload { fn name(&self) -> &'static str { "your_workload" } async fn start(&self, ctx: &RunContext) -> Result<(), DynError> { // implementation Ok(()) } } } -
Add builder extension in
testing-framework/workflows/src/builder/mod.rs:#![allow(unused)] fn main() { pub trait ScenarioBuilderExt { fn your_workload(self) -> YourWorkloadBuilder; } } -
Use in examples in
examples/src/bin/your_scenario.rs:#![allow(unused)] fn main() { let mut plan = ScenarioBuilder::topology_with(|t| { t.network_star() .validators(3) .executors(0) }) .your_workload_with(|w| { // Your new DSL method with closure w.some_config() }) .build(); }
Adding a New Expectation
-
Define the expectation in
testing-framework/workflows/src/expectations/your_expectation.rs:#![allow(unused)] fn main() { use async_trait::async_trait; use testing_framework_core::scenario::{Expectation, RunContext, DynError}; pub struct YourExpectation { // config fields } #[async_trait] impl Expectation for YourExpectation { fn name(&self) -> &str { "your_expectation" } async fn evaluate(&mut self, ctx: &RunContext) -> Result<(), DynError> { // implementation Ok(()) } } } -
Add builder extension in
testing-framework/workflows/src/builder/mod.rs:#![allow(unused)] fn main() { pub trait ScenarioBuilderExt { fn expect_your_condition(self) -> Self; } }
Adding a New Deployer
-
Implement
Deployertrait intesting-framework/runners/your_runner/src/deployer.rs:#![allow(unused)] fn main() { use async_trait::async_trait; use testing_framework_core::scenario::{Deployer, Runner, Scenario}; pub struct YourDeployer; #[async_trait] impl Deployer for YourDeployer { type Error = YourError; async fn deploy(&self, scenario: &Scenario) -> Result<Runner, Self::Error> { // Provision infrastructure // Wait for readiness // Return Runner } } } -
Provide cleanup and handle node control if supported.
-
Add example in
examples/src/bin/your_runner.rs.
For detailed examples, see Extending the Framework and Custom Workload Example.
Part IV — Appendix
Quick-reference material and supporting guidance to keep scenarios discoverable, debuggable, and consistent.
Builder API Quick Reference
Quick reference for the scenario builder DSL. All methods are chainable.
Imports
#![allow(unused)] fn main() { use testing_framework_core::scenario::{Deployer, ScenarioBuilder}; use testing_framework_runner_local::LocalDeployer; use testing_framework_runner_compose::ComposeDeployer; use testing_framework_runner_k8s::K8sDeployer; use testing_framework_workflows::{ScenarioBuilderExt, ChaosBuilderExt}; use std::time::Duration; }
Topology
#![allow(unused)] fn main() { ScenarioBuilder::topology_with(|t| { t.network_star() // Star topology (all connect to seed node) .validators(3) // Number of validator nodes .executors(2) // Number of executor nodes }) // Finish topology configuration }
Wallets
#![allow(unused)] fn main() { .wallets(50) // Seed 50 funded wallet accounts }
Transaction Workload
#![allow(unused)] fn main() { .transactions_with(|txs| { txs.rate(5) // 5 transactions per block .users(20) // Use 20 of the seeded wallets }) // Finish transaction workload config }
DA Workload
#![allow(unused)] fn main() { .da_with(|da| { da.channel_rate(1) // 1 channel operation per block .blob_rate(2) // 2 blob dispersals per block }) // Finish DA workload config }
Chaos Workload (Requires enable_node_control())
#![allow(unused)] fn main() { .enable_node_control() // Enable node control capability .chaos_with(|c| { c.restart() // Random restart chaos .min_delay(Duration::from_secs(30)) // Min time between restarts .max_delay(Duration::from_secs(60)) // Max time between restarts .target_cooldown(Duration::from_secs(45)) // Cooldown after restart .apply() // Required for chaos configuration }) }
Expectations
#![allow(unused)] fn main() { .expect_consensus_liveness() // Assert blocks are produced continuously }
Run Duration
#![allow(unused)] fn main() { .with_run_duration(Duration::from_secs(120)) // Run for 120 seconds }
Build
#![allow(unused)] fn main() { .build() // Construct the final Scenario }
Deployers
#![allow(unused)] fn main() { // Local processes let deployer = LocalDeployer::default(); // Docker Compose let deployer = ComposeDeployer::default(); // Kubernetes let deployer = K8sDeployer::default(); }
Execution
#![allow(unused)] fn main() { let runner = deployer.deploy(&plan).await?; let _handle = runner.run(&mut plan).await?; }
Complete Example
#![allow(unused)] fn main() { use testing_framework_core::scenario::{Deployer, ScenarioBuilder}; use testing_framework_runner_local::LocalDeployer; use testing_framework_workflows::ScenarioBuilderExt; use std::time::Duration; async fn run_test() -> Result<(), Box<dyn std::error::Error + Send + Sync>> { let mut plan = ScenarioBuilder::topology_with(|t| { t.network_star() .validators(3) .executors(2) }) .wallets(50) .transactions_with(|txs| { txs.rate(5) // 5 transactions per block .users(20) }) .da_with(|da| { da.channel_rate(1) // 1 channel operation per block .blob_rate(2) // 2 blob dispersals per block }) .expect_consensus_liveness() .with_run_duration(Duration::from_secs(90)) .build(); let deployer = LocalDeployer::default(); let runner = deployer.deploy(&plan).await?; let _handle = runner.run(&mut plan).await?; Ok(()) } }
Troubleshooting Scenarios
Prerequisites for All Runners:
POL_PROOF_DEV_MODE=trueMUST be set for all runners (local, compose, k8s) to avoid expensive Groth16 proof generation that causes timeouts- KZG circuit assets must be present at
testing-framework/assets/stack/kzgrs_test_params/for DA workloads (fetch viascripts/setup-nomos-circuits.sh)
Quick Symptom Guide
Common symptoms and likely causes:
- No or slow block progression: missing
POL_PROOF_DEV_MODE=true, missing KZG circuit assets for DA workloads, too-short run window, port conflicts, or resource exhaustion—set required env vars, verify assets, extend duration, check node logs for startup errors. - Transactions not included: unfunded or misconfigured wallets (check
.wallets(N)vs.users(M)), transaction rate exceeding block capacity, or rates exceeding block production speed—reduce rate, increase wallet count, verify wallet setup in logs. - Chaos stalls the run: chaos (node control) only works with ComposeDeployer; LocalDeployer and K8sDeployer don't support it (won't "stall", just can't execute chaos workloads). With compose, aggressive restart cadence can prevent consensus recovery—widen restart intervals.
- Observability gaps: metrics or logs unreachable because ports clash or services are not exposed—adjust observability ports and confirm runner wiring.
- Flaky behavior across runs: mixing chaos with functional smoke tests or inconsistent topology between environments—separate deterministic and chaos scenarios and standardize topology presets.
Where to Find Logs
Log Location Quick Reference
| Runner | Default Output | With NOMOS_LOG_DIR + Flags | Access Command |
|---|---|---|---|
| Local | Temporary directories (cleaned up) | Per-node files with prefix nomos-node-{index} (requires NOMOS_TESTS_TRACING=true) | cat $NOMOS_LOG_DIR/nomos-node-0* |
| Compose | Docker container stdout/stderr | Per-node files inside containers (if path is mounted) | docker ps then docker logs <container-id> |
| K8s | Pod stdout/stderr | Per-node files inside pods (if path is mounted) | kubectl logs -l app=nomos-validator |
Important Notes:
- Local runner: Logs go to system temporary directories (NOT in working directory) by default and are automatically cleaned up after tests. To persist logs, you MUST set both
NOMOS_TESTS_TRACING=trueANDNOMOS_LOG_DIR=/path/to/logs. - Compose/K8s: Per-node log files only exist inside containers/pods if
NOMOS_LOG_DIRis set AND the path is writable inside the container/pod. By default, rely ondocker logsorkubectl logs. - File naming: Log files use prefix
nomos-node-{index}*ornomos-executor-{index}*with timestamps, e.g.,nomos-node-0.2024-12-01T10-30-45.log(NOT just.logsuffix). - Container names: Compose containers include project UUID, e.g.,
nomos-compose-<uuid>-validator-0-1where<uuid>is randomly generated per run
Accessing Node Logs by Runner
Local Runner
Console output (default):
POL_PROOF_DEV_MODE=true cargo run -p runner-examples --bin local_runner 2>&1 | tee test.log
Persistent file output:
NOMOS_TESTS_TRACING=true \
NOMOS_LOG_DIR=/tmp/debug-logs \
NOMOS_LOG_LEVEL=debug \
POL_PROOF_DEV_MODE=true \
cargo run -p runner-examples --bin local_runner
# Inspect logs (note: filenames include timestamps):
ls /tmp/debug-logs/
# Example: nomos-node-0.2024-12-01T10-30-45.log
tail -f /tmp/debug-logs/nomos-node-0* # Use wildcard to match timestamp
Compose Runner
Stream live logs:
# List running containers (note the UUID prefix in names)
docker ps --filter "name=nomos-compose-"
# Find your container ID or name from the list, then:
docker logs -f <container-id>
# Or filter by name pattern:
docker logs -f $(docker ps --filter "name=nomos-compose-.*-validator-0" -q | head -1)
# Show last 100 lines
docker logs --tail 100 <container-id>
Keep containers for post-mortem debugging:
COMPOSE_RUNNER_PRESERVE=1 \
NOMOS_TESTNET_IMAGE=nomos-testnet:local \
cargo run -p runner-examples --bin compose_runner
# After test failure, containers remain running:
docker ps --filter "name=nomos-compose-"
docker exec -it <container-id> /bin/sh
docker logs <container-id> > debug.log
Note: Container names follow the pattern nomos-compose-{uuid}-validator-{index}-1 or nomos-compose-{uuid}-executor-{index}-1, where {uuid} is randomly generated per run.
K8s Runner
Important: Always verify your namespace and use label selectors instead of assuming pod names.
Stream pod logs (use label selectors):
# Check your namespace first
kubectl config view --minify | grep namespace
# All validator pods (add -n <namespace> if not using default)
kubectl logs -l app=nomos-validator -f
# All executor pods
kubectl logs -l app=nomos-executor -f
# Specific pod by name (find exact name first)
kubectl get pods -l app=nomos-validator # Find the exact pod name
kubectl logs -f <actual-pod-name> # Then use it
# With explicit namespace
kubectl logs -n my-namespace -l app=nomos-validator -f
Download logs from crashed pods:
# Previous logs from crashed pod
kubectl get pods -l app=nomos-validator # Find crashed pod name first
kubectl logs --previous <actual-pod-name> > crashed-validator.log
# Or use label selector for all crashed validators
for pod in $(kubectl get pods -l app=nomos-validator -o name); do
kubectl logs --previous $pod > $(basename $pod)-previous.log 2>&1
done
Access logs from all pods:
# All pods in current namespace
for pod in $(kubectl get pods -o name); do
echo "=== $pod ==="
kubectl logs $pod
done > all-logs.txt
# Or use label selectors (recommended)
kubectl logs -l app=nomos-validator --tail=500 > validators.log
kubectl logs -l app=nomos-executor --tail=500 > executors.log
# With explicit namespace
kubectl logs -n my-namespace -l app=nomos-validator --tail=500 > validators.log
Debugging Workflow
When a test fails, follow this sequence:
1. Check Framework Output
Start with the test harness output—did expectations fail? Was there a deployment error?
Look for:
- Expectation failure messages
- Timeout errors
- Deployment/readiness failures
2. Verify Node Readiness
Ensure all nodes started successfully and became ready before workloads began.
Commands:
# Local: check process list
ps aux | grep nomos
# Compose: check container status (note UUID in names)
docker ps -a --filter "name=nomos-compose-"
# K8s: check pod status (use label selectors, add -n <namespace> if needed)
kubectl get pods -l app=nomos-validator
kubectl get pods -l app=nomos-executor
kubectl describe pod <actual-pod-name> # Get name from above first
3. Inspect Node Logs
Focus on the first node that exhibited problems or the node with the highest index (often the last to start).
Common error patterns:
- "Failed to bind address" → port conflict
- "Connection refused" → peer not ready or network issue
- "Proof verification failed" or "Proof generation timeout" → missing
POL_PROOF_DEV_MODE=true(REQUIRED for all runners) - "Failed to load KZG parameters" or "Circuit file not found" → missing KZG circuit assets at
testing-framework/assets/stack/kzgrs_test_params/ - "Insufficient funds" → wallet seeding issue (increase
.wallets(N)or reduce.users(M))
4. Check Log Levels
If logs are too sparse, increase verbosity:
NOMOS_LOG_LEVEL=debug \
NOMOS_LOG_FILTER="nomos_consensus=trace,nomos_da_sampling=debug" \
cargo run -p runner-examples --bin local_runner
5. Verify Observability Endpoints
If expectations report observability issues:
Prometheus (Compose):
curl http://localhost:9090/-/healthy
Node HTTP APIs:
curl http://localhost:18080/consensus/info # Adjust port per node
6. Compare with Known-Good Scenario
Run a minimal baseline test (e.g., 2 validators, consensus liveness only). If it passes, the issue is in your workload or topology configuration.
Common Error Messages
"Consensus liveness expectation failed"
- Cause: Not enough blocks produced during run window, missing
POL_PROOF_DEV_MODE=true(causes slow proof generation), or missing KZG assets for DA workloads - Fix:
- Verify
POL_PROOF_DEV_MODE=trueis set (REQUIRED for all runners) - Verify KZG assets exist at
testing-framework/assets/stack/kzgrs_test_params/(for DA workloads) - Extend
with_run_duration()to allow more blocks - Check node logs for proof generation or DA errors
- Reduce transaction/DA rate if nodes are overwhelmed
- Verify
"Wallet seeding failed"
- Cause: Topology doesn't have enough funded wallets for the workload
- Fix: Increase
.wallets(N)count or reduce.users(M)in transaction workload (ensure N ≥ M)
"Node control not available"
- Cause: Runner doesn't support node control (only ComposeDeployer does), or
enable_node_control()wasn't called - Fix:
- Use ComposeDeployer for chaos tests (LocalDeployer and K8sDeployer don't support node control)
- Ensure
.enable_node_control()is called in scenario before.chaos()
"Readiness timeout"
- Cause: Nodes didn't become responsive within expected time (often due to missing prerequisites)
- Fix:
- Verify
POL_PROOF_DEV_MODE=trueis set (REQUIRED for all runners—without it, proof generation is too slow) - Check node logs for startup errors (port conflicts, missing assets)
- Verify network connectivity between nodes
- For DA workloads, ensure KZG circuit assets are present
- Verify
"Port already in use"
- Cause: Previous test didn't clean up, or another process holds the port
- Fix: Kill orphaned processes (
pkill nomos-node), wait for Docker cleanup (docker compose down), or restart Docker
"Image not found: nomos-testnet:local"
- Cause: Docker image not built for Compose/K8s runners, or KZG assets not baked into image
- Fix:
- Fetch KZG assets:
scripts/setup-nomos-circuits.sh v0.3.1 /tmp/nomos-circuits - Copy to assets:
cp -r /tmp/nomos-circuits/* testing-framework/assets/stack/kzgrs_test_params/ - Build image:
testing-framework/assets/stack/scripts/build_test_image.sh
- Fetch KZG assets:
"Failed to load KZG parameters" or "Circuit file not found"
- Cause: DA workload requires KZG circuit assets that aren't present
- Fix:
- Fetch assets:
scripts/setup-nomos-circuits.sh v0.3.1 /tmp/nomos-circuits - Copy to expected path:
cp -r /tmp/nomos-circuits/* testing-framework/assets/stack/kzgrs_test_params/ - For Compose/K8s: rebuild image with assets baked in
- Fetch assets:
For detailed logging configuration and observability setup, see Operations.
FAQ
Why block-oriented timing?
Slots advance at a fixed rate (NTP-synchronized, 2s by default), so reasoning
about blocks and consensus intervals keeps assertions aligned with protocol
behavior rather than arbitrary wall-clock durations.
Can I reuse the same scenario across runners?
Yes. The plan stays the same; swap runners (local, compose, k8s) to target
different environments.
When should I enable chaos workloads?
Only when testing resilience or operational recovery; keep functional smoke
tests deterministic.
How long should runs be?
The framework enforces a minimum of 2× slot duration (4 seconds with default 2s slots), but practical recommendations:
- Smoke tests: 30s minimum (~14 blocks with default 2s slots, 0.9 coefficient)
- Transaction workloads: 60s+ (~27 blocks) to observe inclusion patterns
- DA workloads: 90s+ (~40 blocks) to account for dispersal and sampling
- Chaos tests: 120s+ (~54 blocks) to allow recovery after restarts
Very short runs (< 30s) risk false confidence—one or two lucky blocks don't prove liveness.
Do I always need seeded wallets?
Only for transaction scenarios. Data-availability or pure chaos scenarios may
not require them, but liveness checks still need validators producing blocks.
What if expectations fail but workloads “look fine”?
Trust expectations first—they capture the intended success criteria. Use the
observability signals and runner logs to pinpoint why the system missed the
target.
Glossary
- Validator: node role responsible for participating in consensus and block production.
- Executor: a validator node with the DA dispersal service enabled. Executors can submit transactions and disperse blob data to the DA network, in addition to performing all validator functions.
- DA (Data Availability): subsystem ensuring blobs or channel data are published and retrievable for validation.
- Deployer: component that provisions infrastructure (spawns processes, creates containers, or launches pods), waits for readiness, and returns a Runner. Examples: LocalDeployer, ComposeDeployer, K8sDeployer.
- Runner: component returned by deployers that orchestrates scenario execution—starts workloads, observes signals, evaluates expectations, and triggers cleanup.
- Workload: traffic or behavior generator that exercises the system during a scenario run.
- Expectation: post-run assertion that judges whether the system met the intended success criteria.
- Topology: declarative description of the cluster shape, roles, and high-level parameters for a scenario.
- Scenario: immutable plan combining topology, workloads, expectations, and run duration.
- Blockfeed: stream of block observations used for liveness or inclusion signals during a run.
- Control capability: the ability for a runner to start, stop, or restart nodes, used by chaos workloads.
- Slot duration: time interval between consensus rounds in Cryptarchia. Blocks are produced at multiples of the slot duration based on lottery outcomes.
- Block cadence: observed rate of block production in a live network, measured in blocks per second or seconds per block.
- Cooldown: waiting period after a chaos action (e.g., node restart) before triggering the next action, allowing the system to stabilize.
- Run window: total duration a scenario executes, specified via
with_run_duration(). Framework auto-extends to at least 2× slot duration. - Readiness probe: health check performed by runners to ensure nodes are reachable and responsive before starting workloads. Prevents false negatives from premature traffic.
- Liveness: property that the system continues making progress (producing blocks) under specified conditions. Contrasts with safety/correctness which verifies that state transitions are accurate.
- State assertion: expectation that verifies specific values in the system state (e.g., wallet balances, UTXO sets) rather than just progress signals. Also called "correctness expectations."
- Mantle transaction: transaction type in Nomos that can contain UTXO transfers (LedgerTx) and operations (Op), including channel data (ChannelBlob).
- Channel: logical grouping for DA blobs; each blob belongs to a channel and references a parent blob in the same channel, creating a chain of related data.
- POL_PROOF_DEV_MODE: environment variable that disables expensive Groth16 zero-knowledge proof generation for leader election. Required for all runners (local, compose, k8s) for practical testing—without it, proof generation causes timeouts. Should never be used in production environments.