Skip to main content

Messaging

All inter-service communication in Custody happens over NATS. This page describes the role messaging plays, why NATS, and the patterns we rely on.

Role of the messaging layer

Custody is internally asynchronous. The API does not call MPC nodes directly; instead, every signing, DKG, or refresh request becomes a message on a subject. MPC nodes subscribe to that subject as competing consumers.

This shape gives us two properties for free:

  • Horizontal scale. Adding an MPC node to the cluster increases throughput linearly. There is no shard map to update or partition rebalance to schedule.
  • Loose coupling between request acceptance and signing capacity. The API can keep accepting requests during a brief surge or partial outage of the signing tier; messages queue and are picked up when capacity returns.

Both NATS Core (pub/sub, in-memory) and NATS JetStream (durable streams) are used: Core for ephemeral notifications and health, JetStream for any work item that must survive a restart.

Why NATS

The selection criteria were:

  • Operational footprint. A single Go binary, no JVM, no external ZooKeeper or BookKeeper. Start-up to working cluster is minutes.
  • Zero-trust posture by default. NATS authenticates with NKeys (Ed25519 keypairs), JWTs, or mTLS. There is no password mode. This aligns with the rest of the platform's no shared secrets in the data plane stance.
  • Native multi-tenancy. NATS Accounts isolate subjects, message streams, and routing at the broker level. Two accounts cannot leak messages to each other even if a subject collision exists.
  • Wildcard subjects. Subscribers can express their interest as mpc.node.> rather than enumerating every concrete subject; this maps naturally onto our routing topology without per-subscriber config sprawl.
  • CNCF backing. Open source, governance is not bound to a single vendor.

The systems we explicitly weighed against — Kafka, Redpanda, Pulsar, RabbitMQ, Redis Streams, ZeroMQ, Aeron, MQTT — each lost on at least one of these axes (operational complexity, JVM dependency, password-only auth, single-DC assumptions).

The trade-offs we accepted:

  • Slow consumers are penalised. NATS prefers to drop or disconnect a slow subscriber rather than queue indefinitely. This is intentional and matches our preference for fail closed rather than degrade silently.
  • Subject routing is string-based. No Exchange-style content routing as in RabbitMQ. We have not needed it.

Transport security

In-cluster traffic runs through Cilium with mTLS at the network layer. Enabling mTLS again at the broker is therefore double-encryption for the same hop. We keep it on as a defense-in-depth measure but do not rely on it; the broker authenticates clients with NKeys regardless.

Out of scope here

  • Concrete subject naming and message formats — these live in the proto definitions.
  • Failure semantics for in-flight signing protocols when a node disconnects — these are properties of the signing layer, not of the broker.