← Blog | Architecture Updated May 27, 2026 · 22 min read

CadencIA: from browser to native node as a distributed compute layer

CadencIA starts from a practical question: what if Cadences.app did not always have to depend on a single remote backend to run AI work? Instead of treating every device as a passive client, we can treat it as potential compute: a browser with WebGPU, a foreground PWA, a native node on the LAN, a quota-governed cloud fallback or a machine with local models.

Cloudflare Durable Objects Rust node PWA WebGPU Web-safe jobs Result tokens WebLLM Transformers.js Swarm memory

Real state, May 27 2026

The important new piece is no longer just that a PWA can measure itself. The console Assistant now has a Distribute compute mode: the sending browser creates a public job, receives a temporary jobId and resultToken, then polls for the result while another visible device claims, runs and completes it.

This works beyond the local network because the control plane lives on Cloudflare. It is not direct P2P yet: the current flow is Samsung/iPhone/PC -> CadencIA API -> PWA worker -> CadencIA API -> Assistant. The deliberate constraint is that the receiving device must be paired, foreground, accepting web jobs and already loaded with the required model when the job asks for one.

The deployed smoke test validates the whole cycle: worker heartbeat, model-specific job announce, capability-based claim, complete and tokenized result read. The physical multi-device test is still the next product validation, but the backend contract is no longer a mock.

1 The problem: AI no longer lives in one place

For years the dominant pattern has been simple: the app gathers context, calls a remote provider and waits for a response. That model is still useful, but it becomes narrow when a product needs privacy, low latency, cost control or execution close to the user.

In Cadences, many kinds of compute coexist: Workers AI for cloud tasks, browsers that already have WebGPU, desktop apps with access to disk and peripherals, native nodes on the LAN and local models that should not leave a machine. The question shifts from which provider do we use to which device can run this specific job, now, under these constraints.

The thesis

CadencIA is the distributed compute layer for Cadences: it records capabilities, chooses routes, assigns jobs and lets each runtime contribute what it can do without turning the product into a rigid dependency on a single backend.

2 Cadences AI Gateway and CadencIA are not the same thing

The important separation is conceptual. The Gateway belongs to the product and the business. CadencIA belongs to execution.

Cadences AI Gateway

Auth, customers, organizations and permissions.
Rate limits, billing, traces and provider cost.
Product policies: what can be requested, who asks for it and under which plan.
Abstraction over cloud providers, cheap/free quotas and external APIs.

CadencIA compute layer

Node identity, health vectors and capabilities.
Available models, cached models, runtimes and execution constraints.
Swarm registry, job coordinator, leases and completion.
Local, private, web-safe, LAN or governed fallback execution when it fits the job.

3 Current architecture: small control plane, heterogeneous workers

Cloudflare Worker API

Exposes public and authenticated endpoints, validates tokens, applies CORS and routes to Durable Objects. It is the front door of the control plane.

Swarm Registry

A Durable Object tracks live nodes, last heartbeat, health vector, announced models and runtime metadata.

Job Coordinator

Another Durable Object stores the queue, applies requirement filters, chooses the best compatible free peer and emits dynamic leases.

Swarm memory

The next layer records telemetry, completed jobs, failures, latencies and cached models as queryable memory for the swarm.

cadencia-node

Native Rust worker: Ed25519 identity, health/GPU probes, LAN local server, job loop, executor and optional local inference.

PWA Worker

Foreground browser worker that detects WebGPU/Wasm, runs a short benchmark, loads web models on demand and can claim web-safe jobs with a temporary claimToken.

4 How a job is routed

A job is not assigned by intuition. The producer declares a payload and, when needed, a requires block. The worker that claims it sends its capabilities: backend, runtime, memory, available models, whether it is in foreground, whether it is web-safe and how much work it can accept.

{
  "kind": "web_safe",
  "payload": { "prompt": "Classify this message as positive or negative." },
  "requires": {
    "web_safe": true,
    "foreground": true,
    "runtime_in": ["webgpu", "wasm"],
    "model": "distilbert-sentiment",
    "max_job_secs_lte": 30,
    "max_tokens_lte": 128
  }
}

Field	What it protects
`web_safe`	Prevents a public PWA from claiming work with side effects or requirements outside the browser.
`foreground`	Ensures the user keeps the page alive and visible while the job runs.
`runtime_in`	Distinguishes WebGPU, Wasm, native, CUDA or Metal without coupling the job to a specific device.
`model`	Only assigns the job to workers that have already announced that model as available.
`max_job_secs_lte`	Calculates realistic leases: 30s base, margin over the declared limit and a controlled maximum.

5 The remote PWA: useful, but deliberately limited

The most interesting part of the current prototype is that the browser is no longer just an interface. The PWA detects capabilities, runs a short benchmark, loads models with Transformers.js or WebLLM, pairs with the control plane and appears in the swarm as a web worker.

Still, one rule matters: the PWA does not download models after claiming a job. It only announces rule-web-safe-triage and, if the user already loaded it, an allowed model such as distilbert-sentiment, minilm-embeddings or an experimental WebLLM model. The API filters that list through an explicit allowlist.

The web console now separates three states that used to blur together: a model downloaded in the browser, a model loaded in RAM and a model available for remote job claims. That distinction is the important part: a cached model improves startup, but only a loaded, allowed model announced as a real capability can enter routing. The lab chat is also remembered per device through the browser nodeId, not as a single global conversation.

Heartbeat

Publishes runtime, foreground state, benchmark, supported models, cached models and already loaded models.

Claim

Requires a temporary claimToken, foreground and web_safe. Only browser-suitable jobs enter.

Complete

Revalidates foreground/web_safe, limits result size and completes the assigned lease.

6 Security: do not trust the browser more than necessary

Conservative design

The visible clientNodeId is not secret: the server derives an isolated nodeId with sha256("web-pwa:<clientNodeId>").
The browser receives an ephemeral five-minute claimToken.
Multiple tokens can remain alive by TTL so a heartbeat does not invalidate an in-flight complete.
Public endpoints are rate-limited by IP and do not open the general queue.
Results are size-limited and complete requires foreground/web_safe again.

This is not an anonymous network for arbitrary execution. It is a control plane with clear limits: web-safe jobs first, stronger pairing next, real organization scopes and native workers wherever the browser cannot reach.

7 What already works and what we are not pretending

Already works

Public PWA paired through heartbeat and claimToken.
Swarm registry with visible native nodes and PWAs.
global-org-best-free-v1 scheduler with best compatible free node selection.
Claim/complete for web-safe jobs and already loaded web models.
Assistant with Distribute compute mode and safe polling through a temporary resultToken.
Local PWA memory with MiniLM embeddings isolated from the chat runtime.
Separate tracking for supported, cached and loaded models so the node does not advertise fake capabilities.
Lab chat persisted per browser/device.
API and PWA deployed to production with an end-to-end scheduler smoke test.

Still lab territory

The scheduler is polling-based, not push yet.
The PWA only works in foreground.
Full multi-org scopes are not in production yet.
QR/deep-link pairing is still pending.
Swarm memory is not synchronized across nodes yet.
The physical Samsung/iPhone test still needs both devices visible at the same time.
Heavy jobs still belong more naturally to the native node.

8 Why this matters for Cadences

The promise is not that everything runs in the browser, or that everything is decentralized for aesthetic reasons. The promise is more sober: an execution layer that can choose between cloud, edge, browser, desktop and local node depending on the work.

For a CRM, that can mean classifying sensitive text without sending it to an external provider. For a voice system, it can mean routing a transcript to a local worker. For an AI gateway, it can mean choosing between cost, latency, privacy and availability on each request. For an organization, it can mean using its own machines before paying for remote tokens.

CadencIA turns that idea into infrastructure: health, capabilities, models, leases, results and policies. Still small. Already running. Real enough to stop being a demo and start looking like a platform.

Next step

The natural path is secure QR/deep-link pairing, organization scopes, more expressive WebLLM jobs, explicit tool/internet capabilities and a Cadences AI Gateway adapter so the product can route to CadencIA as another provider, but one backed by real capability memory and provider cost limits.

9 The next vector: hybrid routing and swarm memory

There is a powerful middle ground between “everything local” and “everything cloud”: nodes authenticated in cadences.app that can ask for help from powerful Cadences models —Cloudflare Workers AI, DeepSeek, Gemini, Groq or others— within explicit quotas. The node should not hold provider keys. It asks for a job, declares context and constraints, and the Gateway decides whether it runs locally, in the PWA, on a native node or through a cheap cloud fallback.

That gives us a realistic bootstrap path: use small models or local embeddings for triage, privacy and memory; reserve powerful providers for steps where they add real value; and keep budgets per organization, node, user and task type. CadencIA does not replace the Gateway: it gives the Gateway execution signals so it can spend better.

The other half is memory. The “mempalace”-style indexer already living in whatsapp-local-agent —SQLite, chunks, local embeddings, FTS5, vector search and RRF— fits swarm telemetry almost directly. Instead of indexing only messages, a node can index events: health vectors, cached models, completed leases, errors, latency, battery, GPU, stability and results. With that, the scheduler stops seeing a snapshot and starts querying history.

Local first

Triage, embeddings and private jobs stay in the browser or owned node when that is enough.

Cloud with quota

Powerful models enter as a fallback governed by budget, provider and plan.

Operational memory

Telemetry embeddings let us ask “which nodes usually satisfy this kind of job?”