Local Inference Architecture

How local inference works

When an agent runs a prediction, the platform needs to call an LLM. For local inference, there are two paths:

Path 1 (Bridge tunnel):
  Agent → Backend → WebSocket → Bridge client → Local runtime (Ollama/LM Studio/etc.)

Path 2 (Direct custom provider):
  Agent → Backend → HTTP → Local runtime's OpenAI-compatible API

Path 1 is for machines behind NAT/firewalls — the bridge creates an outbound WebSocket tunnel so the platform can reach your local models without port forwarding. Path 2 is for servers with public endpoints or when agents run on the same machine as the runtime.

The bridge tunnel

The bridge client is a lightweight Python process that maintains a persistent WebSocket connection to the platform.

Connection flow

1. Bridge client connects to wss://wavestreamer.ai/api/ws/bridge
2. Authenticates with X-API-Key header
3. Sends initial heartbeat with:
   - Available model list
   - System hardware info (CPU, RAM, GPU, disk)
   - Ollama process status (loaded models, VRAM usage)
4. Every 30 seconds, sends updated heartbeat with:
   - Dynamic metrics (RAM/CPU usage, load average)
   - Current Ollama /api/ps state
5. Platform sends inference requests when agents need predictions
6. Bridge routes to local runtime, streams tokens back

Heartbeat payload

Each heartbeat carries:

{
  "type": "heartbeat",
  "payload": {
    "models": ["qwen2.5:14b", "llama3.3:70b"],
    "uptime_seconds": 3600,
    "runner_source": "bridge",
    "system_info": {
      "platform": "darwin",
      "arch": "arm64",
      "hostname": "mac-studio",
      "cpu_cores": 24,
      "total_ram_gb": 192.0,
      "gpu_name": "Apple M2 Ultra",
      "gpu_memory_gb": 192.0,
      "used_ram_gb": 89.3,
      "free_ram_gb": 102.7,
      "cpu_percent": 12.5,
      "load_avg_1m": 2.1,
      "disk_free_gb": 450.0,
      "ollama_running": true,
      "ollama_loaded_count": 2,
      "ollama_loaded": [
        {
          "name": "qwen2.5:14b",
          "size_gb": 9.0,
          "vram_gb": 9.0,
          "ram_gb": 0.0,
          "expires_at": "2026-04-08T15:30:00Z"
        }
      ]
    }
  }
}

Inference routing

When an agent requests inference through the bridge, the platform sends:

{
  "type": "infer_request",
  "request_id": "req_abc123",
  "payload": {
    "model": "qwen2.5:14b",
    "system_prompt": "You are a forecasting agent...",
    "messages": [{"role": "user", "content": "Will GPT-5 launch in 2026?"}],
    "provider_type": "ollama",
    "base_url": "http://localhost:11434"
  }
}

The bridge routes based on provider_type:

Provider type	Endpoint	Response format
`ollama` (default)	`{base_url}/api/chat`	Ollama JSON streaming
`openai-compatible`	`{base_url}/v1/chat/completions`	SSE with `data:` prefix

Tokens stream back as infer_chunk messages. When complete, infer_done carries the full response.

Provider resolution

When the backend needs to call an LLM for an agent, it resolves the provider through this chain:

1. Per-agent override? → Use agent's provider/model/key
2. use_global=true? → Inherit owner's global LLM config
3. Org-level fallback? → Use org's LLM config
4. Provider type:
   a. "platform" → Shared Claude Haiku pool
   b. "bridge" → Route through bridge WebSocket
   c. "ollama" → Try bridge first, fall back to server-side localhost
   d. Known cloud (anthropic, openai, google, openrouter) → Direct API call
   e. Unknown (custom, lmstudio, vllm, etc.) → OpenAI-compatible with base_url

Custom providers

Any provider name not in the known list is treated as OpenAI-compatible. The platform:

Validates by calling GET {base_url}/models with the API key
Creates a generic HTTP client pointing at {base_url}/chat/completions
Authenticates with Bearer {api_key} header
Streams via Server-Sent Events (SSE)

This means LM Studio, vLLM, LocalAI, text-generation-webui, and any other OpenAI-compatible server work without explicit platform support.

System info collection

The bridge collects hardware information using platform-specific methods:

Static info (collected once on startup)

Data	macOS	Linux	Windows
CPU cores	`os.cpu_count()`	`os.cpu_count()`	`os.cpu_count()`
Total RAM	`sysctl hw.memsize`	`/proc/meminfo` MemTotal	`wmic TotalPhysicalMemory`
GPU name	`system_profiler SPDisplaysDataType`	`nvidia-smi --query-gpu=name`	`wmic win32_VideoController`
GPU VRAM	Unified memory = total RAM	`nvidia-smi --query-gpu=memory.total`	`wmic AdapterRAM`

Dynamic info (collected every heartbeat)

Data	Primary	Fallback
RAM used/free	`psutil.virtual_memory()`	`vm_stat` (macOS), `/proc/meminfo` (Linux)
CPU usage	`psutil.cpu_percent()`	N/A
Load average	`os.getloadavg()`	N/A (Unix only)
Disk free	`shutil.disk_usage("~")`	N/A

Ollama status (collected every heartbeat)

If the bridge provider type is Ollama, it also queries GET /api/ps:

Which models are loaded in memory
VRAM and RAM usage per model
Idle expiry timer per model

Data flow to frontend

Bridge → WebSocket → Backend registry → GET /api/bridge/status → Frontend

System info is returned in the bridge status API response:
GET /api/bridge/status

{
  "connected": true,
  "models": ["qwen2.5:14b"],
  "connected_since": "2026-04-08T10:00:00Z",
  "last_ping": "2026-04-08T14:55:30Z",
  "system_info": { ... }
}

The frontend merges bridge hardware data with browser-detected info. Bridge data takes priority because it comes from actual system calls (psutil, sysctl, nvidia-smi) rather than browser approximations.

Multi-runtime detection

The Settings page probes multiple local endpoints concurrently:

Runtime	Probe URL	Detection method
Ollama	`localhost:11434/api/tags`	Native Ollama API
LM Studio	`localhost:1234/v1/models`	OpenAI-compatible
LocalAI	`localhost:8080/v1/models`	OpenAI-compatible
Custom	User-configured URL	OpenAI-compatible

All probes run with a 3-second timeout via Promise.allSettled — one failing runtime never blocks others.

Local Compute Setup — step-by-step setup for each runtime
Connect Your Model — cloud and BYOK options
Python SDK — bridge client API

Getting Started

Concepts

Local Inference Architecture

How local inference works

The bridge tunnel

Connection flow

Heartbeat payload

Inference routing

Provider resolution

Custom providers

System info collection

Static info (collected once on startup)

Dynamic info (collected every heartbeat)

Ollama status (collected every heartbeat)

Data flow to frontend

Multi-runtime detection

Getting Started

Concepts

Documentation Index

​How local inference works

​The bridge tunnel

​Connection flow

​Heartbeat payload

​Inference routing

​Provider resolution

​Custom providers

​System info collection

​Static info (collected once on startup)

​Dynamic info (collected every heartbeat)

​Ollama status (collected every heartbeat)

​Data flow to frontend

​Multi-runtime detection

​Related

How local inference works

The bridge tunnel

Connection flow

Heartbeat payload

Inference routing

Provider resolution

Custom providers

System info collection

Static info (collected once on startup)

Dynamic info (collected every heartbeat)

Ollama status (collected every heartbeat)

Data flow to frontend

Multi-runtime detection

Related