Headscale Series：Build a Headscale Cluster That Scales to (Almost) Unlimited Devices

发表于 2025-10-01 阅读次数：本文字数： 12k 阅读时长 ≈ 21 分钟

Audience: engineers, architects, and SREs already running headscale + Tailscale as a self-hosted control plane.
TL;DR: With clusterID sharding + shared Postgres + an admission/control service + query isolation + incremental netmap, you can evolve single-process headscale into a horizontally scalable clustered service—functionally close to Tailscale’s SaaS model.

1) Current State & Problems

In many real deployments (especially with default/monolithic setups), headscale shows these limits:

Single tailnet; weak tenant isolation
By default, there isn’t first-class multi-tenancy. Strong “tenant boundaries” (isolation, quotas, policies) are hard to achieve.
Single-process design; hard to scale horizontally
As a monolith, headscale handles thousands of devices, but at tens of thousands the control-plane load (heartbeats, registration, policy changes, push fan-out) drives CPU/scheduling spikes.
Netmap recomputed on every device/state change
Each device online/offline/heartbeat/ACL change triggers (near) full netmap recomputation. That O(N) cost at high event rates makes CPU the bottleneck.

These issues narrow the path to “SaaS-like headscale that serves 100k+ devices.”

2) Common Approaches: Pros & Cons

Approach A: One tenant = one headscale + one database
- Pros: excellent isolation; minimal code changes.
- Cons: massive resource waste; painful ops (N× upgrades/monitoring/alerts/certs/backups).
Approach B: Make upstream headscale truly multi-tailnet
- Pros: aligns with the goal.
- Cons: deep refactors across auth, data model, queries, caching, event bus, ACL interpreter, and netmap generator;
  even with multi-tailnet, a single process remains a bottleneck.

3) Our Scalable Solution (Linear Growth)

Core idea: Turn headscale into a shardable worker node. Each node has a clusterID. All nodes share one Postgres (or a PG cluster), but each node only sees rows for its clusterID. In front, run an Admission/Control Service (ACS) that assigns tenants/users to a clusterID on first contact (sticky), enabling horizontal scale-out.

3.1 Architecture (ASCII)

                       +-----------------------+
   Register/Login ---> |  Admission/Control    | <--- Admin API/Billing/Quotas
       / Scale         |       Service (ACS)   |
                       +----------+------------+
                                  |
                          choose clusterID
                                  v
      +---------------------------+------------------------------+
      |                           |                              |
+-----+-----+               +-----+-----+                  +-----+-----+
| headscale |               | headscale |                  | headscale |
|  node A   | clusterID=A   |  node B   | clusterID=B      |  node C   | clusterID=C
+-----+-----+               +-----+-----+                  +-----+-----+
      \                           |                              /
       \                          |                             /
        \--------------------+----+----+-----------------------/
                             |   Shared Postgres / PG Cluster  |
                             |     (cluster_id dimension)      |
                             +---------------------------------+

Clients:
  On first join -> call ACS -> receive "available headscale address + clusterID + registration info"
  Then all traffic sticks to that node. New tenants are placed by ACS during scale-out.

Key points

Shared Postgres, but every read/write is filtered by cluster_id, which logically shards one big DB.
ACS handles tenant creation, quotas, clusterID assignment, and first-time registration steering.
Headscale nodes are “as stateless as possible”: durable state lives in PG; caches accelerate but are reproducible.
DERP map, OIDC, ACLs, keys, etc. are stored per tenant / per cluster; queries are isolated.

4) Critical Implementation Details

4.1 Database Changes

4.1.1 Add `cluster_id` to all tenant/device tables

Add cluster_id to tables for tenants/users/devices/keys/routes/ACLs/sessions/netmap metadata and create composite indexes.

Example (conceptual):

-- 1) Columns & defaults
ALTER TABLE nodes            ADD COLUMN cluster_id TEXT NOT NULL DEFAULT 'default';
ALTER TABLE users            ADD COLUMN cluster_id TEXT NOT NULL DEFAULT 'default';
ALTER TABLE api_keys         ADD COLUMN cluster_id TEXT NOT NULL DEFAULT 'default';
ALTER TABLE routes           ADD COLUMN cluster_id TEXT NOT NULL DEFAULT 'default';
ALTER TABLE acl_policies     ADD COLUMN cluster_id TEXT NOT NULL DEFAULT 'default';
ALTER TABLE preauth_keys     ADD COLUMN cluster_id TEXT NOT NULL DEFAULT 'default';
ALTER TABLE device_sessions  ADD COLUMN cluster_id TEXT NOT NULL DEFAULT 'default';

-- 2) Typical indexes
CREATE INDEX idx_nodes_cluster_id        ON nodes(cluster_id);
CREATE INDEX idx_users_cluster_id        ON users(cluster_id);
CREATE INDEX idx_routes_cluster_u_drt    ON routes(cluster_id, user_id, dest_prefix);
CREATE INDEX idx_acl_cluster_u_rev       ON acl_policies(cluster_id, user_id, revision DESC);

If you’re currently “one DB per headscale,” you can keep DB-level sharding for hard isolation—but shared PG + cluster_id gives better resource reuse and elasticity.

4.1.2 Enforce `cluster_id` in the data access layer

On process start, resolve CLUSTER_ID (env/flag).
Inject WHERE cluster_id = ? into every query via the ORM/DAO (e.g., GORM scoped query) or handwritten SQL.
Guard against “unscoped” queries (fail fast).

Go sketch:

type ClusterDB struct {
    db        *gorm.DB
    clusterID string
}

func (c *ClusterDB) Scoped() *gorm.DB {
    return c.db.Where("cluster_id = ?", c.clusterID)
}

// Usage
func (c *ClusterDB) ListNodes(ctx context.Context) ([]Node, error) {
    var nodes []Node
    return nodes, c.Scoped().Find(&nodes).Error
}

4.2 Server-Side Changes

4.2.1 Process startup with clusterID

# Each headscale node starts with its own cluster_id
CLUSTER_ID=A \
PG_DSN="postgres://..." \
DERP_MAP_URL="https://derp.example.com/map.json" \
./headscale --config ./config.yaml

Propagate CLUSTER_ID to logs, metric labels, event topics, etc.
All internal logic operates only on this cluster’s data.

4.2.2 Incremental netmap + convergence

Problem: Full netmap recomputation is O(N); frequent triggers burn CPU.
Fix:

Graph decomposition: model netmap as “nodes + ACL edges + route prefixes,” so any change maps to a limited impact set.
Impact tracking: when device X / ACL Y changes, only recompute for the affected subset (same user/tag/route domain).
Cache + versioning:
- Maintain a netmap version per (cluster_id, tenant/user).
- Cache results in memory (LRU) or Redis with signatures.
- On push, only send when “subscriber_version < latest_version” (delta).
PG LISTEN/NOTIFY: triggers on change tables emit NOTIFY netmap_dirty (cluster_id, scope_key). Nodes LISTEN and queue work with debounce (50–200 ms).

Sketch:

type DirtyEvent struct {
   ClusterID string
   ScopeKey  string // e.g., userID, tagID, or route domain
}

func (s *Server) onDirty(e DirtyEvent) {
   if e.ClusterID != s.clusterID { return }        // only handle my cluster
   s.debouncer.Push(e.ScopeKey)                    // merge changes 50–200 ms
}

func (s *Server) rebuild(scopeKey string) {
   // 1) Find affected devices
   affected := s.index.LookupDevices(scopeKey)
   // 2) Rebuild netmap only for affected devices
   for d := range affected {
      nm := s.builder.BuildForDevice(d)
      s.cache.Put(d, nm)
      s.pushToDevice(d, nm)
   }
}

This is the CPU game-changer. In real clusters we’ve seen 60–90% average CPU reduction.

4.3 Admission/Control Service (ACS)

Responsibilities

First-time admission and clusterID assignment (weighted by plan, geography, live load).
Returns the target headscale URL and registration parameters (e.g., derived AuthKey, OIDC URL).
Sticky placement: the same tenant/account consistently maps to the same cluster unless migrated.

Typical API

POST /api/v1/bootstrap
{
  "account": "acme-inc",
  "plan": "pro",
  "want_region": "ap-sg"
}

200 OK
{
  "cluster_id": "A",
  "headscale_url": "https://hs-a.example.com",
  "derp_map": "https://derp.example.com/map.json",
  "auth_mode": "oidc",
  "note": "stick to cluster A"
}

Placement algorithm: consistent hashing + load factors (CPU, session count, netmap backlog).
Rebalance: ACS maintains “migration plans”; new devices go to new clusters; existing devices migrate in batches (see 6.3).

5) End-to-End Flow (ASCII Sequence)

5.1 First-time device join

Client -> ACS: POST /bootstrap (account, plan, region)
ACS    -> PG : read tenant/quota/bindings
ACS    -> Client: {cluster_id=A, headscale_url=https://hs-a...}

Client -> Headscale(A): OIDC/registration/heartbeat
Headscale(A) -> PG: write nodes/users/... (cluster_id=A)
Headscale(A) -> Client: pushes netmap

5.2 Device change triggers incremental netmap

Client(Device X) -> Headscale(A): heartbeat/route change
Headscale(A)    -> PG: update device_sessions/routes (cluster_id=A)
PG Trigger      -> NOTIFY netmap_dirty (A, scope=user:acme)
Headscale(A)    -> rebuild scope=user:acme (incremental)
Headscale(A)    -> push updated netmaps to affected devices

6) Operations & Evolution

6.1 Deployment tips

PG/PG cluster: primary/replica with streaming replication; partition tables by cluster_id or use partial indexes to handle hot tenants at high concurrency.
Caching: local memory + optional Redis (share hot sets across instances; faster recovery).
TLS/DERP: one DERP map entry point; deploy multiple DERPs by region; all headscale nodes share the same DERP map.
Observability:
- Metrics: netmap_rebuild_qps{cluster_id}, netmap_rebuild_latency_ms{scope}, notify_backlog, sql_qps{table}, push_failures.
- Logs: label by cluster_id to isolate shard anomalies.

6.2 Quotas & Productization

Free tier: enforce “max devices per account” at ACS; deny extra registrations.
Commercial: assign a dedicated clusterID (exclusive headscale node) per tenant to guarantee performance.
Billing: track device online time and/or traffic (if you account for it) in PG; aggregate by clusterID + account.

6.3 Migration & Rebalancing (no downtime)

Add node C (clusterID=C); ACS routes new signups to C.
Gradual tenant migration:
1. Mark tenant T’s “target cluster = C”;
2. Issue hints so T’s devices switch to headscale(C) on the next reconnect;
3. Devices naturally flip during heartbeat timeouts/reconnects;
4. After confirmation, move T’s rows from A to C (same DB: UPDATE ... SET cluster_id='C' WHERE tenant_id=T; cross-DB: ETL).

7) Code Drop (Samples)

7.1 Inject clusterID at startup

func main() {
    cfg := loadConfig()
    clusterID := mustGetEnv("CLUSTER_ID")

    db := mustOpenPG(cfg.PGDSN)
    cdb := &ClusterDB{db: db, clusterID: clusterID}

    srv := NewServer(cdb, clusterID)
    srv.Run()
}

7.2 Guard against unscoped queries

// Every handler/usecase must use Server.scoped(), which enforces cluster_id
func (s *Server) scoped() *gorm.DB {
    return s.cdb.Scoped() // internally adds WHERE cluster_id = ?
}

// Panic in CI/tests if someone uses the raw DB without cluster filter
var rawDBUsed = errors.New("raw DB use forbidden; must use scoped db")

func MustScoped(db *gorm.DB) *gorm.DB {
    if !db.Statement.Clauses["WHERE"].Build(...) { panic(rawDBUsed) }
    return db
}

7.3 LISTEN/NOTIFY for incremental work

-- Change trigger
CREATE OR REPLACE FUNCTION notify_netmap_dirty() RETURNS trigger AS $$
BEGIN
  PERFORM pg_notify('netmap_dirty',
          json_build_object('cluster_id', NEW.cluster_id,
                            'scope_key',  NEW.user_id)::text);
  RETURN NEW;
END; $$ LANGUAGE plpgsql;

CREATE TRIGGER trg_routes_dirty
AFTER INSERT OR UPDATE OR DELETE ON routes
FOR EACH ROW EXECUTE FUNCTION notify_netmap_dirty();

8) Security & Isolation

Tenant boundaries: dual filters—cluster_id + tenant_id. Audit sensitive ops (ACL updates, key lifecycle).
Cross-tenant access: ACL interpreter forbids it by default; allow only via explicit “tenant-to-tenant allowlist.”
Key material: encrypt at rest in PG (KMS/HSM). Headscale nodes do not persist secrets locally.
Least privilege: ACS can only read/write tenant↔cluster mappings; no direct access to device sessions.

9) Aligning with a Tailscale-like SaaS

This design offers the SaaS core: multi-node control plane + shared state store + admission front-end.

ACS orchestrates new tenant onboarding;
Paid tiers map to different cluster strategies (shared vs. dedicated);
Failure domains and capacity scale by node;
Centralized monitoring, billing, and compliance.

10) Closing Thoughts

Turning headscale into a sharded cluster isn’t just “add more boxes.” The keys are:

Keep the data plane simple, make the control plane incremental;
Each node owns its shard; Postgres is shared but logically isolated;
The Admission/Control Service handles tenant orchestration and load placement;
Incremental netmap pulls CPU out of the “full-rebuild hell.”

With this in place, both single-tenant and multi-tenant deployments can achieve near-linear scaling—SaaS-like in practice. If you’re modifying source and aiming for production, share your load curves and topology—especially netmap hit rates and CPU deltas. Those usually decide whether you can push toward six-figure device counts.