Three-site production HCI platform, self-hosted end-to-end. Every node is atomically identical and contributes compute, storage, identity, and networking to the cluster. Built on HashiCorp (Consul, Vault, Nomad), Ory (Kratos, Hydra, Keto), and Nebula mesh VPN. Terraform-provisioned, Ansible-converged, multi-tenant by design.
  • HCL 53.3%
  • Jinja 42.4%
  • Shell 3.9%
  • TypeScript 0.4%
Find a file
Damien Coles 1ce52216c8 fix box
2026-04-07 13:24:18 -04:00
ansible Initial commit — sanitized public mirror of Arvandor HCI platform 2026-04-07 13:19:46 -04:00
nebula Initial commit — sanitized public mirror of Arvandor HCI platform 2026-04-07 13:19:46 -04:00
network Initial commit — sanitized public mirror of Arvandor HCI platform 2026-04-07 13:19:46 -04:00
nomad/jobs Initial commit — sanitized public mirror of Arvandor HCI platform 2026-04-07 13:19:46 -04:00
tenants Initial commit — sanitized public mirror of Arvandor HCI platform 2026-04-07 13:19:46 -04:00
terraform Initial commit — sanitized public mirror of Arvandor HCI platform 2026-04-07 13:19:46 -04:00
.gitignore Initial commit — sanitized public mirror of Arvandor HCI platform 2026-04-07 13:19:46 -04:00
README.md fix box 2026-04-07 13:24:18 -04:00

Arvandor HCI

A self-hosted hyper-converged infrastructure platform built on HashiCorp (Consul, Vault, Nomad) and Ory (Kratos, Hydra, Keto, Oathkeeper) stacks, unified by a Nebula mesh overlay network.

This is a sanitized version of the production IaC that runs across three geographic sites. All IPs, domains, and credentials have been replaced with examples.

Internet
    │
    ├── .10 Web ──→ Caddy → Oathkeeper → Apps (via Consul DNS)
    ├── .11 Nebula ──→ Lighthouse (mesh entry point)
    ├── .12 Mail ──→ Stalwart (SMTP/IMAP)
    ├── .13 Boundary ──→ Kratos OIDC → Vault SSH certs → Nodes
    └── .14 Games ──→ Minecraft / Palworld / Game Streaming
                │
    ┌───────────┴────────────────────────────────────────────┐
    │                  Arvandor Mesh                         │
    │             (Nebula 10.100.0.0/16)                     │
    │                                                        │
    │  Site A ──── server-01..09 (Proxmox VMs, cattle)       │
    │  Site B ──── home-01 (bare metal, control replica)     │
    │  Site C ──── remote-01 (cloud VPS, DR)                 │
    └────────────────────────────────────────────────────────┘

Architecture

Two-Tier Node Model

Every node boots from the same base image and runs the same platform stack (Nebula, Consul client, Vault agent, Garage, Nomad client). Two tiers provide consensus:

  • Backbone nodes (5, odd quorum) — additionally run Consul server, Vault, Nomad server, and Patroni. Span all three sites for fault tolerance.
  • Platform nodes — compute capacity. Services are scheduled by Nomad based on zone constraints, not hardcoded to specific nodes.

Trust Zone Isolation

Proxmox bridges provide L2 isolation per security domain. No VLANs needed — each bridge is an isolated Linux bridge with no physical uplink:

Bridge Zone Purpose Nodes
vmbr1 infrastructure Control plane, monitoring Backbone, monitor
vmbr2 dmz Public-facing traffic (DNAT) Edge
vmbr3 workload Tenant applications Workload
vmbr4 untrusted Game servers, GPU Games, AI
vmbr5 transport Local Nebula peering (encrypted UDP only) All VMs

A compromised game server cannot ARP the control plane. The transport bridge (vmbr5) provides sub-millisecond local Nebula peering without exposing trust zone boundaries.

Deterministic IP Derivation

All IPs are computed from VMID — no manual assignment, no spreadsheets:

  • Bridge: 192.168.{zone}.{vmid}
  • Transport: 192.168.5.{vmid}
  • Nebula: 10.100.{vmid/100}.{vmid%100}

VMID 101 (infrastructure zone) → bridge 192.168.1.101, nebula 10.100.1.1.

Multi-Site Mesh

Three geographic sites connected by Nebula (encrypted WireGuard-based overlay):

Site Location Role
Primary Dedicated server (Proxmox) 9 VMs — full production
Home Bare metal Control replica (quorum participant)
Remote Cloud VPS DR (quorum participant, async Patroni replica)

Five-node odd quorum (3 primary + 1 home + 1 remote) ensures the cluster survives any single site failure.

Deployment Pipeline

Terraform (provision)  →  Ansible (infrastructure)  →  Nomad (workloads)
       │                         │                          │
  Proxmox VMs              Consul, Vault,            Apps, monitoring,
  Linode VPS              Nomad, Patroni,            games, AI inference
                          Garage, Nebula
  • Terraform provisions Proxmox VMs and cloud instances. Outputs feed Ansible inventory generation.
  • Ansible converges infrastructure: mesh networking, service discovery, secrets management, HA database, object storage. Single entrypoint (site.yml) with role-based targeting.
  • Nomad schedules everything else. If a service doesn't need to exist before Nomad can function, it's a Nomad job.

Platform Services

Service Tool Purpose
Service Discovery Consul DNS-based discovery (.service.consul), health checking
Secrets Vault Dynamic credentials, PKI, SSH CA, Nomad Workload Identity
Scheduling Nomad Workload orchestration with zone-aware placement
Database Patroni + PostgreSQL HA with automatic failover, Consul DCS
Object Storage Garage S3-compatible, RF3 across backbone nodes
Cache Valkey Redis-compatible with Sentinel HA
Identity Ory Kratos Self-service auth, passwordless, social login
OAuth2 Ory Hydra OpenID Connect provider for all tenant apps
Authorization Ory Keto Zanzibar-style permission checks
API Gateway Ory Oathkeeper Zero-trust reverse proxy with JWT mutation
Ingress Caddy TLS termination, automatic ACME, Consul-aware routing
Git Hosting Forgejo Self-hosted Git with CI/CD
Mail Stalwart SMTP/IMAP with DKIM, SPF, DMARC
Secure Access Boundary SSH brokering via Kratos OIDC + Vault SSH CA
Monitoring Prometheus + Grafana + Loki Metrics, dashboards, log aggregation
AI Inference Ollama GPU-accelerated LLM inference (RTX 4090)
Game Streaming Sunshine + Moonlight Low-latency game streaming over mesh

Multi-Tenancy

All platform resources are tracked in a declarative tenant registry (tenants/registry.toml). Each tenant declares its Consul services, Vault secrets, databases, S3 buckets, OAuth2 clients, gateway routes, and Boundary access.

A tenant provisioning service (Gatekeeper) reconciles the registry against 7 providers hourly, provisioning and deprovisioning resources automatically.

Vault database role pattern — each tenant gets three dynamic credential roles:

  • {tenant}-app (runtime, DML only)
  • {tenant}-migrate (prestart migrations, DDL+DML)
  • {tenant}-owner (developer access via Boundary)

GPU Passthrough

One node has an RTX 4090 passed through via PCIe (q35 machine type, IOMMU). Used for:

  • AI inference: Ollama with quantized models (fits in 24GB VRAM)
  • Game streaming: Sunshine (native Linux, Proton for Windows games) → Moonlight clients

GPU access is sequential — nomad job stop ollama && nomad job run sunshine to switch. Both are Nomad raw_exec jobs constrained to the GPU node via meta.gpu = true.

Directory Structure

arvandor-hci/
├── terraform/                     # Node provisioning (Proxmox VMs, Linode)
│   ├── providers.tf               # Proxmox + S3 backend
│   ├── cluster.tf                 # Node definitions (single source of truth)
│   ├── vars.tf / outputs.tf
│   ├── terraform.tfvars.example
│   ├── modules/node/              # Reusable HCI node module
│   └── linode/                    # Remote site provisioning
│
├── ansible/                       # Configuration management
│   ├── inventory/
│   │   ├── generate.sh            # Terraform output → hosts.ini
│   │   └── hosts.ini.example
│   ├── playbooks/
│   │   ├── site.yml               # Single entrypoint (all roles)
│   │   ├── bootstrap.yml          # Phase 0: pre-Nebula onboarding
│   │   ├── host.yml               # Proxmox host bootstrap
│   │   └── ...
│   ├── roles/                     # 1:1 with infrastructure services
│   │   ├── base/                  # Every node: nebula, consul, vault-agent, firewall
│   │   ├── consul-server/         # Backbone: Consul server cluster
│   │   ├── vault/                 # Backbone: Vault HA Raft
│   │   ├── nomad-server/          # Backbone: Nomad server
│   │   ├── nomad-client/          # Platform: Nomad client with zone metadata
│   │   ├── patroni/               # Backbone: PostgreSQL HA
│   │   ├── garage/                # Backbone: S3 storage (RF3)
│   │   ├── valkey/                # Cache + Sentinel HA
│   │   ├── gateway/               # Ory stack (Kratos, Hydra, Keto, Oathkeeper)
│   │   ├── boundary/              # SSH access brokering
│   │   ├── forgejo/               # Git hosting
│   │   ├── stalwart/              # Mail server
│   │   └── nvidia/                # GPU passthrough + Sunshine
│   ├── docs/boot-chain.md         # Service dependency layers
│   └── vault/README.md            # Secrets setup guide
│
├── nomad/                         # Workload scheduling
│   └── jobs/
│       ├── example.nomad.hcl      # Reference job with all patterns
│       ├── edge/                  # Caddy, Gatekeeper
│       ├── monitor/               # Prometheus, Grafana, Loki
│       ├── workloads/             # Tenant applications
│       ├── personal/              # Ollama, Sunshine, ComfyUI
│       └── games/                 # Palworld
│
├── nebula/                        # Mesh overlay (docs only, no keys)
│   └── README.md
│
├── network/                       # IP schema + edge routing
│   ├── ip-schema.example
│   └── edge-routing.sh.example
│
└── tenants/                       # Multi-tenancy registry
    └── registry.toml.example

Planned / In Progress

  • Packer node abstraction — atomic node images built from Packer, replacing per-node Ansible convergence for base layer
  • Seamless tenant enrollment — self-service tenant onboarding via Gatekeeper API (currently semi-manual)

Getting Started

  1. Provision nodes: Copy terraform/terraform.tfvars.exampleterraform.tfvars, fill in credentials, run terraform apply
  2. Generate inventory: cd terraform && terraform output -json | ../ansible/inventory/generate.sh > ../ansible/inventory/hosts.ini
  3. Bootstrap: ansible-playbook playbooks/bootstrap.yml --limit server-01 -e @vars/onboarding.yml
  4. Converge: ansible-playbook playbooks/site.yml
  5. Deploy workloads: nomad job run nomad/jobs/workloads/myapp.nomad.hcl

See ansible/docs/boot-chain.md for the full service dependency graph and convergence order.

  • Gatekeeper — Tenant resource provisioning across 7 providers (Vault, Hydra, Keto, Boundary, PostgreSQL, Garage, Valkey)
  • Nebula — Mesh overlay network
  • Ory — Identity and access management
  • HashiCorp — Infrastructure runtime (Consul, Vault, Nomad)