- HCL 53.3%
- Jinja 42.4%
- Shell 3.9%
- TypeScript 0.4%
| ansible | ||
| nebula | ||
| network | ||
| nomad/jobs | ||
| tenants | ||
| terraform | ||
| .gitignore | ||
| README.md | ||
Arvandor HCI
A self-hosted hyper-converged infrastructure platform built on HashiCorp (Consul, Vault, Nomad) and Ory (Kratos, Hydra, Keto, Oathkeeper) stacks, unified by a Nebula mesh overlay network.
This is a sanitized version of the production IaC that runs across three geographic sites. All IPs, domains, and credentials have been replaced with examples.
Internet
│
├── .10 Web ──→ Caddy → Oathkeeper → Apps (via Consul DNS)
├── .11 Nebula ──→ Lighthouse (mesh entry point)
├── .12 Mail ──→ Stalwart (SMTP/IMAP)
├── .13 Boundary ──→ Kratos OIDC → Vault SSH certs → Nodes
└── .14 Games ──→ Minecraft / Palworld / Game Streaming
│
┌───────────┴────────────────────────────────────────────┐
│ Arvandor Mesh │
│ (Nebula 10.100.0.0/16) │
│ │
│ Site A ──── server-01..09 (Proxmox VMs, cattle) │
│ Site B ──── home-01 (bare metal, control replica) │
│ Site C ──── remote-01 (cloud VPS, DR) │
└────────────────────────────────────────────────────────┘
Architecture
Two-Tier Node Model
Every node boots from the same base image and runs the same platform stack (Nebula, Consul client, Vault agent, Garage, Nomad client). Two tiers provide consensus:
- Backbone nodes (5, odd quorum) — additionally run Consul server, Vault, Nomad server, and Patroni. Span all three sites for fault tolerance.
- Platform nodes — compute capacity. Services are scheduled by Nomad based on zone constraints, not hardcoded to specific nodes.
Trust Zone Isolation
Proxmox bridges provide L2 isolation per security domain. No VLANs needed — each bridge is an isolated Linux bridge with no physical uplink:
| Bridge | Zone | Purpose | Nodes |
|---|---|---|---|
| vmbr1 | infrastructure | Control plane, monitoring | Backbone, monitor |
| vmbr2 | dmz | Public-facing traffic (DNAT) | Edge |
| vmbr3 | workload | Tenant applications | Workload |
| vmbr4 | untrusted | Game servers, GPU | Games, AI |
| vmbr5 | transport | Local Nebula peering (encrypted UDP only) | All VMs |
A compromised game server cannot ARP the control plane. The transport bridge (vmbr5) provides sub-millisecond local Nebula peering without exposing trust zone boundaries.
Deterministic IP Derivation
All IPs are computed from VMID — no manual assignment, no spreadsheets:
- Bridge:
192.168.{zone}.{vmid} - Transport:
192.168.5.{vmid} - Nebula:
10.100.{vmid/100}.{vmid%100}
VMID 101 (infrastructure zone) → bridge 192.168.1.101, nebula 10.100.1.1.
Multi-Site Mesh
Three geographic sites connected by Nebula (encrypted WireGuard-based overlay):
| Site | Location | Role |
|---|---|---|
| Primary | Dedicated server (Proxmox) | 9 VMs — full production |
| Home | Bare metal | Control replica (quorum participant) |
| Remote | Cloud VPS | DR (quorum participant, async Patroni replica) |
Five-node odd quorum (3 primary + 1 home + 1 remote) ensures the cluster survives any single site failure.
Deployment Pipeline
Terraform (provision) → Ansible (infrastructure) → Nomad (workloads)
│ │ │
Proxmox VMs Consul, Vault, Apps, monitoring,
Linode VPS Nomad, Patroni, games, AI inference
Garage, Nebula
- Terraform provisions Proxmox VMs and cloud instances. Outputs feed Ansible inventory generation.
- Ansible converges infrastructure: mesh networking, service discovery, secrets management, HA database, object storage. Single entrypoint (
site.yml) with role-based targeting. - Nomad schedules everything else. If a service doesn't need to exist before Nomad can function, it's a Nomad job.
Platform Services
| Service | Tool | Purpose |
|---|---|---|
| Service Discovery | Consul | DNS-based discovery (.service.consul), health checking |
| Secrets | Vault | Dynamic credentials, PKI, SSH CA, Nomad Workload Identity |
| Scheduling | Nomad | Workload orchestration with zone-aware placement |
| Database | Patroni + PostgreSQL | HA with automatic failover, Consul DCS |
| Object Storage | Garage | S3-compatible, RF3 across backbone nodes |
| Cache | Valkey | Redis-compatible with Sentinel HA |
| Identity | Ory Kratos | Self-service auth, passwordless, social login |
| OAuth2 | Ory Hydra | OpenID Connect provider for all tenant apps |
| Authorization | Ory Keto | Zanzibar-style permission checks |
| API Gateway | Ory Oathkeeper | Zero-trust reverse proxy with JWT mutation |
| Ingress | Caddy | TLS termination, automatic ACME, Consul-aware routing |
| Git Hosting | Forgejo | Self-hosted Git with CI/CD |
| Stalwart | SMTP/IMAP with DKIM, SPF, DMARC | |
| Secure Access | Boundary | SSH brokering via Kratos OIDC + Vault SSH CA |
| Monitoring | Prometheus + Grafana + Loki | Metrics, dashboards, log aggregation |
| AI Inference | Ollama | GPU-accelerated LLM inference (RTX 4090) |
| Game Streaming | Sunshine + Moonlight | Low-latency game streaming over mesh |
Multi-Tenancy
All platform resources are tracked in a declarative tenant registry (tenants/registry.toml). Each tenant declares its Consul services, Vault secrets, databases, S3 buckets, OAuth2 clients, gateway routes, and Boundary access.
A tenant provisioning service (Gatekeeper) reconciles the registry against 7 providers hourly, provisioning and deprovisioning resources automatically.
Vault database role pattern — each tenant gets three dynamic credential roles:
{tenant}-app(runtime, DML only){tenant}-migrate(prestart migrations, DDL+DML){tenant}-owner(developer access via Boundary)
GPU Passthrough
One node has an RTX 4090 passed through via PCIe (q35 machine type, IOMMU). Used for:
- AI inference: Ollama with quantized models (fits in 24GB VRAM)
- Game streaming: Sunshine (native Linux, Proton for Windows games) → Moonlight clients
GPU access is sequential — nomad job stop ollama && nomad job run sunshine to switch. Both are Nomad raw_exec jobs constrained to the GPU node via meta.gpu = true.
Directory Structure
arvandor-hci/
├── terraform/ # Node provisioning (Proxmox VMs, Linode)
│ ├── providers.tf # Proxmox + S3 backend
│ ├── cluster.tf # Node definitions (single source of truth)
│ ├── vars.tf / outputs.tf
│ ├── terraform.tfvars.example
│ ├── modules/node/ # Reusable HCI node module
│ └── linode/ # Remote site provisioning
│
├── ansible/ # Configuration management
│ ├── inventory/
│ │ ├── generate.sh # Terraform output → hosts.ini
│ │ └── hosts.ini.example
│ ├── playbooks/
│ │ ├── site.yml # Single entrypoint (all roles)
│ │ ├── bootstrap.yml # Phase 0: pre-Nebula onboarding
│ │ ├── host.yml # Proxmox host bootstrap
│ │ └── ...
│ ├── roles/ # 1:1 with infrastructure services
│ │ ├── base/ # Every node: nebula, consul, vault-agent, firewall
│ │ ├── consul-server/ # Backbone: Consul server cluster
│ │ ├── vault/ # Backbone: Vault HA Raft
│ │ ├── nomad-server/ # Backbone: Nomad server
│ │ ├── nomad-client/ # Platform: Nomad client with zone metadata
│ │ ├── patroni/ # Backbone: PostgreSQL HA
│ │ ├── garage/ # Backbone: S3 storage (RF3)
│ │ ├── valkey/ # Cache + Sentinel HA
│ │ ├── gateway/ # Ory stack (Kratos, Hydra, Keto, Oathkeeper)
│ │ ├── boundary/ # SSH access brokering
│ │ ├── forgejo/ # Git hosting
│ │ ├── stalwart/ # Mail server
│ │ └── nvidia/ # GPU passthrough + Sunshine
│ ├── docs/boot-chain.md # Service dependency layers
│ └── vault/README.md # Secrets setup guide
│
├── nomad/ # Workload scheduling
│ └── jobs/
│ ├── example.nomad.hcl # Reference job with all patterns
│ ├── edge/ # Caddy, Gatekeeper
│ ├── monitor/ # Prometheus, Grafana, Loki
│ ├── workloads/ # Tenant applications
│ ├── personal/ # Ollama, Sunshine, ComfyUI
│ └── games/ # Palworld
│
├── nebula/ # Mesh overlay (docs only, no keys)
│ └── README.md
│
├── network/ # IP schema + edge routing
│ ├── ip-schema.example
│ └── edge-routing.sh.example
│
└── tenants/ # Multi-tenancy registry
└── registry.toml.example
Planned / In Progress
- Packer node abstraction — atomic node images built from Packer, replacing per-node Ansible convergence for base layer
- Seamless tenant enrollment — self-service tenant onboarding via Gatekeeper API (currently semi-manual)
Getting Started
- Provision nodes: Copy
terraform/terraform.tfvars.example→terraform.tfvars, fill in credentials, runterraform apply - Generate inventory:
cd terraform && terraform output -json | ../ansible/inventory/generate.sh > ../ansible/inventory/hosts.ini - Bootstrap:
ansible-playbook playbooks/bootstrap.yml --limit server-01 -e @vars/onboarding.yml - Converge:
ansible-playbook playbooks/site.yml - Deploy workloads:
nomad job run nomad/jobs/workloads/myapp.nomad.hcl
See ansible/docs/boot-chain.md for the full service dependency graph and convergence order.
Related
- Gatekeeper — Tenant resource provisioning across 7 providers (Vault, Hydra, Keto, Boundary, PostgreSQL, Garage, Valkey)
- Nebula — Mesh overlay network
- Ory — Identity and access management
- HashiCorp — Infrastructure runtime (Consul, Vault, Nomad)