How I Maintain a Multi-Site Proxmox Homelab Without Losing My Mind

I run five Proxmox nodes spread across two physical sites connected by a WireGuard site-to-site VPN. One site is a cluster. Between the two sites, I manage around 50 virtual machines and containers, two Proxmox Backup Servers, two TrueNAS instances, GPU passthrough for local AI workloads, and Veeam Agent backups for Windows workstations.

It sounds like a lot. It is a lot. But after a recent multi-day deep dive into every node, every ZFS pool, every backup job, and every SMART report, I built a maintenance workflow that keeps things healthy in about 15-20 minutes per week.

This post covers what I learned, what broke, and the weekly checklist and scripts I now use to prevent it from breaking again.

The Environment#

MAC Site#

pve5 – RTX 3060 12GB, runs Ollama/OpenWebUI for local AI inference, hosts TrueNAS as a VM, Ryzen 7 2700 with 32 GB RAM
pve6 – GTX 1070, AI workloads, hosts PBS6 (Proxmox Backup Server) as an LXC container with 4x 2.4 TB SAS drives in RAIDZ on TrueNAS Scale total of 9.4 TB of usable space, i7 8700 with 32 GB RAM

GW Site#

Custer
pve2 – RTX 4070, AI workloads, hosts TrueNAS as a VM with 8x HGST 8TB drives in RAIDZ2 on TrueNAS Scale total of 42 TB of usable space, Ryzen 9 5900X with 128 GB RAM
pve3 – Hosts GW PBS (Proxmox Backup Server) as a VM with 3x 2TB drives in RAIDZ for a total of 4 TB, Xeon E31240 with 32 GB RAM
pve4 – Dedicated to pfSense, Intel Celeron N5105 with 32 GB RAM

Backup Architecture#

MAC site VMs back up nightly to PBS6 (LXC on pve6)
GW site VMs back up nightly to GW PBS (VM on pve3)
Windows workstations at both sites run Veeam Agent Community Edition, these backup to their local TrueNAS

What Went Wrong (And How I Found It)#

I started by reviewing nightly backup logs and quickly discovered failures across multiple nodes. That led to a full infrastructure audit that uncovered a chain of interconnected issues.

ZFS Data Corruption on NVMe#

Two containers on pve5’s M.2-ZFS pool were failing backups with I/O errors. The ZFS scrub came back clean, SMART showed zero media errors, and memtest passed. The corruption was occurring on different files in different containers on consecutive nights.

The root cause turned out to be ZFS ARC memory pressure. The ARC (Adaptive Replacement Cache) was uncapped and consuming over 5 GiB on a 32 GiB system already running 25 guests. Under heavy swap conditions, ZFS was occasionally writing corrupted data to the pool.

The fix: Cap the ARC to a reasonable size relative to total RAM.

# Immediate effect
echo 2147483648 > /sys/module/zfs/parameters/zfs_arc_max

# Persistent across reboots
echo "options zfs zfs_arc_max=2147483648" > /etc/modprobe.d/zfs.conf
update-initramfs -u

On a 32 GiB system, 2 GiB of ARC is sufficient for VM workloads. On pve2 with 128 GiB RAM, I set it to 8 GiB. The results were dramatic:

Node	ARC Before	ARC After	RAM Available Before	RAM Available After
pve5	5.4 GiB	1.5 GiB	3 GiB	17 GiB
pve3	15.6 GiB	1.3 GiB	6 GiB	21 GiB
pve2	64 GiB	7 GiB	~30 GiB	83 GiB

After capping the ARC, the recurring ZFS corruption stopped.

LVM Thin Pool Overprovisioning#

Two nodes had LVM thin pools where the sum of provisioned thin volumes exceeded the physical pool capacity. This was not causing data loss, but it triggered warnings during backup snapshot creation and represented a ticking time bomb.

The fix involved migrating VM disks to ZFS storage, extending thin pools where possible, and removing stale snapshots. On pve2, I converted an entire ext4 NVMe partition to ZFS, moved all VM disks onto it, and emptied the thin pool completely.

TrueNAS Snapshot Explosion#

The GW TrueNAS had accumulated 67,659 ZFS snapshots. Three periodic snapshot tasks were running, two of them hourly, each creating roughly 90 snapshots per run (one per dataset). The retention engine was not pruning old snapshots.

The fix was to delete two of the three tasks (keeping only a single daily recursive task with 4-week retention), then batch-delete old snapshots via cron jobs on the TrueNAS shell. The steady-state target is approximately 2,520 snapshots (90 datasets times 28 days).

Failing TrueNAS Drives#

The health check revealed three HGST 8TB drives in the GW TrueNAS RAIDZ2 pool showing degradation:

One drive declared SMART FAILED
One with 14 reallocated sectors
One with 8 pending sectors

All eight drives are the same model with approximately 78,000 power-on hours (9 years). The pool is RAIDZ2, so it tolerates two simultaneous failures. Replacement drives (Seagate IronWolf 8TB) are on order.

TrueNAS showed the pool as “ONLINE” with zero errors because ZFS reports pool-level health, not individual drive SMART status. This is why external SMART monitoring is essential.

ZFS Userspace/Kernel Module Mismatch#

After routine package updates, the ZFS userspace tools were upgraded to version 2.4.0 while the running kernel still had the 2.2.x module loaded. This prevented ZFS scrubs from running with the error:

cannot scrub: the loaded zfs module does not support an option for this operation

The fix was to downgrade the userspace tools to match the kernel module and hold the packages to prevent re-upgrade:

apt install zfsutils-linux=2.3.4-pve1 zfs-zed=2.3.4-pve1 zfs-initramfs=2.3.4-pve1
apt-mark hold zfsutils-linux zfs-zed zfs-initramfs

This was necessary on pve2 and pve3 because both nodes have GPU passthrough with manually installed NVIDIA drivers. Upgrading to a kernel that ships the matching ZFS 2.4.0 module would require rebuilding NVIDIA DKMS, which does not compile against the newer kernel API changes.

The Health Check Script#

I run a bash script on each node that checks filesystem utilization, LVM thin pool status, ZFS pool health, drive SMART data, backup status, system resources, Proxmox services, pending updates, dmesg errors, and EFI certificate expiry.

The script outputs a report with OK, WARNING, and CRITICAL indicators. Running it takes about 20 seconds per node.

You can download it here: pve-health-check.sh

I also have a quick memory monitor script for pve that shows RAM/swap, ZFS ARC usage, top swap consumers, and running VM/CT memory allocations:

Download: pve-memcheck.sh

Config Audit#

This script will check the configuration and give a report with recommenations. This is something that should be run once and maybe every 6 months to one year if you do hardware upgrades like I do.

You an download it here: pve-config-audit.sh

The Weekly Maintenance Routine#

The goal is 15-20 minutes every Sunday. Catching problems early prevents multi-day emergencies.

Every Week (15-20 minutes)#

Review backup logs (5 min): Check the last backup run on every node in the PVE web UI. Verify both PBS servers show healthy datastore usage. A single nightly failure is worth investigating. Two consecutive failures on the same VM is a red flag.

Run health check scripts (4 min): SSH into each node and run the script. Scan for CRITICAL and WARNING items. Address criticals immediately and log warnings for the monthly maintenance window.

Check storage health (4 min): Run zpool status on every ZFS pool across all nodes and both TrueNAS instances. Look for non-zero read/write/checksum error counts. Check SMART on the known-degrading GW TrueNAS drives.

Check memory (2 min): Verify ARC caps are holding, swap is under 30%, and no node is above 85% RAM. The memcheck script on pve5 makes this quick.

Check TrueNAS snapshots (1 min): Run zfs list -t snapshot | wc -l on the GW TrueNAS. The count should be around 2,520 and stable. If it is growing, the snapshot task configuration has drifted.

Review updates (1 min): Check for pending updates in the PVE UI. Note any kernel updates but do not apply them without verifying DKMS compatibility on GPU nodes.

Every Month#

Run ZFS scrubs on all pools across all nodes and both TrueNAS instances
Check scrub results the next day (zero errors expected)
Review SMART data trends, especially on older drives
Apply non-kernel PVE updates
Check PBS garbage collection status and datastore utilization
Review LVM thin pool usage trends

Every Quarter#

Reboot nodes with extended uptimes to apply kernel updates (verify DKMS first on GPU nodes)
Test a PBS restore (pick a non-critical VM, restore to a temp ID, verify it boots, delete it)
Test a Veeam restore (restore a single file from a workstation backup)
Audit running VMs and stop anything that does not need to run
Review drive SMART trends compared to the previous quarter
Check EFI certificate expiry dates

Key Lessons Learned#

Cap your ZFS ARC. On systems running VMs, an uncapped ARC will consume all available RAM, pushing guest memory into swap and potentially causing data corruption under extreme pressure. Set zfs_arc_max in /etc/modprobe.d/zfs.conf on every node with ZFS pools.

SMART monitoring is not optional. ZFS will report a pool as healthy even when individual drives are failing. External SMART checks catch degradation before ZFS notices it. The health check script will catch this.

Snapshot tasks need auditing. A recursive snapshot task on a pool with 90 datasets creates 90 snapshots per run. If retention is broken or misconfigured, you can accumulate tens of thousands of snapshots in weeks, degrading pool performance and complicating management.

Hold ZFS packages on pinned kernels. If you cannot upgrade the kernel (due to NVIDIA DKMS or other constraints), hold the ZFS userspace packages to prevent them from upgrading beyond what the running kernel module supports.

Test your backups. A backup that has never been restored is a hypothesis, not a backup. Quarterly restore tests take 20 minutes and confirm the entire chain works.

15 minutes of prevention beats 15 hours of emergency repair. Every issue I found during this audit could have been caught weeks earlier with a simple weekly check. The health check script and weekly routine exist specifically to make that happen.

Downloads#

pve-health-check.sh – Proxmox VE health check script
pve-memcheck.sh – Quick memory and ZFS ARC monitor
pve-config-audit.sh – Configuration Check
weekly-maintenance-checklist.docx – Printable weekly/monthly/quarterly checklist

Final Thoughts#

Running a multi-site Proxmox homelab is rewarding but demands respect. The infrastructure does not maintain itself. A disciplined weekly routine, good scripts, and a willingness to dig into logs before they become emergencies is what separates a homelab that hums along from one that wakes you up at 2 AM.

If you are running Proxmox at home or for your small business, take 20 minutes this weekend to run a health check. You might be surprised what you find.