How We Set Up Our KVM Hypervisor: From Bare Metal to Production-Ready VM Host

How We Set Up Our KVM Hypervisor: From Bare Metal to Production-Ready VM Host

We recently built a dedicated KVM/libvirt hypervisor for running virtual machines β€” development environments, test labs, and container hosts. The goal was simple: take a bare metal server, tune every layer of the stack, and turn it into a production-ready VM host that doesn’t waste a single CPU cycle on overhead.

The Hardware

ComponentSpec
CPUAMD EPYC 7351P β€” 16 cores / 32 threads, 4 NUMA nodes
RAM125 GiB DDR4 ECC
NetworkIntel X520 10GbE (ixgbe driver), single active port + bridge
Storage2x WDC SN720 NVMe 512 GB (PCIe 3.0 x4)
OSDebian 12, kernel 6.12.90

Layer 1: Storage β€” XFS on NVMe

Mount options

For the VM image filesystem (/var/lib/libvirt):

noatime,allocsize=1m,largeio,inode64,logbufs=8,logbsize=32k,noquota

noatime β€” Without it, every time a guest reads a qcow2 file, the host writes an atime update. With noatime, those writes are gone entirely.

allocsize=1m β€” XFS defaults to 4 KiB extent allocation. qcow2 images grow in 2-64 MiB clusters. By pre-allocating 1 MiB extents, we cut allocation overhead by 256x and eliminate fragmentation.

What we deliberately did NOT use

nobarrier β€” The WDC SN720 is a consumer NVMe drive with no power-loss protection. Barriers (FUA writes) keep the XFS journal consistent if the server suddenly loses power.

discard β€” Synchronous TRIM on every unlink destroys write latency. Instead, we enabled fstrim.timer for a weekly batch TRIM.

The second NVMe was formatted with mkfs.xfs -m reflink=1. This enables reflink copies β€” instant, copy-on-write clones of VM images.

cp --reflink=always debian12-base.qcow2 test-vm.qcow2

Layer 2: Memory β€” Hugepages

We allocated 121 GiB of the 125 GiB total RAM to hugepages, leaving 4 GiB for the host OS.

The crucial detail: kernel cmdline

GRUB_CMDLINE_LINUX="... hugepages=61952"

After update-grub and a reboot:

HugePages_Total:   61952
HugePages_Free:    61952

zswap: the safety net

We enabled zswap (lz4 compression, zsmalloc allocator, 20% pool limit) via kernel cmdline.

Layer 3: Network β€” 10GbE Tuning

Socket buffers

net.core.rmem_max = 134217728    # 128 MB
net.core.wmem_max = 134217728    # 128 MB
net.ipv4.tcp_rmem = 4096 1048576 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

Busy polling

net.core.busy_read = 50
net.core.busy_poll = 50

Bridge bypass

net.bridge.bridge-nf-call-iptables = 0
net.bridge.bridge-nf-call-ip6tables = 0

TCP keepalives

net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_probes = 5

Layer 4: CPU & Scheduler

Governor and C-states

The performance governor locks all cores at maximum frequency.

NUMA discipline

kernel.numa_balancing = 0
kernel.timer_migration = 0
kernel.sched_autogroup_enabled = 0

Halt polling

/sys/module/kvm/parameters/halt_poll_ns = 200000

KSM β€” disabled

Kernel Samepage Merging adds unpredictable CPU overhead.

Layer 5: Security β€” SSH & Firewall

UFW: whitelist-only SSH

ufw default deny incoming
ufw default allow outgoing
ufw allow from 127.0.0.0/8 to any port 22

sshd hardening

PermitRootLogin no
PasswordAuthentication no
PubkeyAuthentication yes
MaxAuthTries 3
ClientAliveInterval 300
ClientAliveCountMax 2

Layer 6: Automation β€” Scripts & Ansible

Three deployment scripts:

  • tune-xfs.sh
  • tune-kvm.sh
  • harden-ssh.sh

Plus matching Ansible playbooks for repeatable deployment.

Deployment Order

  1. sudo bash tune-xfs.sh
  2. sudo bash tune-kvm.sh
  3. sudo bash harden-ssh.sh
  4. Edit /etc/default/grub, add hugepages=61952
  5. sudo update-grub && sudo reboot

What We Learned

Hugepages must be allocated at boot. The kernel cmdline approach is the only reliable way.

nobarrier on consumer NVMe is Russian roulette. Without barriers, a sudden power loss could corrupt the XFS journal.

Cockpit pulls in tuned β€” be ready for it. We fixed it by migrating everything into a custom tuned profile.

Reflink cloning on XFS is a superpower. Clone 40 GB VM images in under a second.