Documentation

systemd Masterclass

This guide goes deep on systemd — the init system, service manager, timer scheduler, logging system, socket server, cgroup controller, and container runtime that underpins every modern Linux distribution. If you have been copying service unit files from Stack Overflow without fully understanding them, this is the page that changes that. By the end you will be writing production-grade unit files from scratch, tuning resource limits, hardening services against privilege escalation, and running container workloads without a separate daemon.

What this page covers: unit types and the override pattern, service unit deep dive, timer units replacing cron, socket activation for zero-downtime restarts, security hardening with sandboxing directives, cgroup resource control, structured logging with journald, targets and boot analysis, Podman containers as native systemd services via Quadlet, template units for fleet management, and a complete troubleshooting reference — all grounded in the kldload stack you already have running.

Prerequisites: a kldload system on any supported distro. All examples work on CentOS Stream 9, Debian 13, Ubuntu 24.04, Fedora 41, RHEL 9, and Rocky Linux 9. Arch Linux users: the unit file syntax is identical, but some default units differ.

1. systemd Runs Everything

systemd is PID 1. It is the first userspace process the kernel starts after mounting the root filesystem, and it is the direct or indirect parent of every other process on your system. When your kldload node boots, systemd imports the ZFS pool, mounts datasets, starts WireGuard, brings up networking, launches libvirtd, starts your containers, and then hands control to a login prompt or a graphical session. It does all of this in parallel, with dependency ordering, and with structured logging of every event.

systemd is the init system, service manager, logging system, timer scheduler, device manager, and container runtime for every modern Linux distribution. On kldload, every service — libvirtd, WireGuard, ZFS, Docker, Podman, Prometheus, your applications — is a systemd unit. Understanding systemd means understanding how your entire system starts, stops, recovers, and reports.

The architecture is deliberately monolithic. systemd replaced a dozen separate tools with one unified system: SysV init scripts, cron, inetd, /etc/fstab, syslog, logrotate, and chkconfig. One tool, one configuration language, one dependency model, one log store. The complaints about complexity are valid — it IS complex. But the alternative was a dozen separate tools with twelve different configuration languages that did not know about each other.

People complain about systemd because it replaced simple init scripts with a complex system. The complaints are valid — it IS complex. But it replaced a dozen separate tools (init scripts, cron, inetd, /etc/fstab, syslog, logrotate, chkconfig) with one unified system. On kldload, systemd manages ZFS mounts, WireGuard interfaces, KVM VMs, container lifecycles, snapshot timers, and replication schedules. Learning it well is the single highest-ROI investment for Linux infrastructure. When something breaks at 3am, you want one tool to answer all of your questions: what failed, when did it fail, what was it waiting for, what was its output, and why did the restart policy not catch it. systemd gives you all of that. init scripts gave you a return code and silence.

2. Units — the Building Blocks

Everything systemd manages is a unit. A unit is a configuration file that describes a resource: a service to run, a timer to fire, a socket to listen on, a filesystem to mount, a device to activate. The unit type is determined by the file suffix.

.service

A daemon or one-shot process. The most common unit type. Describes how to start, stop, and restart a process, what environment it runs in, and what resources it is allowed to consume.

// nginx.service, postgresql.service, sshd.service

.timer

A scheduled trigger. Activates an associated .service unit on a calendar schedule or after a time interval. The systemd replacement for cron entries. Survives reboots, supports randomized delay.

// sanoid.timer, certbot.timer, fstrim.timer

.socket

A listening socket (TCP, UDP, or Unix domain). systemd holds the socket open and starts the associated service only when a connection arrives. Enables on-demand activation and zero-downtime restarts.

// sshd.socket, cockpit.socket, docker.socket

.mount

A filesystem mount point. The systemd equivalent of an /etc/fstab entry, but with full dependency support — mount only after the ZFS pool is imported, mount before the services that need it.

// data.mount, mnt-backup.mount

.target

A synchronization point. Groups units together and provides a named milestone in the boot sequence. Replaces SysV runlevels. multi-user.target is the equivalent of runlevel 3.

// multi-user.target, network-online.target, graphical.target

.path

A filesystem watch. Activates a service when a file or directory changes. Useful for triggering jobs when a file lands in a watched directory — without polling.

// spool watcher, config reload trigger

.slice

A cgroup hierarchy node. Groups services into resource partitions. All VM processes belong to machine.slice. User processes belong to user.slice. Apply CPU and memory limits to the whole group.

// system.slice, user.slice, machine.slice

.scope

An externally-managed group of processes. Created at runtime (not from a unit file) to organize processes started outside systemd — for example, a shell session that forks children.

// session-1.scope, user@1000.service

Where Units Live

systemd loads unit files from three locations, in priority order (highest to lowest):

Path	Owner	Notes
`/etc/systemd/system/`	Admin	Highest priority. Your files. Overrides vendor units. Never touched by package updates.
`/run/systemd/system/`	Runtime	Volatile. Created at runtime. Cleared on reboot. Used by systemd itself and container runtimes.
`/usr/lib/systemd/system/`	Vendor	Lowest priority. Shipped by packages. Overwritten on updates. Never edit these directly.

The Override Pattern

Never edit files in /usr/lib/systemd/system/. Package updates silently overwrite them. Instead, use drop-ins:

# Open an editor for a drop-in (creates the directory and file automatically)
systemctl edit nginx.service

# This creates: /etc/systemd/system/nginx.service.d/override.conf
# The drop-in MERGES with the original — you only specify what you want to change

# Example drop-in: increase the open file limit for nginx
[Service]
LimitNOFILE=65536

# After saving, reload systemd and restart the service
systemctl daemon-reload
systemctl restart nginx.service

# Verify the override is active
systemctl cat nginx.service  # shows merged unit with drop-ins annotated
systemctl show nginx.service --property LimitNOFILE

The override pattern is the most important systemd skill. Never edit vendor unit files — they get overwritten on package updates. Always use drop-ins: systemctl edit creates /etc/systemd/system/foo.service.d/override.conf that merges with the original. Your changes survive updates. The drop-in only needs to contain the sections and keys you want to change — everything else is inherited. To completely replace a setting that has list semantics (like ExecStart), write the key empty first to clear it, then write the new value: ExecStart= followed by ExecStart=/usr/bin/mynginx -c /etc/mynginx.conf. This is the correct way to change the command a vendor unit runs.

3. Service Units Deep Dive

Service units are the most common unit type. A service unit has three sections: [Unit] (metadata and dependencies), [Service] (how to run the process), and [Install] (how to enable/disable at boot).

[Unit] Section

[Unit]
Description=My Application Server
Documentation=https://myapp.example.com/docs

# Ordering: start After these units, stop Before them
After=network-online.target postgresql.service
Before=nginx.service

# Dependencies (pulling in units automatically):
# Requires= — hard dependency: if postgresql fails, this unit fails too
# Wants= — soft dependency: if postgresql fails, this unit still starts
# BindsTo= — like Requires= but also stops this unit if the dependency stops
# PartOf= — if the dependency stops or restarts, this unit stops or restarts too
# Conflicts= — cannot run at the same time as this unit
Wants=network-online.target
Requires=postgresql.service
PartOf=myapp.slice

[Service] Section — Type=

The Type= directive tells systemd what "ready" means for this service:

Type=simple (default)

The process started by ExecStart IS the service. systemd considers it ready immediately after the process starts. No readiness signal. Fine for simple daemons that are ready as soon as they start listening.

// Most basic services, one-process daemons

Type=exec

Like simple, but systemd waits until the exec() call succeeds — meaning the binary was found and started. Catches "binary not found" errors that Type=simple misses.

// Slightly safer default than simple

Type=forking

The process daemonizes: the initial process forks and exits, leaving a child as the daemon. systemd follows the child. Requires PIDFile= so systemd can track which PID is the real daemon.

// Old-style daemons: apache2, some databases

Type=oneshot

Run once and exit. systemd considers the unit "active" even after the process exits (unlike simple, which considers it dead). Add RemainAfterExit=yes to hold the "active" state.

// Setup scripts, iptables loaders, module loading

Type=notify

The process sends sd_notify(READY=1) to tell systemd it is ready. systemd waits for this signal before considering the unit active. The best option for any application that supports it.

// PostgreSQL, nginx, systemd-aware apps

Type=dbus

Ready when the service acquires a D-Bus bus name. Used by desktop services and some system daemons that register on D-Bus as part of initialization.

// NetworkManager, bluetooth, desktop services

ExecStart and friends

[Service]
# ExecStart: the main command. Must be an absolute path.
ExecStart=/usr/bin/myapp --config /etc/myapp/config.toml

# ExecStartPre: run before ExecStart. If this fails, ExecStart is skipped.
# Prefix with - to allow failure without aborting:
ExecStartPre=-/usr/bin/myapp --check-config /etc/myapp/config.toml
ExecStartPre=/usr/bin/mkdir -p /var/run/myapp

# ExecStartPost: run after ExecStart reports ready. Does not affect readiness.
ExecStartPost=/usr/bin/curl -sf http://localhost:8080/health

# ExecStop: clean shutdown command. If omitted, systemd sends SIGTERM.
ExecStop=/usr/bin/myapp --graceful-shutdown

# ExecReload: handle SIGHUP or explicit reload
ExecReload=/bin/kill -HUP $MAINPID

Restart Policies

[Service]
# Restart=always      — restart on any exit (clean, failure, signal, timeout)
# Restart=on-failure  — restart only on non-zero exit or signal death
# Restart=on-abnormal — restart on signal, watchdog timeout, or failure (not clean exit)
Restart=on-failure

# How long to wait before restarting
RestartSec=5s

# Rate-limit restarts: allow up to 5 restarts in 60 seconds
# After that, the unit enters a failed state and stops retrying
StartLimitInterval=60s
StartLimitBurst=5

# Reset the rate limit counter after this long without a restart
StartLimitIntervalSec=300s

Environment Variables

[Service]
# Inline environment variables
Environment=APP_ENV=production
Environment=PORT=8080
Environment=LOG_LEVEL=warn

# Load from a file (key=value format, one per line)
EnvironmentFile=/etc/myapp/env
EnvironmentFile=-/etc/myapp/env.local  # the - prefix: ignore if file missing

# Pass variables from systemd's own environment
PassEnvironment=HOME LANG TZ

Production Application Unit — Complete Example

# /etc/systemd/system/myapp.service
[Unit]
Description=My Production Application
Documentation=https://myapp.example.com/docs
After=network-online.target postgresql.service redis.service
Wants=network-online.target
Requires=postgresql.service

[Service]
Type=notify
User=myapp
Group=myapp
WorkingDirectory=/opt/myapp

# Configuration
EnvironmentFile=/etc/myapp/env
Environment=APP_ENV=production

# Pre-flight: validate config, ensure writable directories
ExecStartPre=-/usr/bin/myapp migrate --check
ExecStartPre=/usr/bin/install -d -o myapp -g myapp -m 0750 /var/run/myapp

# Main process
ExecStart=/usr/bin/myapp serve --config /etc/myapp/config.toml
ExecReload=/bin/kill -HUP $MAINPID

# Restart policy: retry on failure, up to 5 times in 2 minutes
Restart=on-failure
RestartSec=10s
StartLimitInterval=120s
StartLimitBurst=5

# Resource limits
LimitNOFILE=65536
LimitNPROC=4096

# Security hardening (see section 6)
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
ReadWritePaths=/var/lib/myapp /var/run/myapp /var/log/myapp

[Install]
WantedBy=multi-user.target

Type=notify is the best option for any application that supports it (including PostgreSQL, nginx, and systemd-aware apps). The application calls sd_notify(3) with READY=1 when it is actually ready to serve traffic — not just when the process started. Type=simple just assumes ready immediately after exec. The difference matters for services that depend on each other. Requires=postgresql.service with Type=notify means your app does not start until PostgreSQL is actually accepting connections, not just until the postgres process launched. This is the difference between a working dependency graph and a race condition. Many Go, Rust, and Python frameworks have sd_notify support built in or available as a library. Use it.

4. Timers — Replacing Cron

A systemd timer is a .timer unit that activates an associated .service unit on a schedule. Timers are better than cron in almost every way: they have dependencies, they survive reboots (with Persistent=true), they support randomized delay to prevent thundering herd, they log to journald, and they participate in cgroup resource control.

Timer Unit Structure

# /etc/systemd/system/myapp-cleanup.timer
[Unit]
Description=Clean up old myapp temporary files daily

[Timer]
# Calendar expression: when to fire
OnCalendar=daily

# If the timer was missed (system was off), run it when the system comes back
Persistent=true

# Add up to 5 minutes of random delay — prevent all nodes firing simultaneously
RandomizedDelaySec=5min

# Specify a different unit to activate (default: same name with .service suffix)
Unit=myapp-cleanup.service

[Install]
WantedBy=timers.target

OnCalendar Syntax

# Shorthand
OnCalendar=hourly          # *-*-* *:00:00
OnCalendar=daily           # *-*-* 00:00:00
OnCalendar=weekly          # Mon *-*-* 00:00:00
OnCalendar=monthly         # *-*-01 00:00:00

# Specific times
OnCalendar=Mon-Fri *-*-* 09:00:00   # weekdays at 9am
OnCalendar=*-*-* 02:30:00           # every day at 02:30
OnCalendar=Sat *-*-* 03:00:00       # Saturday at 3am
OnCalendar=*-*-* *:0/15:00          # every 15 minutes

# Verify your expression
systemd-analyze calendar "Mon-Fri *-*-* 09:00:00"

Relative Timers

[Timer]
# Fire once, N seconds after system boot
OnBootSec=5min

# Fire N seconds after the last time this unit was activated
# (creates a repeating interval from the last run, not a wall-clock schedule)
OnUnitActiveSec=15min

# Fire N seconds after systemd itself started
OnStartupSec=10s

Real-World Examples for kldload

# Weekly ZFS scrub — Saturday at 3am with random delay to spread across a fleet
# /etc/systemd/system/zfs-scrub.timer
[Timer]
OnCalendar=Sat *-*-* 03:00:00
RandomizedDelaySec=30min
Persistent=true

# Hourly ZFS snapshot via sanoid
# /etc/systemd/system/sanoid.timer
[Timer]
OnCalendar=hourly
RandomizedDelaySec=2min
Persistent=true

# Replication every 15 minutes via syncoid
# /etc/systemd/system/syncoid.timer
[Timer]
OnCalendar=*-*-* *:0/15:00
RandomizedDelaySec=60s
Persistent=true

# Certificate renewal check daily at 2am
# /etc/systemd/system/certbot.timer
[Timer]
OnCalendar=*-*-* 02:00:00
RandomizedDelaySec=1h
Persistent=true

# Useful timer commands
systemctl list-timers --all          # show all timers and next trigger time
systemctl status sanoid.timer        # show timer status and last run
journalctl -u sanoid.service -b      # logs from the last run

The kldload KVM profile uses systemd timers for hourly snapshots. The sanoid timer runs snapshot policies. The syncoid timer runs replication. All of these used to be cron jobs, but timers are better: they have dependencies (wait for network-online.target before replicating to a remote host), randomized delay (do not scrub all nodes at the same time), and persistent execution (if the timer fires while the system is off, it runs when the system comes back). They also have structured log output in journald — you can see every run, its exit code, and its duration with journalctl -u syncoid.service. With cron, you got an email or silence.

5. Socket Activation

With socket activation, systemd holds the listening socket open and starts the associated service only when the first connection arrives. The service does not need to be running at all times. And when you restart the service, systemd holds the socket so incoming connections are queued rather than rejected.

How It Works

# Two unit files for every socket-activated service:
# sshd.socket  — systemd owns the socket, listens on port 22
# sshd.service — started on demand when a connection arrives

# sshd.socket
[Unit]
Description=OpenSSH Server Socket
Conflicts=sshd.service

[Socket]
ListenStream=22
Accept=yes   # Accept=yes: fork a new sshd process per connection
             # Accept=no: pass the socket FD to one long-running process

[Install]
WantedBy=sockets.target

Zero-Downtime Restart Pattern

# /etc/systemd/system/myapp.socket
[Unit]
Description=myapp listening socket

[Socket]
# systemd holds this socket. Connections are queued during service restarts.
ListenStream=8080
# Socket backlog size during restart
Backlog=128

[Install]
WantedBy=sockets.target

# /etc/systemd/system/myapp.service
[Unit]
Description=myapp web server
Requires=myapp.socket
After=myapp.socket

[Service]
Type=notify
ExecStart=/usr/bin/myapp serve
# The service inherits the socket FD from systemd automatically
# Your application reads from $LISTEN_FDS and $LISTEN_PID

# Zero-downtime restart: systemd holds the socket while the service restarts
# Incoming connections queue up, not fail. Client sees a brief pause.
Restart=on-failure

[Install]
WantedBy=multi-user.target

# Enable socket, not service — the socket starts the service on demand
systemctl enable --now myapp.socket

# Restart the service without dropping any connections:
systemctl restart myapp.service
# While myapp restarts, the socket is still held by systemd.
# Clients see a brief pause in responses, not a connection refused.

On-Demand Activation Example

# Cockpit web console: only starts when someone connects to port 9090
systemctl status cockpit.socket   # listening, but cockpit.service is inactive

# Connect with a browser to https://host:9090
# systemd starts cockpit.service, handles the connection
# After idle timeout, cockpit.service stops — socket remains listening

# This pattern is used for:
# - Admin interfaces that should not run all the time
# - Services that consume RAM when idle
# - Dev/staging environments with many services, limited memory

Socket activation is how you get zero-downtime service restarts without a load balancer. systemd holds the listening socket, queues incoming connections, restarts the service, then hands the connections to the new process. The client sees a brief pause, not a connection refused. For single-server deployments where you cannot afford a load balancer, this is invaluable. It is also how you run dozens of services on a single server without paying RAM costs for idle ones — only start them when someone actually connects. Cockpit, the kldload web UI, and many monitoring sidecars use this pattern. The sd_listen_fds() API is available for C, Go, Python, and Rust.

6. Security Hardening

systemd unit files support a comprehensive set of security directives that sandbox services using kernel namespaces, seccomp filters, and cgroup policies. No code changes, no application modifications — just unit file directives. A service running under strict sandboxing cannot read your home directory, cannot write outside its designated paths, cannot gain new privileges, and cannot make most dangerous system calls.

Core Sandboxing Directives

[Service]
# Filesystem protection
# strict: remount / and /usr read-only. Service can only write where you explicitly allow.
# full: same as strict but /home is writable by the service user
# true: remount / read-only only
ProtectSystem=strict

# Prevent access to /home, /root, /run/user
# true: make them inaccessible (empty dirs)
# read-only: allow reading but not writing
ProtectHome=true

# Give the service its own private /tmp and /var/tmp — invisible to other processes
PrivateTmp=true

# Prevent the service from gaining new privileges via setuid/sudo/capabilities
NoNewPrivileges=true

# Fine-grained filesystem access on top of ProtectSystem=strict:
ReadOnlyPaths=/etc /usr
ReadWritePaths=/var/lib/myapp /var/log/myapp /var/run/myapp
InaccessiblePaths=/etc/shadow /etc/sudoers /root

Namespace Isolation

[Service]
# Give the service its own network namespace — no access to the host network
# WARNING: if you set this, the service cannot make outbound connections either
# Only use for services that communicate exclusively via Unix sockets
PrivateNetwork=true

# Give the service its own /dev with only safe devices (null, zero, random, urandom, tty)
PrivateDevices=true

# Run the service with a private set of user/group IDs
# Requires kernel user namespace support
PrivateUsers=true

# Mount a new (empty) /proc visible only to the service
ProtectProc=invisible

# Hide kernel tunables from the service
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectKernelLogs=true

# Prevent the service from modifying the system clock
ProtectClock=true

# Prevent access to hostname/domainname modification
ProtectHostname=true

Capability Dropping

[Service]
# Drop ALL capabilities, then grant only what is needed
# Capabilities allow non-root processes to do privileged operations
# CAP_NET_BIND_SERVICE: bind to ports < 1024
# CAP_NET_ADMIN: configure network interfaces
# CAP_SYS_PTRACE: ptrace other processes
# CAP_CHOWN: change file ownership

# Start from a full bounding set and remove everything
CapabilityBoundingSet=

# Or: start empty and add only what you need
CapabilityBoundingSet=CAP_NET_BIND_SERVICE

# Also restrict ambient capabilities (inherited by exec)
AmbientCapabilities=CAP_NET_BIND_SERVICE

System Call Filtering (seccomp)

[Service]
# Allow only syscalls in the given set (predefined groups available)
# @system-service: the recommended baseline for most daemons
# @network-io: socket operations
# @file-system: file operations
# @process: process management
SystemCallFilter=@system-service

# Block specific dangerous syscalls by name (prefix with ~ to deny)
SystemCallFilter=~@debug @mount @privileged @reboot

# Specify what happens when a blocked syscall is made:
# errno: return an error code (default, EPERM)
# kill: kill the process
# log: log and allow
SystemCallErrorNumber=EPERM

Hardened Nginx Unit

# /etc/systemd/system/nginx.service.d/hardened.conf
[Service]
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
PrivateDevices=true
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectControlGroups=true
RestrictRealtime=true
RestrictSUIDSGID=true
LockPersonality=true

# nginx needs to bind port 80/443 and write to its log/pid paths
CapabilityBoundingSet=CAP_NET_BIND_SERVICE CAP_SETUID CAP_SETGID CAP_DAC_OVERRIDE
AmbientCapabilities=CAP_NET_BIND_SERVICE
ReadWritePaths=/var/log/nginx /var/run /var/cache/nginx

SystemCallFilter=@system-service
SystemCallFilter=~@debug @reboot @swap @obsolete

Auditing Unit Security

# Score a unit's security posture (0 = exposed, 10 = fully sandboxed)
systemd-analyze security nginx.service

# Example output:
# NAME                                                        DESCRIPTION
# PrivateNetwork=                                             Service has access to the host's network   UNSAFE ☓
# NoNewPrivileges=                                            Service processes cannot acquire new       OK ✓
# ProtectSystem=strict                                        Service has strict read-only access        OK ✓
# ...
# Overall exposure level for nginx.service: 4.2 OK

# After adding the hardened drop-in:
# Overall exposure level for nginx.service: 7.8 OK

# List all services with their security scores
systemd-analyze security

systemd-analyze security scores your unit files from 0 (completely exposed) to 10 (fully sandboxed). Most default service files score 2-3 (EXPOSED). Adding ProtectSystem=strict, NoNewPrivileges=true, and PrivateTmp=true immediately jumps to 6-7. This is free security hardening — no code changes, no application modifications, just unit file directives. On a kldload system running public-facing services, running systemd-analyze security with no arguments shows you the full list of services sorted by exposure score. Anything below 4.0 that is listening on a public interface should get a hardening drop-in. Do it in an afternoon. It will not break working services if you start with the safe directives (NoNewPrivileges, PrivateTmp, ProtectSystem=strict) and add the rest one at a time.

7. Resource Control with cgroups

systemd uses Linux cgroups v2 to enforce resource limits on every unit. Limits are hierarchical: a limit on a slice applies to everything in that slice. A VM process cannot consume more CPU than its parent machine.slice allows, regardless of what the individual .service unit says.

Per-Service Resource Limits

[Service]
# CPU: limit to 200% (2 cores) of total CPU time
CPUQuota=200%

# CPU weight (relative scheduling priority, 1-10000, default 100)
CPUWeight=50   # lower priority than default

# Memory: soft limit (kernel throttles cgroup when approached)
MemoryHigh=3G

# Memory: hard limit (OOM kill if exceeded)
MemoryMax=4G

# Memory: minimum guaranteed reservation
MemoryMin=256M

# IO weight (relative priority for block IO, 1-10000)
IOWeight=50

# Maximum number of tasks (processes + threads)
TasksMax=256

The Slice Hierarchy

# Default slices:
# system.slice    — all system services (sshd, nginx, postgresql, etc.)
# user.slice      — all user login sessions
# machine.slice   — all VMs and containers (libvirt, podman, nspawn)

# Move a service into a different slice:
[Service]
Slice=myapp.slice   # will be created automatically if it does not exist

# Create a custom slice with limits that apply to everything in it:
# /etc/systemd/system/myapp.slice
[Unit]
Description=myapp services resource partition

[Slice]
MemoryMax=8G
CPUQuota=400%
IOWeight=100

Limiting All VMs to 70% of System Resources

# /etc/systemd/system/machine.slice.d/limits.conf
[Slice]
# All libvirt VMs and containers combined cannot exceed 70% CPU
CPUQuota=700%    # on a 10-core machine: 70% of 10 cores

# All VMs combined: hard limit 70% of installed RAM
# On a 32GB machine: ~22GB for all VMs, leaving ~10GB for host + ZFS ARC
MemoryMax=22G

# Soft limit: start throttling at 18GB
MemoryHigh=18G

# Apply after saving:
systemctl daemon-reload

Real-Time Monitoring

# Real-time cgroup resource usage (like top, but for cgroups)
systemd-cgtop

# Show cgroup tree with resource consumption
systemctl status machine.slice

# Show resource accounting for a specific service
systemctl show myapp.service --property CPUUsageNSec MemoryCurrent TasksCurrent

# Per-service accounting in journald (add to [Service])
# CPUAccounting=true
# MemoryAccounting=true
# IOAccounting=true
# TasksAccounting=true

cgroup resource control is how you prevent one service from consuming all system resources. MemoryMax= is a hard limit — OOM kill if exceeded. MemoryHigh= is a soft limit — the kernel throttles the cgroup when reached. For a kldload KVM host, putting all VMs in machine.slice with a MemoryMax ensures they can never starve the host OS or ZFS ARC. The ZFS ARC competes with everything else for RAM. If a runaway VM consumes all available memory, the ARC shrinks to nearly zero and I/O performance craters. A MemoryMax on machine.slice prevents this. On a 64GB KVM host running kldload, a reasonable allocation is: 20GB reserved for ZFS ARC, 4GB for host OS and services, 40GB for VMs via machine.slice MemoryMax=40G. This is enforced by the kernel and cannot be exceeded no matter what the VMs do.

8. Journal — Structured Logging

journald is the systemd logging daemon. It captures stdout and stderr from every service, kernel messages, audit events, and structured log entries from applications that use the journal API. All logs are stored in a binary indexed format that allows fast filtering without grep.

journalctl Basics

# Follow logs for a specific unit (like tail -f)
journalctl -u nginx.service -f

# Show logs since last boot
journalctl -b

# Show logs from two boots ago
journalctl -b -2

# Show only errors and above (emerg, alert, crit, err)
journalctl -p err

# Combine: nginx errors in the last hour
journalctl -u nginx.service -p err --since "1 hour ago"

# Exact time range
journalctl --since "2026-04-01 03:00:00" --until "2026-04-01 04:00:00"

# Show all failed units from this boot
journalctl -b -p err _SYSTEMD_UNIT=myapp.service

# Show kernel messages only
journalctl -k

# Show logs with full metadata
journalctl -o verbose -u myapp.service

Structured Fields

# journald logs are structured key=value records, not flat text.
# Filter on any field:

# By unit name
journalctl _SYSTEMD_UNIT=postgresql.service

# By PID
journalctl _PID=12345

# By transport (journal, syslog, stdout, stderr, kernel, audit)
journalctl _TRANSPORT=kernel

# By priority number (0=emerg, 1=alert, 2=crit, 3=err, 4=warning, 5=notice, 6=info, 7=debug)
journalctl PRIORITY=3

# Combine arbitrary fields
journalctl _SYSTEMD_UNIT=myapp.service PRIORITY=3 _TRANSPORT=stdout

JSON Output

# Full structured output as JSON (one object per line)
journalctl -u myapp.service -o json | jq .

# Pretty-printed JSON (slower, but readable)
journalctl -u myapp.service -o json-pretty | head -100

# Extract specific fields with jq
journalctl -u myapp.service -o json \
  | jq -r 'select(.PRIORITY <= "4") | [.__REALTIME_TIMESTAMP, .MESSAGE] | @tsv'

# Export logs for external analysis
journalctl -u myapp.service --since today -o json > /tmp/myapp-today.json

Storage Configuration

# /etc/systemd/journald.conf
[Journal]
# Maximum disk space for persistent journal
SystemMaxUse=2G

# Maximum single journal file size
SystemMaxFileSize=128M

# Maximum retention period (delete logs older than this)
MaxRetentionSec=3month

# Compress log data (default: yes)
Compress=yes

# Sync to disk after this long (higher value = better performance, more data loss risk on crash)
SyncIntervalSec=5m

# After editing:
systemctl restart systemd-journald

Log Forwarding

# Forward to a remote syslog server
# /etc/systemd/journald.conf
[Journal]
ForwardToSyslog=yes

# Or install systemd-journal-remote for structured forwarding over HTTPS:
# Sending side:
systemctl enable --now systemd-journal-upload.service

# Receiving side:
systemctl enable --now systemd-journal-remote.service

# Forward to Loki (via promtail or alloy reading from journald socket):
# promtail scrape config:
# - job_name: systemd-journal
#   journal:
#     json: false
#     max_age: 12h
#     labels:
#       job: systemd-journal
#     path: /var/log/journal

journald stores logs in a structured binary format, not flat text files. This means you can filter by PID, by unit, by time range, by priority — without grep. journalctl -u nginx -p err --since "1 hour ago" gives you nginx errors from the last hour. Try doing that with /var/log/nginx/error.log across multiple rotated files. The binary format is also indexed: filtering on any field is fast regardless of total log volume. The tradeoff is that you need journalctl to read the logs, and you need to configure forwarding if you want logs in an external system. On kldload with Prometheus and Grafana, the Loki log aggregator reads directly from the journald socket and indexes everything with the full structured metadata intact.

9. Targets and Boot Ordering

Targets are synchronization points — named milestones in the boot sequence that units can declare they want to start before or after. They replaced SysV runlevels with a more flexible dependency-based model. A target can depend on other targets, which depend on other targets, creating a directed graph that systemd traverses in parallel.

Standard Targets

Target	SysV equiv	Description
`poweroff.target`	0	Shut down the system
`rescue.target`	1	Single-user mode, minimal services, root shell
`multi-user.target`	3	Fully operational, multi-user, no GUI
`graphical.target`	5	Multi-user with display manager
`reboot.target`	6	Reboot the system
`emergency.target`	S	Minimal environment, root shell, no mounts. For recovery.
`network-online.target`	N/A	Network is online and configured. Services that need connectivity use After=network-online.target.

Custom Application Stack Target

# /etc/systemd/system/myapp-stack.target
[Unit]
Description=myapp application stack
# Pull in all components of the stack
Wants=postgresql.service redis.service myapp.service nginx.service
After=postgresql.service redis.service myapp.service nginx.service

[Install]
WantedBy=multi-user.target

# Enable the whole stack with one command:
systemctl enable --now myapp-stack.target

# Stop and start the whole stack:
systemctl stop myapp-stack.target
systemctl start myapp-stack.target

Boot Analysis

# Show total boot time breakdown
systemd-analyze

# Show time taken by each unit to start (sorted by duration)
systemd-analyze blame

# Show the critical path: which chain of dependencies determined total boot time
systemd-analyze critical-chain

# Show critical chain for a specific unit
systemd-analyze critical-chain myapp.service

# Generate a full boot timeline as an SVG (open in a browser)
systemd-analyze plot > /tmp/boot-timeline.svg

# Verify unit file syntax without reloading
systemd-analyze verify /etc/systemd/system/myapp.service

# Show security score for all units
systemd-analyze security

systemd-analyze blame shows you which units take the longest to start. On a kldload system, ZFS pool import and DKMS module builds are usually the longest — ZFS import can take 15-30 seconds on large pools with many disks, and DKMS rebuilds on first boot after a kernel update can take 2-5 minutes. systemd-analyze critical-chain shows the dependency chain: which unit waited for which. This is how you diagnose slow boots. If your app takes 90 seconds to start, blame tells you the app itself took 5 seconds — and critical-chain tells you it waited 85 seconds for network-online.target, which waited 80 seconds for NetworkManager to get a DHCP lease. Fix the network, not the app. On kldload, the ZFS import is almost always on the critical chain. This is expected and cannot be eliminated — the pool must be fully imported before datasets can mount. What you can do is add SSD SLOG devices to reduce import time.

10. Podman + systemd (Containers as Services)

Podman integrates with systemd natively. Containers run as systemd units, with all the restart policies, resource limits, security hardening, and logging that applies to any other service. There are two approaches: auto-generated units from running containers, and Quadlet — declarative .container files that systemd manages directly.

podman generate systemd

# Start a container, then generate a unit file from it
podman run -d --name myapp -p 8080:8080 myregistry/myapp:latest

# Generate a systemd unit file
podman generate systemd --new --name myapp > /etc/systemd/system/myapp-container.service

# --new: recreate the container on start instead of reusing the stopped one
# The generated unit wraps podman run with all the flags used when the container was created

# Enable and start
systemctl daemon-reload
systemctl enable --now myapp-container.service

Quadlet — Declarative Container Units

Quadlet is the modern approach. Write a .container file in /etc/containers/systemd/ and systemd generates the service unit automatically. No manual unit file maintenance.

# /etc/containers/systemd/myapp.container
[Unit]
Description=myapp container
After=network-online.target

[Container]
Image=myregistry/myapp:latest
PublishPort=8080:8080
Volume=/var/lib/myapp:/data:Z
Environment=APP_ENV=production
EnvironmentFile=/etc/myapp/env

# Run as a specific user inside the container
User=app
Group=app

# Health check
HealthCmd=curl -sf http://localhost:8080/health || exit 1
HealthInterval=30s
HealthRetries=3

# Auto-update label
Label=io.containers.autoupdate=registry

[Service]
Restart=always
RestartSec=10s

# Resource limits (all systemd cgroup directives work here)
MemoryMax=2G
CPUQuota=200%

[Install]
WantedBy=multi-user.target

# After saving, reload systemd — it auto-discovers Quadlet files
systemctl daemon-reload

# The generated unit is named myapp.service
systemctl enable --now myapp.service
systemctl status myapp.service

Full Stack with Quadlet

# /etc/containers/systemd/myapp-network.network
[Network]
NetworkName=myapp-net

# /etc/containers/systemd/myapp-db.container
[Container]
Image=docker.io/library/postgres:16
ContainerName=myapp-db
Network=myapp-net
Volume=myapp-db-data:/var/lib/postgresql/data:Z
Environment=POSTGRES_DB=myapp
EnvironmentFile=/etc/myapp/db.env
HealthCmd=pg_isready -U postgres
HealthInterval=10s

# /etc/containers/systemd/myapp-db-data.volume
[Volume]
VolumeName=myapp-db-data

# /etc/containers/systemd/myapp.container
[Unit]
After=myapp-db.service

[Container]
Image=myregistry/myapp:latest
Network=myapp-net
PublishPort=8080:8080
EnvironmentFile=/etc/myapp/env
Environment=DATABASE_URL=postgresql://myapp:secret@myapp-db:5432/myapp

# Quadlet handles all the container lifecycle: create, start, stop, remove

Auto-Update

# Enable auto-update for all containers labelled with io.containers.autoupdate=registry
systemctl enable --now podman-auto-update.timer

# Manual trigger
podman auto-update

# The timer runs daily by default. Customize:
systemctl edit podman-auto-update.timer
# [Timer]
# OnCalendar=*-*-* 03:00:00
# RandomizedDelaySec=30min

Quadlet is Podman's replacement for docker-compose. Instead of a YAML file that needs a daemon, you write .container files in /etc/containers/systemd/ and systemd manages them directly. Each container is a native systemd unit — dependencies, health checks, restart policies, resource limits all work the same as any other service. This is the most "Linux-native" way to run containers. No separate docker-compose daemon, no Docker Desktop, no compose runtime to maintain. Just systemd doing what it already does, with containers as the workload. On kldload, this integrates with the same resource accounting, security hardening, journal logging, and timer scheduling as everything else. You get journalctl -u myapp.service, systemctl status myapp.service, and systemd-analyze security myapp.service for free.

11. Fleet Management Patterns

At scale, systemd unit files are configuration. They live in version control, are deployed by Salt or Ansible, and follow predictable patterns. Template units multiply one unit file into many instances. Drop-ins layer site-specific configuration over common vendor defaults.

Deploying Unit Files with Ansible

# In your Ansible playbook:
- name: Deploy myapp service unit
  copy:
    src: files/myapp.service
    dest: /etc/systemd/system/myapp.service
    mode: '0644'
  notify: reload systemd

- name: Deploy myapp hardening drop-in
  copy:
    src: files/myapp-hardened.conf
    dest: /etc/systemd/system/myapp.service.d/hardened.conf
    mode: '0644'
  notify: reload systemd

handlers:
  - name: reload systemd
    systemd:
      daemon_reload: yes

  - name: restart myapp
    systemd:
      name: myapp.service
      state: restarted

Template Units

A template unit has @ in its name before the suffix: wg-quick@.service. When you enable or start wg-quick@wg0.service, the %i specifier in the unit file is replaced with wg0. One unit file, infinite instances.

# Example: wg-quick@.service template (shipped with wireguard-tools)
[Unit]
Description=WireGuard via wg-quick(8) for %I
After=network-online.target nss-lookup.target
Wants=network-online.target nss-lookup.target
Documentation=man:wg-quick(8) man:wg(8)

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/wg-quick up %i       # %i = wg0, wg1, etc.
ExecStop=/usr/bin/wg-quick down %i

[Install]
WantedBy=multi-user.target

# Enable three WireGuard interfaces from one template:
systemctl enable --now wg-quick@wg0.service   # /etc/wireguard/wg0.conf
systemctl enable --now wg-quick@wg1.service   # /etc/wireguard/wg1.conf
systemctl enable --now wg-quick@vpn-client.service  # /etc/wireguard/vpn-client.conf

Template for Containerized Services

# /etc/systemd/system/app-container@.service
# Run any application as a container by enabling app-container@appname.service
# Expects: /etc/containers/app-configs/%i.env and image name in that file

[Unit]
Description=Container service for %I
After=network-online.target
Wants=network-online.target

[Service]
Type=notify
NotifyAccess=all
EnvironmentFile=/etc/containers/app-configs/%i.env
# IMAGE, PORT, and DATA_DIR come from the env file
ExecStartPre=-/usr/bin/podman pull ${IMAGE}
ExecStart=/usr/bin/podman run \
    --rm \
    --name %i \
    --sdnotify=container \
    -p ${PORT}:${PORT} \
    -v /var/lib/containers/%i:${DATA_DIR}:Z \
    --env-file /etc/containers/app-configs/%i.env \
    ${IMAGE}
ExecStop=/usr/bin/podman stop %i
Restart=on-failure
RestartSec=10s

[Install]
WantedBy=multi-user.target

# /etc/containers/app-configs/grafana.env
IMAGE=docker.io/grafana/grafana:latest
PORT=3000
DATA_DIR=/var/lib/grafana

# Enable:
systemctl enable --now app-container@grafana.service
systemctl enable --now app-container@prometheus.service
systemctl enable --now app-container@loki.service

Template Specifiers Reference

Specifier	Expands to
`%i`	Instance name (the part after @, unescaped)
`%I`	Instance name (unescaped, for display)
`%n`	Full unit name including instance
`%H`	Hostname
`%m`	Machine ID
`%u`	Username of the unit's User= setting

Template units are underused. wg-quick@.service is a template — the part after @ becomes the instance name, passed as %i. You enable wg-quick@wg0, wg-quick@wg1, wg-quick@wg2 — three instances of one template, each reading a different config file. Use this pattern for anything that runs multiple identical instances with different configs: WireGuard interfaces, PHP-FPM pools, backup jobs for different datasets, per-customer application instances. The instance name becomes a variable that flows through the entire unit. You can name logfiles after it, config directories after it, socket paths after it. One unit file replaces dozens of nearly-identical service files. On a kldload fleet managed by Salt, you push one template unit to every node and then enable whichever instances each node needs via the Salt state. The unit file never changes; only the enabled instances differ per host.

12. Troubleshooting

First Checks

# Show the status of a unit with recent log lines
systemctl status myapp.service

# Show all failed units
systemctl --failed

# Show the dependency tree for a unit
systemctl list-dependencies myapp.service

# Show units that are waiting (in activating state)
systemctl list-units --state=activating

# Show units that are masked (deliberately disabled, cannot be started)
systemctl list-units --state=masked

Log Investigation

# Last N lines of logs with context (the -x adds catalog explanations)
journalctl -xe -u myapp.service

# Logs since last boot with errors highlighted
journalctl -b -p warning -u myapp.service

# All logs for a unit, most recent first
journalctl -u myapp.service -r | head -50

# Logs across multiple units (show dependency failures together)
journalctl -b -u myapp.service -u postgresql.service -u redis.service

Unit File Validation

# Check unit file syntax without reloading
systemd-analyze verify /etc/systemd/system/myapp.service

# Show the effective unit after all drop-ins are merged
systemctl cat myapp.service

# Show all properties of a unit (the complete runtime state)
systemctl show myapp.service

# Show a specific property
systemctl show myapp.service --property Restart
systemctl show myapp.service --property ActiveState SubState

Common Problems and Fixes

ExecStart path not absolute

systemd requires absolute paths in ExecStart. ExecStart=myapp fails with "Failed to locate executable". Use ExecStart=/usr/bin/myapp or find the path with which myapp.

// Wrong: ExecStart=python3 app.py // Right: ExecStart=/usr/bin/python3 /opt/myapp/app.py

Type=forking but process doesn't fork

If you declare Type=forking but the process stays in the foreground (most modern daemons do), systemd waits forever for the original process to exit. Switch to Type=simple or Type=exec for foreground processes.

// Most modern apps: Type=simple or Type=notify // Old-style daemons only: Type=forking + PIDFile=

Dependency cycle

A depends on B, B depends on A. systemd detects this at load time and breaks the cycle by dropping one dependency. Use systemd-analyze verify to catch cycles before they cause problems. Check for circular After= / Before= pairs.

// systemd-analyze verify catches this at syntax check

Unit masked

A masked unit is symlinked to /dev/null and cannot be started by anything, including dependencies. Run systemctl unmask myapp.service to unmask it. Masking is intentional — check why it was masked before unmasking.

// systemctl unmask myapp.service

StartLimitBurst exceeded

The service restarted too many times and hit the rate limit. It stops retrying and stays in a failed state. Check the logs, fix the underlying problem, then: systemctl reset-failed myapp.service && systemctl start myapp.service.

// systemctl reset-failed myapp.service

daemon-reload not run after editing

Editing a unit file on disk does not take effect until systemd reloads its configuration. Always run systemctl daemon-reload after any unit file change. Then restart the affected service to apply changes.

// systemctl daemon-reload && systemctl restart myapp

The Nuclear Options

# Restart systemd itself without rebooting
# This re-executes the init process, reloading its binary and all unit configurations
# Running services are NOT stopped — only systemd itself is restarted
systemctl daemon-reexec

# Reload all unit files (less drastic than daemon-reexec)
systemctl daemon-reload

# Emergency: if the system is in a boot loop and you cannot log in
# Append to kernel command line at GRUB:
# systemd.unit=rescue.target
# or for absolute minimal environment:
# systemd.unit=emergency.target

# Debug boot: enable verbose logging
# Append to kernel command line:
# systemd.log_level=debug systemd.log_target=console

# Show what systemd is doing during boot
systemctl status --all 2>&1 | grep -E "failed|error|warning"

The full picture: every process on a kldload system is a leaf node in systemd's dependency tree. The boot sequence is a directed acyclic graph that systemd traverses in parallel, respecting ordering constraints. Services declare what they need (Requires=, Wants=) and when they start (After=). systemd computes the correct order, starts everything in parallel where possible, and provides structured logs, resource accounting, and security sandboxing for every unit — vendor-shipped or custom.

The investment in learning systemd properly pays off every time you need to understand why a service is not starting, why a server is slow to boot, why a job did not run, or why a service is consuming more memory than expected. One tool, one log store, one dependency model, one security framework. The complexity is real — but it is organized complexity with excellent tooling, and it is the same on every Linux distribution kldload supports.

KVM Virtual Machines — libvirtd as a systemd service, VM autostart via systemd
Docker on ZFS — dockerd as a systemd service, socket activation
WireGuard Basics — wg-quick@ template unit in action
WireGuard Masterclass — advanced WireGuard managed by systemd
Snapshots & Replication — sanoid and syncoid timer units
Monitoring Stack Glossary (355 terms) Help & Links — Prometheus, Grafana, and Loki as systemd services
Cloud & Packer — systemd in golden image workflows
Automation — Salt and Ansible deploying unit files to a fleet

← VXLAN & EVPN Packer & IaC Masterclass →