Why a server is slow when CPU and RAM are idle

WILD

Administrator
Staff member
ADMIN
SELLER
SUPREME
MEMBER
Joined
Jan 21, 2025
Messages
219
Reaction score
637
Deposit
0$
If monitoring shows that the CPU is idle most of the time and RAM usage stays within reasonable limits, the bottleneck is usually elsewhere. This article covers the most common causes of Linux server sluggishness in such situations and how to check whether they apply to your system.

Disk subsystem and I/O

In practice, the most frequent cause of latency with a free CPU is disk I/O. One of the key indicators here is high iowait. In this case, the processor itself is idle but waiting for read or write operations to complete. If iostat -x or vmstat reports a high %iowait value (above roughly 10–15%), this is a strong sign of I/O-related issues.

iostat -x -z 1 10

Most often, the bottleneck is the slowest element in the I/O chain. If the server uses multiple storage devices (for example, an HDD-based RAID array alongside SSDs or NVMe), the slow HDD will inevitably affect overall performance.

Another clear symptom is processes stuck in state D (uninterruptible sleep). Such processes do not respond to signals until the I/O operation completes. In this situation, system load increases because these threads effectively hand over control to the kernel for block I/O operations. You can check this with top -H (to display threads) and sort by the S column. If you see many threads stuck in state D, this almost always points to disk or remote filesystem problems.

top -H

I/O queues and schedulers also matter. Storage controllers maintain request queues, and if too many read/write operations accumulate, even fast SSDs can start to lag. This is typically visible as high await values and long queue lengths in iostat -x or vmstat.

If the issue is confirmed, you can move critical workloads to faster storage or redistribute I/O by changing the scheduler. For SSDs and NVMe devices, mq-deadline or none is usually a good choice. Critical volumes should reside on SSD/NVMe, while background jobs can be placed on a separate disk or even a different server. Also, do not forget to configure ionice so that background writes do not compete with latency-sensitive workloads.

Swap and memory policy

Even when there appears to be plenty of RAM available, swap usage can still be the root of the problem. Check the kernel parameter via sysctl vm.swappiness; in most distributions, it defaults to 60. With this setting, the kernel eagerly swaps out anonymous pages long before memory is fully exhausted, which can significantly degrade performance.

cat /proc/sys/vm/swappiness

This issue often surfaces during large background operations, especially on newer Linux kernels where memory behavior has become more aggressive. For example, during large-scale backups to fast NVMe storage, the kernel may cache large amounts of data, while some memory pages get swapped out to a slow HDD, causing noticeable slowdowns.

If the problem is detected, first make sure there is no “dead” swap partition located on slow storage. Second, tune vm.swappiness to reduce unnecessary swapping. For server environments, values in the range of 1–20 are commonly used.

sudo sysctl -w vm.swappiness=10

For completeness, note that lowering this value increases the priority of application data at the cost of less aggressive I/O caching. Fully disabling swap is generally not recommended.

Also keep zswap in mind, which compresses memory pages before writing them to swap on disk. You can check whether it is enabled with:

cat /sys/module/zswap/parameters/enabled

A value of Y means it is enabled, N means disabled. Be aware that compression itself consumes CPU cycles.

Virtualization and CPU steal time

Even if CPU usage inside the guest OS is low, latency can still be introduced by the hypervisor. This can be observed via the %st (steal) column in top, which shows how much time the virtual machine was ready to run but the hypervisor scheduled another guest instead.

META, [07.02.2026 1:50]
If you see nearly 100% idle CPU but a non-zero (especially double-digit) %st, this indicates excessive load on the host. In such cases, the issue is outside your control: either host oversubscription or a “noisy neighbor” is consuming shared resources.

The first step is to capture metrics. Check %st in top, vmstat 1, or mpstat and confirm that steal time increases while application load and CPU user/system time remain low. If that is the case, the only viable solution is to bring these numbers to your provider.

If the provider exposes hypervisor-level metrics, also compare physical core utilization. A reasonable threshold is sustained steal time above ~10%, correlated with increased latency, timeouts, and APM logs.

Network subsystem

Heavy network activity (database clusters, REST APIs, bulk data transfers) can also manifest as system “freezes.” Channel saturation or firewall overload leads to queues building up in the network stack.

Packet drops and retransmissions can be examined via netstat -s and iftop. If the network link is fully saturated, even simple SSH or HTTP requests will lag, despite an apparently idle CPU.

netstat -s | less

or

sudo iftop -i eth0

Packet loss is another common cause. Processes wait for acknowledgments during transmission, even though CPU remains idle. You can inspect error and drop counters here:

cat /proc/net/dev

or

ethtool -S eth0 | egrep -i 'drop|err|fail|miss|over|timeout|abort|reset'

Another subtle issue is excessive interrupts from network interfaces. If all interrupts are handled by a single CPU core (for example, due to missing IRQ balancing), other tasks may starve. Check /proc/interrupts: if one NIC or disk generates most interrupts on one core, consider enabling irqbalance or manually tuning IRQ affinity.

Virtual network adapters can also introduce small but noticeable latency, especially in transaction-heavy workloads. If a VM has multiple interfaces, check statistics for each one (ifconfig, ethtool).

Scheduling, CPU frequencies, and interrupts

There are also internal OS-level factors that affect responsiveness. One of them is CPU frequency management. If the processor runs in an aggressive power-saving mode, sudden load spikes may incur latency while the frequency ramps up. For latency-sensitive production systems, the performance governor is often used to avoid such delays. In some cases, frequencies are even pinned manually, though this is a more specialized scenario.

Interrupt handling is another area to watch. If a network card or storage controller generates too many interrupts and they are processed by a single core, overall performance suffers. Again, /proc/interrupts helps identify such imbalances.

cat /proc/interrupts | head -n 40

In simple cases, enabling irqbalance is sufficient:

sudo systemctl enable --now irqbalance
systemctl status irqbalance --no-pager

Under heavy load or strict requirements, IRQ affinity is usually configured manually to distribute devices across cores.

Locks, contention, and kernel issues

At the application or driver level, lock contention (mutexes, spinlocks) can cause threads to idle while waiting, even though CPU resources are available. This is typically diagnosed with perf, increased context switch rates in vmstat, and characteristic futex or spin_lock calls in profiles. Fixes involve architectural changes, limiting parallelism, or updating problematic components.

vmstat 1 10

Outdated or poorly written drivers for storage, networking, or chipsets can also introduce latency. In such cases, updating drivers and the kernel often helps.

OS-specific behavior and observability

Beyond obvious bottlenecks, each OS has its own quirks. Linux and FreeBSD make heavy use of the page cache, and under intensive write workloads the kernel may spend significant time flushing dirty pages (writeback).


Delays can also stem from auxiliary mechanisms. For example, overly frequent polling of /proc by monitoring tools can generate noticeable overhead. Virtual filesystems (NFS/SMB) and rare I/O error conditions may also lead to unpredictable stalls.

For this reason, it is always worth checking dmesg and system logs, where disk timeouts, network driver errors, and other hardware-related issues are often recorded.

dmesg -T | tail -n 200

and

journalctl -b -p err..alert --no-pager

Finally, collect metrics regularly. Run iostat 1, sar, vmstat, or integrate APM and tracing. Open-source tools worth mentioning include distributed tracing systems like Jaeger (despite its usability challenges) or Zipkin; the OpenTelemetry Collector for metrics, logs, and traces; Grafana Tempo for cost-efficient long-term trace storage; Apache SkyWalking as a full-featured APM platform; and SigNoz as an open-source alternative to commercial APM solutions.
 
Top Bottom