In this guide, our goal is to learn about the tools and environment provided by a typical GNU/Linux system to be able to start troubleshooting even on an unknown machine.
In this tutorial you will learn:
- How to check disk space
- How to check memory size
- How to check system load
- How to find and kill system processes
- How to user logs to find relevant system troubleshooting information
Software Requirements and Conventions Used
|Category||Requirements, Conventions or Software Version Used|
|System||Ubuntu 20.04, Fedora 31|
|Other||Privileged access to your Linux system as root or via the |
|Conventions|| # - requires given linux commands to be executed with root privileges either directly as a root user or by use of |
While GNU/Linux is well-known for it's stability and robustness, there are cases where something can go wrong. The source of the problem may be both internal and external. For example, there can be a malfunctioning process running on the system that eats up resources, or an old hard drive may be faulty, resulting in reported I/O errors.
In any case, we need to know where to look and what to do to get information about the situation, and this guide is trying to provide just about that - a general way of getting the idea of that went wrong. Any problem's resolution begins with knowing about the issue, finding the details, finding the root cause, and solving it. As with any task, GNU/Linux provides countless tools to help the progress, this is the case in troubleshooting as well. The following few tips and methods are just a few common ones that can be used on many distributions and versions.
Suppose we have a nice laptop that we work on. It is running the latest Ubuntu, CentOS or Red Hat Linux on it, with updates always in place to keep everything fresh. The laptop is for everyday general use: we process emails, chat, browse the Internet, maybe produce some spreadsheets on it, etc. Nothing special is installed, an Office Suite, a browser, an email client, and so on. From one day to another, suddenly the machine becomes extremely slow. We already working on it for about an hour, so it's not a problem after boot. What's happening...?
Checking system resources
GNU/Linux does not become slow without a reason. And will most likely tell us where it hurts, as long as it is able to answer. As with any program running on a computer, the operating system uses system resources, and with those running thick, operations will have to wait until there is enough of them to proceed. This will indeed cause responses to get slower and slower, so if there is a problem, it always useful to check for the state of the system resources. In general our (local) system resources consist of disk, memory, and CPU. Let's check all of them.
If the running operating system is out of disk space, that's bad news. As services running can't write their logfiles, they will mostly crash if running, or will not start if the disks are already full. Apart from logfiles, sockets and PID (Process IDentifier) files need to be written on disk, and while these are small in size, if there is absolutely no more space, these can't be created.
To check available disk space we can use
df in the terminal, and add
-h argument, to see the results rounded up to Megabytes and Gigabytes. For us the entries of interest would be volumes that have Use% of 100%. That would mean the volume in question is full. The following example output shows we are fine regarding disk space:
$ df -h Filesystem Size Used Avail Use% Mounted on devtmpfs 1.8G 0 1.8G 0% /dev tmpfs 1.8G 0 1.8G 0% /dev/shm tmpfs 1.8G 1.3M 1.8G 1% /run /dev/mapper/lv-root 49G 11G 36G 24% / tmpfs 1.8G 0 1.8G 0% /tmp /dev/sda2 976M 261M 649M 29% /boot /dev/mapper/lv-home 173G 18G 147G 11% /home tmpfs 361M 4.0K 361M 1% /run/user/1000
So we have space on disk(s). Note that in our case of the slow laptop, disk space exhaustion isn't likely to be the root cause. When disks are full, programs will crash or will not start at all. In extreme case, even login will fail after boot.
Memory is a vital resource too, and if we are short of it, the operating system may need to write currently unused pieces of it to disk temporary (also called "swap out") to give the freed memory to the next process, then read it back when the process owning the swapped content needs it again. This whole method called swapping, and will indeed slow down the system, as writing and reading to and from the disks are much slower than working within the RAM.
To check memory usage we have the handy
free command that we can append with arguments to see the results in Megabytes (
-m) or Gigabytes (
$ free -m total used free shared buff/cache available Mem: 7886 3509 1547 1231 2829 2852 Swap: 8015 0 8015
In the above example we have 8 GB of memory, 1,5 GB of it free, and around 3 GB in caches. The
free command also provides the state of the
swap: in this case it is perfectly empty, meaning the operating system did not need to write any memory content to disk since startup, not even on peak loads. This usually means we have more memory we actually use. So regarding memory we are more than good, we have plenty of it.
As processors do the actual calculations, running out of processor time to compute can again result in slowing the system down. Needed calculations have to wait until any processor have the free time to compute them. The easiest way to see the load on our processors is the
$ uptime 12:18:24 up 4:19, 8 users, load average: 4,33, 2,28, 1,37
The three numbers after load average means average in the last 1, 5 and 15 minutes. In this example the machine have 4 CPU cores, so we are try to use more than our actual capacity. Also notice that the historical values show that the load is going up significantly in the last few minutes. Maybe we found the culprit?
Top consumer processes
Let's see the whole picture of CPU and memory consumption, with the top processes using these resources. We can execute the
top command to see system load in (near) real time:
The first line of top is identical to the output of
uptime, next we can see the number if tasks running, sleeping, etc. Note the count of zombie (defunctioning) processes; this case it is 0, but if there would be some processes in zombie state, they should be investigated. Next line shows the load on CPUs in percentage, and the accumulated percentages of exactly what the processors are busy with. Here we can see that the processors are busy serving userspace programs.
Next are two lines that can be familiar from the
free output, the memory usage if the system. Below these are the top processes, sorted by CPU usage. Now we can see what is eating our processors, it is Firefox in our case.
How do I know that, since the top consuming process is shown as "Web Content" in my
top output? By using
ps to query the process table, using the PID shown next to the top process, which is in this case
$ ps -ef| grep 5785 | grep -v "grep" sandmann 5785 2528 19 18:18 tty2 00:00:54 /usr/lib/firefox/firefox -contentproc -childID 13 -isForBrowser -prefsLen 9825 -prefMapSize 226230 -parentBuildID 20200720193547 -appdir /usr/lib/firefox/browser 2528 true tab
With this step we found the root cause of our situation. Firefox is eating our CPU time to the point our system start to answer to our actions slower. This isn't necessarily the browser's fault, because Firefox is designed to display pages from the World Wide Web: to create a CPU issue for the purpose of demonstration, all I did is opening a few dozen instances of a stress test page in distinct tabs of the browser to the point the CPU shortage surfaces. So I don't need to blame my browser, but myself for opening resource-hungry pages and let them running in parallel. By closing some, my CPU usage returns to normal.
The issue and the solution is uncovered above, but what if I am not able to access the browser to close some tabs? Let's say my graphical session is locked and I can't log back in, or a general process that gone wild does not even have any interface where we can change it's behavior? In such case we can issue the shutdown of the process by the operating system. We already know the PID of the rogue process that we got with
ps, and we can use the
kill command to shut it down:
$ kill 5785
Well-behaving processes will exit, some may not. If so, adding the
-9 flag will force the process termination:
$ kill -9 5785
Note however, that this may cause data loss, because the process does not have time to close opened files or finish writing it's results to disk at all. But in case of some repeatable task, system stability may take priority over loosing some of our results.
Finding related information
Interacting with processes with some sort of interface is not always the case, and many applications only have basic commands that control their behavior - namely, start, stop, reload, and such, because their internal workings are provided by their configuration. The above example was more of a desktop one, let's see a server-side example, where we have an issue with a webserver.
Suppose we have a webserver that serves some content to the world. It is popular, so it isn't good news when we get a call that our service is not available. We can check the webpage in a browser only to get an error message saying "unable to connect". Let's see the machine that runs the webserver!
Our machine hosting the webserver is a Fedora box. This is important because of the filesystem paths we need to follow. Fedora, and all other Red Hat variants store the Apache Webserver's logfiles on the path
/var/log/httpd. In here we can check the
view, but do not find any related information on what the issue might be. Checking the access logs also does not show any problems at first glance, but thinking twice will give us a hint: on a webserver with good enough traffic the access log's last entries should be very recent, but the last entry is already an hour old. We know by experience that the website gets visitors every minute.
Our Fedora installation uses
systemd as init system. Let's query for some information about the webserver:
# systemctl status httpd ● httpd.service - The Apache HTTP Server Loaded: loaded (/usr/lib/systemd/system/httpd.service; disabled; vendor preset: disabled) Drop-In: /usr/lib/systemd/system/httpd.service.d └─php-fpm.conf Active: failed (Result: signal) since Sun 2020-08-02 19:03:21 CEST; 3min 5s ago Docs: man:httpd.service(8) Process: 29457 ExecStart=/usr/sbin/httpd $OPTIONS -DFOREGROUND (code=killed, signal=KILL) Main PID: 29457 (code=killed, signal=KILL) Status: "Total requests: 0; Idle/Busy workers 100/0;Requests/sec: 0; Bytes served/sec: 0 B/sec" CPU: 74ms aug 02 19:03:21 mywebserver1.foobar systemd: httpd.service: Killing process 29665 (n/a) with signal SIGKILL. aug 02 19:03:21 mywebserver1.foobar systemd: httpd.service: Killing process 29666 (n/a) with signal SIGKILL. aug 02 19:03:21 mywebserver1.foobar systemd: httpd.service: Killing process 29667 (n/a) with signal SIGKILL. aug 02 19:03:21 mywebserver1.foobar systemd: httpd.service: Killing process 29668 (n/a) with signal SIGKILL. aug 02 19:03:21 mywebserver1.foobar systemd: httpd.service: Killing process 29669 (n/a) with signal SIGKILL. aug 02 19:03:21 mywebserver1.foobar systemd: httpd.service: Killing process 29670 (n/a) with signal SIGKILL. aug 02 19:03:21 mywebserver1.foobar systemd: httpd.service: Killing process 29671 (n/a) with signal SIGKILL. aug 02 19:03:21 mywebserver1.foobar systemd: httpd.service: Killing process 29672 (n/a) with signal SIGKILL. aug 02 19:03:21 mywebserver1.foobar systemd: httpd.service: Killing process 29673 (n/a) with signal SIGKILL. aug 02 19:03:21 mywebserver1.foobar systemd: httpd.service: Failed with result 'signal'.
The above example is again a simple one, the
httpd main process down because it received a KILL signal. There may be another sysadmin who has the privilege to do so, so we can check who's logged in (or was at the time of the forceful shutdown of the webserver), and ask her/him about the issue (a sophisticated service stop would have been less brutal, so there must be a reason behind this event). If we are the only admins on the server, we can check where that signal is came from - we may have a breach issue, or the operating system sent the kill signal. In both cases we can use the server's logfiles, because
ssh logins are logged to the security logs (
/var/log/secure in Fedora's case), and there are also audit entries to be found in the main log (which is
/var/log/messages in this case). There is an entry that tells us what happened in the latter:
Aug 2 19:03:21 mywebserver1.foobar audit: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=httpd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
For demonstration purposes I killed my own lab webserver's main process in this example. In a server-related issue, the best help we can get fast is by checking the logfiles and query the system for running processes (or their absence), and checking their reported state, to get closer to the issue. To do so effectively, we need to know the services we are running: where do they write their logfiles, how we can get information about their state, and knowing what is logged at normal operation times also helps a lot in identifying an issue - maybe even before the service itself experiences problems.
There are many tools that help us automate most of these things, like a monitoring subsystem, and log aggregation solutions, but these all start with us, the admins who know how the services we run work, where and what to check to know if they are healthy. The above demonstrated simple tools are accessible in any distribution, and with their help we can help solving issues with systems we are not even familiar with. That is an advanced level of troubleshooting, but the tools and their usage shown here are some of the bricks anyone can use to start building their troubleshooting skills on GNU/Linux.