« Lode Runner | Main | Seattle Chess Resource »

System Debugging

Computer system debugging benefits from both experience with and knowledge of the system. It also benefits from many questions being asked, until a cause is known, or at least potential causes being eliminated. As an example, a junior admin may note that a filesystem fails to unmount, and eventually ask a senior admin for help.

(As an aside, learning how to ask smart questions can help avoid time wasted over an “it doesn’t work” exchange.)

Possible workflow for a filesystem failing to unmount:

A good admin will ask many questions, and try to answer all of them, either by asking, or if possible, running the commands. The exact order of steps is not important so much as checking a wide variety of symptoms. Another admin may check df to see if the filesystem is actually mounted before running sudo lsof | grep mountpoint for open files. These commands are (usually) quick to run, and (usually) additional terminals can be opened to continue should one command be slow or block.

Junior admins should spend as much time learning the system as possible. Books such as the UNIX System Administration Handbook should help, as will delving through the manuals or other available vendor documentation. Consider also reading Unix Debugging Tips.

Quick System Overview

Be aware that an error message may not be from the problem, but something affected by the real problem. This is why some initial time spent getting a feel for the system will almost always help:

  • How long has the system been running? Use the uptime command, which also shows the system load. (Old RedHat 7.2 systems, if left running long enough, would eventually show 100% CPU usage due to a procps bug.)

  • Abnormal CPU, disk, or memory use? top, free on Linux, vmstat 1 or iostat can all be consulted in a matter of seconds. Processes stuck in D state or high I/O wait times may indicate an overburdened disk system.

    Note: df and similar commands can hang on a broken NFS mount. With multiple SSH or a screen session, this probably is not a concern. If one only has a single console link, the blocked df process could hamper diagnostics.

  • Anything strange in the system logs? This is more relevant should the system have recently rebooted. Check dmesg, and wherever else the logging daemon sends this information.

  • RAID or SAN or LVM state? More relevant on database and file server systems than throwaway desktop or service nodes. Due to the plethora of different arrays and devices possible here, best to write a wrapper script that figures out what the system has, and emits appropriate diagnostics.

Monitoring can help collect and alarm on abnormal CPU, memory, or disk usage. However, these should not page, as a busy but perfectly operational system may show high CPU, and waking someone up for false alarms hurts their ability to work on real issues.