Reliability Monitor is a built-in part of Windows that's been around since the introduction of Windows Vista back in January 2007. It's always been a somewhat hidden feature of the Windows operating system, and therefore easy for users and admins alike to overlook. Nevertheless, it's a great tool that provides all kinds of interesting insight into system history and stability (see Figure 1). Reliability Monitor is particularly useful when troubleshooting glitchy systems, and can provide insights into possible causes as well as important clues to fixing things.
Understanding Reliability Monitor
Reliability Monitor is part and parcel of the Reliability & Performance Monitor snap-in for the Microsoft Management Console (MMC). That said, Reliability Monitor comes pre-defined with all modern Windows versions, so there's no need to launch MMC, and then to start adding and configuring snap-ins to make Reliability Monitor work.
Instead, Reliability Monitor taps into the Windows Event Manager to elicit data about your system, with a focus on events that impact reliability, as well as performance counters and configuration data. Reliability monitor tracks five different categories of information, namely:
- Application failures: Tracks application failures or errors (e.g., "MS Outlook … stopped working")
- Windows failures: Tracks OS failures or errors (e.g., "Windows hardware error")
- Miscellaneous failures: Tracks other failures or errors, typically peripherals (e.g., "Disk failure")
- Warnings: Tracks failures or errors that don't necessary impact system behavior (e.g., "Unsuccessful driver installation")
- Information: Tracks system changes and updates (e.g., "Successful Windows Update" and "Successful driver installation")
[Related: 10 (mostly) free must-have Windows 10 apps]
Monitoring results are compiled over time, where trouble-free operation increases the stability index, and errors or failures decrease the stability index as they occur. A value of 10 is as high as Reliability Monitor goes, and a value of 1 is low as things get. In actual practice, 10 values on stable, lightly exercised systems are common; and heavily exercised and somewhat abused test systems will throw readings of about 1.7 or thereabouts.
Interestingly, though Reliability Monitor visually tracks errors in the five categories already discussed, it provides details in only three categories in text form at the bottom of its console window, where details or solution lookup is available on an item-by-item basis. Those three categories are:
- Critical events: Lumps Application failures, Windows failures and Miscellaneous failures together in chronological order
- Warnings: All warning messages (marked with a yellow exclamation warning flag) together in chronological order
- Informational events: All information messages (marked with a lower-case "i" on a blue circle) also in chronological order
Reliability Monitor stores reliability history in its own internal file format, but you can use the "Save reliability history…" button at the lower left in the console window at any time to save a snapshot of that data in XML format. This saves only the hourly values for the Reliability Index that the program calculates while a PC is running (not all of the event data from which the index is calculated) in highly human readable form, as the brief snippet in Figure 2 shows:
As you can see, for each hour of trouble-free operation, the Reliability Index gains .03 in value. Losses for errors vary by severity, but typically fall within a range of -0.2 to -1.0.
Launching Reliability Monitor
As is the case with many Windows tools and utilities, there are many ways to launch Reliability Monitor on a PC. My favorite is simply to type "reli" in the search box, and let Windows produce the "View reliability history" prompt that launches this console in response. The explicit, step-by-step way to get to this program is as follows:
- Open Control Panel
- Open Security and Maintenance.
- Expand the Maintenance Category, then select "View Reliability history" under the heading that reads "Check for solutions to problem reports."
Either way, you'll be presented with the Reliability Monitor interface for the local PC. For access to remote PCs, you can establish an RDP session with the target PC, then run Reliability Monitor within that window. It works with equal facility through RDP (or other remote access tools) just as it does locally.
Using Reliability Monitor for troubleshooting
For a proper demonstration of what Reliability Monitor can tell you, and how it points to causes and possible cures, let's take a look at Reliability Monitor data for one of my most heavily used and abused test machines:
As the figure shows, on September 24 this machine had an unusual occurrence of a miscellaneous failure. Clicking the arrow at the left side of the graph moved the timeline back to include that day in the view. Double-clicking on the item in the detail list below that read "Disk failure." This produced the following Description text:
Windows Disk Diagnostic detected a S.M.A.R.T. fault on disk JMicron H/W RAID0 (volumes Unknown). This disk might fail; back up your computer now. All data on the hard disk, including files, documents, pictures, programs, and settings might be lost if your hard disk fails. To determine if the hard disk needs to be repaired or replaced, contact the manufacturer of your computer. If you can't back up (for example, you have no CDs or other backup media), you should shut down your computer and restart when you have backup media available. In the meantime, do not save any critical files to this disk.
This event signaled a serious enough hardware problem with a SATA device on my test PC that resulted in an immediate loss of access to the drive's contents. Looking further back in the history, this was not foreshadowed by other, less severe SMART errors that might have signaled immanent drive failure on a conventional hard disk (this device was a synthetic SSD that consisted of a RAID 0 array of 2 identical mSATA SSDs, where the controller card itself failed). Failing conventional drives would typically provide warning with increasing (and increasingly severe) SMART errors before failing outright, and you'd be able to pick this up in Reliability Monitor.
[Related: How to cure Windows 10's worst headaches]
You can see that this PC has serious stability problems with the built-in Photos app on this machine. As a result of ongoing errors in using the program on this machine, I've now switched to a different application (IrfanView, with choice of default app for viewing images also choosing that program) for viewing photos and images on that machine.
Although you can't always fix the problems that Reliability Monitor will catch, you can apply the punchline of the well-known joke to guiding (or channeling) user behavior: "Patient says to doctor: 'It hurts when I do this' (demonstrates by action). Doctor says to patient: 'Don't do that!'"
Sometimes, managing reliability boils down to managing the behavior of system users, especially when fixing somebody else's unstable software is not a viable option, but where counseling avoidance and providing alternatives (along with proper defaults) steers clear of that problematic software.
In general, working with Reliability Monitor requires looking at the causes of errors, and deciding what might be done to address them. When fixes are possible, they will usually be fairly easy to figure out. Often, though, one must simply steer clear of programs or features that don't work the way they should so as to avoid unnecessary errors. As is so often the case with Windows: "If you can't fix it, avoid it," is a watchword to live by.
Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.