Monitoring: Detecting Flaps
Monitoring software typically supports raising an alarm following a certain number of consecutive breaches: three unsuccessful HTTP requests to a web server, or two consecutive ping failures. However, this will miss conditions where the alarm state flaps. That is, the monitor enters then leaves the alarm state each consecutive check, or enough such that the required number of breaches is never met.
Given two connections, a ping check could be made on each every five minutes, with an alarm cutting after three connection failures. Here is an example over 40 minutes of monitoring:
| Time | Connection #1 | Connection #2 | ||
|---|---|---|---|---|
| 0 | true | 0 | true | 0 |
| 5 | true | 0 | false | 1 |
| 10 | false | 1 | true | 0 |
| 15 | false | 2 | false | 1 |
| 20 | false | 3 | true | 0 |
| 25 | false | 4 | false | 1 |
| 30 | false | 5 | true | 0 |
| 35 | false | 6 | false | 1 |
| 40 | true | 0 | true | 0 |
An alarm is raised for connection #1 at 20 minutes due to three failures. However, connection #2 will never raise an alarm, as it never accumulates enough consecutive failures. Note that between the 5th and 35th minute inclusive, it tallied four errors to only three successes.
Experimental alarm raising code in the test-monitoring script reveals that delaying the alarm reset exposes flaps to an otherwise simple consecutive alarm detector. Using a running average to detect flapping seems too complicated.
$ perl test-monitoring conn#1 at 20 alarm on simple conn#1 at 20 alarm on simple_delayreset conn#1 at 25 alarm on simple conn#1 at 25 alarm on simple_delayreset conn#1 at 30 alarm on simple conn#1 at 30 alarm on simple_delayreset conn#1 at 35 alarm on simple conn#1 at 35 alarm on simple_delayreset conn#2 at 25 alarm on simple_delayreset conn#2 at 35 alarm on simple_delayreset
Without flap detection, monitoring will usually alarm minutes or hours into an incident, usually when a backlog or some other condition created by the flapping reaches alarm levels. Therefore, flapping detection should be added to network and other connectivity tests, where sporadic failures over time will cause flapping.
Update: T_____ pointed out Detection and Handling of State Flapping in Nagios 3.