System and network issues can be detected by alarming on a single breach, though this would generate many false alarms. Network hiccups, database flips, and other non-events would trigger these alarms again and again. False alarms must be minimized, as otherwise humans will quickly burn out. On-call shifts must not be worried away over false alarms, and instead as much time and mental ability preserved for real production issues.
Duplicate alarms are a similar annoyance: one incident reported by many tickets. One of the alarms may best describe or point to the problem, while the rest are at best false alarms, at worst false leads that waste time.
Monitoring systems should be setup with aggregation and suppression: common alarms for a service grouped into a single parent alarm, and suppression across different systems or services if one service depends on another. That is, for suppression, should a backend database fail, the alarms for services that rely on that database should suppress. With aggregation, issues for a database or web service are reported by a single alarm, with the on-call then researching which components are in alarm. This should be faster than the on-call researching one issue, then being interrupted time and time again by subsequent related components raising alarms.
Monitors must be reviewed and if necessary adjusted after each alarm: was the alarm a false positive? How much higher would the threshold have to be to prevent a repeat? Would that higher threshold miss anything? If monitoring missed something, how much lower would the threshold have to be? At that lower level, how many more false alarms would one expect? Could any other condition also be used to monitor the service? Is the software buggy? If the software cannot be fixed, will an automated restart workaround the issue well enough?
Convincing management that time must be spent on incremental monitoring improvements is likely the largest hurdle. Start by tracking numbers of alarms, and root cause on tickets, so false alarm versus other issues can be graphed over time. After monitoring improvements, the number of alarms should go down (assuming the number of incidents is adjusted for), as well as the number of false alarms. The well-being of the on-call team, while harder to gauge than ticket metrics, is also critical. Addressing individual or group concerns about pager loads is best done before the entire group quits!
Splits between teams that receive the page and secondary escalation responsible for the monitoring configuration can also cause stress: the first tier may feel that an alarm triggers too much, but the second tier may not think it a problem, or may not consider adjusting the alarm a priority. Solutions include giving the first tier an appropriate degree of control over monitoring, or a process by which alarms can be noted as problematic, and pointed directly at the second tier team until the condition is addressed.