Alerting & Thresholds
Alert Processing Flow
Proactive alerting is the bridge between monitoring (knowing what is happening) and action (fixing it). Without alerts, dashboards are just pretty pictures nobody watches 24/7.
Alert Severity Levels
Common Threshold Configurations
| Rule Name | Metric | Condition | Persistence | Severity | Status |
|---|---|---|---|---|---|
| High CPU Sustained | CPU Utilization | > 90% | 10 minutes | Critical | ● Enabled |
| Low Disk Space | Disk Free % | < 10% | Immediate | Critical | ● Enabled |
| High Memory | Memory Utilization | > 95% | 5 minutes | Warning | ● Enabled |
| Slow Boot Time | Boot Duration | > 120 sec | Immediate | Warning | ● Enabled |
| Stale Reboot | System Uptime | > 30 days | Immediate | Info | ● Enabled |
Persistence Windows
A persistence window is the duration a condition must remain true before the alert fires. Without it, a brief CPU spike from opening Excel triggers a Critical alert.
Notification Channels
Email Notifications
Simplest channel. Configure SMTP, add distribution groups. Best for Warning and Info alerts.
ServiceNow Integration
Auto-create incidents with severity-to-priority mapping, affected CI, and diagnostic details in work notes.
Webhook / API
Send alert payloads to Slack, Teams, PagerDuty, or custom automation. Full alert context in JSON.
Alert Fatigue: The Silent Killer
- Team members dismiss alerts without investigating
- Alert email folders have thousands of unread messages
- Critical incidents discovered by users, not alerts
- Engineers auto-archive alert emails
- New hires told "just ignore most of those"
Start with 5-10 high-confidence rules. Review volume weekly. Require action for every alert — if the response is always "nothing to do," eliminate or downgrade it. Use suppression windows during maintenance.
Simulated Alert Dashboard
Review the following alert dashboard. Identify which alerts need immediate action and which are noise.
| Hostname | Severity | Metric | Value | Duration | Timestamp |
|---|---|---|---|---|---|
| CAEI-445521 | CRITICAL | Disk Free % | 2% | 3 hours | 08:12 AM |
| CAEI-778432 | CRITICAL | CPU Utilization | 99% | 45 min | 10:30 AM |
| CAEI-332198 | WARNING | Memory | 89% | 12 min | 10:55 AM |
| CAEI-990015 | WARNING | Boot Time | 142 sec | N/A | 09:01 AM |
| CAEI-112847 | INFO | Uptime | 34 days | N/A | 07:00 AM |
| CAEI-556310 | INFO | Uptime | 21 days | N/A | 07:00 AM |
Scenario: Alert Storm After Patch Deployment
Wednesday morning. Your team deployed a cumulative Windows update to 2,000 endpoints last night. You arrive to find 347 Warning alerts and 28 Critical alerts for high CPU. Alerts started at 2:00 AM and are still coming in. ServiceNow has 375 auto-generated incidents.
What is the best first step?
Correct: C. Post-patch, it is common for TrustedInstaller and Windows Modules Installer to spike CPU temporarily. Verify the cause, suppress non-critical alerts for the maintenance window, and monitor for resolution within 1-2 hours. Rolling back (A) is premature, disabling alerts (B) leaves you blind, and bulk-closing tickets (D) destroys audit trail.
Exercise: Configure an Alert Rule
Walk through the 9-step process to create an alert rule:
Navigate
Performance > Alerts in the left-hand menu
Create
Click "Create Alert Rule" to open the rule builder
Select Metric
Choose CPU, memory, disk, boot time, or network latency
Define Condition
Set operator (greater than, less than) and threshold value
Set Persistence
How long the condition must remain true (1 min – 24 hours)
Choose Severity
Critical, Warning, or Informational
Define Scope
All endpoints, a computer group, or an OS type
Notifications
Select channels: email, ServiceNow, webhook
Save & Enable
Rules can be enabled, disabled, or duplicated anytime
Knowledge Check
1. What is the primary purpose of a persistence window in an alert rule?
Correct: B. A persistence window requires the condition to remain true for a specified duration, filtering out momentary spikes and ensuring only sustained, actionable conditions trigger notifications.
2. Which of the following is a sign of alert fatigue in your team?
Correct: C. Auto-archiving alert emails is a clear sign the team has given up on processing them, meaning critical alerts are likely buried in the noise.
3. A Warning alert should typically result in which of the following actions?
Correct: B. Warning alerts indicate a degraded but functional state. They warrant a medium-priority ticket and attention within hours — not the immediate response of a Critical, nor the passive logging of an Informational.