Module 1 — Lesson 5 of 8

Alerting & Thresholds

Shift from reactive firefighting to proactive detection by configuring intelligent alerts and fine-tuning thresholds in Tanium Performance.

📚 Overview

🔧 Deep Dive

🛠 Hands-On

✅ Check

⚠

Severity Levels

🔔

Common Threshold Metrics

🔥

<10%

Target False Positive Rate

🕑

<20

Alerts/Day per 5K Endpoints

Alert Processing Flow

Key Concept

Proactive alerting is the bridge between monitoring (knowing what is happening) and action (fixing it). Without alerts, dashboards are just pretty pictures nobody watches 24/7.

Alerts in an untuned environment (per day)

Alerts after proper tuning (per day)

Reduction with threshold tuning

Alert Severity Levels

🚨

Critical

Immediate action. User cannot work. Auto-page on-call + P2 SNOW incident.

⚠

Warning

Attention within hours. Degraded but functional. P3 SNOW incident + email.

🛈

Info

No immediate action. Trend analysis & capacity planning. Dashboard only.

Common Threshold Configurations

Tanium Console — Performance > Alert Rules

Alert Rules

Alert History

Notifications

Rule Name	Metric	Condition	Persistence	Severity	Status
High CPU Sustained	CPU Utilization	> 90%	10 minutes	Critical	● Enabled
Low Disk Space	Disk Free %	< 10%	Immediate	Critical	● Enabled
High Memory	Memory Utilization	> 95%	5 minutes	Warning	● Enabled
Slow Boot Time	Boot Duration	> 120 sec	Immediate	Warning	● Enabled
Stale Reboot	System Uptime	> 30 days	Immediate	Info	● Enabled

Persistence Windows

A persistence window is the duration a condition must remain true before the alert fires. Without it, a brief CPU spike from opening Excel triggers a Critical alert.

92%

Brief spike (2 min) — No alert

94%

Sustained (12 min) — Alert fires

35%

Normal baseline — No alert

Notification Channels

Email Notifications

Simplest channel. Configure SMTP, add distribution groups. Best for Warning and Info alerts.

ServiceNow Integration

Auto-create incidents with severity-to-priority mapping, affected CI, and diagnostic details in work notes.

Webhook / API

Send alert payloads to Slack, Teams, PagerDuty, or custom automation. Full alert context in JSON.

Deep Dive: Designing an Escalation Chain ▼

Informational

Logged to dashboard. Included in weekly DEX report. No direct notification.

Warning

Email to support group + P3 SNOW incident. Escalate to team lead if unacknowledged after 4 hours.

Critical

P2 SNOW incident + email + Slack alert to on-call. Page manager after 30 min. Escalate to P1 after 2 hours unresolved.

Key Principle

Escalation should be time-based, not severity-based alone. A Warning unacknowledged for 8 hours is more dangerous than a Critical picked up in 5 minutes.

Alert Fatigue: The Silent Killer

Warning Signs

Team members dismiss alerts without investigating
Alert email folders have thousands of unread messages
Critical incidents discovered by users, not alerts
Engineers auto-archive alert emails
New hires told "just ignore most of those"

Prevention

Start with 5-10 high-confidence rules. Review volume weekly. Require action for every alert — if the response is always "nothing to do," eliminate or downgrade it. Use suppression windows during maintenance.

Simulated Alert Dashboard

Review the following alert dashboard. Identify which alerts need immediate action and which are noise.

Tanium Console — Active Alerts

Alerts (6)

Rules

History

Hostname	Severity	Metric	Value	Duration	Timestamp
CAEI-445521	CRITICAL	Disk Free %	2%	3 hours	08:12 AM
CAEI-778432	CRITICAL	CPU Utilization	99%	45 min	10:30 AM
CAEI-332198	WARNING	Memory	89%	12 min	10:55 AM
CAEI-990015	WARNING	Boot Time	142 sec	N/A	09:01 AM
CAEI-112847	INFO	Uptime	34 days	N/A	07:00 AM
CAEI-556310	INFO	Uptime	21 days	N/A	07:00 AM

Scenario: Alert Storm After Patch Deployment

Wednesday morning. Your team deployed a cumulative Windows update to 2,000 endpoints last night. You arrive to find 347 Warning alerts and 28 Critical alerts for high CPU. Alerts started at 2:00 AM and are still coming in. ServiceNow has 375 auto-generated incidents.

What is the best first step?

Immediately roll back the patch on all 2,000 endpoints. Disable all CPU alert rules to stop the noise, then investigate later. Check whether the high CPU is from a post-patch process that will resolve on its own, and suppress non-critical alerts for the patched group while you investigate. Bulk-close all 375 ServiceNow incidents as "not an issue."

Correct: C. Post-patch, it is common for TrustedInstaller and Windows Modules Installer to spike CPU temporarily. Verify the cause, suppress non-critical alerts for the maintenance window, and monitor for resolution within 1-2 hours. Rolling back (A) is premature, disabling alerts (B) leaves you blind, and bulk-closing tickets (D) destroys audit trail.

Exercise: Configure an Alert Rule

Walk through the 9-step process to create an alert rule:

Navigate

Performance > Alerts in the left-hand menu

Create

Click "Create Alert Rule" to open the rule builder

Select Metric

Choose CPU, memory, disk, boot time, or network latency

Define Condition

Set operator (greater than, less than) and threshold value

Set Persistence

How long the condition must remain true (1 min – 24 hours)

Choose Severity

Critical, Warning, or Informational

Define Scope

All endpoints, a computer group, or an OS type

Notifications

Select channels: email, ServiceNow, webhook

Save & Enable

Rules can be enabled, disabled, or duplicated anytime

← Previous: Health Scores Next: Root Cause Analysis →

DEX Training

Alerting & Thresholds

Alert Processing Flow

Alert Severity Levels

Common Threshold Configurations

Persistence Windows

Notification Channels

Email Notifications

ServiceNow Integration

Webhook / API

Informational

Warning

Critical

Alert Fatigue: The Silent Killer

Simulated Alert Dashboard

Scenario: Alert Storm After Patch Deployment

Exercise: Configure an Alert Rule

Navigate

Create

Select Metric

Define Condition

Set Persistence

Choose Severity

Define Scope

Notifications

Save & Enable

Knowledge Check