Module 1 — Lesson 6 of 8

Root Cause Analysis

Learn to systematically trace endpoint performance problems back to their origin using Tanium Performance data, timeline analysis, and change correlation.
📚 Overview
🔧 Deep Dive
🛠 Hands-On
Check
🔍
5
Steps in the RCA Process
📈
2
Pattern Types (Systemic vs. Isolated)
📋
100%
Documentation Rate Goal

Five-Step RCA Workflow

1. Identify Symptom "Slow computer" 2. Timeline When did it start? Gradual vs. sudden 3. Correlate What changed? Patch? Deploy? GPO? 4. Isolate Who is affected? 1 device or 500? 5. Root Cause Determine & document Prevent recurrence
Remember

RCA is not about assigning blame. It is about understanding what happened so you can prevent it from happening again. A blameless RCA culture encourages transparency and better outcomes.

Systemic vs. Isolated Issues

🌐
Systemic
Multiple endpoints affected. Usually caused by a change deployed at scale (patch, GPO, deployment). Fix once, fix all.
💻
Isolated
Single endpoint or very few. Hardware failure, user-installed software, unique config. Requires per-device investigation.
Watch Out

Do not assume a problem is isolated because you have only one ticket. Many users suffer silently. Always check whether the same pattern exists on other endpoints before concluding it is isolated.

Timeline Analysis in Tanium

Tanium Console — Endpoint Timeline: CAEI-445521
Timeline
Metrics
Processes
CPU Utilization (%) 100 50 0 Patch deployed Tue 2:00 AM Mon Tue 12AM Wed 6AM Wed 6PM Thu Normal Post-change

Reading the timeline: A sudden spike at a specific time (especially off-hours) strongly suggests an event-driven cause. A gradual creep over days/weeks suggests resource exhaustion.

Change Correlation Sources

Tanium Patch

Was a patch deployed around the time the problem started?

Tanium Deploy

Was new software installed or updated on the affected group?

Group Policy

Were GPO changes applied to these endpoints?

ServiceNow Changes

Check the change calendar for infrastructure changes at that time.

Change Impact Analysis

Compare endpoint performance before and after a known change event. Tanium's continuous metric history makes this possible.

82
Health Score Before
48
Health Score After
80
Control Group (No Change)

The control group confirms that the change — not an environment-wide factor — is the cause of the degradation.

Walkthrough: Post-Patch Slowness

The Situation

Thursday morning. Claims Department (120 endpoints) reports widespread slowness. Tickets started around 9 AM. Tuesday night patch cycle targeted this group.

Identify the Symptom

Health scores dropped 82 → 54 overnight. CPU avg 78% (up from 35%), disk I/O latency doubled. Memory normal.

Check the Timeline

Sharp inflection at 2:00 AM Wednesday — sudden, not gradual. Event-driven.

Correlate with Changes

KB5034441 deployed at 1:30 AM. No other deployments in 7 days. No infrastructure changes.

Isolate the Group

Underwriting (same hardware, same OS, NOT patched) — stable at 80+. Problem is specific to the patched group.

Root Cause Found

TiWorker.exe stuck in a retry loop due to a .NET conflict. Fix: restart TrustedInstaller service via Tanium action.

Exercise: Order the RCA Steps

Drag the five RCA steps to their correct order (1 through 5):

Correlate with changes
Identify the symptom
Determine root cause
Isolate affected group
Check the timeline
Step 1: Drop here
Step 2: Drop here
Step 3: Drop here
Step 4: Drop here
Step 5: Drop here

Correct order: (1) Identify the symptom, (2) Check the timeline, (3) Correlate with changes, (4) Isolate the affected group, (5) Determine the root cause. Each step narrows down possibilities until you reach the definitive cause.

Knowledge Check

1. An endpoint's health score dropped suddenly at 3:00 AM. What does this timing pattern most likely indicate?

Correct: B. A sudden, sharp decline at a specific time (especially outside business hours) strongly suggests an event such as a scheduled deployment or patch.

2. You suspect a Tuesday night patch caused performance degradation. What is the strongest evidence to confirm the patch is the root cause?

Correct: C. A controlled test — reversing the suspected change on a subset and verifying resolution, while the control group still shows the problem — is the strongest evidence of causation. Timing alone (A) is correlation, not causation.