Root Cause Analysis
Five-Step RCA Workflow
RCA is not about assigning blame. It is about understanding what happened so you can prevent it from happening again. A blameless RCA culture encourages transparency and better outcomes.
Systemic vs. Isolated Issues
Do not assume a problem is isolated because you have only one ticket. Many users suffer silently. Always check whether the same pattern exists on other endpoints before concluding it is isolated.
Timeline Analysis in Tanium
Reading the timeline: A sudden spike at a specific time (especially off-hours) strongly suggests an event-driven cause. A gradual creep over days/weeks suggests resource exhaustion.
Change Correlation Sources
Tanium Patch
Was a patch deployed around the time the problem started?
Tanium Deploy
Was new software installed or updated on the affected group?
Group Policy
Were GPO changes applied to these endpoints?
ServiceNow Changes
Check the change calendar for infrastructure changes at that time.
Change Impact Analysis
Compare endpoint performance before and after a known change event. Tanium's continuous metric history makes this possible.
The control group confirms that the change — not an environment-wide factor — is the cause of the degradation.
Walkthrough: Post-Patch Slowness
Thursday morning. Claims Department (120 endpoints) reports widespread slowness. Tickets started around 9 AM. Tuesday night patch cycle targeted this group.
Identify the Symptom
Health scores dropped 82 → 54 overnight. CPU avg 78% (up from 35%), disk I/O latency doubled. Memory normal.
Check the Timeline
Sharp inflection at 2:00 AM Wednesday — sudden, not gradual. Event-driven.
Correlate with Changes
KB5034441 deployed at 1:30 AM. No other deployments in 7 days. No infrastructure changes.
Isolate the Group
Underwriting (same hardware, same OS, NOT patched) — stable at 80+. Problem is specific to the patched group.
Root Cause Found
TiWorker.exe stuck in a retry loop due to a .NET conflict. Fix: restart TrustedInstaller service via Tanium action.
Exercise: Order the RCA Steps
Drag the five RCA steps to their correct order (1 through 5):
Correct order: (1) Identify the symptom, (2) Check the timeline, (3) Correlate with changes, (4) Isolate the affected group, (5) Determine the root cause. Each step narrows down possibilities until you reach the definitive cause.
Knowledge Check
1. An endpoint's health score dropped suddenly at 3:00 AM. What does this timing pattern most likely indicate?
Correct: B. A sudden, sharp decline at a specific time (especially outside business hours) strongly suggests an event such as a scheduled deployment or patch.
2. You suspect a Tuesday night patch caused performance degradation. What is the strongest evidence to confirm the patch is the root cause?
Correct: C. A controlled test — reversing the suspected change on a subset and verifying resolution, while the control group still shows the problem — is the strongest evidence of causation. Timing alone (A) is correlation, not causation.