Module 1 — Lesson 3 of 8

Dashboards & Metrics

Navigate the Performance dashboards, understand key metrics, and learn how to drill down from fleet-wide views to individual endpoints.

📚 Overview

🔧 Deep Dive

🛠 Hands-On

✅ Check

📈

Metric Categories

📊

Dashboard Levels

🔍

Time Range Options

🎯

Top-Down

Drill-Down Approach

Dashboard Drill-Down Hierarchy

The Performance dashboards are organized as a funnel: start broad at the fleet level, narrow to a computer group, then drill into specific endpoints that need attention. This top-down approach ensures you focus time on the machines that matter most.

Six categories of metrics are tracked: CPU utilization, memory pressure, disk I/O and space, boot/login times, application crashes/hangs, and network latency. Each tells a different story about the endpoint experience.

Simulated Performance Overview Dashboard

Tanium Console -- Performance Overview

Overview

By Group

Endpoints

Alerts

7.6

Fleet Health Score

72%

Good/Excellent

22%

Fair

Poor

Fleet Averages -- Key Metrics

CPU Usage

35%

Memory Pressure

62%

Disk Usage

48%

Avg Boot Time

2m 45s

Top Issues Detected

Issue	Affected Endpoints	Severity
Disk space below 10%	47	High
Boot time > 5 minutes	38	High
High memory pressure (>90%)	29	Medium
Frequent app crashes (>3/day)	15	Medium

The Six Metric Categories

1. CPU Utilization ▼

What it measures: Percentage of CPU capacity used, tracking average and peak utilization.

Why it matters: Sustained high CPU (>80% average) causes application slowness and input lag. Short spikes are normal; persistent high usage suggests a runaway process, insufficient hardware, or misconfigured application.

Healthy range: Average below 60%, peak spikes returning to baseline quickly.

2. Memory (RAM) Usage ▼

What it measures: Total physical memory, memory in use, available memory, and page file utilization. Calculates memory pressure percentage.

Why it matters: When RAM is exhausted, Windows pages to disk (swapping), which is orders of magnitude slower. High memory pressure makes endpoints feel sluggish even with low CPU. Common causes: too many browser tabs, memory leaks, insufficient RAM.

3. Disk I/O and Space ▼

What it measures: IOPS, throughput (MB/s), disk queue length, and free disk space.

Why it matters: High queue lengths mean the disk cannot keep up with requests. Low free space (<10%) causes update failures, crashes, and instability. HDD-equipped machines consistently score worse than SSD machines on this metric.

4. Boot and Login Times ▼

What it measures: Total time from power-on to usable desktop, broken into phases: BIOS/UEFI, OS kernel, service startup, user profile, Group Policy, startup apps.

Why it matters: Boot time is the most visible employee experience metric. An 8-minute boot vs. 90 seconds is immediately felt. Slow boots often indicate too many startup programs, failing HDD, or GP misconfig.

5. Application Crashes and Hangs ▼

What it measures: Number and frequency of crash events (unhandled exceptions) and hang events ("Not Responding") from Windows Event Log.

Why it matters: Frequent crashes suggest compatibility issues, corrupt installations, or driver conflicts. A sudden spike across many endpoints often correlates with a recent update or config change.

6. Network Latency ▼

What it measures: Round-trip time (RTT) to domain controllers, file servers, and cloud service gateways.

Why it matters: Separates "slow machine" from "slow network." An endpoint with healthy CPU, memory, and disk but high network latency will still feel slow to the user. Helps route issues to the right team (endpoint vs. network).

Drill-Down Workflow: Fleet to Endpoint

Spot the Trend

On the Overview dashboard, notice Claims Department health dropped from 7.8 to 6.2 this week.

Drill into Group

Click Claims Department -- see that 45 of 200 endpoints are now "Fair" or "Poor."

Sort by Score

Sort the endpoint list ascending -- worst-performing machines appear at the top.

Examine Endpoint

Click the lowest-scoring endpoint. Detail shows: CPU 92% (SearchIndexer.exe), memory 95%, boot 6 min.

Take Action

Initiate remediation -- restart the process, deploy a fix, or flag for technician visit.

Time Range Filters

Time Range	Best For
Last 1 hour	Real-time troubleshooting during an active incident
Last 24 hours	Day-over-day comparison
Last 7 days	Weekly trend analysis (most common view)
Last 30 days	Monthly reporting and long-term trends
Custom range	Before/after comparisons around specific changes

Tip: Before/After Comparisons

Investigating a change impact (e.g., "did Tuesday's Windows update affect boot times?")? Use the custom time range to compare the week before to the week after. Performance's trend graphs make this visual comparison straightforward.

Simulated: Endpoint Detail View

Tanium Console -- Endpoint Detail: CAEI778234

Overview

By Group

Endpoints

Detail

4.2

Health Score

92%

CPU Avg

95%

Memory Used

6m 12s

Last Boot

Resource Breakdown

CPU

92%

Memory

95%

Disk I/O Queue

4.5

Disk Free

12%

Top Processes

Process	CPU %	Memory MB
SearchIndexer.exe	48%	312
chrome.exe (12 tabs)	22%	1,840
outlook.exe	8%	420
Teams.exe	6%	380

🤔 What Would You Do?

It's Monday morning and you open the Performance Overview dashboard. You notice that the fleet-wide health score dropped from 7.5 to 5.8 over the weekend. The Health Distribution chart shows that 30% of endpoints moved from "Good" to "Fair." The trend line shows the drop happened gradually between Saturday 2:00 AM and Saturday 6:00 AM.

What is the most likely cause, and what should you investigate first?

Employees were working overtime on Saturday and overloaded their machines A scheduled maintenance window (Windows Update, software deployment, or Group Policy change) ran during that timeframe The Tanium Server experienced an outage and data is inaccurate Ignore it -- scores fluctuate on weekends when machines are idle

Correct! A gradual decline between 2:00 AM and 6:00 AM on a Saturday is a classic signature of a scheduled maintenance window. Check your WSUS/SCCM/Intune update history and any change management records for that window.

Not quite. The timing (Saturday 2:00-6:00 AM) and the gradual spread across 30% of endpoints strongly suggests a scheduled maintenance activity -- Windows Update, a software deployment, or a Group Policy change.

Match the Metric to Its Description

Drag each metric on the left to its correct description on the right.

CPU Utilization

Memory Pressure

Disk Queue Length

Boot Time

Application Crashes

Duration from power-on to a usable desktop

Percentage of processing capacity in use

Process terminations from unhandled exceptions

Number of pending disk read/write operations

How close the system is to exhausting physical RAM

All matches correct! You have a solid understanding of the key Performance metrics.

Some matches are incorrect. Review the metrics descriptions in the Deep Dive tab and try again.

Walkthrough: Drill Down to Root Cause

Follow this step-by-step simulated walkthrough to practice the drill-down workflow.

Tanium Console -- Computer Group View

Overview

By Group

Endpoints

Alerts

Health Scores by Department

Computer Group	Endpoints	Avg Score	Trend (7d)
IT Department	85	8.2	+0.1
Underwriting	320	7.1	0.0
Customer Service	450	6.8	-0.3
Claims Processing	200	5.4	-1.6
Remote Workers	280	7.4	+0.2

Action: Claims Processing stands out with a 1.6-point drop. Click into this group to investigate.

✍ Knowledge Check

1. In the Performance Overview dashboard, what does the Health Distribution chart show?

A list of the top 10 worst-performing endpoints The percentage of endpoints in each health category (Excellent, Good, Fair, Poor) Network bandwidth utilization across the fleet A comparison of Tanium versus SCCM data accuracy

Correct! The Health Distribution chart gives you a quick visual breakdown of how many endpoints fall into each health category.

Not quite. The Health Distribution chart shows the percentage of endpoints categorized as Excellent, Good, Fair, or Poor -- giving you a quick sense of overall fleet health distribution.

2. Which metric would help you distinguish between a "slow machine" problem and a "slow network" problem?

CPU Utilization Boot Time Network Latency (RTT to infrastructure endpoints) Application Crashes

Correct! Network latency (round-trip time) helps you determine if the user's slow experience is caused by the local machine or by network conditions.

Not quite. Network Latency is the metric that separates local machine problems from network problems. If CPU, memory, and disk are healthy but network RTT is high, the slowness is network-related.

← Previous: Installation & Config Next: Health Scores →

DEX Training

Dashboards & Metrics

Dashboard Drill-Down Hierarchy

Simulated Performance Overview Dashboard

Fleet Averages -- Key Metrics

Top Issues Detected

The Six Metric Categories

Drill-Down Workflow: Fleet to Endpoint

Spot the Trend

Drill into Group

Sort by Score

Examine Endpoint

Take Action

Time Range Filters

Simulated: Endpoint Detail View

Resource Breakdown

Top Processes

🤔 What Would You Do?

Match the Metric to Its Description

Walkthrough: Drill Down to Root Cause

Health Scores by Department

✍ Knowledge Check