Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
At Microsoft, we embrace Open Standards through adoption and support of OpenTelemetry metrics stored in Azure Monitor Workspaces(AMW), with Prometheus Query Language(PromQl) our foundational metrics query language across all AMW metrics.
Before reading this article, users are recommended to first understand the difference between Host OS vs Guest OS performance counters on virtual machines.
This article is about Guest OS performance counters that users must opt-in to collecting, either via Azure Monitor Agent with DCR, VM Insights with DCR, or user-collected with the OTelCollector as part of OTel instrumentation libraries. Users are recommended to store all metrics in the metrics-optimized Azure Monitor Workspace, where they are cheaper and faster to query than in Log Analytics Workspaces.
This article provides users with the following information:
- Overview of performance counters[#performance-counters]
- Benefits of using OpenTelemetry system metrics[#benefits-of-opentelemetry]
- Benefits of using Azure Monitor Workspace for metrics[#benefits-of-azure-monitor-workspace]
- Comparison of OpenTelemetry naming convention to traditional performance counters[#performance-counter-names]
- Resource Attributes[#resource-attributes]
OpenTelemetry Guest OS Performance Counters are currently in public preview.
Performance Counters
Both Windows and Linux provide users with OS-level metrics related to CPU usage, memory consumption, disk I/O, networking and more to help diagnose performance issues. You can easily see an example on your local machine right now by using Performance Monitor(perfmon) on Windows or by using the perf command on Linux.
The total number of available OS performance counters is dynamic, with Windows providing ~1846 OS performance counters by default and several more available based on the local machine available hardware, software, and tracepoint events.
A subset of OpenTelemetry Metrics are known as system metrics. System metrics are essentially another name for performance counters; they are an Open Source Standard for consistent naming and formatting of performance counters and do not add any net-new OS performance counters.
Benefits of OpenTelemetry
Cross-OS observability The OpenTelemetry semantic convention for system metrics streamlines the cross-OS end user experience by converging Windows and Linux performance counters into a consistent naming convention and metric data model. This makes it easier for users to manage their virtual machines / nodes across their fleet with a single set of queries used for either Windows or Linux OS images. The same configuration-as-code (ARM/Bicep templates, Terraform, etc) using the same PromQl queries can be used for any hosting resource that adopts OpenTelemetry system metrics.
More performance counters The OpenTelemetry Collector Host Metrics Receiver collects many more performance counters than Azure Monitor currently makes available for collection via DCR with Log Analytics workspace as a destination. For example, users can now monitor per-process CPU utilization, disk I/O, memory usage and more.
Fewer performance counters In many scenarios, existing performance counters have been simplified into a single OTel system metric with metric dimensions (Resource Attributes) simplifying the user experience.
For example, the CPU time in different states can surface as the following three performance counters in Windows:
- \Processor Information(_Total )% Processor Time
- \Processor Information(_Total)% Privileged Time
- \Processor Information(_Total)% User Time or as the following seven performance counters in Linux:
- Cpu/usage_user
- Cpu/usage_system
- Cpu/usage_idle
- Cpu/usage_active
- Cpu/usage_nice
- Cpu/usage_iowait
- Cpu/usage_irq
In OpenTelemetry, all of these counters become a single performance counter: system.cpu.time, and the time spent in each state (such as user, system, idle) can now be found by simply filtering on the dimension State.
Benefits of Azure Monitor Workspace
Metrics stored in Azure Monitor workspaces are cheaper and faster to query than when stored in Log Analytics workspaces, due to the different data models backing these different data stores.
In addition to those general benefits, users no longer experience mismatches in schemas between the Perf and Insights tables. VM Insights (v2) sending to AMW uses a subset of the OpenTelemetry system metrics we make available to users, providing seamless compatibility across user cohorts. Large enterprises with application teams that use a mix of VM Insights and non-VM Insights Guest OS performance counter monitoring can use the same PromQl queries, dashboards, and alerts for the same OTel metrics.
Performance Counter Names
The following performance counters are collected by the Azure Monitor Agent for Windows and Linux virtual machines. The default sampling frequency is 60 seconds, but this frequency can be changed when creating or updating the data collection rule.
| OTel Performance Counter | Type | Unit | Aggregation | Monotonic | Dimensions | Description |
|---|---|---|---|---|---|---|
| system.cpu.utilization | Gauge | 1 | N/A | FALSE | cpu: Logical CPU number starting at 0 (values: Any Str) state: Breakdown of CPU usage by type (values: idle, interrupt, nice, softirq, steal, system, user, wait) |
Difference in system.cpu.time since the last measurement per logical CPU, divided by the elapsed time (0–1). |
| system.cpu.time | Sum | s | Cumulative | TRUE | cpu: Logical CPU number starting at 0 (values: Any Str) state: Breakdown of CPU usage by type (values: idle, interrupt, nice, softirq, steal, system, user, wait) |
Total seconds each logical CPU spent on each mode. |
| system.cpu.physical.count | Sum | {cpu} | Cumulative | FALSE | (none) | Number of available physical CPUs. |
| system.cpu.logical.count | Sum | {cpu} | Cumulative | FALSE | cpu: Logical CPU number starting at 0 (values: Any Str) | Number of available logical CPUs. |
| system.cpu.load_average.5m | Gauge | {thread} | N/A | FALSE | (none) | Average CPU Load over 5 minutes. |
| system.cpu.load_average.1m | Gauge | {thread} | N/A | FALSE | (none) | Average CPU Load over 1 minute. |
| system.cpu.load_average.15m | Gauge | {thread} | N/A | FALSE | (none) | Average CPU Load over 15 minutes. |
| system.cpu.frequency | Gauge | Hz | N/A | FALSE | (none) | Current frequency of the CPU core in Hz. |
| process.uptime | Gauge | s | N/A | FALSE | (none) | Time the process has been running. |
| process.threads | Sum | {threads} | Cumulative | FALSE | (none) | Process threads count. |
| process.signals_pending | Sum | {signals} | Cumulative | FALSE | (none) | Number of pending signals for the process (Linux only). |
| process.paging.faults | Sum | {faults} | Cumulative | TRUE | type: Type of fault (values: major, minor) | Number of page faults the process has made (Linux only). |
| process.open_file_descriptors | Sum | {count} | Cumulative | FALSE | (none) | Number of file descriptors in use by the process. |
| process.memory.virtual | Sum | By | Cumulative | FALSE | (none) | Virtual memory size. |
| process.memory.utilization | Gauge | 1 | N/A | FALSE | (none) | Percentage of total physical memory used by the process. |
| process.memory.usage | Sum | By | Cumulative | FALSE | (none) | Amount of physical memory in use. |
| system.disk.weighted_io_time | Sum | s | Cumulative | FALSE | device: Name of the disk (values: Any Str) | Time disk spent activated multiplied by queue length. |
| system.disk.pending_operations | Sum | {operations} | Cumulative | FALSE | device: Name of the disk (values: Any Str) | Queue size of pending I/O operations. |
| system.disk.operations | Sum | {operations} | Cumulative | TRUE | device: Name of the disk (values: Any Str) direction: Direction of flow (values: read, write) |
Disk operations count. |
| system.disk.operation_time | Sum | s | Cumulative | TRUE | device: Name of the disk (values: Any Str) direction: Direction of flow (values: read, write) |
Time spent in disk operations. |
| system.disk.merged | Sum | {operations} | Cumulative | TRUE | device: Name of the disk (values: Any Str) direction: Direction of flow (values: read, write) |
Disk reads/writes merged into single physical operations. |
| system.disk.io_time | Sum | s | Cumulative | TRUE | device: Name of the disk (values: Any Str) | Time disk spent activated. |
| system.disk.io | Sum | By | Cumulative | TRUE | device: Name of the disk (values: Any Str) direction: Direction of flow (values: read, write) |
Disk bytes transferred. |
| process.handles | Sum | {count} | Cumulative | FALSE | (none) | Number of open handles (Windows only). |
| process.disk.operations | Sum | {operations} | Cumulative | TRUE | direction: Direction of flow (values: read, write) | Disk operations performed by the process. |
| process.disk.io | Sum | By | Cumulative | TRUE | direction: Direction of flow (values: read, write) | Disk bytes transferred. |
| process.cpu.utilization | Gauge | 1 | N/A | FALSE | state: Breakdown of CPU usage (values: system, user, wait) | Percentage of total CPU time used by the process since last scrape (0–1). |
| process.cpu.time | Sum | s | Cumulative | TRUE | state: Breakdown of CPU usage (values: system, user, wait) | Total CPU seconds broken down by states. |
| process.context_switches | Sum | {count} | Cumulative | TRUE | type: Type of context switch (values: Any Str) | Number of times the process has been context switched (Linux only). |
| system.memory.utilization | Gauge | 1 | N/A | FALSE | state: Breakdown of memory usage (values: buffered, cached, inactive, free, slab_reclaimable, slab_unreclaimable, used) | Percentage of memory bytes in use. |
| system.memory.usage | Sum | By | Cumulative | FALSE | state: Breakdown of memory usage (values: buffered, cached, inactive, free, slab_reclaimable, slab_unreclaimable, used) | Bytes of memory in use. |
| system.memory.page_size | Gauge | By | N/A | FALSE | (none) | System's configured page size. |
| system.memory.limit | Sum | By | Cumulative | FALSE | (none) | Total bytes of memory available. |
| system.linux.memory.dirty | Sum | By | Cumulative | FALSE | (none) | Amount of dirty memory (/proc/meminfo). |
| system.linux.memory.available | Sum | By | Cumulative | FALSE | (none) | Estimate of available memory (Linux only). |
| system.network.packets | Sum | {packets} | Cumulative | TRUE | device: Network interface name (values: Any Str) direction: Direction of flow (values: receive, transmit) |
Number of packets transferred. |
| system.network.io | Sum | By | Cumulative | TRUE | (none) | Bytes transmitted and received. |
| system.network.errors | Sum | {errors} | Cumulative | FALSE | device: Network interface name (values: Any Str) direction: Direction of flow (values: receive, transmit) |
Number of errors encountered. |
| system.network.dropped | Sum | {packets} | Cumulative | TRUE | device: Network interface name (values: Any Str) direction: Direction of flow (values: receive, transmit) |
Number of packets dropped. |
| system.network.conntrack.max | Sum | {entries} | Cumulative | FALSE | (none) | Limit for entries in conntrack table. |
| system.network.conntrack.count | Sum | {entries} | Cumulative | FALSE | (none) | Count of entries in conntrack table. |
| system.network.connections | Sum | {connections} | Cumulative | FALSE | protocol: Network protocol (values: tcp) state: Connection state (values: Any Str) |
Number of connections. |
| system.uptime | Gauge | s | N/A | FALSE | (none) | Time the system has been running. |
| system.processes.created | Sum | {processes} | Cumulative | TRUE | (none) | Total number of created processes. |
| system.processes.count | Sum | {processes} | Cumulative | FALSE | status: Process status (values: blocked, daemon, detached, idle, locked, orphan, paging, running, sleeping, stopped, system, unknown, zombies) | Total number of processes in each state. |
| system.paging.utilization | Gauge | 1 | N/A | FALSE | device: Page file name (values: Any Str) state: Paging usage type (values: cached, free, used) |
Swap (Unix) or pagefile (Windows) utilization. |
| system.paging.usage | Sum | By | Cumulative | FALSE | device: Page file name (values: Any Str) state: Paging usage type (values: cached, free, used) |
Swap (Unix) or pagefile (Windows) usage. |
| system.paging.operations | Sum | {operations} | Cumulative | TRUE | direction: Page flow (values: page_in, page_out) type: Fault type (values: major, minor) |
Paging operations. |
| system.paging.faults | Sum | {faults} | (none) | TRUE | type: Fault type (values: major, minor) | Number of page faults. |
| system.filesystem.utilization | Gauge | 1 | N/A | FALSE | device: Filesystem identifier mode: Mount mode (values: ro, rw) mountpoint: Path type: Filesystem type (values: ext4, tmpfs, etc.) |
Fraction of filesystem bytes used. |
| system.filesystem.usage | Sum | By | Cumulative | FALSE | device: Filesystem identifier mode: Mount mode mountpoint: Path type: Filesystem type state: Usage type (values: free, reserved, used) |
Filesystem bytes used. |
| system.filesystem.inodes.usage | Sum | {inodes} | Cumulative | FALSE | device: Filesystem identifier mode: Mount mode mountpoint: Path type: Filesystem type state: Usage type (values: free, reserved, used) |
Filesystem inodes used. |