High Cardinality at Scale: Rethinking Observability for Cloud- native Environments. Download here

Platform
Fleet
Flow
Lake
Observe

Fleet

Fleet Management transforms the traditional, static method of telemetry into a dynamic, flexible system tailored to your unique operational needs. It offers a nuanced approach to observability data collection, emphasizing efficiency and adaptability.

Learn More

FLEET management

Download

100% Pipeline control to maximize data value. Collect, optimize, store, transform, route, and replay your observability data – however, whenever and wherever you need it.

Learn More

Capabilities

Filter/Reduce >

Mask/Transform >

Enrich >

Route >

Reply >

Apica’s data lake (powered by InstaStore™), a patented single-tier storage platform that seamlessly integrates with any object storage. It fully indexes incoming data, providing uniform, on-demand, and real-time access to all information.

Learn More

Capabilities

Compliance >

Search >

Replay >

The most comprehensive and user-friendly platform in the industry. Gain real-time insights into every layer of your infrastructure with automatic anomaly detection and root cause analysis.

Learn More

Capabilities

Logs >

Metrics >

Traces >

Synthetic Monitoring >

Time Series Database >

Apica Test Data Orchestrator >
Resources

Resources
Events & Webinars
Videos
Blog
DOCUMENTATION

Resources

Solution Briefs

Case studies

Datasheets

White Papers

Brochures

Apica Ascent Freemium Launch

Download

Events & Webinars

Join us for live and virtual events featuring expert insights, customer stories, and partner connections. Don’t miss out on valuable learning opportunities!

Learn More

Apica at Boomi World 2025

Learn More

Videos

Dive into valuable discussions and get to know our company through exclusive video content.

Learn More

Who is Apica?

Blog

Articles and guides that help you make data-driven decisions

Learn More

Apica Ascent Freemium
Free Enterprise-Grade Telemetry Data Management and Observability is Here: Introducing Apica Freemium

Learn More

DOCUMENTATION

Find easy-to-follow documentation with detailed guides and support to help you use our products effectively.

Apica Docs

Search Docs

Ascent API Documentation
Solutions

Overview
By Industry
By usecase
By Technology

Overview

How it works

InstaStoreTM

Experience Ascent

Integrations

ROI Calculator

by industry

Banking and Finance

Manufacturing

Government

Healthcare

IOT and IIOT

Media and Entertainment

Retail

by usecase

Telemetry Pipeline + Observability

Plan B for Native Observability

Compliance

Generative AI Assistant

Apica and Splunk integration

Hybrid Cloud Monitoring

Consolidated Monitoring

AI and LLM Observability

by technology

AWS Observability

Kubernetes Monitoring

OpenTelemetry

IoT and IIoT
Company

About Us
Security
News
Leadership
Partners
Careers

About Us

Apica keeps enterprises operating. The Ascent platform delivers intelligent data management to quickly find and resolve complex digital performance issues before they negatively impact the bottom line.

Learn More

Apica ESG Report 2025

Download

Security

In a world in constant motion where threat actors are everywhere it is important to always improve the security in all parts of your organization. We believe that is done by leveraging industry best practices and adopting the latest technology. We are proud to be both ISO27001 and SOC2 certified and thus your data is safe and secure with us.

Learn More

News

Stay updated with the latest news and press releases, featuring key developments and industry insights.

Learn More

Apica Launches Ascent Freemium to Democratize Intelligent Telemetry Data Management and Observability.

Learn More

Leadership

Meet our leadership team, dedicated to driving innovation and success. Discover the visionaries behind our company’s growth and strategic direction.

Learn More

Apica Partner Network

Join the Apica Partner Network and collaborate with industry leaders to deliver cutting-edge solutions. Together, we drive innovation, growth, and success for our clients.

Learn More

Apica + Oracle

Apica + Boomi

Careers

Build your future with us! Explore exciting career opportunities in a dynamic environment that values innovation, teamwork, and professional growth.

Learn More
Login

Try for Free, No Risk
Load Test Portal
Monitoring Portal

Get Started Free

Get Enterprise-Grade Data Management Without the Enterprise Price Tag Manage Your Data Smarter – Start for Free

Learn More

Load Test Portal

Ensure seamless performance with robust load testing on Apica’s Test Portal powered by InstaStore™. Optimize reliability and scalability with real-time insights.

Learn More

Monitoring Portal

Access the Monitoring Portal (powered by InstaStore™) to view live system performance data, monitor key metrics, and quickly identify any issues to maintain optimal reliability and uptime.

Login

Top Kubernetes Health Metrics You Must Monitor

DevOps, Kubernetes, Monitoring, Observability
February 11, 2021

Kubernetes is one of the most popular choices for container management and automation today. A highly efficient Kubernetes setup generates innumerable new metrics every day, making monitoring cluster health quite challenging. You might find yourself sifting through several different metrics without being entirely sure which ones are the most insightful and warrant utmost attention.

As daunting a task as this may seem, you can hit the ground running by knowing which of these metrics provide the right kind of insights into the health of your Kubernetes clusters. Although there are observability platforms to help you monitor your Kubernetes clusters’ right metrics, knowing exactly which ones to watch will help you stay on top of your monitoring needs. In this article, we take you through a few Kubernetes health metrics that top our list.

Crash Loops

A crash loop is the last thing you’d want to go undetected. During a crash loop, your application breaks down as a pod starts and keeps crashing and restarting in a circle. Multiple reasons can lead to a crash loop, making it tricky to identify the root cause. Being alerted when a crash loop occurs can help you quickly narrow down the list of causes and take emergency measures to keep your application active.

Cluster State Metrics

Another critical metric to keep an eye on is your cluster states. You should be able to track the aggregated resource usage throughout all the nodes in your cluster, including desired pods, node status, current pods, unavailable pods, and available pods. Monitoring your cluster states and evaluating the resultant metrics gives you a topline view of your cluster’s overall health. You’ll also stay apprised of issues with your nodes and pods. Based on the state metrics, you can decide if you need to investigate a larger problem or scale your cluster.

Using this metric, you can also evaluate the number of resources your nodes are using. You’ll also see how many nodes you have, of which how many are still available, which in turn lets you know precisely what you’re paying for and whether you need to tweak the amount and size of nodes used.

Disk and Memory Pressure

Disk pressure is a metric that indicates whether your nodes utilize disk space too quickly or too much of it, based on the usage thresholds you’ve set in your configuration. Monitoring this metric enables you to determine when you need to add additional disk space. It could also indicate that your application isn’t functioning as designed and uses more disk space than required.

Memory pressure is a metric that indicates the amount of memory a node is using. Monitoring this metric helps you keep nodes from running out of memory and indicate nodes with over-allocated memory resources that are unnecessarily increasing your infrastructure spends. A high memory pressure can also tell if your applications are leaking memory.

Network Unavailable

You’d immediately want to know when there’s something wrong with your network. After all, your nodes and applications need network connectivity to function. This metric will let you know when issues are hampering the network connectivity of your nodes. These issues could be a result of improper network configuration or a physical connection issue with your hardware.

CPU Utilization

Knowing how many CPU cycles your nodes use is vital to ensure that your nodes employ their allocated CPU resources judiciously. If your applications or nodes use up all of their allocated processing resources, you’d have to increase your CPU allocation or add additional nodes to your cluster. If your nodes or applications are using lesser CPU cycles than what you’re paying for, you’d have to revaluate the CPU allocation and downgrade if necessary. Monitoring CPU Utilization helps you stay on top of such scenarios and have your deployments run more efficiently.

Job Failures

Kubernetes Jobs are controllers that ensure that pods execute for a certain amount of time and then retire them as soon as they serve their intended purpose. There are times when jobs don’t complete successfully – either due to nodes rebooting or going into crash loops, or even resource exhaustion. Either way, you’d want to know about job failures as soon as they occur.

Job failures don’t necessarily mean that your application is inaccessible – but ignoring job failures could lead to more significant issues for your deployments down the line. Monitoring job failures closely can help in timely recovery and future avoidance of these issues.

DaemonSets

DaemonSets ensure that all nodes in your Kubernetes cluster run a copy of a specific pod of your liking. DaemonSets are especially useful when you’d like to run a monitoring service pod on all your existing nodes and any new nodes added to your cluster.

Monitoring DaemonSets helps you understand the health of your clusters. Ideally, the number of DaemonSets observed in a cluster should match the number of DaemonSets desired. If you notice that these numbers aren’t identical, at least one of your DaemonSets likely have failed.

Monitoring Kubernetes Health Metrics

Staying on top of all Kubernetes health metrics is crucial to ensure early detection, prevention, and timely diagnosis of issues that can bring down your clusters. Arming yourself with the right monitoring strategy, knowledge of which Kubernetes health metrics to focus on, and the right set of monitoring tools is the best way to ensure that your production environment is always up and running.

Us folks at Apica have built a monitoring tool that helps monitor Kubernetes clusters of all sizes, ensures that nothing goes undetected, keeps costs at a bare minimum while providing the kind of observability for Kubernetes like no one else does. Talk to us about your Kubernetes infrastructure system and what you’re looking to monitor. We can get you set up in under five minutes and walk through you how Apica can be the key pillar for your monitoring needs.

The Apica blog

Let’s keep this a friendly and inclusive space: A few ground rules: be respectful, stay on topic, and no spam, please.

Discover Apica in Action

See how Apica Ascent helps you with quality testing with comprehensive monitoring and intelligent test data management.
Schedule a demo today to explore the Apica Ascent platform.

Fleet

FLEET management

Resources

Apica Ascent Freemium Launch

Overview

About Us

Apica ESG Report 2025

Get Started Free

Top Kubernetes Health Metrics You Must Monitor

Crash Loops

Cluster State Metrics

Disk and Memory Pressure

Network Unavailable

CPU Utilization

Job Failures

DaemonSets

Monitoring Kubernetes Health Metrics

Madhu Bilagi

The Apica blog

Leave a Comment Cancel reply

Table of Contents

Share this article

Related articles

Discover Apica in Action

Follow us on: