As companies embrace digital transformation, IT environments grow more complex. Hybrid infrastructure, microservices, containerization, and multi-cloud deployments create distributed systems that are dynamic and complex to monitor.
This definitve guide aims to equip engineers, site reliability teams, and technical leaders with best practices to effectively monitor these modern environments using the powerful open-source platform Checkmk.
What is Checkmk and Why It Matters
Created by German DevOps company tribe29, Checkmk builds on top of the popular Nagios monitoring engine to create an enterprise-ready, scalable platform. Some key capabilities:
- Automatic discovery of assets and granular services
- Custom dashboards with rich visualizations
- Flexible alerting rules and predictive anomaly detection
- Over 2,400 plugins to monitor all infrastructure and apps
- Scales to monitor 10,000s of assets and metrics
Checkmk creates unified visibility across the entire IT stack:
Industry leaders like American Express, UPS, Uber, Univision and more trust Checkmk to monitor their business-critical infrastructure.
"We needed very flexible, customizable dashboards and comprehensive monitoring of our complete hybrid infrastructure, both inside and outside our own data centers. Checkmk gives us all of these capabilities together in one efficient platform." – Michael Renz, Head of Managed Infrastructure Services, Volkswagen Financial Services
So why does infrastructure monitoring matter? Here are 3 core reasons:
- Reduce MTTR (mean time to resolution) when issues occur by fast accurate alerts
- Optimize costs by right-sizing overprovisioned capacity based on actual usage
- Improve SLAs and uptime by establishing baselines and measuring against KPIs
Now let‘s dive into unlocking these and more benefits by implementing Checkmk for robust monitoring.
Installation and Configuration
Checkmk offers flexible deployment models to meet your specific architecture needs:
From on-prem VM images to Docker containers to cloud marketplace templates, we‘ll cover installing Checkmk via Docker.
Prerequisites
To follow along, you‘ll need:
- Linux host (tested on Ubuntu 20.04)
- Docker Engine installed
- Dedicated storage volume
# Create Docker volume
sudo docker volume create checkmk_data
Run Checkmk Container
Pull the official Checkmk Raw Edition image and run:
docker run -d \
--name checkmk \
-p 8080:5000 \
--mount source=checkmk_data,target=/omd/sites \
--restart unless-stopped \
checkmk/check-mk-raw:2.0.0-latest
This launches a Checkmk container, exposing the web UI at port 8080. Agents communicate on port 5000. Container data persists on our mounted volume for upgrades.
Initial setup takes 1-2 minutes. Get your login credentials by viewing container logs:
sudo docker logs checkmk
Now access the web UI at http://server:8080/cmk and login as cmkadmin
with the auto-generated password provided.
Initial Configuration
Upon your fist login, navigate to Setup > Users to create additional user accounts per best practices:
- Read-only users to view dashboards and alerts
- Monitoring admins to configure checks and alerts
- Site admins for full control of all Checkmk settings
Also configure user profile notification settings – passwords, timezones, alerting preferences, etc.
With that foundation established, let‘s start monitoring infrastructure!
Automatically Discover and Monitor Assets
A key advantage of Checkmk is automatic discovery of assets – both network devices and custom applications. This simplifies setup and provides flexibility as environments change.
Let‘s walk through adding systems for monitoring.
Monitoring Network Devices
First we‘ll add networking gear like switches, routers, and firewalls.
Navigate to Hosts > Network Scan. Here you can perform an SNMP scan to discover devices on target subnets.
Select your desired IP ranges and credential sets, then click Discover. Checkmk will probe those networks and add any responsive SNMP devices as unmanaged hosts.
You can also manually Add Host if you know the hostname/IP and access details.
Now configure monitoring checks for discovered infrastructure:
- Navigate to each host and enable core checks like interfaces, traffic, CPU, memory, disks, etc.
- Set up role-based access so network engineers only see their gear
- Configure alert thresholds based on utilization and error rates
With hosts added and checks activated, Checkmk will pull availability data and perfomance metrics to provide visibility!
Monitoring Applications
In addition to infrastructure, monitor availability of any application service by installing the Checkmk agent.
The agent can be deployed on Linux, Windows, Docker, Kubernetes clusters, and more. It seamlessly integrates using standard protocols:
- SNMP for networking devices
- SSH for Linux/Unix
- WinRM for Windows systems
- REST APIs for cloud platforms
This provides a unified view across your entire infrastructure:
The agent will report host resources like CPU, memory, disks out of the box. You can define custom service checks to monitor anything:
# Check homepage response time
response_time=$(curl -s -w "%{time_total}\n" example.com -o /dev/null)
echo "Response time: $response_time seconds"
This simple plugin checks webpage load time. Get creative with scripts to validate certificates, tail logs, query APIs, pull business KPIs, and more.
Now let‘s explore managing alerting and dashboards.
Defining Alert Rules and Integrating Machine Learning
Infrastructure monitoring is fairly limited without robust alerting configured. Checkmk provides flexible options to notify teams of adverse conditions or service degradation detected.
Configuring Notifications
Start by creating notification channels under Alerting > Notification Methods:
- Slack
- PagerDuty
- Webhooks
- Microsoft Teams
Channels like PagerDuty can trigger automatic escalations and 24/7 incident response workflows.
Consider creating different channels per team, environment tier, time of day/week, etc. This segmentation allows precise alert targeting.
Defining Alert Rules
With notifications configured, define some alerting rules based on conditions like:
- Services or hosts changing state
- Breaching metrics thresholds
- Repeated recoveries or fluctuations
- Missing monitoring data
Best practice is to notify early on degradation – not just complete failure. This allows investigating and preventing full outage.
Checkmk offers several advanced alerting capabilities:
Anomaly Detection with Machine Learning
Identify statistical anomalies in performance metrics through machine learning algorithms. This achieves predictive monitoring by detecting unusual metric deviations and changes in seasonal patterns.
Log Monitoring Alerts
Ingest logs from devices and applications to trigger alerts on specific patterns using regex, without needing to define exact thresholds.
Message Packs for Alert Storms
Group escalating alerts during major outages into a single notification, suppressing flood of alerts. This prevents overwhelming responders.
Leveraging these smart notification behaviors allows faster assessment and coordination during crisis events.
Building Custom Dashboards and Visualizations
While Checkmk offers default dashboards, custom views are key to focus monitoring on organizational goals. Dashboards provide specialized one-stop visibility for each stakeholder group.
Creating Views
Navigate to Monitoring > Dashboards and click Create Dashboard. Give it a name and select desired widgets:
- Metrics/graphs
- Topology maps
- Tables/lists
- Custom URLs
Checkmk leverages an embedded Grafana instance for building polished graphs. This integration brings leading data visualization directly into Checkmk without needing a separate Grafana deployment.
Example dashboard types:
- Network operations – Maps, device health lists
- Application monitoring – Performance graphs, uptime stats
- Executive – Business KPIs, cloud utilization
- Geographic – Maps to display infrastructure status by region
- 3rd party tools – Embed links via custom URLs
Best practice is to design dashboards for specific personas like developers, DevOps, security teams, executives, etc. This establishes information radiators aligned to each group‘s focus area and concerns.
Sharing and Securing Dashboards
Under Dashboard Settings you control:
- Visibility – Public, private or restricted per user group
- Interactivity – Choose drill-down ability
- Auto refresh – Set interval to update contents
Take advantage of visibility rules and [dashboard user roles](https://checkmk.com/cms_dashboard_permissions.html?utm_medium=social&utm_source= YouTube+Description+%F0%9F%91%87&utm_campaign=youtube) for precise access control. This ensures separation for security compliance.
Now let‘s explore expansion options…
Expanding Monitoring Scope
A major benefit of Checkmk is over 2,400+ plugins to monitor every layer of infrastructure, popular platforms, and custom metrics not covered out of the box.
Let‘s discuss options to enrich monitoring visibility:
Integrating External Data Sources
Checkmk can ingest time-series data from specialized monitoring systems into its metrics database:
For example, output application metrics from APM tools like Datadog or Dynatrace into their proprietary datastores, then pull into Checkmk as another dimension correlated against infrastructure.
This delivers multidimensional monitoring visibility across apps, network, security tools, etc. providing holistic context when investigating issues.
You get enriched infrastructure monitoring combined with deep application telemetry in a single pane of glass!
Author Custom Plugins
Checkmk makes extending monitoring extremely easy by writing custom scripts in any language like Python, PowerShell, bash, etc.
Example plugin use cases:
- Pull business KPIs
- Check app log for patterns
- Validate certificate expirations
- Monitor queue depth
- Fetch data from internal APIs
- Integrate with service desk
The plugins architecture provides endless possibilities to ingest metrics not covered out of the box. Before coding your own, search the expansive library of existing plugins at Checkmk Exchange. You can likely find one handling that MySQL, Docker or SSL check!
Securing Checkmk Deployments
As Checkmk provides central visibility across potentially sensitive infrastructure, it‘s critical the platform itself is secured against compromise or unauthorized access.
Here are 5 best practices to secure Checkmk:
Authenticated Access
- Require user authentication via SSO or credentials
- Enforce complex passwords and multi-factor authentication
- Create principle of least privilege user groups
Encryption
- Enable transport layer security (HTTPS/TLS)
- Encrypt stored credentials and custom variable values
Network Controls
- Place Checkmk server in DMZ with firewall rules
- Allow traffic only on port 5000 for agent communication
Monitoring and Alerting
- Checkmk monitors itself – track performance and availability
- Create alerts for warning signs like:
- Unauthorized login attempts
- Configuration changes
- Disk space usage
- System resource usage
Regular Review
- Audit user accounts and roles quarterly
- Pen-test to validate controls and find gaps
Following security best practices ensures Checkmk provides infrastructure visibility without introducing risk.
Comparing Checkmk to Other Monitoring Tools
With so many infrastructure monitoring platforms available now like Zabbix, Datadog, Nagios, Grafana, let‘s compare how Checkmk stacks up to key alternatives:
Checkmk | Zabbix | Datadog | Nagios | Prometheus | |
---|---|---|---|---|---|
Open source | Yes | Yes | No | Yes | Yes |
Auto discovery | Yes | Partial | Partial | No | No |
Dashboards | Custom + Grafana | Custom | Custom | Limited | Grafana |
Alerting | Robust | Decent | Custom | Rigid | Limited |
Plugins | 2,400+ | Fewer | Tons | Many | Some |
Scalability | 10,000s nodes | 100,000s | Any size | 1,000s | 10,000s |
Cloud support | Growing | Minimal | Excellent | Minimal | Kubernetes focus |
Container focus | Mid | Low | High | Low | Mid |
While open source options like Zabbix offer breadth, they lack polish and deep capabilities found in commercial tools like Datadog. Checkmk delivers a leading open source platform balanced by an enterprise-ready commercial offering at a fair TCO.
The auto discovery, custom dashboards, flexible alerting rules, and breadth of plugins make Checkmk very compelling as a consolidated monitoring system.
Closing Recommendations
With over 90% of Checkmk costs incurred in labor, achieving monitoring efficiency is critical not just for visibility, but optimizing limited engineering resources.
Hopefully this guide has equipped you to get a robust Checkmk implementation in place leveraging:
- Automatic discovery to reduce manual configuration
- Custom dashboards providing focus for each team
- Intelligent alerting rules detecting anomalies
- Grafana integrations for beautiful visualizations from application to infrastructure
- Configuring monitoring as code decreases drift
We highly recommend deploying Checkmk Community Edition for unified infrastructure visibility supporting optimized deployments all the way to enterprise scale.
Checkmk delivers an exceptional open source monitoring platform, complemented by commercial features, a stellar UI, and integrations delivering comprehensive observability across modern IT environments.
What aspects of Checkmk are you looking to leverage in your stack? How can a monitoring platform provide more value within your systems? We welcome you to join hundreds of organizations relying on Checkmk for peace of mind ensuring their critical business services deliver nonstop.