Two-Layer Monitoring: What Your Provider Covers vs What You Cover
When you deploy a VPS on MassiveGRID, there are two monitoring layers. Understanding who watches what prevents both blind spots and duplicated effort.
What MassiveGRID Monitors (Infrastructure Layer)
- Physical hardware — CPU temperature, disk health (SMART), memory ECC errors, power supply status
- Network infrastructure — switch health, BGP sessions, link utilization, packet loss between nodes
- Storage cluster — Ceph cluster health, OSD status, replication state, storage capacity
- DDoS mitigation — 12 Tbps scrubbing capacity, automatic detection and filtering
- Hypervisor health — Proxmox node status, HA cluster state, automatic VM failover
If a physical disk fails, a network link goes down, or a DDoS attack hits your IP, MassiveGRID detects and responds automatically. You do not need to monitor any of this.
What You Monitor (Application Layer)
- Application health — is your website responding? Is your API returning correct data?
- Service status — is Nginx running? Is MySQL accepting connections? Is PHP-FPM healthy?
- Resource usage — CPU utilization, RAM consumption, disk space, I/O throughput
- Logs — error rates, authentication failures, suspicious activity
- SSL certificates — expiration dates, renewal status
The gap between these two layers is where most outages hide. Your hardware is fine, your network is fine, but MySQL crashed because it ran out of memory. Only application-layer monitoring catches this.
Installing Netdata: Real-Time System Monitoring
Netdata is the best real-time monitoring tool for a single VPS. It installs in one command, auto-detects your services, and has minimal performance overhead (typically under 2% CPU).
# One-line installation on Ubuntu 24.04
curl https://get.netdata.cloud/kickstart.sh > /tmp/netdata-kickstart.sh && \
sh /tmp/netdata-kickstart.sh --stable-channel
# Verify it is running
systemctl status netdata
# Access the dashboard
# http://your-server-ip:19999
Securing Netdata Access
By default, Netdata listens on all interfaces on port 19999. Do not leave this open to the internet:
# Option 1: Restrict to localhost only (access via SSH tunnel)
# Edit /etc/netdata/netdata.conf
[web]
bind to = 127.0.0.1
# Then access via SSH tunnel:
# ssh -L 19999:localhost:19999 user@your-server-ip
# Open http://localhost:19999 in your browser
# Option 2: Restrict with UFW firewall
ufw allow from YOUR_IP to any port 19999
ufw deny 19999
# Restart Netdata after configuration changes
systemctl restart netdata
Netdata Dashboard Overview
Once installed, the Netdata dashboard automatically shows:
- System Overview — CPU usage per core, total RAM/swap usage, load average, uptime
- Disk I/O — read/write throughput, IOPS, latency, disk space per mount
- Network — bandwidth in/out per interface, packet rates, errors and drops
- Applications — per-process CPU/RAM/disk/network usage (groups by application)
- Services — auto-detected MySQL, Nginx, PHP-FPM, Redis, and more with dedicated dashboards
Netdata retains detailed metrics for hours and summarized data for days, all stored locally with minimal disk usage.
Key Metrics and What They Mean
Monitoring is useless if you do not know what the numbers mean. Here are the metrics that matter and when to worry:
| Metric | Normal Range | Warning | Critical | What It Means |
|---|---|---|---|---|
| CPU usage | 10–50% | 70–80% | 90%+ | Processing capacity consumed |
| Load average | < vCPU count | 1.5x vCPUs | 2x+ vCPUs | Number of processes wanting CPU time |
| RAM usage | 40–70% | 80–90% | 95%+ | Memory consumed (includes cache) |
| Swap usage | 0 MB | Any usage | 50%+ of swap | System ran out of RAM |
| Disk space | < 60% | 70–85% | 90%+ | Available storage remaining |
| Disk I/O wait | < 5% | 10–20% | 30%+ | CPU waiting on slow storage |
| Network errors | 0 | Any | Sustained errors | Network problems or misconfig |
Important note about RAM: Linux uses free RAM for disk caching. A server showing 80% RAM usage might actually have plenty of available memory because the cache can be freed instantly. Use free -h and look at the "available" column, not "used."
Setting Up Uptime Kuma: External Service Monitoring
Netdata tells you what is happening inside your server. Uptime Kuma tells you what your users experience from the outside. This distinction matters—your server might show normal CPU/RAM but your website could still be returning 500 errors.
Installing Uptime Kuma with Docker
# Install Docker if not already present
curl -fsSL https://get.docker.com | sh
systemctl enable docker
# Run Uptime Kuma
docker run -d \
--name uptime-kuma \
--restart=always \
-p 3001:3001 \
-v uptime-kuma:/app/data \
louislam/uptime-kuma:1
# Access at http://your-server-ip:3001
# Create your admin account on first access
Pro tip: Ideally, run Uptime Kuma on a different server than the one you are monitoring. If the monitored server goes down and Uptime Kuma is on the same server, you will not get an alert. A separate small Cloud VPS dedicated to monitoring is a good investment.
Monitors to Configure
Set up these monitors for a typical web server:
- HTTP(S) monitor — check your website URL, expect a 200 response, verify the response contains expected text (keyword monitoring)
- TCP monitor — check that MySQL (port 3306), Redis (port 6379), and other services are accepting connections
- Ping monitor — basic connectivity check to the server IP
- DNS monitor — verify your domain resolves correctly
- SSL certificate monitor — alerts before your certificate expires (set to 14 days)
# Recommended check intervals:
# HTTP/HTTPS: every 60 seconds
# TCP services: every 120 seconds
# Ping: every 60 seconds
# DNS: every 300 seconds
# SSL certificate: every 86400 seconds (daily)
Configuring Alert Thresholds
Alerts should be actionable. If an alert fires and your response is "that is normal, ignore it," the threshold is wrong. Here are sensible defaults:
Netdata Alert Configuration
# Netdata alert configs are in /etc/netdata/health.d/
# Edit or create custom alerts:
# /etc/netdata/health.d/custom.conf
# CPU usage alert
alarm: cpu_usage_high
on: system.cpu
lookup: average -5m percentage foreach user,system
every: 1m
warn: $this > 80
crit: $this > 95
info: CPU usage is high
# Disk space alert
alarm: disk_space_low
on: disk.space
lookup: min -1m percentage of used
every: 1m
warn: $this > 85
crit: $this > 95
info: Disk space running low
# RAM alert (using available, not used)
alarm: ram_available_low
on: system.ram
lookup: min -5m MB of available
every: 1m
warn: $this < 256
crit: $this < 128
info: Available RAM is critically low
# Reload Netdata alerts
netdatacli reload-health
Is It Real Load or Noisy Neighbors?
This is one of the most common diagnostic questions on a shared VPS. Your monitoring shows high CPU or I/O wait, but your application has not changed and traffic is normal. Is the problem yours or someone else's?
How to Tell
# Check steal time (CPU cycles taken by the hypervisor for other VMs)
top -bn1 | head -5
# Look for "%st" (steal time) in the CPU line
# Normal: 0-2%
# Concerning: 5-10%
# Problem: 10%+
# Check I/O wait patterns
iostat -xz 1 10
# If await (average wait time) spikes without your traffic increasing,
# it may be contention on the storage backend
# Compare performance at different times
# Run the same benchmark at 3 AM and 3 PM
sysbench cpu --threads=2 --time=30 run
# If results differ significantly, resource contention is likely
When to Upgrade to VDS
If steal time is regularly above 5% or your performance varies significantly by time of day without corresponding traffic changes, you are experiencing resource contention. No amount of tuning on your end will fix this—the solution is dedicated resources with a Cloud VDS.
With a VDS, your CPU cores and RAM are exclusively allocated to you. No other tenant can affect your performance. You get the same self-managed control, same Ceph NVMe storage, same 12 Tbps DDoS protection—just without the noisy neighbors.
Alert Channels: Getting Notified Where You Actually Look
Alerts are only useful if you see them in time. Configure notifications through channels you actually check:
Email Alerts (Netdata)
# /etc/netdata/health_alarm_notify.conf
SEND_EMAIL="YES"
DEFAULT_RECIPIENT_EMAIL="your@email.com"
EMAIL_SENDER="netdata@yourdomain.com"
# Requires a working MTA (postfix, msmtp, etc.)
# Quick msmtp setup:
apt install msmtp msmtp-mta
# Configure /etc/msmtprc with your SMTP provider
Telegram Alerts
# 1. Create a bot via @BotFather in Telegram
# 2. Get your chat ID from @userinfobot
# 3. Configure Netdata:
# /etc/netdata/health_alarm_notify.conf
SEND_TELEGRAM="YES"
TELEGRAM_BOT_TOKEN="your-bot-token"
DEFAULT_RECIPIENT_TELEGRAM="your-chat-id"
Slack Webhook Alerts
# /etc/netdata/health_alarm_notify.conf
SEND_SLACK="YES"
SLACK_WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
DEFAULT_RECIPIENT_SLACK="#alerts"
Discord Alerts
# /etc/netdata/health_alarm_notify.conf
SEND_DISCORD="YES"
DISCORD_WEBHOOK_URL="https://discord.com/api/webhooks/YOUR/WEBHOOK"
DEFAULT_RECIPIENT_DISCORD="alerts"
Pick one or two channels maximum. Having alerts in five different places means you ignore all of them.
Log Monitoring with Logwatch
Real-time monitoring shows you what is happening now. Log analysis shows you what happened and what is trending. Logwatch gives you a daily email summary of your server's log activity.
# Install Logwatch
apt update && apt install -y logwatch
# Run a manual report to see what it looks like
logwatch --detail High --range Today --output stdout
# Configure daily email reports
# /etc/logwatch/conf/logwatch.conf
Output = mail
MailTo = your@email.com
MailFrom = logwatch@yourdomain.com
Detail = Med
Range = yesterday
# Logwatch runs automatically via /etc/cron.daily/00logwatch
What Logwatch Reports Include
- SSH activity — successful and failed login attempts (critical for security)
- PAM authentication — sudo usage, su attempts
- Disk usage — filesystem usage summary
- Package updates — what was installed or updated via apt
- Nginx/Apache — request counts, error codes, top URLs
- MySQL — connections, errors, slow queries
- Postfix/mail — email delivery status if you run a mail server
Spend two minutes each morning reviewing the Logwatch email. You will spot trends (like increasing failed SSH attempts or growing disk usage) before they become problems.
MySQL and Nginx Monitoring
Netdata auto-detects MySQL and Nginx if they are running, but you need to enable status endpoints for detailed metrics.
Nginx Status
# Add to your Nginx configuration
server {
listen 127.0.0.1:80;
server_name 127.0.0.1;
location /nginx_status {
stub_status on;
allow 127.0.0.1;
deny all;
}
}
# Reload Nginx
nginx -t && systemctl reload nginx
# Netdata will automatically pick up the stats
MySQL Performance Schema
# Create a Netdata monitoring user for MySQL
mysql -e "CREATE USER 'netdata'@'localhost' IDENTIFIED BY 'complex-password-here';"
mysql -e "GRANT USAGE, REPLICATION CLIENT, PROCESS ON *.* TO 'netdata'@'localhost';"
mysql -e "FLUSH PRIVILEGES;"
# Configure Netdata MySQL monitoring
# /etc/netdata/go.d/mysql.conf
jobs:
- name: local
dsn: netdata:complex-password-here@tcp(127.0.0.1:3306)/
# Restart Netdata
systemctl restart netdata
Building a Dashboard You Will Actually Check
The biggest monitoring failure is not missing tools—it is alert fatigue and dashboards nobody opens. Here is how to build a monitoring setup that actually works:
- Start with just three metrics: CPU, RAM, and disk space. Get comfortable checking these daily before adding more complexity.
- Set alerts only for things you will act on. If CPU hitting 60% does not require action, do not alert on it. Alert on 80% where you would investigate and 95% where you would immediately respond.
- Use one notification channel. Telegram or Slack, not both. Every additional channel splits your attention.
- Review Logwatch daily. Two minutes, every morning, same time. Make it a habit.
- Check Uptime Kuma weekly. Review uptime percentages, response time trends, and any incidents from the past week.
A monitoring system you actually use beats a sophisticated one you ignore.
When Monitoring Reveals You Need More
Good monitoring often leads to an important realization: you need more resources or less operational burden. If your alerts keep firing for legitimate load (not noisy neighbors), it is time to scale. On MassiveGRID, Cloud VPS supports independent scaling of CPU, RAM, and storage. If performance is inconsistent, Cloud VDS eliminates shared-resource contention. And if monitoring and responding to alerts has become a part-time job, Managed Cloud Dedicated Servers include 24/7 professional monitoring and incident response as part of the service.
MassiveGRID Ubuntu VPS includes: Ubuntu 24.04 LTS pre-installed · Proxmox HA cluster with automatic failover · Ceph 3x replicated NVMe storage · Independent CPU/RAM/storage scaling · 12 Tbps DDoS protection · 4 global datacenter locations · 100% uptime SLA · 24/7 human support rated 9.5/10
→ Deploy a self-managed VPS — from $1.99/mo
→ Need dedicated resources? — from $8.30/mo
→ Want fully managed hosting? — we handle everything
Automated Health Checks with a Simple Script
For servers where you do not want to install a full monitoring stack, a simple bash script can check critical services and send an alert if something is wrong:
#!/bin/bash
# /usr/local/bin/health-check.sh
# Run via cron every 5 minutes: */5 * * * * /usr/local/bin/health-check.sh
ALERT_EMAIL="your@email.com"
HOSTNAME=$(hostname)
check_service() {
local service=$1
if ! systemctl is-active --quiet "$service"; then
echo "ALERT: $service is down on $HOSTNAME" | \
mail -s "[$HOSTNAME] Service Down: $service" "$ALERT_EMAIL"
# Attempt auto-restart
systemctl restart "$service"
fi
}
# Check critical services
check_service nginx
check_service mysql
check_service php8.3-fpm
check_service redis-server
# Check disk space (alert if above 85%)
DISK_USAGE=$(df -h / | awk 'NR==2 {print $5}' | tr -d '%')
if [ "$DISK_USAGE" -gt 85 ]; then
echo "ALERT: Disk usage at ${DISK_USAGE}% on $HOSTNAME" | \
mail -s "[$HOSTNAME] Disk Space Warning" "$ALERT_EMAIL"
fi
# Check RAM (alert if available RAM below 200MB)
AVAIL_RAM=$(free -m | awk 'NR==2 {print $7}')
if [ "$AVAIL_RAM" -lt 200 ]; then
echo "ALERT: Available RAM at ${AVAIL_RAM}MB on $HOSTNAME" | \
mail -s "[$HOSTNAME] Low Memory Warning" "$ALERT_EMAIL"
fi
# Make executable and add to cron
chmod +x /usr/local/bin/health-check.sh
echo "*/5 * * * * /usr/local/bin/health-check.sh" | crontab -
This script provides basic monitoring with zero dependencies beyond a working mail setup. It is not a replacement for Netdata, but it is better than nothing, and it takes five minutes to set up.
Monitoring Strategy by Server Role
Different workloads need different monitoring focus areas. Here is what to prioritize based on what your server does:
Web Server (Nginx/Apache + PHP)
- Primary metric: HTTP response time and error rate
- Secondary: PHP-FPM pool status (active/idle processes), CPU usage
- Alert on: Response time above 2 seconds, 5xx error rate above 1%, PHP-FPM max_children reached
Database Server (MySQL/MariaDB)
- Primary metric: Query response time, connection count
- Secondary: InnoDB buffer pool hit ratio, slow query rate, disk I/O wait
- Alert on: Connections above 80% of max, buffer pool hit ratio below 99%, replication lag (if applicable)
Application Server (Node.js, Python, etc.)
- Primary metric: Application response time, memory usage trend
- Secondary: Event loop lag (Node.js), worker process count, CPU per process
- Alert on: Memory growth over time (leak detection), response time degradation, process crashes
Docker Host
- Primary metric: Container health status, resource usage per container
- Secondary: Docker daemon health, disk space (images and volumes grow fast)
- Alert on: Container restarts, disk usage above 80%, any container in unhealthy state
Focus your monitoring on the metrics that matter for your specific workload. Monitoring everything equally is the same as monitoring nothing—the important signals get lost in noise.