Ubuntu VPS Monitoring: Set Up Alerts Before Problems Become Downtime

Two-Layer Monitoring: What Your Provider Covers vs What You Cover

When you deploy a VPS on MassiveGRID, there are two monitoring layers. Understanding who watches what prevents both blind spots and duplicated effort.

What MassiveGRID Monitors (Infrastructure Layer)

Physical hardware — CPU temperature, disk health (SMART), memory ECC errors, power supply status
Network infrastructure — switch health, BGP sessions, link utilization, packet loss between nodes
Storage cluster — Ceph cluster health, OSD status, replication state, storage capacity
DDoS mitigation — 12 Tbps scrubbing capacity, automatic detection and filtering
Hypervisor health — Proxmox node status, HA cluster state, automatic VM failover

If a physical disk fails, a network link goes down, or a DDoS attack hits your IP, MassiveGRID detects and responds automatically. You do not need to monitor any of this.

What You Monitor (Application Layer)

Application health — is your website responding? Is your API returning correct data?
Service status — is Nginx running? Is MySQL accepting connections? Is PHP-FPM healthy?
Resource usage — CPU utilization, RAM consumption, disk space, I/O throughput
Logs — error rates, authentication failures, suspicious activity
SSL certificates — expiration dates, renewal status

The gap between these two layers is where most outages hide. Your hardware is fine, your network is fine, but MySQL crashed because it ran out of memory. Only application-layer monitoring catches this.

Installing Netdata: Real-Time System Monitoring

Netdata is the best real-time monitoring tool for a single VPS. It installs in one command, auto-detects your services, and has minimal performance overhead (typically under 2% CPU).

# One-line installation on Ubuntu 24.04
curl https://get.netdata.cloud/kickstart.sh > /tmp/netdata-kickstart.sh && \
  sh /tmp/netdata-kickstart.sh --stable-channel

# Verify it is running
systemctl status netdata

# Access the dashboard
# http://your-server-ip:19999

Securing Netdata Access

By default, Netdata listens on all interfaces on port 19999. Do not leave this open to the internet:

# Option 1: Restrict to localhost only (access via SSH tunnel)
# Edit /etc/netdata/netdata.conf
[web]
    bind to = 127.0.0.1

# Then access via SSH tunnel:
# ssh -L 19999:localhost:19999 user@your-server-ip
# Open http://localhost:19999 in your browser

# Option 2: Restrict with UFW firewall
ufw allow from YOUR_IP to any port 19999
ufw deny 19999

# Restart Netdata after configuration changes
systemctl restart netdata

Netdata Dashboard Overview

Once installed, the Netdata dashboard automatically shows:

System Overview — CPU usage per core, total RAM/swap usage, load average, uptime
Disk I/O — read/write throughput, IOPS, latency, disk space per mount
Network — bandwidth in/out per interface, packet rates, errors and drops
Applications — per-process CPU/RAM/disk/network usage (groups by application)
Services — auto-detected MySQL, Nginx, PHP-FPM, Redis, and more with dedicated dashboards

Netdata retains detailed metrics for hours and summarized data for days, all stored locally with minimal disk usage.

Key Metrics and What They Mean

Monitoring is useless if you do not know what the numbers mean. Here are the metrics that matter and when to worry:

Metric	Normal Range	Warning	Critical	What It Means
CPU usage	10–50%	70–80%	90%+	Processing capacity consumed
Load average	< vCPU count	1.5x vCPUs	2x+ vCPUs	Number of processes wanting CPU time
RAM usage	40–70%	80–90%	95%+	Memory consumed (includes cache)
Swap usage	0 MB	Any usage	50%+ of swap	System ran out of RAM
Disk space	< 60%	70–85%	90%+	Available storage remaining
Disk I/O wait	< 5%	10–20%	30%+	CPU waiting on slow storage
Network errors	0	Any	Sustained errors	Network problems or misconfig

Important note about RAM: Linux uses free RAM for disk caching. A server showing 80% RAM usage might actually have plenty of available memory because the cache can be freed instantly. Use free -h and look at the "available" column, not "used."

Setting Up Uptime Kuma: External Service Monitoring

Netdata tells you what is happening inside your server. Uptime Kuma tells you what your users experience from the outside. This distinction matters—your server might show normal CPU/RAM but your website could still be returning 500 errors.

Installing Uptime Kuma with Docker

# Install Docker if not already present
curl -fsSL https://get.docker.com | sh
systemctl enable docker

# Run Uptime Kuma
docker run -d \
  --name uptime-kuma \
  --restart=always \
  -p 3001:3001 \
  -v uptime-kuma:/app/data \
  louislam/uptime-kuma:1

# Access at http://your-server-ip:3001
# Create your admin account on first access

Pro tip: Ideally, run Uptime Kuma on a different server than the one you are monitoring. If the monitored server goes down and Uptime Kuma is on the same server, you will not get an alert. A separate small Cloud VPS dedicated to monitoring is a good investment.

Monitors to Configure

Set up these monitors for a typical web server:

HTTP(S) monitor — check your website URL, expect a 200 response, verify the response contains expected text (keyword monitoring)
TCP monitor — check that MySQL (port 3306), Redis (port 6379), and other services are accepting connections
Ping monitor — basic connectivity check to the server IP
DNS monitor — verify your domain resolves correctly
SSL certificate monitor — alerts before your certificate expires (set to 14 days)

# Recommended check intervals:
# HTTP/HTTPS: every 60 seconds
# TCP services: every 120 seconds
# Ping: every 60 seconds
# DNS: every 300 seconds
# SSL certificate: every 86400 seconds (daily)

Configuring Alert Thresholds

Alerts should be actionable. If an alert fires and your response is "that is normal, ignore it," the threshold is wrong. Here are sensible defaults:

Netdata Alert Configuration

# Netdata alert configs are in /etc/netdata/health.d/
# Edit or create custom alerts:

# /etc/netdata/health.d/custom.conf

# CPU usage alert
alarm: cpu_usage_high
on: system.cpu
lookup: average -5m percentage foreach user,system
every: 1m
warn: $this > 80
crit: $this > 95
info: CPU usage is high

# Disk space alert
alarm: disk_space_low
on: disk.space
lookup: min -1m percentage of used
every: 1m
warn: $this > 85
crit: $this > 95
info: Disk space running low

# RAM alert (using available, not used)
alarm: ram_available_low
on: system.ram
lookup: min -5m MB of available
every: 1m
warn: $this < 256
crit: $this < 128
info: Available RAM is critically low

# Reload Netdata alerts
netdatacli reload-health

Is It Real Load or Noisy Neighbors?

This is one of the most common diagnostic questions on a shared VPS. Your monitoring shows high CPU or I/O wait, but your application has not changed and traffic is normal. Is the problem yours or someone else's?

How to Tell

# Check steal time (CPU cycles taken by the hypervisor for other VMs)
top -bn1 | head -5
# Look for "%st" (steal time) in the CPU line
# Normal: 0-2%
# Concerning: 5-10%
# Problem: 10%+

# Check I/O wait patterns
iostat -xz 1 10
# If await (average wait time) spikes without your traffic increasing,
# it may be contention on the storage backend

# Compare performance at different times
# Run the same benchmark at 3 AM and 3 PM
sysbench cpu --threads=2 --time=30 run
# If results differ significantly, resource contention is likely

When to Upgrade to VDS

If steal time is regularly above 5% or your performance varies significantly by time of day without corresponding traffic changes, you are experiencing resource contention. No amount of tuning on your end will fix this—the solution is dedicated resources with a Cloud VDS.

With a VDS, your CPU cores and RAM are exclusively allocated to you. No other tenant can affect your performance. You get the same self-managed control, same Ceph NVMe storage, same 12 Tbps DDoS protection—just without the noisy neighbors.

Alert Channels: Getting Notified Where You Actually Look

Alerts are only useful if you see them in time. Configure notifications through channels you actually check:

Email Alerts (Netdata)

# /etc/netdata/health_alarm_notify.conf
SEND_EMAIL="YES"
DEFAULT_RECIPIENT_EMAIL="your@email.com"
EMAIL_SENDER="netdata@yourdomain.com"

# Requires a working MTA (postfix, msmtp, etc.)
# Quick msmtp setup:
apt install msmtp msmtp-mta
# Configure /etc/msmtprc with your SMTP provider

Telegram Alerts

# 1. Create a bot via @BotFather in Telegram
# 2. Get your chat ID from @userinfobot
# 3. Configure Netdata:

# /etc/netdata/health_alarm_notify.conf
SEND_TELEGRAM="YES"
TELEGRAM_BOT_TOKEN="your-bot-token"
DEFAULT_RECIPIENT_TELEGRAM="your-chat-id"

Slack Webhook Alerts

# /etc/netdata/health_alarm_notify.conf
SEND_SLACK="YES"
SLACK_WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
DEFAULT_RECIPIENT_SLACK="#alerts"

Discord Alerts

# /etc/netdata/health_alarm_notify.conf
SEND_DISCORD="YES"
DISCORD_WEBHOOK_URL="https://discord.com/api/webhooks/YOUR/WEBHOOK"
DEFAULT_RECIPIENT_DISCORD="alerts"

Pick one or two channels maximum. Having alerts in five different places means you ignore all of them.

Log Monitoring with Logwatch

Real-time monitoring shows you what is happening now. Log analysis shows you what happened and what is trending. Logwatch gives you a daily email summary of your server's log activity.

# Install Logwatch
apt update && apt install -y logwatch

# Run a manual report to see what it looks like
logwatch --detail High --range Today --output stdout

# Configure daily email reports
# /etc/logwatch/conf/logwatch.conf
Output = mail
MailTo = your@email.com
MailFrom = logwatch@yourdomain.com
Detail = Med
Range = yesterday

# Logwatch runs automatically via /etc/cron.daily/00logwatch

What Logwatch Reports Include

SSH activity — successful and failed login attempts (critical for security)
PAM authentication — sudo usage, su attempts
Disk usage — filesystem usage summary
Package updates — what was installed or updated via apt
Nginx/Apache — request counts, error codes, top URLs
MySQL — connections, errors, slow queries
Postfix/mail — email delivery status if you run a mail server

Spend two minutes each morning reviewing the Logwatch email. You will spot trends (like increasing failed SSH attempts or growing disk usage) before they become problems.

MySQL and Nginx Monitoring

Netdata auto-detects MySQL and Nginx if they are running, but you need to enable status endpoints for detailed metrics.

Nginx Status

# Add to your Nginx configuration
server {
    listen 127.0.0.1:80;
    server_name 127.0.0.1;

    location /nginx_status {
        stub_status on;
        allow 127.0.0.1;
        deny all;
    }
}

# Reload Nginx
nginx -t && systemctl reload nginx

# Netdata will automatically pick up the stats

MySQL Performance Schema

# Create a Netdata monitoring user for MySQL
mysql -e "CREATE USER 'netdata'@'localhost' IDENTIFIED BY 'complex-password-here';"
mysql -e "GRANT USAGE, REPLICATION CLIENT, PROCESS ON *.* TO 'netdata'@'localhost';"
mysql -e "FLUSH PRIVILEGES;"

# Configure Netdata MySQL monitoring
# /etc/netdata/go.d/mysql.conf
jobs:
  - name: local
    dsn: netdata:complex-password-here@tcp(127.0.0.1:3306)/

# Restart Netdata
systemctl restart netdata

Building a Dashboard You Will Actually Check

The biggest monitoring failure is not missing tools—it is alert fatigue and dashboards nobody opens. Here is how to build a monitoring setup that actually works:

Start with just three metrics: CPU, RAM, and disk space. Get comfortable checking these daily before adding more complexity.
Set alerts only for things you will act on. If CPU hitting 60% does not require action, do not alert on it. Alert on 80% where you would investigate and 95% where you would immediately respond.
Use one notification channel. Telegram or Slack, not both. Every additional channel splits your attention.
Review Logwatch daily. Two minutes, every morning, same time. Make it a habit.
Check Uptime Kuma weekly. Review uptime percentages, response time trends, and any incidents from the past week.

A monitoring system you actually use beats a sophisticated one you ignore.

When Monitoring Reveals You Need More

Good monitoring often leads to an important realization: you need more resources or less operational burden. If your alerts keep firing for legitimate load (not noisy neighbors), it is time to scale. On MassiveGRID, Cloud VPS supports independent scaling of CPU, RAM, and storage. If performance is inconsistent, Cloud VDS eliminates shared-resource contention. And if monitoring and responding to alerts has become a part-time job, Managed Cloud Dedicated Servers include 24/7 professional monitoring and incident response as part of the service.

MassiveGRID Ubuntu VPS includes: Ubuntu 24.04 LTS pre-installed · Proxmox HA cluster with automatic failover · Ceph 3x replicated NVMe storage · Independent CPU/RAM/storage scaling · 12 Tbps DDoS protection · 4 global datacenter locations · 100% uptime SLA · 24/7 human support rated 9.5/10

→ Deploy a self-managed VPS — from $1.99/mo
→ Need dedicated resources? — from $8.30/mo
→ Want fully managed hosting? — we handle everything

Automated Health Checks with a Simple Script

For servers where you do not want to install a full monitoring stack, a simple bash script can check critical services and send an alert if something is wrong:

#!/bin/bash
# /usr/local/bin/health-check.sh
# Run via cron every 5 minutes: */5 * * * * /usr/local/bin/health-check.sh

ALERT_EMAIL="your@email.com"
HOSTNAME=$(hostname)

check_service() {
  local service=$1
  if ! systemctl is-active --quiet "$service"; then
    echo "ALERT: $service is down on $HOSTNAME" | \
      mail -s "[$HOSTNAME] Service Down: $service" "$ALERT_EMAIL"
    # Attempt auto-restart
    systemctl restart "$service"
  fi
}

# Check critical services
check_service nginx
check_service mysql
check_service php8.3-fpm
check_service redis-server

# Check disk space (alert if above 85%)
DISK_USAGE=$(df -h / | awk 'NR==2 {print $5}' | tr -d '%')
if [ "$DISK_USAGE" -gt 85 ]; then
  echo "ALERT: Disk usage at ${DISK_USAGE}% on $HOSTNAME" | \
    mail -s "[$HOSTNAME] Disk Space Warning" "$ALERT_EMAIL"
fi

# Check RAM (alert if available RAM below 200MB)
AVAIL_RAM=$(free -m | awk 'NR==2 {print $7}')
if [ "$AVAIL_RAM" -lt 200 ]; then
  echo "ALERT: Available RAM at ${AVAIL_RAM}MB on $HOSTNAME" | \
    mail -s "[$HOSTNAME] Low Memory Warning" "$ALERT_EMAIL"
fi

# Make executable and add to cron
chmod +x /usr/local/bin/health-check.sh
echo "*/5 * * * * /usr/local/bin/health-check.sh" | crontab -

This script provides basic monitoring with zero dependencies beyond a working mail setup. It is not a replacement for Netdata, but it is better than nothing, and it takes five minutes to set up.

Monitoring Strategy by Server Role

Different workloads need different monitoring focus areas. Here is what to prioritize based on what your server does:

Web Server (Nginx/Apache + PHP)

Primary metric: HTTP response time and error rate
Secondary: PHP-FPM pool status (active/idle processes), CPU usage
Alert on: Response time above 2 seconds, 5xx error rate above 1%, PHP-FPM max_children reached

Database Server (MySQL/MariaDB)

Primary metric: Query response time, connection count
Secondary: InnoDB buffer pool hit ratio, slow query rate, disk I/O wait
Alert on: Connections above 80% of max, buffer pool hit ratio below 99%, replication lag (if applicable)

Application Server (Node.js, Python, etc.)

Primary metric: Application response time, memory usage trend
Secondary: Event loop lag (Node.js), worker process count, CPU per process
Alert on: Memory growth over time (leak detection), response time degradation, process crashes

Docker Host

Primary metric: Container health status, resource usage per container
Secondary: Docker daemon health, disk space (images and volumes grow fast)
Alert on: Container restarts, disk usage above 80%, any container in unhealthy state

Focus your monitoring on the metrics that matter for your specific workload. Monitoring everything equally is the same as monitoring nothing—the important signals get lost in noise.