Metrics tell you that something went wrong. Logs tell you why. If you have followed our Prometheus and Grafana monitoring guide, you already have metrics covering CPU, memory, disk, and application performance. But when Prometheus fires an alert at 3 AM — CPU spike, increased error rate, disk I/O saturation — you need logs to diagnose the root cause. That is where Loki fits into the stack: a log aggregation system designed by the Grafana team to work seamlessly alongside Prometheus, using the same label-based approach and the same Grafana dashboards you already know.
Unlike the ELK stack (Elasticsearch, Logstash, Kibana), Loki does not index the full text of every log line. It indexes only the metadata labels (job, host, container name) and stores the raw log content compressed on disk. This architectural decision makes Loki dramatically more resource-efficient — ideal for VPS environments where RAM and CPU are shared resources. A complete observability stack (Prometheus + Loki + Promtail + Grafana) runs comfortably on hardware that would struggle with Elasticsearch alone.
MassiveGRID Ubuntu VPS includes: Ubuntu 24.04 LTS pre-installed · Proxmox HA cluster with automatic failover · Ceph 3x replicated NVMe storage · Independent CPU/RAM/storage scaling · 12 Tbps DDoS protection · 4 global datacenter locations · 100% uptime SLA · 24/7 human support rated 9.5/10
Deploy a self-managed VPS — from $1.99/mo
Need dedicated resources? — from $19.80/mo
Want fully managed hosting? — we handle everything
The Three Pillars of Observability
Modern observability relies on three complementary data types:
- Metrics (Prometheus) — Numerical measurements over time: CPU usage at 85%, request latency at 200ms, error rate at 2.3%. Metrics answer "what is happening" and "how much."
- Logs (Loki) — Timestamped text records from applications and systems: error messages, access logs, audit trails. Logs answer "why did it happen" and provide the context behind metric anomalies.
- Traces (Tempo, Jaeger) — Request paths through distributed systems: this HTTP request hit service A, then service B, then the database. Traces answer "where did time go" in multi-service architectures.
This guide adds the second pillar — logs — to your existing Prometheus and Grafana setup. With metrics and logs in the same Grafana instance, you can click from a metric alert directly into the relevant logs without switching tools.
Why Loki Over Elasticsearch
The ELK stack (Elasticsearch + Logstash + Kibana) is the incumbent solution for log aggregation. It is powerful but resource-hungry. Here is why Loki is the better choice for VPS environments:
- Memory efficiency — Elasticsearch needs 4-8GB of heap memory minimum. Loki runs with 256-512MB for moderate log volumes.
- No full-text indexing — Elasticsearch indexes every word in every log line, consuming CPU and storage. Loki indexes only labels, storing raw logs compressed.
- Same query patterns — If you know PromQL (Prometheus), LogQL (Loki) feels familiar. Same label selectors, same concepts.
- Native Grafana integration — Loki is a first-class data source in Grafana. No separate Kibana installation needed.
- Simpler operations — No cluster management, no shard rebalancing, no index lifecycle policies. Single binary, single configuration file.
The trade-off: Loki's grep-style searches are slower than Elasticsearch's indexed searches for ad-hoc full-text queries. But for operational use — finding error logs within a time range, filtering by service name, correlating with metrics — Loki is fast enough and far more efficient.
Prerequisites
This guide assumes you have Prometheus and Grafana running via Docker Compose, as described in our monitoring guide. Loki is resource-efficient. Added to your existing Grafana stack, Promtail uses approximately 50MB RAM and Loki uses 256-512MB. A Cloud VPS with 4 vCPU and 8GB RAM runs the complete observability stack — Prometheus, Grafana, Loki, and Promtail — with room for your application workloads.
Verify your existing stack is running:
docker compose -f /opt/monitoring/docker-compose.yml ps
Docker Compose Setup
Add Loki and Promtail to your existing monitoring Docker Compose file. If you are starting fresh, create the full stack:
# /opt/monitoring/docker-compose.yml
# Add these services to your existing Prometheus/Grafana compose file
services:
# ... existing prometheus and grafana services ...
loki:
image: grafana/loki:3.3.0
container_name: loki
restart: unless-stopped
ports:
- "127.0.0.1:3100:3100"
volumes:
- ./loki/config.yml:/etc/loki/config.yml:ro
- loki-data:/loki
command: -config.file=/etc/loki/config.yml
networks:
- monitoring
promtail:
image: grafana/promtail:3.3.0
container_name: promtail
restart: unless-stopped
volumes:
- ./promtail/config.yml:/etc/promtail/config.yml:ro
- /var/log:/var/log:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- /run/docker.sock:/run/docker.sock:ro
command: -config.file=/etc/promtail/config.yml
depends_on:
- loki
networks:
- monitoring
volumes:
loki-data:
networks:
monitoring:
driver: bridge
Create the configuration directories:
mkdir -p /opt/monitoring/loki /opt/monitoring/promtail
Promtail Configuration
Promtail is the log collection agent. It tails log files, labels them, and ships them to Loki. Configure it to collect system logs, Nginx access and error logs, Docker container logs, and application logs:
# /opt/monitoring/promtail/config.yml
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
# ── System logs ──────────────────────────────────────────────
- job_name: syslog
static_configs:
- targets:
- localhost
labels:
job: syslog
host: myserver
__path__: /var/log/syslog
- job_name: auth
static_configs:
- targets:
- localhost
labels:
job: auth
host: myserver
__path__: /var/log/auth.log
# ── Nginx logs ───────────────────────────────────────────────
- job_name: nginx-access
static_configs:
- targets:
- localhost
labels:
job: nginx
type: access
host: myserver
__path__: /var/log/nginx/access.log
pipeline_stages:
- regex:
expression: '^(?P<remote_addr>[\w.]+) - (?P<remote_user>\S+) \[(?P<time_local>.+)\] "(?P<method>\S+) (?P<path>\S+) (?P<protocol>\S+)" (?P<status>\d+) (?P<body_bytes_sent>\d+)'
- labels:
method:
status:
- job_name: nginx-error
static_configs:
- targets:
- localhost
labels:
job: nginx
type: error
host: myserver
__path__: /var/log/nginx/error.log
# ── Docker container logs ────────────────────────────────────
- job_name: docker
docker_sd_configs:
- host: unix:///run/docker.sock
refresh_interval: 5s
relabel_configs:
- source_labels: ['__meta_docker_container_name']
regex: '/(.*)'
target_label: container
- source_labels: ['__meta_docker_container_log_stream']
target_label: stream
- source_labels: ['__meta_docker_container_label_com_docker_compose_service']
target_label: compose_service
# ── Application logs ─────────────────────────────────────────
- job_name: application
static_configs:
- targets:
- localhost
labels:
job: myapp
host: myserver
__path__: /var/log/myapp/*.log
pipeline_stages:
- multiline:
firstline: '^\d{4}-\d{2}-\d{2}'
max_wait_time: 3s
- regex:
expression: '^(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) (?P<level>\w+) (?P<message>.*)'
- labels:
level:
Key configuration details: the positions file tracks where Promtail left off in each log file, so it survives restarts without duplicating or missing entries. Pipeline stages parse log lines and extract labels for efficient querying in Loki.
Loki Configuration
Configure Loki for single-instance deployment with filesystem storage:
# /opt/monitoring/loki/config.yml
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9096
common:
instance_addr: 127.0.0.1
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
query_range:
results_cache:
cache:
embedded_cache:
enabled: true
max_size_mb: 100
schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
storage_config:
tsdb_shipper:
active_index_directory: /loki/tsdb-index
cache_location: /loki/tsdb-cache
limits_config:
retention_period: 30d
max_query_series: 5000
max_query_parallelism: 2
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
compactor:
working_directory: /loki/compactor
compaction_interval: 10m
retention_enabled: true
retention_delete_delay: 2h
delete_request_store: filesystem
The retention_period: 30d automatically deletes logs older than 30 days. Adjust based on your compliance requirements and available disk space. The compactor runs every 10 minutes to merge small index files and enforce retention.
Start the updated stack:
cd /opt/monitoring
docker compose up -d loki promtail
Verify Loki is accepting data:
# Check Loki readiness
curl -s http://127.0.0.1:3100/ready
# Check Promtail targets
curl -s http://127.0.0.1:9080/targets
Adding Loki as a Data Source in Grafana
Open Grafana (typically at https://grafana.yourdomain.com), navigate to Connections > Data Sources > Add data source, and select Loki. Configure the connection:
- URL:
http://loki:3100(using the Docker network hostname) - Timeout: 60s (for large queries)
- Leave authentication disabled (Loki is on the internal Docker network)
Click Save & Test. Grafana will verify the connection and confirm "Data source successfully connected." You can now query logs from the Explore view alongside your Prometheus metrics.
LogQL Basics
LogQL is Loki's query language. If you know PromQL, the syntax feels familiar. Queries start with a log stream selector (curly braces with labels) and optionally add filter expressions:
# ── Stream selectors ───────────────────────────────────────────
# All Nginx access logs
{job="nginx", type="access"}
# All logs from a specific Docker container
{container="myapp"}
# All syslog entries
{job="syslog"}
# ── Line filters ───────────────────────────────────────────────
# Nginx 500 errors
{job="nginx", type="access"} |= "500"
# Error-level application logs
{job="myapp"} |= "ERROR"
# Exclude health check noise
{job="nginx", type="access"} != "/health"
# Regex filter — 4xx and 5xx status codes
{job="nginx", type="access"} |~ "\" [45]\\d{2} "
# ── Parsing and filtering ─────────────────────────────────────
# Parse Nginx logs and filter by status
{job="nginx", type="access"} | pattern `<ip> - - [<timestamp>] "<method> <path> <_>" <status> <bytes>` | status >= 400
# ── Metric queries (aggregation) ──────────────────────────────
# Error rate per minute
rate({job="nginx", type="access"} |= "500" [1m])
# Log volume by container
sum by (container) (rate({job="docker"} [5m]))
# Count failed SSH attempts per hour
sum(count_over_time({job="auth"} |= "Failed password" [1h]))
The power of LogQL becomes apparent when combined with Prometheus metrics. A spike in HTTP error rate (Prometheus) leads you to filter Nginx error logs (Loki) for the same time window — all within the same Grafana dashboard.
Building Log Dashboards in Grafana
Create a new dashboard that combines Prometheus metrics and Loki logs. Here is a practical layout for a web server monitoring dashboard:
Row 1 — Metrics overview (Prometheus):
- Request rate (requests/second)
- Error rate (4xx + 5xx percentage)
- Response latency (p50, p95, p99)
Row 2 — Log panels (Loki):
- Recent error logs:
{job="nginx", type="error"} - HTTP 5xx access logs:
{job="nginx", type="access"} |~ "\" 5\\d{2} " - Application exceptions:
{job="myapp"} |= "Exception"
Row 3 — Log volume graphs (Loki metric queries):
- Error log volume over time:
sum(rate({job="nginx", type="error"} [5m])) - Log volume by container:
sum by (compose_service) (rate({job="docker"} [5m]))
The key technique is using Grafana's dashboard variables. Create a variable $timeRange that synchronizes across all panels. When you spot a metric anomaly, zoom in on the time range, and all log panels update to show only logs from that window.
Alert Rules on Log Patterns
Grafana alerting works with Loki queries just as it does with Prometheus. Create alerts on log patterns that indicate problems:
# Alert: More than 10 HTTP 500 errors in 5 minutes
# LogQL expression for Grafana alert rule:
sum(count_over_time({job="nginx", type="access"} |= " 500 " [5m])) > 10
# Alert: Failed SSH login attempts (brute force detection)
sum(count_over_time({job="auth"} |= "Failed password" [15m])) > 20
# Alert: Application out-of-memory errors
count_over_time({job="myapp"} |= "OutOfMemoryError" [10m]) > 0
# Alert: Disk space warnings in syslog
count_over_time({job="syslog"} |= "No space left on device" [5m]) > 0
In Grafana, navigate to Alerting > Alert Rules > New Alert Rule. Select Loki as the data source, enter the LogQL expression, set the threshold condition, and configure notification channels. This gives you proactive alerting on log patterns without manually watching dashboards.
Log Retention and Storage Management
Logs consume disk space proportionally to your traffic and verbosity. A moderately busy web server generates 1-5GB of raw logs per day. Loki compresses logs efficiently (typically 10:1), but storage still grows over time.
Monitor Loki's disk usage:
# Check Loki storage size
du -sh /var/lib/docker/volumes/*loki*
# Check overall disk usage
df -h /
Strategies for managing log storage:
- Tune retention — Set
retention_periodin Loki's config based on your needs: 7d for development, 30d for production, 90d for compliance. - Filter verbose logs in Promtail — Drop health check logs and other noise before they reach Loki:
# Add to Promtail pipeline_stages for nginx-access:
pipeline_stages:
- drop:
expression: '.*"GET /health HTTP.*"'
drop_counter_reason: health_check
- drop:
expression: '.*"GET /favicon.ico HTTP.*"'
drop_counter_reason: favicon
- Scale storage independently — With MassiveGRID's independent resource scaling, add disk space without changing CPU or RAM allocation.
For comprehensive disk management strategies, see our disk space management guide.
Practical Workflow: From Alert to Root Cause
Here is how the complete observability stack works together in a real incident:
- Prometheus alert fires — "HTTP error rate above 5% for 5 minutes."
- Open Grafana dashboard — Metrics panel shows error rate spike starting at 14:32. CPU is normal. Memory is normal. Disk I/O has a spike.
- Check Loki logs — Filter Nginx error logs for the 14:30-14:40 window:
Result: "upstream timed out (110: Connection timed out) while reading response header from upstream."{job="nginx", type="error"} |= "upstream" | line_format "{{.timestamp}} {{.message}}" - Drill deeper — Check application logs for the same window:
Result: "Slow query detected: SELECT * FROM orders WHERE... took 12.4s"{job="myapp"} |= "timeout" or |= "slow query" - Root cause identified — A missing database index caused full table scans, which saturated disk I/O, which caused application timeouts, which caused Nginx upstream timeouts, which caused HTTP 500 errors.
Without Loki, step 3 requires SSH-ing into the server and manually grepping log files. With Loki, you stay in Grafana and correlate metrics with logs in seconds. LogQL queries that scan log data on disk can be I/O intensive. Dedicated VPS resources ensure your log queries complete quickly without competing with your application workloads.
Connecting Loki Alerts to ntfy
If you have set up ntfy for push notifications (see our ntfy self-hosting guide), connect Grafana alerts to ntfy for instant mobile notifications on log anomalies.
In Grafana, go to Alerting > Contact Points > New Contact Point and add a webhook:
# Webhook URL for ntfy
https://ntfy.yourdomain.com/alerts
# Or use the Grafana alerting webhook with a script
# /opt/monitoring/scripts/ntfy-notify.sh
#!/bin/bash
TITLE="$1"
MESSAGE="$2"
curl -s \
-H "Title: $TITLE" \
-H "Priority: high" \
-H "Tags: warning" \
-d "$MESSAGE" \
https://ntfy.yourdomain.com/server-alerts
Now log-based alerts — error spikes, failed login attempts, application exceptions — trigger push notifications to your phone. Combined with Prometheus metric alerts, you get comprehensive coverage.
Prefer Managed Observability?
A complete observability stack gives you visibility into every layer of your infrastructure. But visibility is only valuable if someone reads the dashboards and acts on the alerts. MassiveGRID's fully managed hosting includes 24/7 monitoring by a human team that investigates alerts, diagnoses issues, and takes action — even at 3 AM. You get the dashboards and the peace of mind that someone is watching them when you are not.