Disaster recovery is not about preventing disasters — it is about surviving them. Hardware fails, humans make mistakes, attackers find vulnerabilities, and software has bugs. The question is not whether something will go wrong, but whether you can recover when it does. A good disaster recovery plan is a document you hope you never need, tested regularly so that when you do need it, it works.
This guide covers what can go wrong, what your hosting provider already handles, what you are responsible for, how to build a recovery plan, and step-by-step recovery procedures for the three most common disaster scenarios.
MassiveGRID Ubuntu VPS includes: Ubuntu 24.04 LTS pre-installed · Proxmox HA cluster with automatic failover · Ceph 3x replicated NVMe storage · Independent CPU/RAM/storage scaling · 12 Tbps DDoS protection · 4 global datacenter locations · 100% uptime SLA · 24/7 human support rated 9.5/10
Deploy a self-managed VPS — from $1.99/mo
Need dedicated resources? — from $19.80/mo
Want fully managed hosting? — we handle everything
What "Disaster" Means for a VPS
Not all disasters are equal. Classify them by likelihood and impact to prioritize your preparation:
| Disaster Type | Likelihood | Impact | Who Handles It |
|---|---|---|---|
| Hardware failure (CPU, memory, motherboard) | Low | Total outage | Hosting provider (HA failover) |
| Disk failure | Low | Data loss | Hosting provider (Ceph replication) |
| DDoS attack | Medium | Service unavailability | Hosting provider (DDoS protection) |
| Network outage (datacenter) | Low | Total outage | Hosting provider |
Accidental data deletion (DROP TABLE) |
High | Data loss | You |
| Security breach / server compromise | Medium | Data loss + exposure | You |
| Bad deployment / application corruption | High | Service degradation | You |
| Configuration mistake (firewall, permissions) | High | Lockout or outage | You |
| Ransomware / data encryption | Low-Medium | Total data loss | You |
Notice the pattern: the disasters most likely to happen are the ones you are responsible for handling.
What MassiveGRID Already Handles
On a MassiveGRID Cloud VPS, three of the five most common infrastructure disasters are already handled — before you do anything:
Hardware Failure: Proxmox HA Cluster
Your VPS runs on a Proxmox high-availability cluster. If the physical server hosting your VPS fails, the cluster automatically migrates your VM to a healthy node. This happens automatically, typically within seconds, with no data loss.
Disk Failure: Ceph 3x Replication
Your VPS storage uses Ceph with 3x replication across NVMe drives. Every block of data is written to three independent disks on different physical hosts. A single disk failure (or even an entire storage node failure) causes zero data loss — Ceph continues serving data from the remaining replicas while rebuilding the lost copy.
DDoS Attacks: 12 Tbps Protection
Volumetric DDoS attacks are absorbed by MassiveGRID's 12 Tbps DDoS mitigation infrastructure. Your VPS continues operating normally while attack traffic is filtered upstream.
Key insight: Infrastructure-level disasters are handled by your hosting provider's architecture. Application-level disasters are your responsibility. Everything in this guide focuses on the disasters that you must plan for.
What You Must Handle
The disasters your hosting provider cannot protect you from:
- Accidental data deletion:
rm -rf /var/www,DROP TABLE users;, deleted Docker volumes - Security breaches: Compromised SSH keys, exploited application vulnerabilities, stolen database credentials
- Application corruption: Bad migrations, corrupted configuration, broken deployments
- Human error: Wrong server, wrong database, wrong command
- Ransomware: Encrypted files with demands for payment
The common thread: all of these require backups, documentation, and tested recovery procedures.
The Disaster Recovery Plan
A disaster recovery plan is a document, not software. It should be stored outside your VPS (in a Git repository, a shared document, or a password manager note) and be accessible even when your server is completely unavailable.
DR Plan Template
====================================
DISASTER RECOVERY PLAN
Application: [Your App Name]
Last Updated: [Date]
Last Tested: [Date]
====================================
## 1. CONTACTS
- Primary admin: [Name] [Phone] [Email]
- Secondary admin: [Name] [Phone] [Email]
- Hosting support: MassiveGRID 24/7 support
- Domain registrar: [Name] [Login URL]
## 2. INFRASTRUCTURE INVENTORY
- VPS Provider: MassiveGRID
- Server IP: [IP Address]
- Datacenter: [Location]
- VPS Specs: [vCPU/RAM/Storage]
- OS: Ubuntu 24.04 LTS
- Domain: [yourdomain.com]
- DNS Provider: [Provider]
- SSL: Let's Encrypt (auto-renew)
## 3. APPLICATION STACK
- Web server: Nginx 1.x
- Runtime: Node.js 20.x / Python 3.12
- Database: PostgreSQL 16
- Cache: Redis 7
- Process manager: PM2 / systemd
- Containerized: Yes/No (Docker Compose)
## 4. BACKUP LOCATIONS
- Database backups: [Location, retention period]
- File backups: [Location, retention period]
- Configuration backups: [Git repo URL]
- Backup encryption key: [Stored in password manager]
## 5. RECOVERY OBJECTIVES
- RTO (Recovery Time Objective): [Target]
- RPO (Recovery Point Objective): [Target]
## 6. RECOVERY PROCEDURES
- [See sections below]
## 7. TEST SCHEDULE
- Full recovery test: Quarterly
- Backup verification: Monthly
- Last test results: [Date] [Pass/Fail]
RTO and RPO Explained
Two numbers define your disaster recovery requirements:
RTO (Recovery Time Objective): How long can your application be down before it causes unacceptable damage?
RPO (Recovery Point Objective): How much data can you afford to lose? This determines your backup frequency.
| Application Type | Typical RTO | Typical RPO | Backup Strategy |
|---|---|---|---|
| Personal blog | 24 hours | 7 days | Weekly backups |
| Company website | 4 hours | 24 hours | Daily backups |
| SaaS application | 1 hour | 1 hour | Hourly database backups + WAL archiving |
| E-commerce store | 30 minutes | 15 minutes | Continuous replication + frequent snapshots |
| Financial application | 15 minutes | 0 (zero data loss) | Synchronous replication + WAL shipping |
Be honest about your RTO/RPO. Saying "zero downtime, zero data loss" when you have daily backups and no replication is a fantasy, not a plan.
Backup Verification: Can You Actually Restore?
A backup you have never restored from is not a backup — it is a hope. Follow our automatic backups guide to set up backups, then verify them regularly.
#!/bin/bash
# verify-backup.sh — Monthly backup verification script
set -euo pipefail
BACKUP_DIR="/backups"
TEST_DIR="/tmp/backup-test"
LOG_FILE="/var/log/backup-verification.log"
echo "=== Backup Verification $(date -u +%Y-%m-%dT%H:%M:%SZ) ===" | tee -a "$LOG_FILE"
# Find the most recent backup
LATEST_BACKUP=$(ls -t "$BACKUP_DIR"/db-backup-*.sql.gz 2>/dev/null | head -1)
if [ -z "$LATEST_BACKUP" ]; then
echo "FAIL: No backup files found" | tee -a "$LOG_FILE"
exit 1
fi
echo "Testing backup: $LATEST_BACKUP" | tee -a "$LOG_FILE"
# Check backup file integrity
if ! gzip -t "$LATEST_BACKUP" 2>/dev/null; then
echo "FAIL: Backup file is corrupted (gzip integrity check failed)" | tee -a "$LOG_FILE"
exit 1
fi
echo "PASS: File integrity check" | tee -a "$LOG_FILE"
# Check backup age
BACKUP_AGE_HOURS=$(( ($(date +%s) - $(stat -c %Y "$LATEST_BACKUP")) / 3600 ))
MAX_AGE_HOURS=25 # Should be less than 25 hours for daily backups
if [ "$BACKUP_AGE_HOURS" -gt "$MAX_AGE_HOURS" ]; then
echo "FAIL: Backup is ${BACKUP_AGE_HOURS} hours old (max: ${MAX_AGE_HOURS})" | tee -a "$LOG_FILE"
exit 1
fi
echo "PASS: Backup age check (${BACKUP_AGE_HOURS}h old)" | tee -a "$LOG_FILE"
# Test restore to a temporary database
TEST_DB="backup_test_$(date +%s)"
sudo -u postgres createdb "$TEST_DB"
if gunzip -c "$LATEST_BACKUP" | sudo -u postgres psql "$TEST_DB" > /dev/null 2>&1; then
echo "PASS: Database restore successful" | tee -a "$LOG_FILE"
# Verify data integrity
TABLE_COUNT=$(sudo -u postgres psql -t -c "SELECT count(*) FROM information_schema.tables WHERE table_schema = 'public'" "$TEST_DB" | tr -d ' ')
echo "PASS: Restored database has $TABLE_COUNT tables" | tee -a "$LOG_FILE"
ROW_COUNT=$(sudo -u postgres psql -t -c "SELECT sum(n_tup_ins) FROM pg_stat_user_tables" "$TEST_DB" | tr -d ' ')
echo "INFO: Approximate row count: $ROW_COUNT" | tee -a "$LOG_FILE"
else
echo "FAIL: Database restore failed" | tee -a "$LOG_FILE"
fi
# Cleanup
sudo -u postgres dropdb "$TEST_DB"
echo "=== Verification complete ===" | tee -a "$LOG_FILE"
# Schedule monthly verification
sudo crontab -e
# Add:
0 3 1 * * /home/admin/scripts/verify-backup.sh 2>&1 | mail -s "Backup Verification Report" admin@yourdomain.com
Recovery Scenario #1: "I Accidentally Deleted My Database"
This is the most common disaster. Someone runs DROP TABLE on the wrong database, a migration script deletes data instead of transforming it, or a bulk delete operation has a missing WHERE clause.
Immediate Response (First 5 Minutes)
# 1. STOP THE APPLICATION immediately
# Prevent new writes from making recovery harder
sudo systemctl stop myapp
# or
pm2 stop all
# or
docker compose stop app
# 2. DO NOT restart PostgreSQL
# The WAL (Write-Ahead Log) may still contain the deleted data
# 3. Assess the damage
sudo -u postgres psql -d myapp -c "\dt" # List remaining tables
sudo -u postgres psql -d myapp -c "SELECT count(*) FROM users;" # Check specific tables
Recovery from Backup
# 4. Identify the most recent backup
ls -la /backups/db-backup-*.sql.gz
# 5. Create a recovery database (don't overwrite the damaged one yet)
sudo -u postgres createdb myapp_recovery
# 6. Restore the backup
gunzip -c /backups/db-backup-2026-02-28-0300.sql.gz | sudo -u postgres psql myapp_recovery
# 7. Verify the restored data
sudo -u postgres psql myapp_recovery -c "SELECT count(*) FROM users;"
sudo -u postgres psql myapp_recovery -c "SELECT max(created_at) FROM users;"
# ^ This tells you the RPO — how much data you'll lose
# 8. If the restoration looks good, swap databases
sudo -u postgres psql -c "ALTER DATABASE myapp RENAME TO myapp_damaged;"
sudo -u postgres psql -c "ALTER DATABASE myapp_recovery RENAME TO myapp;"
# 9. Restart the application
sudo systemctl start myapp
# or
pm2 start all
# 10. Verify the application works
curl -s https://yourdomain.com/api/health/ready | jq .
Point-in-Time Recovery (If Using WAL Archiving)
If you have PostgreSQL WAL archiving configured, you can recover to a specific point in time — just before the accidental deletion:
# recovery.conf / postgresql.conf settings for PITR
restore_command = 'cp /var/lib/postgresql/wal_archive/%f %p'
recovery_target_time = '2026-02-28 14:55:00 UTC' # Just before the DELETE
recovery_target_action = 'promote'
# Step-by-step PITR
# 1. Stop PostgreSQL
sudo systemctl stop postgresql
# 2. Back up the current (damaged) data directory
sudo cp -r /var/lib/postgresql/16/main /var/lib/postgresql/16/main.damaged
# 3. Restore base backup
sudo rm -rf /var/lib/postgresql/16/main
sudo tar xzf /backups/base-backup-latest.tar.gz -C /var/lib/postgresql/16/
# 4. Create recovery signal file
sudo touch /var/lib/postgresql/16/main/recovery.signal
# 5. Configure recovery target in postgresql.conf
echo "restore_command = 'cp /backups/wal_archive/%f %p'" | sudo tee -a /var/lib/postgresql/16/main/postgresql.conf
echo "recovery_target_time = '2026-02-28 14:55:00 UTC'" | sudo tee -a /var/lib/postgresql/16/main/postgresql.conf
# 6. Start PostgreSQL (it will replay WAL up to the target time)
sudo chown -R postgres:postgres /var/lib/postgresql/16/main
sudo systemctl start postgresql
# 7. Check logs for recovery progress
sudo tail -f /var/log/postgresql/postgresql-16-main.log
Recovery Scenario #2: "My Server Was Compromised"
You discover unauthorized access — unusual processes, modified files, unfamiliar SSH keys, or alerts from your monitoring system. This is a security incident requiring a structured response.
Phase 1: Containment (First 15 Minutes)
# 1. DO NOT shut down the server yet — preserve evidence
# 2. Record what's running RIGHT NOW
ps auxf > /tmp/forensics-processes.txt
ss -tlnp > /tmp/forensics-listening-ports.txt
last -50 > /tmp/forensics-login-history.txt
cat /etc/passwd > /tmp/forensics-passwd.txt
crontab -l > /tmp/forensics-crontab.txt 2>/dev/null
sudo cat /var/log/auth.log > /tmp/forensics-auth.txt
# 3. Check for unauthorized SSH keys
find / -name "authorized_keys" -exec echo "=== {} ===" \; -exec cat {} \; > /tmp/forensics-ssh-keys.txt
# 4. Check for recently modified files
find / -mtime -1 -type f -not -path "/proc/*" -not -path "/sys/*" 2>/dev/null > /tmp/forensics-recent-files.txt
# 5. Copy forensics files OFF the server
scp /tmp/forensics-*.txt admin@safe-machine:/incident-response/
Phase 2: Isolation
# 6. Block all incoming connections except your IP
# (From the hosting provider's console, not from the compromised server)
# Or use iptables if you must do it from the server:
sudo iptables -I INPUT -s YOUR_IP -j ACCEPT
sudo iptables -I INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
sudo iptables -A INPUT -j DROP
# 7. Change all passwords and revoke all API keys
# Do this from a DIFFERENT machine:
# - Hosting provider password
# - Database passwords
# - API keys for external services
# - SSH keys (generate new ones)
# 8. Revoke compromised SSH keys
# On every server that trusted the compromised key:
# Remove the key from ~/.ssh/authorized_keys
Phase 3: Rebuild from Clean State
Never trust a compromised server. Rebuild from scratch.
# 9. Deploy a NEW VPS
# From the MassiveGRID control panel, create a fresh Ubuntu 24.04 VPS
# 10. Set up the new server using your configuration management
# If you followed our Ansible guide:
ansible-playbook -i inventory/production site.yml
# 11. Restore data from the MOST RECENT backup that predates the compromise
# Identify when the breach occurred from forensics logs
# Restore the backup from BEFORE that timestamp
# 12. Restore the application
cd /home/deploy/app
git clone https://github.com/yourorg/yourapp.git .
npm ci --production
pm2 start ecosystem.config.js
# 13. Update DNS to point to the new server
# Lower TTL first, then update the A record
# 14. Verify everything works on the new server before decommissioning the old one
Phase 4: Post-Incident
# 15. Analyze how the breach occurred
# Review forensics files from Phase 1
# Common entry points:
# - Weak SSH passwords (use key-only auth)
# - Unpatched application vulnerabilities
# - Exposed database ports
# - Stolen credentials from another breach
# - Insecure application code (SQL injection, RCE)
# 16. Harden the new server
# Follow the security hardening guide for EVERY item
For detailed hardening procedures, see our Ubuntu VPS security hardening guide.
Recovery Scenario #3: "My Application Is Corrupted After a Bad Deploy"
A deployment goes wrong: a database migration is destructive and irreversible, a configuration change breaks the application, or a code bug corrupts user data.
Immediate Rollback
# Option A: Git-based rollback (code issues)
cd /home/deploy/app
# See what changed
git log --oneline -5
# Roll back to the previous commit
git checkout HEAD~1
# Reinstall dependencies if they changed
npm ci --production
# Restart
pm2 reload all
# Verify
curl -s https://yourdomain.com/api/health/ready
# Option B: Docker-based rollback
# If using tagged Docker images:
# See running image
docker ps --format "{{.Image}}"
# myapp:v2.3.1
# Roll back to previous version
cd /home/deploy/app
# Edit docker-compose.yml to use previous image tag
sed -i 's/myapp:v2.3.1/myapp:v2.3.0/' docker-compose.yml
docker compose up -d
# Verify
docker compose logs --tail 50 app
Database Migration Rollback
# If the migration has a down/rollback function:
npm run db:migrate:undo
# or
python manage.py migrate previous_migration_name
# If the migration is irreversible (dropped column, deleted data):
# You MUST restore from backup
# 1. Stop the app
pm2 stop all
# 2. Restore database
sudo -u postgres dropdb myapp
sudo -u postgres createdb myapp
gunzip -c /backups/db-backup-pre-deploy.sql.gz | sudo -u postgres psql myapp
# 3. Roll back the code
cd /home/deploy/app
git checkout HEAD~1
npm ci --production
# 4. Restart
pm2 start all
Pre-Deploy Backup Script
Always create a backup immediately before deploying. Automate it:
#!/bin/bash
# pre-deploy-backup.sh — Run before every deployment
set -euo pipefail
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/backups/pre-deploy"
mkdir -p "$BACKUP_DIR"
echo "Creating pre-deploy backup: $TIMESTAMP"
# Database backup
sudo -u postgres pg_dump -F c myapp > "$BACKUP_DIR/db-${TIMESTAMP}.dump"
# Application files backup
tar czf "$BACKUP_DIR/app-${TIMESTAMP}.tar.gz" -C /home/deploy app/
# Keep only last 10 pre-deploy backups
ls -t "$BACKUP_DIR"/db-*.dump | tail -n +11 | xargs rm -f 2>/dev/null
ls -t "$BACKUP_DIR"/app-*.tar.gz | tail -n +11 | xargs rm -f 2>/dev/null
echo "Pre-deploy backup complete: $BACKUP_DIR"
echo " Database: db-${TIMESTAMP}.dump ($(du -sh "$BACKUP_DIR/db-${TIMESTAMP}.dump" | cut -f1))"
echo " Application: app-${TIMESTAMP}.tar.gz ($(du -sh "$BACKUP_DIR/app-${TIMESTAMP}.tar.gz" | cut -f1))"
Integrate it into your deployment script:
#!/bin/bash
# deploy.sh
# Step 1: Pre-deploy backup
./pre-deploy-backup.sh || { echo "Backup failed, aborting deploy"; exit 1; }
# Step 2: Deploy
git pull origin main
npm ci --production
npm run db:migrate
pm2 reload all
# Step 3: Verify
sleep 5
STATUS=$(curl -s -o /dev/null -w "%{http_code}" https://yourdomain.com/api/health/ready)
if [ "$STATUS" != "200" ]; then
echo "Deploy verification FAILED. Run rollback manually."
echo "Backup location: /backups/pre-deploy/"
exit 1
fi
echo "Deployment successful"
Testing Your Recovery Plan
An untested recovery plan is just a document. Schedule quarterly recovery drills on a test VPS.
Test recovery on a temporary Cloud VDS — dedicated resources give accurate recovery time estimates without affecting production.
# Quarterly DR test procedure
# 1. Spin up a test VPS (MassiveGRID, same specs as production)
# 2. Time yourself: can you rebuild from scratch?
START_TIME=$(date +%s)
# 3. Follow your recovery procedures EXACTLY as documented
# Don't improvise — the point is to test the documentation
# 4. Restore from the most recent backup
# Time the database restore specifically
# 5. Verify the application works
# Run automated tests or manual verification
# 6. Record the results
END_TIME=$(date +%s)
RECOVERY_TIME=$(( (END_TIME - START_TIME) / 60 ))
echo "Recovery completed in ${RECOVERY_TIME} minutes"
echo "RTO target: 60 minutes"
echo "Result: $([ $RECOVERY_TIME -le 60 ] && echo 'PASS' || echo 'FAIL')"
# 7. Document what went wrong and update the plan
# 8. Destroy the test VPS
Common discoveries during DR tests:
- The backup is there but the restore command is wrong
- A dependency was installed manually and is not in the configuration scripts
- Database connection strings are hardcoded and point to the old server
- SSL certificates cannot be re-issued quickly because DNS propagation takes time
- The recovery procedure references a tool version that no longer exists
Each of these discoveries should result in an update to your recovery plan.
Documenting Your Infrastructure
If you were hit by a bus tomorrow, could someone else rebuild your server? Infrastructure documentation should answer: what is installed, how it is configured, and where the data is stored.
# Generate an infrastructure inventory automatically
#!/bin/bash
# infrastructure-audit.sh — Run monthly, store output in your DR plan
echo "=== INFRASTRUCTURE AUDIT $(date -u +%Y-%m-%dT%H:%M:%SZ) ==="
echo ""
echo "## System"
echo "Hostname: $(hostname)"
echo "OS: $(lsb_release -ds)"
echo "Kernel: $(uname -r)"
echo "CPU: $(nproc) cores"
echo "RAM: $(free -h | awk '/Mem:/ {print $2}')"
echo "Disk: $(df -h / | awk 'NR==2 {print $2 " total, " $3 " used"}')"
echo ""
echo "## Installed Services"
systemctl list-units --type=service --state=running --no-pager | grep -v systemd
echo ""
echo "## Listening Ports"
sudo ss -tlnp | grep LISTEN
echo ""
echo "## Docker Containers"
docker ps --format "table {{.Names}}\t{{.Image}}\t{{.Status}}\t{{.Ports}}" 2>/dev/null || echo "Docker not installed"
echo ""
echo "## Nginx Sites"
ls /etc/nginx/sites-enabled/ 2>/dev/null || echo "No Nginx sites"
echo ""
echo "## Cron Jobs"
crontab -l 2>/dev/null || echo "No cron jobs for $(whoami)"
sudo crontab -l 2>/dev/null || echo "No cron jobs for root"
echo ""
echo "## SSL Certificates"
sudo certbot certificates 2>/dev/null || echo "Certbot not installed"
echo ""
echo "## Firewall Rules"
sudo ufw status verbose 2>/dev/null || echo "UFW not active"
echo ""
echo "## Backup Configuration"
ls -la /backups/ 2>/dev/null || echo "No /backups directory"
echo "Latest backup:"
ls -lt /backups/*.gz 2>/dev/null | head -3 || echo "No backup files found"
echo ""
echo "## Package Versions"
nginx -v 2>&1
node --version 2>/dev/null || echo "Node.js not installed"
python3 --version 2>/dev/null || echo "Python not installed"
psql --version 2>/dev/null || echo "PostgreSQL not installed"
docker --version 2>/dev/null || echo "Docker not installed"
Ansible as a Disaster Recovery Tool
The best disaster recovery tool is infrastructure as code. If your entire server configuration is defined in Ansible playbooks, rebuilding from scratch is a single command.
# Your Ansible repository IS your disaster recovery plan
# Directory structure
ansible/
├── inventory/
│ ├── production
│ └── staging
├── playbooks/
│ ├── site.yml # Full server setup
│ ├── app-deploy.yml # Application deployment
│ └── db-restore.yml # Database restoration
├── roles/
│ ├── base/ # SSH hardening, UFW, fail2ban
│ ├── nginx/ # Nginx + SSL
│ ├── postgresql/ # PostgreSQL setup
│ ├── app/ # Application deployment
│ └── monitoring/ # Uptime Kuma, logging
└── group_vars/
└── all.yml # Encrypted variables (ansible-vault)
# Full server rebuild: one command
ansible-playbook -i inventory/production playbooks/site.yml
# Then restore the database from backup
ansible-playbook -i inventory/production playbooks/db-restore.yml \
-e "backup_file=/backups/db-backup-latest.sql.gz"
# Example db-restore.yml playbook
---
- hosts: database
become: yes
vars:
backup_file: "{{ backup_file }}"
db_name: myapp
db_user: appuser
tasks:
- name: Copy backup file to server
copy:
src: "{{ backup_file }}"
dest: /tmp/restore.sql.gz
- name: Stop application
systemd:
name: myapp
state: stopped
delegate_to: "{{ groups['webservers'][0] }}"
- name: Drop and recreate database
become_user: postgres
shell: |
dropdb --if-exists {{ db_name }}
createdb -O {{ db_user }} {{ db_name }}
- name: Restore from backup
become_user: postgres
shell: gunzip -c /tmp/restore.sql.gz | psql {{ db_name }}
- name: Start application
systemd:
name: myapp
state: started
delegate_to: "{{ groups['webservers'][0] }}"
- name: Verify application health
uri:
url: https://yourdomain.com/api/health/ready
status_code: 200
delegate_to: "{{ groups['webservers'][0] }}"
retries: 10
delay: 5
With Ansible, your disaster recovery procedure becomes: provision a new VPS, point Ansible at it, and run the playbook. Everything — from SSH hardening to SSL certificates to application deployment — is automated and reproducible.
Disaster Recovery Offsite Backup Strategy
Your backups must exist outside the server they protect. If the server is compromised or destroyed, local backups are useless.
# Offsite backup script — send backups to a remote location
#!/bin/bash
# offsite-backup.sh
set -euo pipefail
BACKUP_DIR="/backups"
REMOTE_DEST="backup-server:/offsite-backups/$(hostname)/"
RETENTION_DAYS=30
# Sync recent backups to remote server
rsync -avz --progress \
"$BACKUP_DIR/" \
"$REMOTE_DEST"
# Clean up old backups on remote (keep 30 days)
ssh backup-server "find /offsite-backups/$(hostname)/ -type f -mtime +${RETENTION_DAYS} -delete"
# Alternatively, upload to object storage (S3-compatible)
# aws s3 sync "$BACKUP_DIR/" "s3://your-backup-bucket/$(hostname)/" \
# --storage-class STANDARD_IA \
# --exclude "*.tmp"
echo "Offsite backup sync complete"
# Schedule daily offsite sync
sudo crontab -e
# Add:
30 4 * * * /home/admin/scripts/offsite-backup.sh >> /var/log/offsite-backup.log 2>&1
Follow the 3-2-1 backup rule:
- 3 copies of your data
- 2 different storage media or locations
- 1 copy offsite (different datacenter, different provider)
Disaster Recovery Is Included with Managed Hosting
If building, testing, and maintaining a disaster recovery plan is more operational overhead than you want to handle, MassiveGRID Managed Dedicated Cloud Servers include:
- Automated daily backups with verified restoration
- 24/7 server monitoring and incident response
- Security hardening and patch management
- DDoS protection and firewall management
- Full infrastructure documentation
You focus on your application. MassiveGRID handles the infrastructure, the backups, the security, and the disaster recovery.
Summary: The Disaster Recovery Checklist
Use this checklist to verify your disaster readiness:
DISASTER RECOVERY READINESS CHECKLIST
======================================
[ ] Automated daily database backups running
[ ] Backup files stored OFFSITE (not just on the VPS)
[ ] Backup restoration tested within the last 90 days
[ ] Recovery time measured and within RTO target
[ ] Data loss window acceptable for RPO target
[ ] Pre-deploy backups automated
[ ] Application rollback procedure documented and tested
[ ] Infrastructure documented (installed services, versions, config)
[ ] Configuration managed in code (Ansible, scripts, or Git)
[ ] Security hardening applied (SSH, firewall, updates)
[ ] Monitoring running on a SEPARATE server
[ ] Contact information documented and accessible
[ ] DR plan stored OUTSIDE the VPS it protects
[ ] DR plan tested quarterly
[ ] All team members know where to find the DR plan
Disaster recovery is the difference between a bad day and a catastrophe. The infrastructure handled by your hosting provider (hardware, disk, DDoS) is the foundation. Everything you build on top — backups, documentation, tested recovery procedures, infrastructure as code — determines whether a disaster means 15 minutes of recovery or 15 days of rebuilding from memory.