Recently, I have started backing up my servers to Amazon S3 (Simple Storage Service) using Duplicity. So far, I am pleased with the process and the costs involved. I still do regular disk dumps to my own backup server but may slowly phase down the frequency of these dumps.
What is Amazon S3?
Amazon S3 provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web. It gives any developer access to the same highly scalable, reliable, fast, inexpensive data storage infrastructure that Amazon uses to run its own global network of web sites. The service aims to maximize benefits of scale and to pass those benefits on to developers.
What is Duplicity?
Duplicity backs up directories by producing encrypted tar-format volumes and uploading them to a remote or local file server. Because duplicity uses librsync, the incremental archives are space efficient and only record the parts of files that have changed since the last backup. Because duplicity uses GnuPG to encrypt and/or sign these archives, they will be safe from spying and/or modification by the server.
Why move from the old method?
The main reason is that my backup server, located in my home, regularly loses connectivity during the disk dump. My ISP is Time Warner and while I am generally quite pleased with their service, it seems to regularly disconnect me during the early morning hours. I intend to continue using disk dumps to my backup server but now augment this with using Amazon S3.
There is an important difference between the two methods.
Dump backs up the entire file system – not the files. Dump does not care what file system is on the hard drive, or even if there are files in the file system. It examines files on a file system, determines which ones need to be backed up, and copies those files to a specified disk, tape, file or other storage medium. It dumps one file system at a time quickly and efficiently. Unfortunately, it does not do individual directories, and so it eats up a great deal more storage space than tar.
Duplicity uses the rsync library to determine changes to files and directories. It then uses tar to create a compressed archive of all changes it has found.
With dump, you are able to restore your filesystem back to a snapshot of any previous point in time for which a dump exists. With Duplicity, you are able to restore your system back to its last duplicity backup.
There are some advantages to including this method among a repertoire of backup methods. These include:
- Distributed storage
- Encrypted storage
- Space efficient
- Low cost
- Easy and unlimited expansion
- Multiple backup methodologies
How to implement?
The main assumption for my implementation instructions is that the servers are running FreeBSD version 6.x or greater. Since all the software is open source, it should be a simple matter to get Duplicity running on your brand of server.
Implementation is really quite easy and involves mainly some initial installation, setup, and learning a few commands.
> cd /usr/ports/sysutils/duplicity > make install clean ## The py-boto port is needed for duplicity to work with S3. > cd /usr/ports/devel/py-boto > make install clean > rehash
Register for Amazon S3
In order to use this service, you must register for an account and provide credit card information. See http://aws.amazon.com/s3/ to view pricing, terms of service and to register for an account.
Once you have an account, you need to login to get your Access Identifiers. You need both the regular Access Key and your Secret Access Key. Make a note of these.
Prepare to backup
I strongly suggest you read the man files for duplicity and boto. Beyond that, I have found it convenient to set and later unset some environment variables. The method to set environment variables depends on your shell. Using the CSH shell:
## Keys from Amazon > setenv AWS_ACCESS_KEY_ID regular_Access_Key > setenv AWS_SECRET_ACCESS_KEY your_Secret_Access_Key ## Encryption password of your choice ## This must be the same for each incremental backup ## If you do not set the variable here, the shell will request it > setenv PASSPHRASE GnuPG_Encryption_Password
Perform the backup
I have adopted some conventions to ease my use of duplicity and insure my uniformity. Mainly, wherever Amazon or duplicity speak of bucketname, I use the name of the server and directory to backup. Here are my commands:
## Backup my entire FreeBSD system > duplicity /var s3://s3.amazonaws.com/servername/var > duplicity /home s3://s3.amazonaws.com/servername/home > duplicity /usr s3://s3.amazonaws.com/servername/usr > duplicity / --exclude=/home --exclude=/usr --exclude=/var --exclude=/sys --exclude=/dev --exclude=/proc --exclude=/tmp --exclude=/mnt s3://s3.amazonaws.com/servername/root ## List files in one of the buckets > duplicity list-current-files s3://s3.amazonaws.com/servername/var ## Restore files from one of the buckets to the specified directory > duplicity s3://s3.amazonaws.com/servername/home /backup/servername/duplicity/home
Unset environment variables
For security reasons, it is important to unset your environment variables. Here’s how:
# Unset environment variables > setenv AWS_ACCESS_KEY_ID > setenv AWS_SECRET_ACCESS_KEY > setenv PASSPHRASE
Build a Shell Script
From this point, it is a simple matter to build a shell script so that these commands may be run from cron or periodic. I’ll leave this as an exercise for you.