Backing up a Server to Amazon S3

Motivation

When deploy­ing a serv­er on the inter­net you always have to deal with secu­ri­ty issues. You hard­en your serv­er by set­ting up encrypt­ed con­nec­tions, con­fig­ur­ing a tight fire­wall and putting crit­i­cal ser­vices in a chroot jail. How­ev­er what hap­pens if an intrud­er hacks into your serv­er and deletes your con­tent? Or if you make a mis­take and erase some data? The last line of defense is hav­ing a good back­up strategy.

The ques­tion aris­es what do back­up and where to back it up. For me some­thing like to holy grail would be hav­ing a fast, reli­able, file-sys­tem-based back­up solu­tion like the snap­shot fea­ture in ZFS. It should then be pos­si­ble to sync these snap­shots in a band­width-effi­cient man­ner to a remote location.

Using Duplicity for Backup

There are a lot of dif­fer­ent solu­tions around, how­ev­er they dif­fer in secu­ri­ty, price and reli­a­bil­i­ty. Duplic­i­ty is one of them. It is rel­a­tive­ly easy to set up if you have a sim­ple back­up prob­lem, such as back­ing up the web serv­er and its cor­re­spond­ing data­base. The neat thing about Duplic­i­ty is that it can make encrypt­ed, incre­men­tal back­ups using stan­dard file for­mats. For the incre­men­tal part of this oper­a­tion it relies on rdiff to do the heavy lift­ing and it is using GPG to encrypt the back­up with a public/private key pair. One of the oth­er ben­e­fits is that Duplic­i­ty offers out of the box Ama­zon S3 sup­port. This means that you are able to store your back­ups up in the cloud in a save man­ner. By per­form­ing incre­men­tal back­ups only, the costs for traf­fic and stor­age are min­i­mized. In my case which is per­form­ing a dai­ly back­up of the con­fig­u­ra­tion and the blog of the serv­er I nev­er paid any­thing because there is a min­i­mum billing amount per month.

Using a file serv­er in the cloud as the back­up des­ti­na­tion has its ben­e­fits espe­cial­ly in the restore case. Then you can rely on Ama­zon’s band­width to per­form a fast restore instead of using your home cable or DSL connection.

What is happening during a backup?

In the begin­ning, when there are no pre­vi­ous back­ups to be used for incre­men­tal back­ups, Duplic­i­ty is per­form­ing a full back­up. For its back­up files it uses stan­dard tar file for­mat which then gets encrypt­ed using GPG and your pub­lic key. It is then uploaded to a remote server.

The next back­up is an incre­men­tal back­up. This means that duplic­i­ty now first checks if its local cache of pre­vi­ous diffs is up to date with the remote repos­i­to­ry. If that is not the case it down­loads all the pre­vi­ous diffs because it needs them to gen­er­ate the new diff of the most recent changes. It then cal­cu­lates the diff, encrypts it and uploads it togeth­er with a hash of the diff to the remote server.

Backup a Database

When using a web appli­ca­tion often there are not only files, but also data­bas­es to be backed up to be able to ful­ly restore your web page from the back­up. In my case I am using a MySQL serv­er. The way I am doing it is first per­form­ing a mysql­dump and then back­ing this file up. I have writ­ten a small helper script in perl that is kicked off by a cron job.

I use for quite a while now and it is work­ing very nice­ly. As a start­ing point I pub­lished the back­up wrap­per script on github.