Ryan Finnie
on 4 August 2015
It’s a topic many people don’t like to think about: backups. In addition to making sure your cloud environments are correctly deployed, highly available, secured and monitored, you need to make sure they are backed up for disaster recovery purposes. (And no, replication is not the same as backups.)
Canonical’s IS team is responsible for thousands of machines and instances, and over the years we have been a part of the shift from statically-deployed bare-metal environments to embracing dynamic environment deployments on private and public clouds. As this shift has occurred, we’ve needed to adjust our thinking about how to back up these environments. This has led to the development of Turku, a decentralized, cloud- and Juju-friendly backup system.
Old and Busted
Traditional backup systems tend to follow a similar deployment workflow:
- A backup agent is installed on the machine to be backed up.
- A centralized server is configured with information about the client machine, what to back up, and when.
- At scheduled times, the server connects to the client agent and performs backups.
This workflow has several disadvantages. Primarily, it relies on a centralized configuration mechanism. This may be fine if you only have a few static machines to back up, but the act of manually configuring backups on a backup server does not scale well.
In addition, most backup systems require ingress access to the machine to be backed up. While this may seem logical at first, it becomes a problem when the concept of the service unit is no longer tightly coupled to a machine’s hostname or IP. Not to mention the security aspect of allowing one machine direct access to all of your infrastructure.
Most of our environments are deployed via Juju, which abstracts the concept of networking, especially for services which are not at the front-end layer and do not have floating IPs. Your typical database / store unit is never going to have a floating IP, in most cases is not reachable from most of our networks, and in some cases isn’t even likely to be in the same location tomorrow. Having a backup server being able to reach this sort of unit is usually just not possible.
New Hotness
After struggling with the limitations of these sorts of backup systems, Canonical’s IS team put together a decentralized, cloud-friendly backup system called Turku. Turku takes a different approach to backup management:
- A backup agent (turku-agent) is installed on the machine to be backed up. This can be installed manually, or be deployed as a Juju subordinate service.
- The agent is configured with the location and API key of an API server (turku-api), and with sources of data to be backed up (where, when, what to exclude, how long to keep snapshots, etc). In a Juju subordinate charm setup, this is as easy as the master charm dropping configuration files in /etc/turku-agent/sources.d/ and running
turku-update-config
. - The agent registers itself with the API server and sends its configuration. It then regularly checks in with the API server (every 5 minutes by default). If the API server determines it’s time for a backup (using scheduling data provided by the agent), it tells the agent to check in with a particular storage unit (turku-storage).
- The agent checks in with the storage unit by SSHing to it, using a unit-specific public key relayed from the agent to the storage unit via the API server. This SSH session includes a reverse tunnel to a local Turku-specific rsync daemon on the agent machine.
- The storage unit connects to this rsync daemon over the reverse tunnel and rsyncs the scheduled data modules. It then handles snapshotting of the data. The preferred method is using attic, a deduplication program, but it can also use hardlink trees or even no snapshotting, depending on the nature of the source of data to be backed up.
- Storage units occasionally expire snapshots using retention policies, again, as configured by the agent.
This workflow gives most of the power to the client, and avoids needing to configure a centralized server every time a client unit is added or removed. In almost all situations, no configuration is needed on any server systems. And because of the reverse tunnel, no ingress access is required to each client machine; only egress to the API server and storage units are required.
Schedule and retention information is defined in the agent using natural language expressions. For example, a typical daily backup source may be configured with the schedule “daily, 0200-1400”. As we have thousands of machines being backed up, we found that it’s best to configure the source with a schedule as wide as possible, to allow the API server’s scheduler the most freedom to determine when a backup should start. In most cases, service units are not time-constrained, so most schedule definitions are simply “daily”.
You can be specific, such as “sunday, 0630” for a weekly run, and the API scheduler will try to be as accommodating as possible, but again, it’s recommended to be as open as possible when it comes to backup times.
Similarly, a typical retention definition is “last 5 days, earliest of 1 month, earliest of 2 months”. For example, if today is December 15 and a backup is made, 7 snapshots would exist: December 11 through 15, December 1 and November 1.
Restores
A backup system is useless if you can’t be confident you can restore the data. When Turku was handed over to the IS Operations team to begin deployment and migrations from our previous backup systems, the first thing they did was test restores in a variety of situations. They came up with some interesting scenarios and helped improve usability of the restore mechanism.
In most cases, when doing a restore, you usually don’t want to restore in place. At first this seems counter-intuitive, but when dealing with a disaster recovery situation, it’s usually a matter of getting data from a previous point in time and re-integrating it with the live data in some way, depending on the exact nature of the disaster.
You may remember from above that the Turku agent runs its own local rsync daemon which is served over the reverse SSH tunnel. Most of this daemon’s modules are read-only sources of data to be backed up, but it also includes a writable restore module. When you run “turku-agent-ping –restore” on the machine to restore data to, it connects to the storage unit and establishes the reverse tunnel as normal, but then just sits there until cancelled. You then log into the storage unit, pick a snapshot to restore, then rsync it to the writable module over the tunnel. (As the tunnel ports and credentials are randomized, “turku-agent-ping –restore” helpfully gives you a sample rsync invocation using the actual port/credentials.) This is one of the only times you’ll need to log into a Turku infrastructure machine, but it gives the administrator the most flexibility, especially in a time of crisis.
Scalability
Turku is designed for easy scale-out when deployed via Juju. turku-api is a standard Django application and can be easily horizontally scaled through juju add-unit turku-api
(though in practice we’ve had thousands of units checking in to a pair of turku-api units with almost no load). turku-storage is also horizontally scalable, which is more important as your backup infrastructure grows. To expand storage, you can simply add more block storage to an existing turku-storage unit (they’re managed in an LVM volume on each unit), or add more units with juju add-unit turku-storage
, plus block storage.
When more storage is added to Turku, either through raw block storage or new storage units, the API scheduler automatically takes care of proportionally allocating new agents/sources depending on the storage split. For example, if you have two 5TB turku-storage units, one of which is half full and the other is empty, a new registered source will be twice as likely to be assigned to the empty storage unit. When storage units reach an 80% high water mark, they stop accepting new sources altogether, but will continue to back up existing registered sources. Actively rebalancing storage units is not currently supported as the proportional registration system plus the high water mark is sufficient for most situations, but it is planned for the future.
Current Status
Most of our backup infrastructure has been migrated to Turku, which has been in operation for approximately 6 months. We’re releasing the code in the state we have been using it, but this is a very early public release. Documentation will be ported over from our internal wikis, and it’s possible there is code or functionality specific to our infrastructure (though unlikely, as Turku was developed with the goal of eventually being open sourced).
Please take a look at the Turku project page on Launchpad, download the software, take a look, file bugs… We’re excited to hear from you!
N.B.: Turku is a city on the southwest coast of Finland at the mouth of the Aura River, in the region of Finland Proper. To the Swedes, it is known by its original Latin name: Åbo, or Aboa. One of Canonical’s first server naming schemes over 10 years ago was Antarctic bases, and our first backup server was aboa, named after the Finnish research station. Backup systems since then have tended to be a play on the name Aboa.