HOWTO: Use Amazon S3 + s3backer + Linux LVM2 for unlimited, flexible remote storage
Nathan Gardiner, Feb 2 2009 Comments on this article appreciated! Leave Comment or View CommentsThese days, it is easy and inexpensive to host small to medium size operations on the Internet thanks to the power of commodity hardware and powerful software. A virtual or dedicated server can be purchased for 10's to 100's of dollars a month and serve as many as tens of thousands of websites to hundreds of thousands of clients.
Your operation doesn't have to be nearly that big to benefit from Amazon S3 storage. In fact, I host my own business as well as personal servers for mail, DNS, database, web and applications on four virtual servers for less than $40 a month.
This $40 a month gets me a total of 384MB of RAM, enough to comfortably host all the services I need, and a total of 20GB of disk space.
The problem? 20GB of space is far from enough. Once you subtract 4GB in Operating System use, and then 10GB for database storage, 2GB for applications, 4.3GB for web storage and 2GB for CVS/Subversion, I've already exhausted all additional space and ensured that my resources aren't able to be used to their full potential.
Unfortunately disk space is at a premium. Most virtual hosting companies can't provide it efficiently, and require you to up-sell to higher specs to get the space you need, which inevitably leads back to this same problem again. What if I told you however that my 128MB VPS with 8GB of disk space currently hosts over 30GB of content, at near native performance, with 6GB of "system disk" free and unlimited potential to scale this storage out to terabytes or even exabytes?
Please Note: At the time this article was written, there was a need to keep the s3backer block size close to the system page size (getconf PAGESIZE) which is generally 4K. This is no longer a requirement of s3backer, and in fact is discouraged due to the cost of storing data at such a small block size. Use the calculator to determine the most effective block size for your needs.
Using s3backer to host filesystems remotely on Amazon S3
Many projects exist which attempt to implement views of Amazon S3 keys and values as a filesystem. All suffer from the same unfortunate flaws - performance issues when accessing large numbers of files, listing directories or perfoming filesystem operations which don't easily "translate" from the traditional UNIX filesystem layer to Amazon S3 operations. Making changes to large files can also be slow and costly.
s3backer is a concept which changes much of this - it works by creating what is in effect a large sparse file in an Amazon S3 bucket, divided into blocks of a given size and stored as a single key/value pair per block.
Instead of consuming large amounts of space on creation, the s3backer file effectively consumes no space at all until it is used. The file begins as a variable sized collection of zeros, which are not stored on Amazon S3 until they're finally populated with data. As an additional space conservation mechanism, the blocks can be compressed with zlib to consume less Amazon S3 storage.
Generally, the advice for s3backer users is to create a filesystem (ext3, reiser, etc) on the file which s3backer exposes, and then mount it using the -o loop option, which connects it to a "loopback" block device node and acts as if it were a physical block device. While this works quite well, it makes the process of hosting several different filesystems difficult and inefficient, and also complicates projecting the eventual size of a filesystem.
Advantages of using LVM2 + s3backer vs multiple s3backer instances
- Lower resource usage. Each instance of s3backer is a userspace process which consumes anywhere from 5MB to 15MB of RAM, and adds another daemon to start, schedule and shut down.
- Flexibility. A single Physical Volume of 500GB (for example) could be created and added to the volume group. As each logical volume needs space, lvextend can be used to extend the logical volume's size on the fly. When running individual incarnations of s3backer, the filesystem needs to be unmounted and s3backer restarted with the new filesystem size.
- Space Efficiency. Logical Volumes which are no longer required can be removed and their extents returned to the volume group pool. With individual filesystems, the underlying S3 keys would need to be removed (with cost attached). By allowing Logical Volumes to be kept at a minimum size, there is less "data spreading" and retention of deleted data, which could be costly on a large over-dimensioned s3backer filesystem.
Setting up s3backer as a LVM Physical Volume
Step 1: Create s3backer file
Before you can complete this step, you'll need to be somewhat familliar with s3backer and how it works. Make sure you have completed the following steps first:
- Put your AWS access code ID and AWS password in a file called /etc/s3.crd. They should be on a single line, separated by a colon. It's possible to specify them on the s3backer command line with the --accessId and --accessKey arguments, but this can be insecure in a multi-user environment.
- Create a bucket for this data, or decide on an existing bucket (and optional prefix) to use.
- Work out how big you'd like the file to be, and even more importantly, how big the block size should be. The correct values for these settings depends greatly on what you intend to do with the space. See my crude s3backer cost calculator to determine the cost of your chosen settings under different conditions.
s3backer --accessFile=/etc/s3.crd --blockSize=4K --compress --filename=file --prefix=some/prefix/ --size=30G [bucketname] /mnt/s3backer
This example will create a 30 gigabyte s3backer file stored under [bucketname]:some/prefix/[blockno], where blockno is a hexadecimal representation of which particular 4K block is being stored at that location. The file will be accessible at /mnt/s3backer/file on the client side.
Given that a maximum of 30G can be stored, and block size is 4K, this would require 7,864,320 keys to store data, at a cost of 0.01 per 1000 PUT requests ($78). Ensure that you factor this equation in when establishing the volume and block size of data you are storing. Remember that my s3 block-size calculator can help.
Step 2: Create loopback device
Please Note: This example uses loop0 as the loopback device. If you have mounted any other filesystems with the -o loop option or have other s3backer instances running, you will need to substitute this for the first available loopback device. losetup -f will return the first available device.
losetup /dev/loop0 /mnt/s3backer/file
Step 3: Configure LVM Volume Group and Physical Volume
Designate the loopback device as an LVM Physical Volume. This will write LVM metadata to the s3backer file and allow you to create a Volume Group to allocate this storage from.
pvcreate /dev/loop0
Note: If you recieve an error at this point indicating that the pvcreate command does not exist, you do not have the LVM2 package installed on your host. To install this on Debian Linux, you will need to apt-get install lvm2
Create a volume group which you will use to allocate storage to Logical Volumes:
vgcreate s3 /dev/loop0
Note: If you recieve an error at this point similar to: "/proc/misc: No entry for device-mapper found", you do not have the dm_mod kernel module loaded. You can load the module with the modprobe dm_mod command
You should now have a 30G LVM Volume Group named s3. You can confirm this with the following command
vgdisplay
--- Volume group ---
VG Name s3
Format lvm2
Metadata Areas 2
Metadata Sequence No 8
VG Access read/write
VG Status resizable
VG Size 30.00 GB
PE Size 4.00 MB
Alloc PE / Size 0 / 0.00 GB
Free PE / Size 7860 / 30.00 GB
Creating Logical Volumes
From here, you're free to create, resize and delete Logical Volumes. This is achieved by the lvcreate, lvresize and lvremove commands demonstrated below. There are man pages with further information on how to use these commands.
The following commands will create a 2GB Logical Volume called scratch, create a reiserfs filesystem with 4K block size, and mount it under /mnt/scratch.
lvcreate -L 2G -n scratch s3
mkreiserfs -b 4096 /dev/s3/scratch
mount /dev/s3/scratch /mnt/scratch
Allocating More Space to a Logical Volume
Most filesystems (including ext3 and reiserfs) will allow you to do an online resize of a filesystem without needing to unmount the volume, as long as the resize is positive (ie. adding more space, not reducing space). The following example shows 500MB being added to our "scratch" Logical Volume on the fly:
lvresize -L +500M /dev/s3/scratch
resize_reiserfs /dev/s3/scratch
Removing a Logical Volume
When we're done with a Logical Volume, we can remove it from the Volume Group and return the Physical Extents which it consumed to the Volume Group for re-assignment. The following example shows the /mnt/scratch volume being unmounted and removed.
umount /mnt/scratch
lvremove /dev/s3/scratch
Which filesystem to use?
- Any journalled filesystem such as XFS, JFS, ext3, ReiserFS or ZFS.
- I recommend ReiserFS as it is able to tail pack files, which improves space efficiency for small files, especially if you increase the block size. The downside is fragmentation, although this does not affect me much as the majority of operations on my filesystems are reads. YMMV.
- At the moment, it is highly recommended that you keep the block size on variable block size filesystems the same as that of your s3backer
Tuning BlockCacheSize
The higher the blocksize that you choose, the more important it will become to tune the blockCacheSize parameter for s3backer. By default, the blockCacheSize is set to 1000, which means that 1000 of the most recent or most requested blocks will be cached in memory. For a small blocksize such as 4K, this means that only 4K * 1000 = 4MB of memory will be consumed for block cache. For a blocksize of 256K however, this means 265K * 1000 = 256MB which may consume all of the host's memory for the block cache, and will almost certainly cause any host with less than 256MB of memory to run out of memory and crash.
The following is an oversimplified starting point which shows a number of different blocksizes, and a Small (S), Medium (M) and Large (L) blockCacheSize setting. The Small, Medium and Large headings denote the amount of activity on the machine which s3backer is running on, and therefore how much memory you are willing to
| 64MB RAM | S | M | L |
| 64K | 500 (32MB) | 250 (16MB) | 100 (6.4MB) |
| 128K | 250 (32MB) | 125 (16MB) | 63 (8MB) |
| 256K | 125 (32MB) | 63 (16MB) | 32 (8MB) |
| 512K | 63 (32MB) | 32 (16MB) | 16 (8MB) |
| 128MB RAM | S | M | L |
| 64K | 1000 (64MB) | 500 (32MB) | 250 (16MB) |
| 128K | 500 (64MB) | 256 (32MB) | 128 (16MB) |
| 256K | 125 (32MB) | 63 (16MB) | 32 (8MB) |
| 512K | 63 (32MB) | 32 (16MB) | 16 (8MB) |
| 256MB RAM | S | M | L |
| 64K | 2000 (128MB) | 1000 (64MB) | 500 (32MB) |
| 128K | 1000 (128MB) | 500 (64MB) | 256 (32MB) |
| 256K | 256 (128MB) | 128 (64MB) | 63 (32MB) |
| 512K | 128 (128MB) | 63 (64MB) | 32 (32MB) |
To set the intended blockCacheSize for s3backer, you must pass the --blockCacheSize=xx command line argument. If you are using the init script provided at the end of this document, place the --blockCacheSize=xx argument into the OPTIONS variable, which will be passed to s3backer.
How can I make this all work on boot?
There may be better ways to do this, but it works well for me on Debian Etch (3.0). The first step is to enable the dm_mod module to start on boot:
Step 1: Open /etc/modules.conf in $editor.
Step 2: Add "dm_mod" (without the quotes) on a new line.
Now, we need to author an init script which will mount the s3backer file, create a loopback device, scan for physical volumes and activate the LVM volume group. The following script does so, however it requires some configuration at the top of the file.
Place the following contents into "/etc/init.d/s3lvm":
#!/bin/sh
### BEGIN INIT INFO
# Provides: s3fs
# Required-Start: $network $local_fs $remote_fs
# Required-Stop: $network $local_fs $remote_fs
# Default-Start: 2 3 4 5
# Default-Stop: 0 1 6
# Short-Description: Mount remote S3 filesystem
# Description: This script will use s3backer to mount a loopback file
# from Amazon S3, create a loopback device (/dev/loop0),
# import it as an LVM physical volume and mount all
# available logical volumes
### END INIT INFO
# Author: Nathan Gardiner
### Configuration ###
BACKERDIR="/mnt/s3backer" # The location s3backer mounts to
BLOCKSIZE="4K" # The blocksize we use
BUCKET="some.s3.bucket" # Bucket Name
FILENAME="file" # The s3backer filename
OPTIONS="--compress" # Any additional options
PREFIX="some/prefix/" # The prefix we use
SIZE="30G" # Total size
#
# Function that starts the daemon/service
#
do_start()
{
# Mount S3 filesystem
s3backer --accessFile=/etc/s3.crd --blockSize=$BLOCKSIZE \
--filename=$FILENAME --prefix=$PREFIX --size=$SIZE \
$OPTIONS $BUCKET $BACKERDIR
# Check that the file has mounted. If not, throw error
if [ ! -f "$BACKERDIR/$FILENAME" ]; then
echo "Error! s3backer has failed to mount $BACKERDIR/$FILENAME."
exit 1
fi
# Get name of first available loopback device
LOOP=`losetup -f`;
# Create loopback device
losetup $LOOP $BACKERDIR/$FILENAME
# Check that block-special loopback device exists
if [ ! -b "$LOOP" ]; then
echo "Error! $LOOP does not exist."
exit 1
fi
# Scan physical volumes
pvscan
# Enable all volume groups
for vg in `vgdisplay | grep VG\ Name | awk '{ print $3 }'`; do
vgchange -a y $vg
done
# Get a list of all logical volumes, and mount them
for fs in `lvdisplay | grep LV\ Name | awk '{ print $3 }'`; do
mount $fs
done
return 0
}
#
# Function that stops the daemon/service
#
do_stop()
{
# First, we unmount all mounted logical volumes.
for fs in `lvdisplay | grep LV\ Name | awk '{ print $3 }'`; do
umount $fs
done
# Now, we disable each volume group
for vg in `vgdisplay | grep VG\ Name | awk '{ print $3 }'`; do
vgchange -a n $vg
done
# Disconnect all mounted loopback devices
for lo in `losetup -a | cut -d ':' -f 1`; do
losetup -d $lo
done
# Unmount our s3backer directory
umount $BACKERDIR
return 0
}
case "$1" in
start)
[ "$VERBOSE" != no ] && log_daemon_msg "Starting $DESC" "$NAME"
do_start
case "$?" in
0|1) [ "$VERBOSE" != no ] && log_end_msg 0 ;;
2) [ "$VERBOSE" != no ] && log_end_msg 1 ;;
esac
;;
stop)
[ "$VERBOSE" != no ] && log_daemon_msg "Stopping $DESC" "$NAME"
do_stop
case "$?" in
0|1) [ "$VERBOSE" != no ] && log_end_msg 0 ;;
2) [ "$VERBOSE" != no ] && log_end_msg 1 ;;
esac
;;
*)
echo "Usage: $SCRIPTNAME {start|stop}" >&2
exit 3
;;
esac
:
Now, you can start this script at boot with the following command (on Debian or Ubuntu systems):
update-rc.d s3lvm defaults
FAQ
Help.. I'm getting "No matching physical volumes found" and vgdisplay shows no volume groups!
This is almost certainly because one of the configuration details which you have provided for the init script does not match the parameters you first used to create your s3backer storage file. Ensure that the bucket name, block size and prefix match.
Exactly how important is this block size thing you continually mention?
I cannot state enough how important it is. It's the one most critical mistake I've made so far when implementing s3backer, and it's enough to send you broke if you don't factor everything in beforehand. I originally used a block size of 4K for my storage, thinking it was more efficient as it was smaller (so small changes wouldn't require as large an upload) and matched the kernel page size. The second issue is now negated, and the first is irrelevant as I spent MUCH more money paying for initial upload than incremental changes. What's worse, the US$100-odd I spent on uploads for 16GB is now gone, and it has cost me money to move that data across to a new instance.
On my current blocksize (256kb) I would pay only $40 for 100GB of upload, a much better cost/benefit outcome. Hopefully by making a big deal about this, people will use the calculator to find a suitable block size.
Is it possible to change the block size of my s3backer storage?
It is possible, but isn't cheap. Still, if you would like to do this, the LVM2 tool set contains a mechanism for moving Physical Extents (PEs) from one device to another. The way to achieve this would be:
- Create another s3backer file with the block size that you desire
- Using losetup, attach it to a new loopback device (eg. /dev/loop1)
- Add this new device to the existing volume group:
- pvcreate /dev/loop1
- vgextend [volume group] /dev/loop1
- Using the pvmove command, you can move either the entire volume group, or only certain volumes at a time to the new physical volume.
- Examples:
- pvmove -n photos /dev/loop0 /dev/loop1 - would move data from loop0 to loop1 for logical volume photos.
- pvmove /dev/loop0:1 /dev/loop1 - would only move physical extent 1 from loop0 to loop1.
- pvmove /dev/loop0 /dev/loop1 - would move the entire contents of loop0 to loop1.