The Book of Xen - LightNovelsOnl.com
You're reading novel online at LightNovelsOnl.com. Please use the follow button to get notifications about your favorite novels and its latest chapters so you can come back anytime and won't miss anything.
No matter what, though, all storage backends look the same from within the Xen virtual domain. The hypervisor exports a Xen VBD (virtual block device) to the domU, which in turn presents the device to the guest OS with an administrator-defined mapping to traditional Unix device nodes. Usually this will be a device of the form hdx hdx or or sdx sdx, although many distros now use xvdx xvdx for for xen virtual disk xen virtual disk. (The hd hd and and sd sd devices generally work, as well.) devices generally work, as well.) We recommend blktap (a specialized form of file backend) and LVM for storage backends. These both work, offer good manageability, can be resized and moved freely, and support some mechanism for the sort of things we expect of filesystems now that we Live In The Future. blktap is easy to set up and good for testing, while LVM is scalable and good for production.
None of this is particularly Xen-specific. LVM is actually used (outside of Xen) by default for the root device on many distros, notably Red Hat, because of the management advantages that come with an abstracted storage layer. blktap is simply a Xen-specific mechanism for using a file as a block device, just like the traditional block loop driver. It's superior to the loop mechanism because it allows for vastly improved performance and more versatile filesystem formats, such as QCOW, but it's not fundamentally different from the administrator's perspective.
Let's get to it.
Basic Setup: Files For people who don't want the ha.s.sle and overhead of LVM, Xen supports fast and efficient file-backed block devices using the blktap driver and library.
blktap (blk being the worn-down stub of "block" after being typed hundreds of times) includes a kernel driver and a users.p.a.ce daemon. The kernel driver directly maps the blocks contained by the backing file, avoiding much of the indirection involved in mounting a file via loopback. It works with many file formats used for virtual block devices, including the basic "raw" image format obtainable by dd dd ing a block device. ing a block device.
You can create a file using the dd dd command: command: #ddif=/dev/zeroof=/opt/xen/anthony.imgbs=1Mcount=1024NoteYour version of dd might require slightly different syntax-for example, it might require you to specify the block size in bytes.
Now dd dd will chug away for a bit, copying zeroes to a file. Eventually it'll finish: will chug away for a bit, copying zeroes to a file. Eventually it'll finish: 1024+0recordsin 1024+0recordsout 1073741824bytes(1.1GB)copied,15.1442seconds,70.9MB/s Thus armed with a filesystem image, you can attach it using the tap driver, make a filesystem on it, and mount it as usual with the mount mount command. command.
#xmblock-attach0tap:aio:/opt/xen/anthony.img/dev/xvda1w0 #mkfs/dev/xvda1 #mount/dev/xvda1/mnt/ First, we use the xm(8) xm(8) command to attach the block device to domain 0. In this case the command to attach the block device to domain 0. In this case the xm xm command is followed by the block-attach subcommand, with the arguments command is followed by the block-attach subcommand, with the arguments Now that it's mounted, you can put something in it. (See Chapter3 Chapter3 for details.) In this case, we'll just copy over a filesystem tree that we happen to have lying around: for details.) In this case, we'll just copy over a filesystem tree that we happen to have lying around: #cp-a/opt/xen/images/centos-4.4/*/mnt/ Add a disk= disk= line to the domU config (in our example, line to the domU config (in our example, /etc/xen/anthony /etc/xen/anthony) to reflect the filesystem: disk=['tap:aio:/opt/xen/anthony.img'] Now you should be able to start the domain with its new root device: #xmcreate-canthony Watch the console and bask in its soothing glow. MOUNTING PARt.i.tIONS WITHIN A FILE-BACKED VBDThere's nothing that keeps you from part.i.tioning a virtual block device as if it were a hard drive. However, if something goes wrong and you need to mount the subpart.i.tions from within dom0, it can be harder to recover. The standard mount -o loop filename /mnt -o loop filename /mnt won't work, and neither will something like won't work, and neither will something like mount /dev/xvda1 /mnt mount /dev/xvda1 /mnt (even if the device is attached as (even if the device is attached as /dev/xvda /dev/xvda, Xen will not automatically scan for a part.i.tion table and create appropriate devices).kpartx will solve this problem. It reads the part.i.tion table of a block device and adds mappings for the device mapper, which then provides device file-style interfaces to the part.i.tions. After that, you can mount them as usual. will solve this problem. It reads the part.i.tion table of a block device and adds mappings for the device mapper, which then provides device file-style interfaces to the part.i.tions. After that, you can mount them as usual.Let's say you've got an image with a part.i.tion table that describes two part.i.tions:#xmblock-attach0tap:aio:/path/to/anthony.img/dev/xvdaw0 #kpartx-av/dev/xvdakpartx will then find the two part.i.tions and create will then find the two part.i.tions and create /dev/mapper/xvda1 /dev/mapper/xvda1 and and /dev/mapper/xvda2 /dev/mapper/xvda2. Now you should be able to mount and use the newly created device nodes as usual. LVM: Device-Independent Physical Devices Flat files are well and good, but they're not as robust as simply providing each domain with its own physical volume (or volumes). The best way to use Xen's physical device support is, in our opinion, LVM. LVM, short for logical volume management logical volume management, is Linux's answer to VxFS's storage pools or Windows Dynamic Disks. It is what the marketing people call enterprise grade enterprise grade. In keeping with the software mantra that "all problems can be solved by adding another layer of abstraction," LVM aims to abstract away the idea of "disks" to improve manageability. Instead, LVM (as one might guess from the name) operates on logical volumes. This higher-level view allows the administrator much more flexibility-storage can be moved around and reallocated with near impunity. Even better, from Xen's perspective, there's no difference between an LVM logical volume and a traditional part.i.tion. Sure, setting up LVM is a bit more work up front, but it'll save you some headaches down the road when you have eight domUs on that box and you are trying to erase the part.i.tion for the third one. Using LVM and naming the logical volume to correspond to the domU name makes it quite a bit harder to embarra.s.s yourself by erasing the wrong part.i.tion.[27] QCOWUp to this point, we've talked exclusively about the "raw" file format-but it's not the only option. One possible replacement is the QCOW format used by the QEMU project. It's got a lot to recommend it-a fast, robust format that supports spa.r.s.e allocation, encryption, compression, and copy-on-write. We like it, but support isn't quite mature yet, so we're not recommending it as your primary storage option.Nonetheless, it might be fun to try. To start working with QCOW, it'll be convenient to have QEMU. (While Xen includes some of the QEMU tools, the full package includes more functionality.) Download it from http://www.nongnu.org/qemu/download.html. As usual, we recommend the source install, especially because the QEMU folks eschew standard package management for their binary distribution.Install QEMU via the standard process:#tarzxvf Basic Setup: LVM The high-level unit that LVM operates on is the volume group volume group, or VG VG. Each group maps physical extents physical extents (disk regions of configurable size) to (disk regions of configurable size) to logical extents logical extents. The physical extents are hosted on what LVM refers to as physical volumes physical volumes, or PVs PVs. Each VG can contain one or more of these, and the PVs themselves can be any sort of block device supported by the kernel. The logical extents, reasonably enough, are on logical volumes logical volumes, abbreviated LVs LVs. These are the devices that LVM actually presents to the system as usable block devices. As we're fond of saying, there really is no subst.i.tute for experience. Here's a five-minute ill.u.s.trated tutorial in setting up logical volumes (see Figure4-1 Figure4-1). Figure4-1.This diagram shows a single VG with two PVs. From this VG, we've carved out three logical volumes, lv1, lv2, and lv3. lv1 and lv3 are being used by domUs, one of which treats the entire volume as a single part.i.tion and one of which breaks the LV into subpart.i.tions for / and /var. Begin with some hard drives. In this example, we'll use two SATA disks. NoteGiven that Xen is basically a server technology, it would probably be most sensible to use RAID-backed redundant storage, rather than actual hard drives. They could also be part.i.tions on drives, network block devices, UFS-formatted optical media ... whatever sort of block device you care to mention. We're going to give instructions using a part.i.tion on two hard drives, however. These instructions will also hold if you're just using one drive.WarningNote that we are going to repart.i.tion and format these drives, which will destroy all data on them. First, we part.i.tion the drives and set the type to Linux LVM Linux LVM. Although this isn't strictly necessary-you can use the entire drive as a PV, if desired-it's generally considered good Unix hygiene. Besides, you'll need to part.i.tion if you want to use only a portion of the disk for LVM, which is a fairly common scenario. (For example, if you want to boot from one of the physical disks that you're using with LVM, you will need a separate /boot /boot part.i.tion.) part.i.tion.) So, in this example, we have two disks, sda and sdb. We want the first 4GB of each drive to be used as LVM physical volumes, so we'll part.i.tion them with fdisk fdisk and set the type to 8e (Linux LVM). and set the type to 8e (Linux LVM). If any part.i.tions on the disk are in use, you will need to reboot to get the kernel to reread the part.i.tion table. (We think this is ridiculous, by the way. Isn't this supposed to be the future?) Next, make sure that you've got LVM and that it's LVM2, because LVM1 is deprecated.[28] #vgscan--version LVMversion:2.02.23(2007-03-08) Libraryversion:1.02.18(2007-02-13) Driverversion:4.5.0 You might need to load the driver. If vgscan vgscan complains that the driver is missing, run: complains that the driver is missing, run: #modprobedm_mod In this case, dm dm stands for stands for device mapper device mapper, which is a low-level volume manager that functions as the backend for LVM. Having established that all three of these components are working, create physical volumes as ill.u.s.trated in Figure4-2 Figure4-2. #pvcreate/dev/sda1 #pvcreate/dev/sdb1Figure4-2.This diagram shows a single block device after pvcreate has been run on it. It's mostly empty, except for a small identifier on the front. Bring these components together into a volume group by running vgcreate vgcreate. Here we'll create a volume group named cleopatra cleopatra on the devices sda1 and sdb1: on the devices sda1 and sdb1: #vgcreatecleopatra/dev/sda1/dev/sdb1 Finally, make volumes from the volume group using lvcreate lvcreate, as shown in Figure4-3 Figure4-3. Think of it as a more powerful and versatile form of part.i.tioning. #lvcreate-L Figure4-3.lvcreate creates a logical volume, /dev/vg/lvol, by chopping some s.p.a.ce out of the LV, which is transparently mapped to possibly discontinuous physical extents on PVs. Create a filesystem using your favorite filesystem-creation tool: #mkfs/dev/cleopatra/menas At this point, the LV is ready to mount and access, just as if it were a normal disk. #mount/dev/cleopatra/menas/mnt/hd To make the new device a suitable root for a Xen domain, copy a filesystem into it. We used one from http://stacklet.com/-we just mounted their root filesystem and copied it over to our new volume. #mount-oloopgentoo.img/mnt/tmp/ #cp-a/mnt/tmp/*/mnt/hd Finally, to use it with Xen, we can specify the logical volume to the guest domain just as we would any physical device. (Note that here we're back to the same example we started the chapter with.) disk=['phy:/dev/cleopatra/menas,sda1,w'] At this point, start the machine. Cross your fingers, wave a dead chicken, perform the accustomed ritual. In this case our deity is propitiated by an xm create xm create. Standards have come down in the past few millennia. #xmcreatemenas [27] This example is not purely academic. This example is not purely academic. [28] This is unlikely to be a problem unless you are using Slackware. This is unlikely to be a problem unless you are using Slackware. Enlarge Your Disk Both file-backed images and LVM disks can be expanded transparently from the dom0. We're going to a.s.sume that disk s.p.a.ce is so plentiful that you will never need to shrink an image. Be sure to stop the domain before attempting to resize its underlying filesystem. For one thing, all of the user-s.p.a.ce resize tools that we know of won't attempt to resize a mounted filesystem. For another, the Xen hypervisor won't pa.s.s along changes to the underlying block device's size without restarting the domain. Most important, even if you were able to resize the backing store with the domain running, data corruption would almost certainly result. File-Backed Images The principle behind augmenting file-backed images is simple: We append more bits to the file, then expand the filesystem. First, make sure that nothing is using the file. Stop any domUs that have it mounted. Detach it from the dom0. Failure to do this will likely result in filesystem corruption. Next, use dd dd to add some bits to the end. In this case we're directing 1GB from our to add some bits to the end. In this case we're directing 1GB from our /dev/zero /dev/zero bit hose to bit hose to anthony.img anthony.img. (Note that not specifying an output file causes dd dd to write to stdout.) to write to stdout.) #ddif=/dev/zerobs=1Mcount=1024>>/opt/xen/anthony.img Use resize2fs resize2fs to extend the filesystem (or the equivalent tool for your choice of filesystem). to extend the filesystem (or the equivalent tool for your choice of filesystem). #e2fsck-f/opt/xen/anthony.img #resize2fs/opt/xen/anthony.img resize2fs will default to making the filesystem the size of the underlying device if there's no part.i.tion table. will default to making the filesystem the size of the underlying device if there's no part.i.tion table. If the image contains part.i.tions, you'll need to rearrange those before resizing the filesystem. Use fdisk fdisk to delete the part.i.tion that you wish to resize and recreate it, making sure that the starting cylinder remains the same. to delete the part.i.tion that you wish to resize and recreate it, making sure that the starting cylinder remains the same. LVM It's just as easy, or perhaps even easier, to use LVM to expand storage. LVM was designed from the beginning to increase the flexibility of storage devices, so it includes an easy mechanism to extend a volume (as well as shrink and move). If there's free s.p.a.ce in the volume group, simply issue the command: #lvextend-L+1G/dev/cleopatra/charmian If the volume group is full, you'll need to expand it. Just add a disk to the machine and extend the vg: #vgextend/dev/cleopatra/dev/sdc1 Finally, just as in the previous example, handle the filesystem-level expansion-we'll present this one using ReiserFS. #resize_reiserfs-s+1G/dev/cleopatra/charmian Copy-on-Write and Snapshots One of the other niceties that a real storage option gives you is copy-on-write, which means that, rather than the domU overwriting a file when it's changed, the backend instead transparently writes a copy elsewhere.[29] As a corollary, the original filesystem remains as a As a corollary, the original filesystem remains as a snapshot snapshot, with all modifications directed to the copy-on-write clone. This snapshot provides the ability to save a filesystem's state, taking a snapshot of it at a given time or at set intervals. There are two useful things about snapshots: for one, they allow for easy recovery from user error.[30] For another, they give you a checkpoint that's known to be consistent-it's something that you can conveniently back up and move elsewhere. This eliminates the need to take servers offline for backups, such as we had to do in the dark ages. For another, they give you a checkpoint that's known to be consistent-it's something that you can conveniently back up and move elsewhere. This eliminates the need to take servers offline for backups, such as we had to do in the dark ages. CoW likewise has a bunch of uses. Of these, the most fundamental implication for Xen is that it can dramatically reduce the on-disk overhead of each virtual machine-rather than using a simple file as a block device or a logical volume, many machines can share a single base filesystem image, only requiring disk s.p.a.ce to write their changes to that filesystem. CoW also comes with its own disadvantages. First, there's a speed penalty. The CoW infrastructure slows disk access down quite a bit compared with writing directly to the device, for both reading and writing. If you're using spa.r.s.e allocation for CoW volumes, the speed penalty becomes greater due to the overhead of allocating and remapping blocks. This leads to fragmentation, which carries its own set of performance penalties. CoW can also lead to the administrative problem of oversubscription; by making it possible to oversubscribe disk s.p.a.ce, it makes life much harder if you accidentally run out. You can avoid all of this by simply allocating s.p.a.ce in advance. There's also a trade-off in terms of administrative complexity, as with most interesting features. Ultimately, you, the Xen administrator, have to decide how much complexity is worth having. We'll discuss device mapper snapshots, as used by LVM because they're the implementation that we're most familiar with. For shared storage, we'll focus on NFS and go into more detail on shared storage systems in Chapter9 Chapter9. We also outline a CoW solution with UnionFS in Chapter7 Chapter7. Finally, you might want to try QCOW block devices-although we haven't had much luck with them, your mileage may vary. [29] This is traditionally abbreviated CoW, partly because it's shorter, but mostly because "cow" is an inherently funny word. Just ask Wikipedia. This is traditionally abbreviated CoW, partly because it's shorter, but mostly because "cow" is an inherently funny word. Just ask Wikipedia. [30] It's not as hard you might suppose to It's not as hard you might suppose to rm rm your home directory. your home directory. LVM and Snapshots LVM snapshots are designed more to back up back up and and checkpoint checkpoint a filesystem than as a means of long-term storage. It's important to keep LVM snapshots relatively fresh-or, in other words, make sure to drop them when your backup is done. a filesystem than as a means of long-term storage. It's important to keep LVM snapshots relatively fresh-or, in other words, make sure to drop them when your backup is done.[31] Snapshot volumes can also be used as read-write backing store for domains, especially in situations where you just want to generate a quick domU for testing, based on some preexisting disk image. The LVM doc.u.mentation notes that you can create a basic image, snapshot it multiple times, and modify each snapshot slightly for another domain. In this case, LVM snapshots would act like a block-level UnionFS. However, note that when a snapshot fills up, it's immediately dropped by the kernel. This may lead to data loss. The basic procedure for adding an LVM snapshot is simple: Make sure that you have some unused s.p.a.ce in your volume group, and create a snapshot volume for it. THE XEN L IVECD REVISITED: COPY-ON-WRITE IN ACTIONThe Xen LiveCD actually is a pretty nifty release. One of its neatest features is the ability to automatically create copy-on-write block devices when a Xen domain starts, based on read-only images on the CD.The implementation uses the device mapper to set up block devices and snapshots based on flat files, and is surprisingly simple.First, the basic storage is defined with a line like this in the domain config file:disk=['cow:/mnt/cdrom/rootfs.img30,sda1,w']Note the use of the cow: cow: prefix, which we haven't mentioned yet. This is actually a custom prefix rather than part of the normal Xen package. prefix, which we haven't mentioned yet. This is actually a custom prefix rather than part of the normal Xen package.We can add custom prefixes like cow: because /etc/xen/scripts/create_block_device /etc/xen/scripts/create_block_device falls through to a script with a name of the form falls through to a script with a name of the form block-[type] block-[type] if it finds an unknown device type-in this case, cow. The if it finds an unknown device type-in this case, cow. The block-cow block-cow script expects one argument, either script expects one argument, either create or destroy create or destroy, which the domain builder provides when it calls the script. block-cow block-cow then calls either the then calls either the create_cow or destroy_cow create_cow or destroy_cow script, as appropriate. script, as appropriate.The real setup takes place in a script, /usr/sbin/create_cow /usr/sbin/create_cow. This script essentially uses the device mapper to create a copy-on-write device based on an LVM snapshot,[32] which it presents to the domain. We won't reproduce it here, but it's a good example of how standard Linux features can form the basis for complex, abstracted functions. In other words, a good hack. which it presents to the domain. We won't reproduce it here, but it's a good example of how standard Linux features can form the basis for complex, abstracted functions. In other words, a good hack. First, check to see whether you have the driver dm_snapshot dm_snapshot. Most modern distros s.h.i.+p with this driver built as a loadable module. (If it's not built, go to your Linux kernel source tree and compile it.) #locatedm_snapshot.ko Manually load it if necessary. #modprobedm_snapshot Create the snapshot using the lvcreate lvcreate command with the command with the -s -s option to indicate "snapshot." The other parameters specify a length and name as in an ordinary logical volume. The final parameter specifies the option to indicate "snapshot." The other parameters specify a length and name as in an ordinary logical volume. The final parameter specifies the origin origin, or volume being snapshotted. #lvcreate-s-L100M-npompei.snap/dev/cleopatra/pompei This snapshot then appears to be a frozen image of the filesystem-writes will happen as normal on the original volume, but the snapshot will retain changed files as they were when the snapshot was taken, up to the maximum capacity of the snapshot. When making a snapshot, the length indicates the maximum amount of changed data that the snapshot will be able to store. If the snapshot fills up, it'll be dropped automatically by the kernel driver and will become unusable. For a sample script that uses an LVM snapshot to back up a Xen instance, see Chapter7 Chapter7. [31] Even if you add no data to the snapshot itself, it can run out of s.p.a.ce (and corrupt itself) just keeping up with changes in the main LV. Even if you add no data to the snapshot itself, it can run out of s.p.a.ce (and corrupt itself) just keeping up with changes in the main LV. [32] More properly, a device mapper snapshot, which LVM snapshots are based on. LVM snapshots are device mapper snapshots, but device mapper snapshots can be based on any pair of block devices, LVM or not. The LVM tools provide a convenient frontend to the arcane commands used by More properly, a device mapper snapshot, which LVM snapshots are based on. LVM snapshots are device mapper snapshots, but device mapper snapshots can be based on any pair of block devices, LVM or not. The LVM tools provide a convenient frontend to the arcane commands used by dmsetup dmsetup. Storage and Migration These two storage techniques-flat files and LVM-lend themselves well to easy and automated cold migration cold migration, in which the administrator halts the domain, copies the domain's config file and backing storage to another physical machine, and restarts the domain. Copying over a file-based backend is as simple as copying any file over the network. Just drop it onto the new box in its corresponding place in the filesystem, and start the machine. Copying an LVM is a bit more involved, but it is still straightforward: Make the target device, mount it, and move the files in whatever fas.h.i.+on you care to. Check Chapter9 Chapter9 for more details on this sort of migration. for more details on this sort of migration. Network Storage These two storage methods only apply to locally accessible storage. Live migration, in which a domain is moved from one machine to another without being halted, requires one other piece of this puzzle: The filesystem must be accessible over the network to multiple machines. This is an area of active development, with several competing solutions. Here we'll discuss NFS-based storage. We will address other solutions, including ATA over Ethernet and iSCSI, in Chapter9 Chapter9. NFS NFS is older than we are, and it is used by organizations of all sizes. It's easy to set up and relatively easy to administer. Most operating systems can interact with it. For these reasons, it's probably the easiest, cheapest, and fastest way to set up a live migration-capable Xen domain. The idea is to marshal Xen's networking metaphor: The domains are connected (in the default setup) to a virtual network switch. Because the dom0 is also attached to this switch, it can act as an NFS server for the domUs. In this case we're exporting a directory tree-neither a physical device nor a file. NFS server setup is quite simple, and it's cross platform, so you can use any NFS device you like. (We prefer FreeBSD-based NFS servers, but NetApp and several other companies produce fine NFS appliances. As we might have mentioned, we've had poor luck using Linux as an NFS server.) Simply export your OS image. In our example, on the FreeBSD NFS server at 192.0.2.7, we have a full Slackware image at /usr/xen/images/slack /usr/xen/images/slack. Our /etc/exports /etc/exports looks a bit like this: looks a bit like this: /usr/xen/images/slack-maproot=0192.0.2.222 We leave further server-side setup to your doubtless extensive experience with NFS. One easy refinement would be to make / read-only and shared, then export read-write VM-specific /var /var and and /home /home part.i.tions-but in the simplest case, just export a full image. part.i.tions-but in the simplest case, just export a full image. NoteAlthough NFS does imply a performance hit, it's important to recall that Xen's network buffers and disk buffers are provided by the same paravirtualized device infrastructure, and so the actual network hardware is not involved. There is increased overhead in transversing the networking stack, but performance is usually better than gigabit Ethernet, so it is not as bad as you might think. Now configure the client (CONFIG_IP_PNP=y). First, you'll need to make some changes to the domU's kernel to enable root on NFS: networking-> networkingoptions-> ip:kernellevelautoconfiguration If you want to do everything via DHCP (although you should probably still specify a MAC address in your domain config file), add DHCP support under that tree: CONFIG_IP_PNP_DHCP CONFIG_IP_PNP_DHCP: or CONFIG_IP_PNP_BOOTP CONFIG_IP_PNP_BOOTP if you're old school. If you are okay specifying the IP in your domU config file, skip that step. if you're old school. If you are okay specifying the IP in your domU config file, skip that step. Now you need to enable support for root on NFS. Make sure NFS support is Y and not M; that is, CONFIG_NFS_FS=Y CONFIG_NFS_FS=Y. Next, enable root over NFS: CONFIG_ROOT_NFS=Y CONFIG_ROOT_NFS=Y. In menuconfig menuconfig, you can find that option under: Filesystems-> NetworkFileSystems-> NFSfilesystemsupport-> RootoverNFS Note that menuconfig menuconfig won't give you the option of selecting root over NFS until you select kernel-level IP autoconfiguration. won't give you the option of selecting root over NFS until you select kernel-level IP autoconfiguration. Build the kernel as normal and install it somewhere where Xen can load it. Most likely this isn't what you want for a dom0 kernel, so make sure to avoid overwriting the boot kernel. Now configure the domain that you're going to boot over NFS. Edit the domain's config file: #Rootdevicefornfs. root="/dev/nfs" #Thenfsserver. nfs_server='38.99.2.7' #Rootdirectoryonthenfsserver. nfs_root='/usr/xen/images/slack' netmask="255.255.255.0"