Headless Virtualbox, dm-crypt, and VRDP

Posted by dave
on Wednesday, March 10

So we just got a new server machine into the office, and it’s got some major stonk so that we can move towards virtualised services in a big way. The machine is called gort, and it dominates:

As a result, we’ve got a few new file servers for specific purposes, and they’re running samba on Linux on top of Virtualbox rather than SMB on Windows. Some of them have sensitive data, which we want to protect in case of theft, so I’ve encrypted the crap out of everything using dm-crypt.

I was initially stumped as to how to make this whole setup work properly. A dm-crypt disk volume needs its encryption passphrase entered before it even begins to boot. However, the Virtualbox machine image needs to run using VBoxHeadless, without any kind of graphical interface, because we want to be able to boot the whole setup remotely via a console if a machine image goes down for some reason. It seemed like there was going to be a problem – either we could run the setup headless and not have to come into the office if a machine crashed, or have dm-crypt encryption as an anti-theft strategy, but not both.

The solution turned out to be remarkably simple.

Any Virtualbox VM image started using a command like

VBoxHeadless -startvm "FooBox"

automatically gets its own display server running something called “VRDP” (which I assume is maybe “Virtualbox Remote Desktop Protocol”). The first VM listens on port 3689 of the host server, and you need to explicitly tell any further VMs to listen on a different port, perhaps like so:

VBoxHeadless --startvm "Some Machine Image" -v on -vrdpport 5000 &

This will tell Virtualbox to start the VM called “Some Machine Image” and turn on the VRDP server on port 5000 (and the relinquish control of the terminal back to stdin due to the ampersand). Assuming that the host server has a hostname of “foo.bar.blah” on your network, you can then use your regular Remote Desktop Protocol (RDP) client to connect to the host server on port 5000. You’ll be shown the current state of the guest VM right from the moment it initially boots, allowing you to type your encryption passphrase in.

So the process basically looks like this:

dave@gort:$ VBoxHeadless --startvm "Some Machine Image" -v on -vrdpport 5000 &
[1] 8905
dave@gort:$ Sun VirtualBox Headless Interface 3.1.4
(C) 2008-2010 Sun Microsystems, Inc.
All rights reserved.
Listening on port 5000.
dave@gort:~$

At that point, the VM image has fired up and is running, but it isn’t booting, because it needs its passphrase before it can access its system files and start to boot.

So, we can connect to the host machine (don’t bother trying to connect to the guest machine yet). One very important thing to note is that VRDP only worked for me using RDPv5:

This gets us right on to the VM image’s grub screen, and then to the dm-crypt passphrase entry.

Entering the passphrase allows boot to continue normally. I’ve got to say, I’ve been pleasantly surprised by easy Virtualbox has made everything. The admin tools are first-rate and convenient to use, networking has been easy to set up, and the learning curve has been low. Now I can take all the time I’ve saved and use it to mess with kvm.

Demystifying the Cloud

Posted by dave
on Tuesday, January 19

I don’t like talking about “the cloud” for the same reason I don’t like talking about “web 2.0” – although it’s suggestive of a certain set of ideas, it’s too vague to really do anything creative with. I figure anybody who can talk about “the cloud” for more than 30 seconds without mentioning more specific technologies is thinking at a much too general level.

Whether I’m right or wrong about that, there are a lot of people gasbagging about “the cloud” just now. Let’s take a few steps back and think about applying the level of detail implied by such a non-specific phrase to stuff that isn’t software. Duke Ellington wouldn’t write a jazz tune by thinking about “woodwind instruments” – he’d do it by thinking something like “what kind of song structure would really allow Johnny Hodges to let it rip? What key should I write it in? What should the drummer do?”. Although there’s no evidence to support my claim, I doubt that Leonardo Da Vinci got anything useful out of thinking about anything as general as “oil painting”, even though that was the hot new thing at the time.

Anyone who’s serious about being creative and making cool stuff needs to think seriously about the conceptual and technical tools which they have at their disposal, and overly-generalized marketing buzzwords only get in the way of this kind of thinking. It doesn’t matter whether you’re making music, painting, writing code, or writing a novel, without clear thinking about what you’re trying to do and how to achieve it, you’re just going to make a mess.

Some conceptual tools

So what are people talking about when they talk about “the cloud”? Let’s try and break it down, because there are some interesting ideas in there.

A common breakdown of the cloud is to think of it as being made up of the following categories:

  • Software as a Service
  • Platform as a Service
  • Utility Computing

This categorization scheme is focused on marketing concerns rather than technical and creative ones, and it doesn’t do much to help me think about cool stuff I could build. I prefer a categorization that looks more like this:

  • On-demand services – You only pay for the computing power you use, and the costs of both entry and exit are low (although the running costs may be significantly higher on a per-unit basis than for “normal” systems). Capital costs are borne by the service provider.
  • Virtualization – Running your systems inside a software container instead of running it directly on physical hardware. This could mean virtualizing the entire operating system or virtualizing an application in a sandbox or VM.
  • Parallel processing – Using networked computers as one big computer to solve computationally expensive problems more quickly, also known as “grid computing”.
  • Distributed architectures – Linking a bunch of machines together on a network and setting up applications that run across multiple machines.
  • Self-managing systems – Software systems that are designed to take intelligent actions based on what’s happening in their environment.

These concepts are like Lego blocks. We can chain them together to make interesting software systems, depending on what we want to achieve.

Amazon’s EC2, for example, is basically on-demand services + operating system virtualization. Once we’ve got those two blocks snapped together, we’ve got a very interesting base to work from. We could be content with on-demand virtualization and use it to just deal with traffic spikes without massive capital investment, or we could snap a few more concepts into place.

On-demand virtualized operating systems + distributed architectures when applied to the problem of storage equals a virtual Storage Area Network which can store petabytes of information. Examples of this in the real world are things like Amazon S3 or Ubuntu One. I’m currently experimenting with a system that’s in the same general category, a CouchDB-based network-storage system called adhd which I’m working on with a friend. It adds the concept of self-management and can make its own semi-intelligent decisions about replication, although it’s still very much a work in progress.

Sometimes it’s interesting to take a Lego block away (instead of adding one) and see what happens. Thinking about distributed architectures and parallel processing without OS virtualization, for example, is interesting to me when I consider that even a relatively poor neighbourhood of London like the one I live in (Brixton/Stockwell) probably has more spare processing cycles in peoples’ houses than many data centres. This is a computing issue (what could we build if we could harness that processing power?) but it’s also a social and environmental issue (it’s wasteful to have all those machines sitting idle).

One thing which probably isn’t in anybody’s “cloud” categorization, but which I’m interested in, is the notion of event-driven systems. Polling sucks, and we should be building systems that can notify other systems when something happens, instead of repeatedly asking “has this thing happened yet?”. Google’s pubsubhubbub protocol is an interesting attempt to apply an event-driven approach to RSS feed publish notifications. Any time we can use events instead of polling, we should. Eventually we’ll hit a point where polling systems are seen for the slow, primitive beasts they are, and we’ll have a real-time internet. Until then, setTimeout() and AJAX polling will continue to rule the internetz.

If these are the basic suppositions, what tools are going to grow in importance and what kinds of systems are we going to see in the next little while? What are the implications for the popular tools of today?

Technical Tools

The increased ease of operating system virtualization could make application virtualization and portability a lot less compelling. Who cares if Java can run on every possible machine architecture and operating system, if we can virtualize operating systems and specific applications really easily?

PHP will either get a decent threading model or die – too many of our Lego blocks rely on the idea that a language runtime can walk and chew gum at the same time, or at least fake it convincingly, and PHP’s approach (fork an entirely new operating system process) just won’t cut it much longer. Java, the .NET languages, and the dynamic duo (Python and Ruby) won’t be hurt by this. With Ruby specifically, the JRuby implementation is going to keep rising in popularity since it’s got Java’s tried-and-tested concurrency code going for it and doesn’t have the Global Interpreter Lock (GIL) that MRI 1.8 and 1.9 have. For the kinds of multi-threaded distributed systems I’m talking about, Erlang and Scala seem like they can only grow in popularity, seeing as how they were designed to deal with exactly this problem set. Lastly, for persistence, distributed block storage systems and the NoSQL movement are going to be increasingly important.

Proprietary vendors will need to come up with a compelling license-management story for on-demand services or risk being bypassed by a whole generation of geeks who can’t be bothered to keep track of what licenses they’ve used. So far the massive first-mover advantage of Amazon in the virtualized on-demand operating system space, built largely on a Free Software stack (Xen and Linux), has obscured this.

If I want to set up my own OS virtualization and app infrastructure based on technologies which require license payments, and I want to bring instances up and down 300 times a month, how many licenses have I used? How much hassle will it be for me to keep track of? Why should I bother? Proprietary software vendors can mitigate this problem to some extent by cutting deals with hosting companies and getting them to take care of the bean-counting at a layer below that of the app developer.

The world’s biggest existing “cloud”?

Turning from tools to stuff we can build, what sorts of things are possible if we try snapping some of our conceptual Lego blocks together in different ways? Well, let’s take a look at the largest cloud system that everyone ignores. Sometimes people snap the conceptual Lego together in some pretty interesting ways, and it’s worth paying attention when they do.

The biggest publicly-accessible redundant storage system on the planet is one which we don’t normally think of as being a “cloud” system: the global Bittorrent network. This system uses a mix of on-demand services, a distributed storage and addressing infrastructure, and software self-management to swap movies and music, and back it all up in a redundant system running on tens of millions of computers. It’s mostly ignored in the “legitimate” application development literature because the vast majority of the content inside the network infringes copyright law, but it’s an amazing system nonetheless. To copyright owners, it’s a Frankenstein’s Monster which so far seems unkillable.

Compared to BitTorrent infrastructure, Amazon’s EC2 is a fairly miniscule system. Realizing this made me think a bit about how snobby we programmers can be. Despite a lot of rhetoric about “revolutionary” and “disruptive” technologies in blogs the world over, the fact that a bunch of 15 year old kids can distribute content more efficiently than the world’s largest media corporations mostly escapes our notice. The technology that accounts for about half the world’s TCP/IP traffic gets little mention from app development bloggers, because we’re all focused on Infrastructure as a Service discussions and flapping around with our iPhones. To his credit George Reese does mention BitTorrent (once) in his book Cloud Application Architectures. It’d be interesting to see it get a mention in the software architecture literature. Ironically, the book Essential Software Architecture is available as a torrent but doesn’t mention BitTorrent.

Some experiments

Despite all the hype about it, all of this cloud-related stuff is actually one of the most interesting things to happen in the last three years. Getting in on the game, me and my friend Manos have been messing around with distributed storage architectures, on-demand services, and self-managing systems in our adhd project. Even though it doesn’t really do enough yet to be useful, the sheer fun and challenge of building such a weird system (using CouchDB on top of an Eventmachine base) has been a blast.

I’ve also been building a little distributed video encoder called the Enigmamachine (again using Eventmachine as a base, but this time with Sinatra in a supporting role). There’s a whole area of programming just waiting to be explored. While none of this stuff is new (all of the stuff on our “Cloud Lego” list has been part of the programmer’s toolbox for decades), it does seem that the pervasiveness of HTTP and the mass availability of high-bandwith networks is opening up a lot of new prospects.

I’ll be giving a talk about all of this stuff at a British Interactive Media Association event at the swanky Hospital Club (the guys over at Ultraspeed invited me, thanks for that!).