“ShellShock” … a truly stunning example of an ill-considered feature.

For those who live under a rock, or weren’t paying attention, the so-called ShellShock bug as stated by most is that if you create an environment variable in the form: name='() { :; } ; command’ and start Bash, command will be executed unconditionally when Bash starts. Which isn’t normally a problem, but if Bash is the default shell, and (say) a web script executes a system() call to run a system command, it’s going to run Bash. And since CGI scripts (and things that behave like CGI) put things they got from the original web client’s HTTP headers, that basically provides a means of running whatever you want in the context of the web application. Ugly.

Of course there are now patches, now that the white hats know about the problem, although how long the black hats have known and were exploiting it, no-one can say.

So let’s look at the problem in detail. (If you aren’t familiar about Unix-like OSes and shell programming you can stop reading now).

Bash has a feature that allows a function to be exported in the environment and imported from the environment. For example,

$ foo() { echo i am foo ; }        # Define a function foo
$ foo                              # Execute it
i am foo
$ bash                             # Start a subshell
$ foo                              # foo is not defined in the subshell
bash: foo: command not found
$ exit                             # Return to the outer level
$ export -f foo                    # export foo to the environment
$ bash                             # Start another subshell
$ foo                              # foo is now available to the subshell
i am foo

Now the mechanism that Bash uses to implement this feature is simple. Too simple. Internally, Bash maintains separate tables of variables and functions. On starting, it imports the environment into the list of variables. This is true of all Bourne-compatible shells like Bash. But Bash has a couple of special cases, one of which is that it can place functions into the environment too. (It doesn’t by default; you have to use export -f function-name to do this.)

The environment is pretty straightforward; it is simply a list of strings in the form name=value. So how does Bash store and retrieve a function?

It’s simple. Too simple. It looks for the string “() {” (that is, open paren, close paren, space, open curly). In our example, foo() is exported as “foo=() { echo i am foo ; }“. When Bash starts, it recognises the “() {“, rewrites the line as “foo() { echo i am foo ; }“, and hands it straight to its command interpreter for execution, just as if it had been entered like the first line of the example.

Prior to the patches coming out, that’s all it did. It didn’t check to see if the definition had anything after the closing curly bracket. So if you put anything in the environment that looked like “function-name='() { function-definition } ; other-commands“, other-commands would be unconditionally run. The patches attempt to stop other-commands from being executed.

As I write this, most patches out there are flawed, because there are other things that can go badly awry with this. And that’s not a surprise, because the basic action is still, fundamentally, hand a piece of arbitrary text, of unknown source, to the command interpreter. How could that possibly go wrong?

Let’s step back a bit here. The environment is a place for programs to put bits of data for programs running in sub-processes to pick up. Usually, this is benign; the sub-processes generally only look for variables they want, and can take or leave the data. There are of course many examples of shell scripts executing environment variables as shell code, because they haven’t quoted the expansions properly, but generally, you can write secure shell script.

But Bash’s function export/import feature fundamentally changes that model. It allows the code that the script is executing to be changed by data inherited from outside its control, and before the script takes control.

For example, let’s just assume that all the patches to Bash work, and the functionality is reduced to only ever allowing a function to be imported, and never having any other nasty side effect. I can still do this:

$ cd() { echo I am a bad man ; }  # Redefine the cd shell builtin
$ export -f cd                    # Export it
$ cat x.sh                        # x.sh is just a script that does a cd
cd /home
$ ./x.sh                          # And run the script
I am a bad man

The implications? If I can control the environment, I can control the operation the commands executed by any Bash script run from my session, including, for example, any script launched by a privileged program. And if /bin/sh is linked to Bash is the default shell, any shell command launched via a system() call is also a “bash script”, since system(“command“) simply spawns a sub-process, and in it, executes /bin/sh -c command.

When I look at the function import feature of Bash, my reaction is, why the hell did anyone think this was a good idea?

I’m usually not keen on removing features. In my experience, if you think nobody would do it that way, you’re probably wrong (see don’s law). But this one is just bad for so many reasons it’s ridiculous. It’s not needed for /bin/sh compatibility. As far as I can make out, it’s rarely if ever used at all. So if there’s a candidate for a featurectomy, this is it. (If you want to do this, the offending code is in the function initialize_shell_variables() in the file variables.c of the Bash source code at ftp.gnu.org/gnu/bash/.)

Or perhaps we should all just do what FreeBSD and Debian Linux have already done, and use a smaller, lighter shell (such as Dash) for shell scripts (installed/linked as /bin/sh), and relegate Bash to interactive command interpreter duties only.

Band-aid patching around this bug without removing the underlying issue – that Bash imports code from an untrusted source – is only addressing part of the problem.

Edit: There are of course now patches in play which do a few things; the band-aids referred to above, and a new one to move the exported functions into environment variables named BASH_FUNC_functionname. I’m not sure that the latter significantly improves security of the “feature”.

However, there is one way to deal with commands being passed to /bin/sh. Bash recognises when it is executed as “sh”, and makes some assumptions. This patch (to Bash 4.3 patch 27) makes Bash refuse to import functions when executed as “sh”. The advantage of this is that commands invoked from system(), and scripts that specify their interpreter as “#!/bin/sh” (and therefore should not expect Bash-isms to be present) will not be vulnerable to any abuse of the function export/import feature.

Don’t get me wrong, I am still advocating a complete featurectomy. But this might be more acceptable to those who think importing random functions from who knows where is somehow a good idea…

Back in the Dark Ages, when dinosaurs ruled the earth … yeah, say the mid 1990s, early ISPs tended to offer “free” email service as part of their connection plans. It was cheap to do; the email usually just took the form of a POP email box, via which you downloaded your email with a client such as Eudora or POPmail for those reprobate MS-DOS users who loved their text-mode clients.

Your email address was usually something like your-dialup-username@the-isp’s-domain-name, e.g don@netlink.co.nz. There were a bunch of reasons for this.

  1. Email was an early “killer app” for the Internet. Giving the customers an email address got them on and able to do something useful with the ‘net, back when the Web was still in its infancy.
  2. Domain name services were offered as a premium service, and were often expensive in terms of the effort and domain name fees required to provide; corporate customers with permanent connections would usually provide their own email service.
  3. Having the ISP’s domain name in all its customers’ email addresses provided brand recognition.
  4. The requirement to change email address created a disincentive for customers to change provider.

The world has changed a bit since then.

There are a bunch of email providers, like Hotmail, Yahoo! and GMail, which will happily give you an email address, and a very nice web interface, which you can use to get at your email from anywhere. There’s absolutely no need to get yourself tied to a specific provider. (There is one caveat though, and that’s that if you’re not paying for the service, you are not a customer.)

Point 1 no longer applies. You don’t need the ISP’s email service.

There are now many commercial domain hosting services available. Granted, they are not free, although many are cost effective, but these provide good email services, including hosted IMAP service (far superior to the old POP service, which assumed you’d only ever get your email on one computer), server-side filtering, spam removal and so-forth, as well as web hosting options. The days of the ISP manually configuring a “virtual domain” onto its web and email servers, and charging a premium price for it, are long gone.

The game of providing email has changed.The service isn’t a case of holding mail in a temporary spool for later download by a single desktop computer. A decent email service stores, and backs up, all email, so that it can bet retrieved from multiple desktop, portable and mobile clients. Spam processing is a major drain on resources; many folk don’t understand that it’s war out there – spam is driven by large commercial interests who pay highly organised criminals to spam, and to attack computers to create the means to spam. So not only do you not want to be the target for these gangs, not being that target is actually cost effective. Automation makes configuring domain names, email and web hosting easy and cheap for suitably organised providers, and domain name registration fees are down to very low prices. For prices in the low hundreds per year, or less, you can have your own domain name, as many email addresses as you need within it, and a smart web host running easy to operate software (such as WordPress, which I’m using to write this).

The last two reasons for ISPs providing email are for the their benefit, not yours. They get the brand recognition. They get to keep you as a customer, or at least on their customer list, for long after their use-by date has passed.

Email has never, ever, been a “free” service; somewhere, somehow, the providers of the service have been making a buck out of it. Maybe it’s in customer retention, maybe it’s in the brand recognition. (It was Telecom Xtra’s explicitly stated goal in its early days to make “xtra.co.nz” a recognised brand.) Maybe it’s in advertising. When you buy that fancy domain / web hosting package with email? Well, the provider has probably spent as much if not more on the email part than on the web hosting part. Which brings me to a simple question.If domain hosting is so cheap, who do I still see @xtra.co.nz email addresses painted on the sides of vans, on billboards and on business cards? The money you spent on that isn’t promoting your business, it’s promoting Telecom’s. Why would you do that?

What is your email worth to you?

What would you do if xtra.co.nz was no longer available?If you’re no longer a Telecom customer, you’re likely to see your @xtra.co.nz email address axed in the near future, unless you pay them to keep it. If changing your address means reprinting your stationery and repainting signs, and losing email from customers that haven’t noticed that your email address has changed, that’s a high price to pay for a “free” service.

So, c’mon. In NZ, we have a domain registration system that’s the envy of the world (and I’m proud to say I’ve had a bit to do with that). Hosting your email has never been so easy or so cheap, at a time that trying to do it yourself has never been so difficult. How you present yourself to an increasingly digital world is important to how others see you, and whether they want to do business with you.

So once again, what is your email worth to you?

A recent posting on an InternetNZ list mailing reminded me of just how far we have come. In March, 1995, I took the minutes of the New Zealand Internet Society of  steering group meeting.

Just so we’re clear on what the Internet was back then, the Web was only just beginning to get traction; typical data rates were 48 kbps; establishments such as universities had rates of up to 256 kbps, the total amount of Internet bandwidth out of the country was (I think) 384 kbps. That’s the same amount as six phone calls. Most traffic was email and file downloads using FTP. Interactive services usually required a terminal session using Telnet. Dial-up Internet was only just becoming available; most services that you could use from home required you to dial into an Internet-connected computer service using a terminal emulator, and running your mail program and FTP downloads from there; if you wanted to download a file to your computer, you used a file transfer program like Zmodem to suck it down from the service provider’s computer after the FTP download had finished.

So the Internet was still a new thing. We were still trying to get to grips with how things should be done. So far, all the officialness required was being done through the Tuia Society, which was simply not equipped to address interests outside the immediate research and education community. It did, to its credit, recognise that this baton needed to be passed onto a more broad-based organisation. The March meeting was to explore the possibility of creating an New Zealand Internet Society, possibly as a chapter of the international Internet Society.

The technology to record this ground breaking meeting? Pen and paper.

Here, then, are the minutes to that meeting: Continue reading ‘Old school’ »

At NZNOG 2012, I presented our work on applying point-to-point semantics to Ethernet-like interfaces, described in my earlier post, Broadcast Interface Addressing Consiudered Harmful.

The slides are available here.

We’ve done a bit more work on this since the original article. One thing that occurred to us was that if are prepared to keep maiking ARP requests for a client, you know whether the link is alive or not. In fact, you can do ARP to a host even if you’re not really talking to it.

Consider: We have an IP host, say We tell it that its default gateway is We answer for all ARP requests the host makes, except for itself (see the earlier paper).

But now instead of one upstream router, we have two. And furthermore, we have the two routers, using and respectively as their local IP addresses, i.e. the addresses they will put in their ARP packets (and in any ICMP packets generated from the interfaces). We still tell the client its default gateway is

The two routers both ARP for the host. Both routers know if they can reach the host. Between them, via a “back channel” (i.e. a protocol running over the backbone), they agree which host should be the “active” router for that host.

The active router simply behaves as the upstream router as previously described. The inactive router does nothing more than make ARP requests for the host, and report its availability. This way, if the active router stops participating in the information protocol (i.e. dies), or the active router loses contact with the host, and the inactive router can still contact the host, the inactive router can take over the active role.

As it does this, it can generate an unsolicited unicast ARP reply to the host, to inform it that the “default” IP address ( in our example) has changed. Other addresses will sort themselves out depending on the host’s ARP caching strategy. Ideally, the client host will have a fairly rapid ARP time-out and will retry its broadcast ARP for any such addresses.

This approach has advantages over protocols like VRRP. VRRP works by changing the interface MAC address to a “shared” address, so that IP clients don’t know that there has been a change when the active router swaps over. While that makes for a potentially more rapid fail-over, it comes with a number of disadvantages:

  • The shared MAC address changes requires a change to the MAC table on layer 2 switches;
  • There is some risk of MAC address collisions, especially in Q-in-Q (stacked VLAN) configurations;
  • the VRRP protocol is visible (multicasted) on the client VLAN;

But the major advantage of this approach is that since there is a handshake with the end client. VRRP and similar protocols have no such handshake; they’re fine for detecting and replacing a failed router, but where the failed component is intervening layer 2 infrastructure, VRRP has no way of knowing that the host is not reachable from the active host, but is reachable from the inactive one. For example:

  • Switch X connects to Y, and Y to Z
  • Client C connects to switch Y
  • Client D connects to switch X
  • Router A connects to switch X, and is active for clients C & D
  • Router B connects to switch Z, and is inactive for client C & D

If the link between switches X and Y fails, Router A loses connectivity to Client C. With ARP handshaking, this loss of connectivity is detected and handled by failing over advertisement of Client C’s address to Router B. Furthermore, Client D remains reachable from Router A (and indeed connectivity is lost from Router B), but since each client IP address is processed independently, the active router for that host does not change.

We believe this is applicable to a number of situations, especially Internet access networks, be they in a data centre or layer-2 metropolitan access networks.

Juha Saarinen dropped me a note a week or two back, asking for an update to my last post, in the wake of the IANA IP address pool finally running out and the recently announced successful bid for Nortel Networks’ IP address space by Microsoft for inclusion in NZCS Newsline.

The published article can be found here, and is different enough from the previous version to warrant re-posting.

Continue reading ‘IPocalypse Now’ »

The IPocalypse is upon us. There are seven /8 IPv4 address blocks left! Soon there will be six. Then five.

On that fateful day, when the sixth to last /8 block is assigned, the five Regional Internet Registries (RIRs) will receive one each of the remaining five /8s for final allocation. This will probably happen in the next month or two.

Then there will be no more! Oh woe is us!

Or not. There are a bunch of ways that we can measure IP address space usage. They include:

  1. The number of address available. Formally, this is 2^32 minus the 588,514,560 addresses (or just over 35 /8 blocks)that  are assigned for special uses (multicast, reserved, private addressing etc), leaving 3,706,452,736 addresses (or the equivalent of just over  220.9 /8 blocks) available for present or future end-user assignment.
  2. The amount of addresses assigned by IANA to RIRs for allocation. Currently, this stands at pretty much all of the above space, less the aforementioned seven /8s (or 117,440,512 addresses).
  3. The amount of address space allocated by RIRs. According to Geoff Huston, this is likely, at current rates of assignment, to run out in mid-late 2011.
  4. The amount of address space that is actually advertised. Right now, a little under 2/3rds of the allocatable address space (that is, excluding private, multicast and reserved address space) is actually advertised to the global routing table. That’s right, 1/3rd of the IP address space is unequivocally dark.
  5. The amount of address space actually allocated to infrastructure. Now things get murky. Is a /8 advertisement actually representing a /8 worth of allocation? Or is the holder of that /8 advertising it simply because they can?
  6. The amount of address space actually in use. This too is largely unmeasurable. Many advertisements, especially smaller ones. are to achieve multihoming, in which case a /24 may have very small numbers of hosts actually assigned to it. The nature of IP address assignment is that you always have to allocate a larger subnet than you plan to use, unless you can do single IP address per client allocations, e.g. using PPP & friends, my ARP hack or layer-3 VLAN schemes.

Measurements 1 through 4 are easy. 5 & 6 are hard. All we can say for sure is that each measurement will give a smaller number of addresses in use than the one above. If an address appears on the global routing table, we can follow it to its associated autonomous system, but beyond that, we have to look ad individual addresses, and even then an assigned and in-use address my be behind a firewall or something and effectively invisible but none the less actively in play.

It did occur to me to look at reverse map entries, but experience suggests that these are unhelpful, being fairly universally badly managed.

So, the question of when IP address space will run out remains difficult to answer. Geoff’s IPv4 Address Report shows a curve in address advertisements (fig. 11c) which,although initially exponential, seems to have settled to a linear growth of about 176,000,000 addresses per year in actual advertisements since 2006. If that rate is maintained, the 1.3 billion or so unadvertised addresses should run out in about seven years.

But I suspect that as RIR space becomes unavailable, we’ll start to see address space that is currently advertised but not actually in use being re-allocated (read: sold). For starters, there are about 200 million addresses tied up in non-carrier addresses that are currently advertised as /8s. Admittedly, a goodly chunk of that space may actually be in use, but one suspects that a significant proportion isn’t. There are a lot of equally historical /16 assignments and smaller blocks assigned under multihoming policies that are similarly underutilised, and could shed a large proportions of their advertised allocation as their holders discover it’s worth more to them in someone else’s hands than in their own.

So I’m going to lick my finger and stick it in the wind. I think we have ten years or so before we really, genuinely run out of IPv4 addresses, and that ignores the transition to IPv6 completely. In reality, as IPv4 addresses become scarce (read: expensive), we’ll see folks making do with less and looking harder at IPv6 transition, so I doubt we’ll ever actually run out. Sure, there’s a whole bunch of stuff you can’t do without lots of addresses, but those applications will simply have to go to IPv6.

Don’t get me wrong; I’m not suggesting for a moment that we don’t have to worry. The single thing that will prevent exhaustion is money. Scarce resources have value; the more scarcity, the more value. RIRs have some really hard choices ahead of them; they’re going to be in the firing line to manage the emerging market in IPv4 address space. Pretending that organisations don’t “own” their address space will stop being an option; the court cases haven’t started in earnest yet, but unless the RIRs urgently awake from the fantasy that IP address space is not a tradable asset, they will.

Either they will rise to the challenge, or they’ll swept into irrelevancy. I rather hope the latter doesn’t happen, because the alternative is anarchy. The best we can hope for is that enough wiser heads prevail to ensure that the emerging IP address bourses have sufficient support to ensure that the fabric of the Internet isn’t torn apart by the conflict between those who long for a non-commercial Internet where everyone plays nice, and the immediate needs of a market where folks need to get stuff done.

This is a picture of my keyboard:

Yes, it’s grubby. And yes, this keyboard really is old enough to not have Windows keys. Actually, it’s about twice that old. Twenty years ago I needed a new keyboard, so I bought a cheap one. (Back then, $200 or more was cheap for a keyboard.) I’m not really sure what I’m going to do when it expires because I’ve never used a keyboard since that I liked. They don’t make keyboards like this any more, with discrete key switches and a distinct tactile click when the key goes down. (Well, they do, but they’re big heavy IBM keyboards that are so noisy they can be heard three blocks away.)

And yes, that key between the Ctrl and Alt keys is labelled “Any”.

The true irony of this is that this key doesn’t actually do anything. No key-code is generated when you press it, so pressing the “Any” key in response to “Press any key to continue” will result in a distinct lack of continuation.

We all have our favourite tech support stories, the “my cup holder is broken” cases, the “it works better if you plug it in” cases. So I wonder how many of us have actually had someone ask where the Any key was?

Once I got called out to look at a printer that apparently wasn’t working. The data plug was upside down. It was a D-shell plug, and they only go in one way, but there it was.

I turned it over and it worked fine.

I know you don’t believe me. I wouldn’t believe me. But it did happen – the male plug was a wee bit bigger than it should have been, and only had a few pins installed which in turn were a bit loose, and the combination of these faults meant it actually went together and seated tightly.

Many, many moons ago, back in the days of serial terminals and multiplexors, the boss came by, saying, “I just had a call from the Auckland office. They say all their terminals are down.” I muttered something unprintable, and wandered into the comms room.

Looking at the multiplexor, I noted the “RA” light flashing. Remote Alarm, meaning the mux couldn’t see the mux at the other end. Probably a comms fault, hardly the first time. Moving up the rack, the NTU on the data circuit to Auckland indicated that it couldn’t see its partner at the other end.

That could just about explain it.

So I ambled off in the direction of the technicians’ office. Back then the telco stuff was handled by the people who looked after the phones, and that meant the techs. So I told Evans, the head tech of my findings, and he picked up the phone to put through a fault call.

Later that day, I ran into Evans in the corridor. “What’s up with that Auckland circuit?” I asked.

“Oh the fault man went out there. There’s no power.”

“What, to the NTU?”

“Nah, to the building.”

I hate IPv4 link broadcast interface (e.g. Ethernet) addressing semantics.  To recap, if I have two boxes on each end of a point-to-point link (say between a gateway and an end host), we address as follows (for example):

  • Network address (reserved)
  • Host 1 (gateway)
  • Host 2 (end host)
  • Broadcast address.

That’s four IP addresses, for a link to a single host.  Hello?  Haven’t you heard the news?  IP addresses are running out!

Some folks manage to get away with using /31 masks, e.g.

  • Host 1 (gateway)
  • Host 2 (end host)

which is just wrong.  Better in terms of address usage (two addresses instead of four), but still just plain wrong. An you’re still wasting addresses.

The PPP folks a long time ago figured that a session, particularly in client to concentrator type configurations, only needs one IP address. A “point to point” interface has a local address, and a remote address, of which only the remote address needs to be stuffed in the routing table.  The local address can be the address of the concentrator, and doesn’t even need to be in the same subnet.

So why can’t my Ethernet interfaces work the same way?

A point to point link really doesn’t have broadcast semantics.  Apart from stuff like DHCP, you never really need to broadcast — after all, our PPP friends don’t see a need for a “broadcast” address.

Well, we decided we had to do something about this.  The weapon of choice is NetGraph on FreeBSD.  NetGraph basically provides a bunch of kernel modules that can be linked together.  It’s been described as “network Lego”.  I like it because it’s easy to slip new kernel modules into the network stack in a surprising number of places. This isn’t a NetGraph post, so I won’t spend more verbiage on it,but it’s way cool. Google it.

In a real point-to-point interface, both ends of the link know the semantics of the link.  For Ethernet point-to-point addressing, we can still do this (and my code happily supports this configuration), but obviously both ends have to agree to do so. “Normal” clients won’t know what we’re up to, so we have to do this in such a way that we don’t upset their assumptions.

So we cheat. And we lie. And worst of all,we do proxy ARP!

What we do is tell our clients that they are on a /24 network. Their IP address is, for example,, and the gateway is Any time we get a packet for, we’ll send it out that interface, doing ARP as normal to resolve the remote host’s MAC address.

Going the other way, we answer ARP requests for any IP address in, except, with our own MAC address.  That means that if they ARP for, we’ll answer the ARP request, which directs that packet to us, where we can use our interior routes to route it correctly.  In our world, two “adjacent” IP addresses could be on opposite sides of the network, or it could be on a different VLAN on the same interface.

The result is one IP address per customer.  We “waste” three addresses per 256, the network (.0), gateway (.1) and broadcast (.255), and we have to be a bit careful about what we do with the .1 address — it could appear on every router that is playing with that /24.  But we can give a user a single IP address, and put it anywhere in the network.

We can actually have multiple IP addresses on the same interface; we do this by having the NetGraph module have a single Ethernet interface but multiple virtual point-to-point interfaces.  So if we want to give someone two IP addresses, we can do that as two, not necessarily adjacent, /32 addresses.  We don’t answer ARPs for any of the assigned addresses, but do answer everything else. The module maintains a mapping of point-to-point interface to associated MAC address.

Seriously.  They don’t like it.  They sulk.

Brendan Gregg of the Sun Microsystems Fishworks engineering team, has written up this effect, with video, at http://blogs.sun.com/brendan/entry/unusual_disk_latency

Moreover, don’t vibrate your drives.  Why an I saying this?

Because, three months ago we took delivery of three 1U pizza boxes. They’re small Supermicro boxes, with room for a normal ATX motherboard and a hard drive.  We equipped these with terabyte drives, fairly normal Supermicro motherboards, 3 GHz Core2 Duo CPUs and 8GB memory each.

They just didn’t run right.  Occasionally, one wouldn’t even make it through an OS install, and the ones that did wouldn’t put through as much work as a much lower spec machine.

We suspected the drives; we suspected the power supply.  Actually, we really thought it was the power supply, but even though the PSUs on these chassis were small, and the 12V rails seems to be running slightly low, at 11.85V, no amount of bashing the numbers suggested that the systems were actually underpowered.

The first breakthrough was running “hdparm -t –direct /dev/sda” on the drive, which showed wildly fluctuating numbers, consistent with the behaviour we were seeing.  So it was something to do with the disk subsystem.

The next breakthrough was when we discovered that if we unplugged the chassis fan (an ugly centrigufal thing) from the motherboard, the problem went away.  The hdparm numbers stabilised at 100MB/s or more.

We saw small changes in power supply volts when we did this, so we were still suspecting the power supply.  I put an ammeter on the fan power line, to see how much power the fan was pulling.  1.2A at full speed.

We played with the fan speed in the BIOS; at its lowest speed, it would pull 0.25A, and the drive would perform well; at the “server” setting, with the server otherwise unloaded, it would pull about 0.6A.  At that rate, it was starting to have an effect on performance.

This was a PSU that was supposed to be able to deliver 18A on the 12V rail, and 260W total.  I really couldn’t see how the 12V would be at the edge when the PSU was pulling less than 100W (measured at the AC feed) and was running three fans and a hard drive and a few minor bits and pieces like the serial port and network interface, all of which should have summed to maybe 5A.  The numbers didn’t add up.

Finally, I had a brainwave.  I removed the fan from the chassis, still running.  The problem went away.  I touched the fan to the drive.  The drive throughput dropped through the floor.

After a few more experiments, the conclusion is that with the fan mounted close to the drive, the vibrations were enough to upset the performance of the drive, consistently.  Two different terabyte drives (one Seagate, one Western Digital) exhibited the same problem.

I duplicated this by applying abnormal vibration to the case of my desktop PC (half terabyte Seagate), and even the grottly little thing I have at home (a Seagate 160GB PATA drive).

Conclusion: all modern drives are subject to potentially serious performance issues when faced with abnormal vibration.  The Supermicro chassis exacerbated the problem  because of the placement of the fan with respect to the drive, and the fact the drive is mounted directly to the chassis.  Also, the placement of cables up against the fan meant that vibrations were being transferred directly through the connectors from the fan; somthing that could be partially alleviated by re-routing the power cable under the fan.

The fact that right angle SATA power connectors are so darned hard to get made this more of an issue than it should have been.

I think a bit of judicious use of closed-cell foam packing, turning the fan speed down, and re-routing cables away from the fan will finally solve the problem.


The following is a technique I’ve used over the last decade or so for distributing web traffic (or potentially any services) across multiple services, using just DNS.  Being an old DNS hack, I’ve called this technique Poor Man’s Anycast, although it doesn’t really use anycasting.

But before we get into the technique, we need to make a brief diversion into the little-known but rather neat feature of the DNS, or more accurately, DNS forwarders, which makes this a cool way to do stuff. The feature is name server selection.

Most DNS clients, and by this I include your home PC, make use of a DNS forwarder.  The forwarder is the thing that handles (and caches) DNS requests from end clients, while a DNS server carries authoritative information about a limited set of domains and only answers queries for them. These two functions have historically been conflated rather severely, mainly due to the use of BIND for both, and why this is a bad thing is the subject for a whole other post.

Moving right along. A DNS forwarder gets to handle lots of queries for any domain that its clients ask for. When you ask for foo.example.net, it asks one of the root servers (a.root-servers.net, b.root-servers.net et al) for that full domain name (let’s assume it’s just come up and doesn’t have anything cached).  It gets back a delegation from the root servers, saying basically, “I don’t know, but the GTLD (.com, .net) servers will”, and tell you where to find the GTLD servers (a.gtld-servers.net et al).  You ask the one of the GTLD servers, and get back an answer that says that they don’t know either, but ns1.example.net and ns2.example.net do.

You then ask (say) ns1.example.net, and hopefully you’ll get the answer you want (e.g. the IP address).

Now, along the way, the forwarder has been caching everything it got. Every time it asks a name server for data, it stores the time it took to reply. That means that when looking up names in example.net, the forwarder has been collecting timing and reliability data which it uses to choose which name server to ask next time, as well as the answers it received.  So if ns1.example.net answers in 20 ms, but ns2.example.net answers in 10 ms, roughly two thirds of the queries for something.example.net will be sent to ns2.example.net. If the timing difference is much greater, the split of queries will be even more marked. Similarly, if a name server fails to respond at all, that fact will be reflected in the accumulated preference assigned to that server, and it will get very few queries in future; just enough so that we know we can start sending it queries again when it comes back.

This is a powerful effect, and is of particular use when distributing servers over a wide geographical area. DNS specialists know about it, because poor DNS performance affects everything, and DNS people don’t like adversely affecting everything. (They’re really quite paranoid about it. Trust me, I’m one.) But it can also be used to pick the closest server for other things as well.

After all, closeness (in terms of round-trip time) is very important in network performance (see my post on bandwidth delay products).

The technique is as follows.  Let’s say we have three web servers, carrying static content. Call them, say, auckland.example.net, chicago.example.net and london.example.net. Let’s say that they’re widely disparate. All three servers carry content for http://www.example.com/.

So, we start by configuring, on the example.com name servers:

$ORIGIN example.com.
$TTL 86400
www     IN      NS      auckland.example.net.
        IN      NS      chicago.example.net.
        IN      NS      london.example.net.

We then run a DNS server on all three web servers.  We configure the servers with a zone for www.example.com along the lines of:

$ORIGIN www.example.com
$TTL 86400                       ; Long (24 hour) TTL on NS records etc
@       IN      SOA     auckland.example.net. webmaster.example.com (
                                2009112900 3600 900 3600000 300 )
        IN      NS      auckland.example.net.
        IN      NS      chicago.example.net.
        IN      NS      london.example.net.
$TTL 300                         ; Short (five minute) TTL on A record
@       IN      A ; Set this to host IP address

Now the key is that each web server serves up its own IP address. When a DNS forwarder makes a query for a www.example.com, it will be directed to one of auckland.example.net, chicago.example.net or london.example.net. But as more and more queries get made, one of those three will start handling the bulk of the queries, at least if that one is significantly closer than the other two. And if auckland.example.net gets the query, it answers with its  own IP address, meaning that it also gets the subsequent HTTP request or other services directed to it. The short DNS TTL (5 minutes in the example) mean that the address gets queried moderately often, allowing the name server selection to “get up to speed”. Much longer TTLs on the name servers mean the data doesn’t get forgotten too quickly.

The result is that in many cases, the best server gets the request.

The technique works best if there are lots of domains being handled by the same set of servers, and there are lots of requests coming through. That way the preferences get set quickly in the major ISPs’ DNS forwarders. The down side of the technique is that far away servers will still get some queries. This non-determinism may be a reason for not deploying this technique.  If you want determinism, you’ll need to look at more industrial grade techniques.

Now, this isn’t what players like Akamai do, and it isn’t what anycasting is about. Akamai and (some) other content distribution networks work by maintaining a map of the Internet, and returning DNS answers based on the requester’s IP address. But this is a fairly heavyweight answer to the problem. It’s not something you can implement with just BIND alone.

Anycasting on the other hand relies on advertising the same IP address in multiple places, and letting BGP sort out the nearest path. This has three disadvantages:

  1. It potentially breaks TCP. If there is an equal cost path to a given anycast node, it’s possible one packet from a stream might go one way, while the next packet might be sent to a completely different host (at the same IP address). In practice, this has proven to be less of a problem than might be expected, but there is still scope for surprises.
  2. Each of your nodes has to be separately BGP peered with its upstream network(s). That’s a lot more administration than many ISPs will do for free.
  3. Most importantly, being close in BGP terms is not the same as being close physically or in terms of round-trip time. Many providers have huge reach within a single AS, so a short AS-path (the main metric for BGP) may actually be a geographically long distance, with a correspondingly long round-trip.

The other nice thing about poor man’s anycast is that it’s dynamic; if a node falls off the world, as long as its DNS goes away too, it’ll just disappear from the cloud as soon as the TTLs time out. If a path to it gets congested, name server selection will notice the increased TTL and de-prefer that server.

And of course you don’t need to be a DNS or BGP guru, or buy/build expensive, complex software systems to set it up.