Why should you consider purchasing a support contract for OpenNMS or another true open-source product? The answer may surprise you.
My employer, The OpenNMS Group, maintains OpenNMS as a truly and unquestionably open-source project. The software is really free, as in beer, speech, and freedom — we do not sell an enterprise version with more features. Every penny of my paycheck represents revenue from the sale of support contracts, professional services, and training for OpenNMS, which we offer at darned reasonable prices.
One of our support customers, a service provider in the United Kingdom, reported a problem last Friday with their OpenNMS server. The OpenNMS daemons had stopped unexpectedly, and after our customer’s staff had started them back up, the daemons stopped again after a few minutes. This happened repeatedly, so we had a fairly hot support ticket on our hands. It turned out that the customer had added RAM to the OpenNMS server at our suggestion (the amount of stuff they are now managing with OpenNMS is larger than they initially expected), and it was immediately after bringing up the server with the new memory that the daemons started crashing.
In a previous life working on commercial software, if a ticket like this had gotten escalated to me I would have requested some logs and recommended that the customer run an exhaustive memory test on the server, since a bad memory module can easily cause crashes. In this case, I requested the memory test. Then I logged in to the customer’s server and set to work looking at the system. After bad RAM, my second suspicion was that the daemons were trying to breathe more deeply in the newly expanded physical memory, but were running out of heap space since the Java VM under which they were running was constrained to a maximum heap size that was appropriate for the pre-upgrade system. As soon as the workday was over in the customer’s timezone, I adjusted the maximum heap size setting and restarted the daemons. I continued to monitor the system throughout the weekend even though this customer does not pay for 24×7 support. The good news was that the daemons were no longer crashing after just a few minutes. The bad news was that they were instead stopping after a few hours.
On Monday morning, I came back to this ticket in earnest. Logging in again and looking closely at the OpenNMS logs, I saw no telltale signs of bad memory being the culprit. In fact, the logs painted a picture of the daemons having been shut down in a somewhat orderly fashion — shutdown hooks were being called, bean contexts were being destroyed, and resources were being freed. This is not what a memory-related daemon crash looks like at all. I saved the logs and started the daemons again, and went about my morning, checking back periodically to see whether the daemons were still running.
An hour or so later on our daily scrum call, I brought up this issue. Everybody agreed that the scenario was odd, and I got a couple of additional pairs of eyeballs committed to have a look. Dave stuck his head in and quickly found a message in the system’s kernel logs indicating that something called oom_killer had actually sent a shutdown signal to the OpenNMS daemons. That explained the orderly shutdown I had observed in the daemon logs, but neither of us was familiar with the OOM Killer. It turns out that this dreaded beast comes into play when the Linux kernel is Out Of Memory (hence OOM), sacrificing a running process in order to free up enough resources for the rest of the system to continue running. But why was the system running out of memory? After the memory upgrade, it should have had plenty of RAM.
It turns out that the server was exhausting its physical memory for short periods when OpenNMS detected critical outages and queued up e-mail notifications. When physical memory is exhausted, the kernel turns to paging for a respite, using swap space specially allocated on the system’s disks as a place to “page out” some less critical processes that are using the physical memory that the system needs to handle the tasks at hand. Most of the time this strategy works very well, but the customer had not adjusted the amount of swap space allocated on the server’s disks to compensate for the just-added physical memory — the rule of thumb is that a server should have an amount of swap space equal to twice the size of its physical memory, but now this server’s swap size was just 25% of its physical memory size. This comparably tiny amount of swap was quickly exhausted, but the kernel needed to make more room, so it dispatched the OOM Killer to assassinate a process whose demise would free up a ton of virtual memory. That unlucky process was the Java VM under which the OpenNMS daemons were running.
Fairly confident that we had a handle on the problem, I suggested to Dave that adding swap space on the customer’s server was a good next step. He agreed, and since for this customer we manage the server and operating system as well as the OpenNMS installation, I brought online additional swap to bring the system in line with the twice-physical-memory guideline. The daemons have not crashed since, and I’m crossing my fingers that clicking the “Publish” button on this blog post will not cause them to fall over
Now, ask yourself whether you would expect this level of service from a commercial software vendor. In a previous life at such a company, my approach to this same issue would have been very different. I would have requested logs from the customer, and after spotting the appearance of an ordered shutdown in the log messages, the customer would have been on his own to track down the OOM Killer’s involvement in the problem. The ticket would have dragged on for weeks or months. Managers and directors would have wrung their hands in one weekly meeting after another at the pernicious red-highlighted row in a spreadsheet. The customer would have become disgruntled but been powerless to do anything besides escalate the issue, because as much as they might want to cancel their product maintenance (which typically costs from 17% to 25% of the list price of the software), the vendor would cut them off from getting new product versions if they did so.
Because the software that we support at The OpenNMS Group is available to anybody free of charge, we cannot say to a customer, “oh, you didn’t pay your support bill, so you can’t have the new version of the product.” Therefore our support has to go far beyond what most people expect from support that costs five or ten times as much. We also cannot build a revenue model around charging for support according to the number of nodes, interfaces, or agents being managed, or even according to the number of OpenNMS servers that are running. Everybody who buys a given level of support pays the same amount for that support.
So why buy support for free software? Because it’s very likely that you will get much more and much better support for your money from a company that does not have a software license revenue cash cow, and has therefore figured out how to transform support from a “cost center” to the star of its show.
OpenNMS, Software