Archive

Archive for January, 2008

The best kind of love

January 31st, 2008
Comments Off

Picking up on Dave’s recent post, I’d like to spend a moment discussing love.

My mother tells the story of a day when I was being a kid, getting dirty playing in the garden as she and her own mother watched. Suddenly I just put my fun-having on hold, ran over to Gran, and gave her a giant muddy hug. She turned to my mother and said, “that’s the best kind of love, when you don’t even have to ask for it.”

Back in grown-up land, a guy named Neil Watson has been hanging out for a while on the OpenNMS discussion list, asking and answering questions, and generally being part of the community. Neil is not a network management guru, but he’s displayed a keen interest in OpenNMS and has always been appreciative of the help he’s received and willing to give back. That alone is good enough to call love, especially when contrasted with some of the help vampires who have plagued our community over the years.

Neil didn’t stop at giving back on the mailing list, though. He’s written a very nice and accessible review of OpenNMS 1.3.9. Nobody on the mailing list asked him to do this, and as far as I can tell his employer did not pay him to do it. He just did it. It’s like a big unsolicited hug for the OGP and everyone else who loves OpenNMS, and we didn’t even have to get muddy.

So thanks, Neil. If I ever get to Toronto, I owe you a beer and a big hug.

OpenNMS, Software

Why buy support for free software?

January 23rd, 2008
Comments Off

Why should you consider purchasing a support contract for OpenNMS or another true open-source product? The answer may surprise you.

My employer, The OpenNMS Group, maintains OpenNMS as a truly and unquestionably open-source project. The software is really free, as in beer, speech, and freedom — we do not sell an enterprise version with more features. Every penny of my paycheck represents revenue from the sale of support contracts, professional services, and training for OpenNMS, which we offer at darned reasonable prices.

One of our support customers, a service provider in the United Kingdom, reported a problem last Friday with their OpenNMS server. The OpenNMS daemons had stopped unexpectedly, and after our customer’s staff had started them back up, the daemons stopped again after a few minutes. This happened repeatedly, so we had a fairly hot support ticket on our hands. It turned out that the customer had added RAM to the OpenNMS server at our suggestion (the amount of stuff they are now managing with OpenNMS is larger than they initially expected), and it was immediately after bringing up the server with the new memory that the daemons started crashing.

In a previous life working on commercial software, if a ticket like this had gotten escalated to me I would have requested some logs and recommended that the customer run an exhaustive memory test on the server, since a bad memory module can easily cause crashes. In this case, I requested the memory test. Then I logged in to the customer’s server and set to work looking at the system. After bad RAM, my second suspicion was that the daemons were trying to breathe more deeply in the newly expanded physical memory, but were running out of heap space since the Java VM under which they were running was constrained to a maximum heap size that was appropriate for the pre-upgrade system. As soon as the workday was over in the customer’s timezone, I adjusted the maximum heap size setting and restarted the daemons. I continued to monitor the system throughout the weekend even though this customer does not pay for 24×7 support. The good news was that the daemons were no longer crashing after just a few minutes. The bad news was that they were instead stopping after a few hours.

On Monday morning, I came back to this ticket in earnest. Logging in again and looking closely at the OpenNMS logs, I saw no telltale signs of bad memory being the culprit. In fact, the logs painted a picture of the daemons having been shut down in a somewhat orderly fashion — shutdown hooks were being called, bean contexts were being destroyed, and resources were being freed. This is not what a memory-related daemon crash looks like at all. I saved the logs and started the daemons again, and went about my morning, checking back periodically to see whether the daemons were still running.

An hour or so later on our daily scrum call, I brought up this issue. Everybody agreed that the scenario was odd, and I got a couple of additional pairs of eyeballs committed to have a look. Dave stuck his head in and quickly found a message in the system’s kernel logs indicating that something called oom_killer had actually sent a shutdown signal to the OpenNMS daemons. That explained the orderly shutdown I had observed in the daemon logs, but neither of us was familiar with the OOM Killer. It turns out that this dreaded beast comes into play when the Linux kernel is Out Of Memory (hence OOM), sacrificing a running process in order to free up enough resources for the rest of the system to continue running. But why was the system running out of memory? After the memory upgrade, it should have had plenty of RAM.

It turns out that the server was exhausting its physical memory for short periods when OpenNMS detected critical outages and queued up e-mail notifications. When physical memory is exhausted, the kernel turns to paging for a respite, using swap space specially allocated on the system’s disks as a place to “page out” some less critical processes that are using the physical memory that the system needs to handle the tasks at hand. Most of the time this strategy works very well, but the customer had not adjusted the amount of swap space allocated on the server’s disks to compensate for the just-added physical memory — the rule of thumb is that a server should have an amount of swap space equal to twice the size of its physical memory, but now this server’s swap size was just 25% of its physical memory size. This comparably tiny amount of swap was quickly exhausted, but the kernel needed to make more room, so it dispatched the OOM Killer to assassinate a process whose demise would free up a ton of virtual memory. That unlucky process was the Java VM under which the OpenNMS daemons were running.

Fairly confident that we had a handle on the problem, I suggested to Dave that adding swap space on the customer’s server was a good next step. He agreed, and since for this customer we manage the server and operating system as well as the OpenNMS installation, I brought online additional swap to bring the system in line with the twice-physical-memory guideline. The daemons have not crashed since, and I’m crossing my fingers that clicking the “Publish” button on this blog post will not cause them to fall over :)

Now, ask yourself whether you would expect this level of service from a commercial software vendor. In a previous life at such a company, my approach to this same issue would have been very different. I would have requested logs from the customer, and after spotting the appearance of an ordered shutdown in the log messages, the customer would have been on his own to track down the OOM Killer’s involvement in the problem. The ticket would have dragged on for weeks or months. Managers and directors would have wrung their hands in one weekly meeting after another at the pernicious red-highlighted row in a spreadsheet. The customer would have become disgruntled but been powerless to do anything besides escalate the issue, because as much as they might want to cancel their product maintenance (which typically costs from 17% to 25% of the list price of the software), the vendor would cut them off from getting new product versions if they did so.

Because the software that we support at The OpenNMS Group is available to anybody free of charge, we cannot say to a customer, “oh, you didn’t pay your support bill, so you can’t have the new version of the product.” Therefore our support has to go far beyond what most people expect from support that costs five or ten times as much. We also cannot build a revenue model around charging for support according to the number of nodes, interfaces, or agents being managed, or even according to the number of OpenNMS servers that are running. Everybody who buys a given level of support pays the same amount for that support.

So why buy support for free software? Because it’s very likely that you will get much more and much better support for your money from a company that does not have a software license revenue cash cow, and has therefore figured out how to transform support from a “cost center” to the star of its show.

OpenNMS, Software

Injections and reflections and web apps, oh my!

January 19th, 2008
Comments Off

I love application security. It has always been a part of my work, whether as an explicit part of my job or as something my curious nature just couldn’t resist spilling over into.

I hate tickingclock.gif. A long time ago, there was a horrible animated GIF image of a ticking clock in the web interface of my employer’s main product. When you ran a report, you saw the ticking clock. There was no progress indicator, no cancel button, just the ticking clock.

In the 5.0 release, engineering added a Cancel button, and there was much rejoicing. The cancel button was just the submit button of an HTML form that held in a hidden field the PID of the running report process, which would be sent a kill signal on the server. Of course, since the PID was unchecked on the server side, anybody who had a little cunning and permission to run reports could kill any process running as NH_USER, including the server processes themselves. Oops. They fixed that issue a year or so later.

Fast-forward to the present. A good samaritan notified my new employer of an SQL injection vulnerability in the asset information part of the OpenNMS web app. We were building SQL queries directly out of data received from the web UI, which is high on the list of ways to get your application cracked. Remember how I said I can’t resist app security? I spent a few hours changing all the code in the web app AssetModel classes to use prepared statements and parameter binding, which gives you escaping of SQL special characters for free. It also gives a performance boost in some situations.

I knew at the time I was doing these fixes that it would be impossible to plug every hole. The web app is something we plan to redo from whole cloth in the next year or two, so this is work that will be thrown away. As if to remind me of these twin facts, I got a follow-up e-mail from our good samaritan just moments after checking in the SQL injection fixes. He had now found a reflected cross-site scripting (XSS) vulnerability. I was fried and my wife had come home, so I shelved the XSS problem until this morning.

Reflected XSS vulnerabilities are the less nasty kind (the nastier being the stored variety), and their potential for wreaking havoc is sometimes difficult to see at first. An attacker has to be somewhat crafty in order to get a victim to bite, but complacency and e-mails composed in 24-point purple MS Comic Sans font (with importance set to high, please) make the attacker’s job easier. Once the victim clicks on the attacker’s link (and logs in if not already authenticated) the fun can begin. It’s almost no work at all to craft a URL that sends the user’s session ID cookie to a drop-box where the attacker can retrieve it, allowing him to hijack the victim’s web app session. If the victim is assigned administrative privileges, it’s game over.

Fixing a reflected XSS vulnerability is harder than the technical part of exploiting one. When I started banging on this bug this morning, I did a proof-of-concept fix that I felt confident would work: I escaped the offending parameter string before putting it into the exception that our code throws when it’s unable to parse the string as an integer. Still, the alert pop-up that I had crafted into my test URL came right back. Why didn’t that work? I did a clean build and a fresh new install and tried again. Still no change. Another cup of coffee helped. The problem was that the Javascript encoded into one of the URL parameters was being reflected not once but twice. Even though I had escaped the dirty data before putting it into my exception’s message, the next exception up the stack contained the dirty data verbatim. Building a nice tailored exception class and a corresponding error-page fixed this particular XSS issue, and had the added benefit of making it look much nicer when a typo or an old link triggered the same exception-handling code that the XSS attack targeted.

Almost every web app vulnerability exploits the fact that some code uses data received from the browser without validating the data first. I went through the remainder of the web app code and replaced about a million Integer.parseInt method calls with an equivalent method that sanitizes the data first, removing anything that’s not a decimal digit. There were also a dozen or so Long.parseLong calls and a handful of Double.parseDouble calls, all of them now using safe equivalent methods. I also ran across one more SQL injection vulnerability that I had previously missed because the offending code was hidden in inner classes instead of being in plain sight in its own package.

There are really two lessons from all of this, I think. First, never trust data received from the user. That’s elementary, but too often overlooked by even good programmers who just want to finish a project. Second, never fall into the trap of thinking you’ve plugged all the holes. There are always more out there.

OpenNMS, Security, Software