Vendors, Open Source, and Hypocrisy
A couple weeks ago, one of our support customers requested help in configuring OpenNMS to collect performance data from a network storage server in their environment. I was not familiar with the storage server vendor, but the vendor’s web site touts their rating among the fastest-growing tech companies in the U.S. The vendor’s MIB was good and provided plenty of useful objects. With the data collection definitions in place, we restarted OpenNMS, discovered one of the storage servers, and… nothing. Data collection failed, and we started seeing some new SNMP-related messages in the logs:
ERROR [DefaultUDPTransportMapping_127.0.0.1/0] org.snmp4j.MessageDispatcherImpl: java.io.IOException: Only 32bit unsigned integers are supported at position 52
Anybody who does enterprise management for a living knows that there are plenty of lame SNMP agents out there. I did a few walks and learned that the storage server is running a Linux kernel and is using version 5.2.1.2 of the open-source UCD-SNMP agent. That’s a pretty old version of Net-SNMP, so I was not too surprised that it was giving us trouble. After taking some packet traces from the customer’s system and spending some time with WireShark, the SNMP4J source code, and William Stallings’ SNMP, SNMPv2, SNMPv3, and RMON 1 and 2, Third Edition, I tracked down the exact problem. I’ll just quote my notes from the ticket here, redacting to protect the guilty.
This device is definitely exhibiting buggy SNMP protocol behavior that is stopping OpenNMS collecting interface statistics from it. This reply and the several comments that precede it should be adequate documentation to open a bug report with the vendor.
…
By manually decoding the varbinds in this response PDU, one can see the problem. Starting at the 52nd octet in the dump we see the value of the first varbind (for ifInOctets.3) described:
41:05:01:0f:1c:a5:26
The first two octets (41:05) identify the type (counter(41)) and encoded length (5). The next five octets (01:0f:1c:a5:26) should encode the actual value of the counter. The problem is that a counter is defined as a 32-bit unsigned integer. All integers (regardless of signedness) are stored as two’s-complement according to the ASN.1 BER, but an unsigned integer represented in this format must always have 00 as its first octet. A quick inspection of the SNMP4J code confirms the issue, in file org/snmp4j/asn1/BER.java:
public static final long decodeUnsignedInteger(BERInputStream is, MutableByte type)
throws IOException
/* snippage — jeffg was here */// check for legal uint size
int b = is.read();
if ((length > 5) || ((length > 4) && (b != 0×00))) {
throw new IOException(”Only 32bit unsigned integers are supported”+
getPositionMessage(is));
}As an additional reference, please see the attached excerpt from SNMP, SNMPv2, SNMPv3, and RMON 1 and 2_, 3rd Ed., William Stallings (Addison Wesley, 1999), p. 591.
The upshot of all this analysis is that we won’t be able to collect performance data from these storage server nodes until the vendor can provide a software update that resolves the counter-encoding bug. This will probably take the form of a newer Net-SNMP agent (the 5.2.1.2 version currently loaded is from mid-2005) that addresses this issue. I’ve spent some time trying to track down what release fixed this problem but can’t find a reference to it in the Net-SNMP changelog. I’m certain it’s fixed in some later release, though, because we don’t see this problem with modern Net-SNMP agents.
I thought the above would be plenty of ammunition for our customer to go straight to developer-level support with the storage vendor. Today our customer contact (Mike) updated the support ticket with the storage vendor’s reply:
Mike,
Not sure what you’re looking for here but we only support SNMP agent for HP OpenView, CA Unicenter, IBM Tivoli NetView and BMC Patrol. [We provide] SNMP support to integrate [storage server product] management into an existing enterprise management solution such as HP OpenView, CA Unicenter, IBM Tivoli NetView and BMC Patrol.
No support for OpenNMS to my knowledge…
Rgds,
Jacques
Did I miss something? OpenNMS exists. It’s an enterprise management solution. It’s bested HP OpenView and IBM Tivoli Netview in at least one survey of actual users. The storage server’s SNMP agent is clearly and demonstrably in violation of the encoding rules specified for the SNMP SMI, a fact that is likely to cause interoperability problems with any reasonably strict implementation of SNMP. Why should a storage vendor dictate which enterprise management products its customers should use to manage their storage servers?
After an appeal from our customer, Jacques grudgingly agreed to escalate the issue, but not without being snooty about it:
Mike,
Sorry for that but I’m not making the rules…
Will however escalate your ‘concerns’ to higher level support.
You also might wanna ask for an RFE (Request for enhancement) thru your sales or SA.Rgds,
Jacques
Nobody had used the word “concerns” up to this point, so Jacques’ use of quote marks around it is pretty clearly for the sheer contempt of it. I do hope that Mike will contact his sales or SA, not to request an enhancement, but to suggest a proper fix as a great way to keep commission checks coming.
This story would not bother me nearly as much if the storage vendor were not standing on the shoulders of two open-source giants while thumbing its nose at a third open-source project — and at one of its own paying customers! By using the Linux kernel and the UCD-SNMP agent in its storage servers, the vendor eliminated the huge cost of developing these components in-house. Likewise, by choosing OpenNMS over the very expensive commercial management products in the storage vendor’s anointed list, our mutual customer has a far larger slice of budget available to buy network storage servers. Given these facts, I fully expect that the answer from the escalation team will be “Bien sûr! Tout de suite!”
Update 18 June 2008
Despite my faith that the vendor would act appropriately, they came back and said that the only supported management platforms are the ones called out above (HPOV NNM, Unicenter, Tivoli Netview, and BMC Patrol). I spent a little time and built Net-SNMP 5.2.1.2 on an x86_64 Linux system, shoved enough traffic through its interface to trigger the BER bug when I request ifInOctets, and tried hitting it with xnmgraph from OpenView Network Node Manager. Just like OpenNMS, NNM’s SNMP library discards the invalid response PDUs and reports a timeout. I’m recommending that our customer install the HPOV NNM demo, discover the storage server, and see how the vendor feels about the situation now.
Update 25 June 2008
Our customer brought up a VM for us to install NNM 7.53. As expected, the NNM tools choke on the mangled Counter in the response PDUs that the storage server sends:
[root@nnmserver ~]# /opt/OV/bin/snmpget -d -v 1 -c public -r 1 10.11.12.13 ifInOctets.1
Transmitted 45 bytes to 10.11.12.13 port 161:
Initial Timeout: 0.80 seconds
...
Received 50 bytes from 10.11.12.13 port 161:
0: 30 30 02 01 00 04 06 70 75 62 6c 69 63 a2 23 02 00.....public.#.
16: 04 5c f8 55 1d 02 01 00 02 01 00 30 15 30 13 06 .\.U.......0.0..
32: 0a 2b 06 01 02 01 02 02 01 0a 01 41 05 02 d3 c0 .+.........A....
48: 65 fe -- -- -- -- -- -- -- -- -- -- -- -- -- -- e...............
0: SNMP MESSAGE (0x30): 48 bytes
2: INTEGER VERSION (0x2) 1 bytes: 0 (SNMPv1)
5: OCTET-STR COMMUNITY (0x4) 6 bytes: "public"
13: RESPONSE-PDU (0xa2): 35 bytes
15: INTEGER REQUEST-ID (0x2) 4 bytes: 1559778589
21: INTEGER ERROR-STATUS (0x2) 1 bytes: noError(0)
24: INTEGER ERROR-INDEX (0x2) 1 bytes: 0
27: SEQUENCE VARBIND-LIST (0x30): 21 bytes
29: SEQUENCE VARBIND (0x30): 19 bytes
31: OBJ-ID (0x6) 10 bytes: .1.3.6.1.2.1.2.2.1.10.1
43: error parsing number
Our customer has asked the storage vendor to take another look at the problem since their SNMP agent is demonstrably incompatible with one of their anointed management platforms.