Monitoring: one ring to rule them all?

I recently had occasion to do a little research into the world of open source monitoring for my team (Advanced Technology Group at HP). This was my first focused exposure to monitoring tools and issues.

If you are at all involved in this space, you probably have two or more tools stitched together to cover the multiple needs of a comprehensive monitoring solution. (collectd + Nagios + Graphite, for example)

My initial response to this was "there is a need for one great end-to-end solution!" But after further reflection, I'm not sure this would actually be something that anyone would ever tackle, and maybe it's not needed.

ERP vs. integrated specialized products

In my previous jobs, I've had experience with ERP (Enterprise Resource Planning) systems, and definitely know the benefits of having a system that spans the organization and covers multiple functions (billing, HR, finance, customers..). Things work so much better when everything is designed to work together, and data can flow from billing to finance (for example) with minimal friction and data transformation needed. I also know the painful, painful work of integrating data from one system into another. And if it also has to come back to the originating system? So many challenges and failure points.

But it's also true that the ERP is going to have some things it does better than others. Some modules will be exactly what would you want and some will have some puzzling deficiencies. And those deficiencies may lead you to pick some products from other vendors to augment your ERP experience.

There does not seem to be an ERP equivalent for open source monitoring today. Products tend to cover one or two (maybe 3) of the following areas:

  • Metrics collection / transport / storage
  • Checks (active / passive)
  • Alerting (dashboard and/or notifications)
  • Graphing

And maybe that's ok. Maybe you want to get best-of-breed specialization in these various buckets. A full end-to-end product may be more ambitious than anyone wants to tackle, and if we've got really good solutions for the pieces in the chain...

A side note here - what is missing from this list (and most if not all of the products) is a really sophisticated thresholding engine, that allow you to crunch the raw data and get beyond the basic state-based checks into correlation and formulas. We're starting to tread into big data land here, and that's probably a whole separate post.

But, as anyone who has had to work on integrating different products from different vendors knows, the devil is in the data standard differences.

Standardizing monitoring data

At the network layer, SNMP (Simple Network Management Protocol) is well-defined and broadly used for network monitoring. Moving up the stack into compute nodes and applications, things start to get a little more wild and woolly. But compute and applications is the bread and butter of cloud computing. Lack of a standard protocol means every product has to create plugins to interact with the other product(s). Not the end of the world - plug-ins add a lot to the tools. But getting a robust, agreed-upon data standard protocol for the bits and pieces of things we care about in the cloud would definitely help with integration of products and also the development of new tools trying to tackle cloud monitoring. Maybe there is room to start to define monitoring protocols in the OASIS project?

Perhaps we won't ever see a full end-to-end open source monitoring tool that does it all. (Although I suspect folks would use it if it existed!) But getting to a standard protocol seems much more achievable / likely, and could bring significant benefit to the work of designing and integrating different tools. An ecosystem of easily interoperable tools is the next best thing to that one big product that (tries to) do it all.

Other voices

Some opinions I came across during my research that I found particularly interesting:
Laurie Denness advocating for Nagios and multiple tools in response to:
Andy Skykes arguing for the death of Nagios and using different (but still multiple) tools (its a slide deck...)
Portertech comparing Sensu and Nagios
Research from Dataloop on what tools are being used (confirming the ubiquity of Nagios, as the previous three links illustrate...)