Two weekends ago I spoke at FOSDEM on the systems and devops track presenting my experience as a sysadmin going all the way to development and back. I’ve put the slides online and embedded them in this post. There should be a video of the talk at some point soonThe video of the talk is now available. Between Q&A and some post-talk conversations I’ve put together a few notes that I wanted to share.

What tools have you used in your development?

I do most of my development in python so a lot of the libraries and tools are python related. Here’s a list:

How did you store the metrics? What options are out there?

In the beginning I was just running post commit jobs and storing stuff into sqlite. That still works well for the amount of commits and runs I do. You could certainly use mysql or postgresql, which if you run Hudson you’d probably end up using anyway. Other than that, any key/value store would work too, they quite match the use case, all you have is a value with a label at a point in time. For the system metrics I used to use rrdtool and moved recently to graphite. Both scale pretty well, especially now that rrdtool ships rrdcached. Also, since someone commented on it, the usual complaint about data loss/lack of granularity with rrds is inaccurate. What happens is that most applications based on rrd ship with very conservative defaults that most people take for a limitation of rrd itself. For more details on this topic see vvuksan’s blog post Misconceptions about RRD storage.

How can you handle a lot of metrics?

  • chained rrdcached with shards , quite new and still experimental but working and certainly pretty interesting.
  • a key value store, probably redis. This is a somewhat arbitrary recommendation based on my experience with it, most likely mongodb or cassandra would also scale well, I believe twitter has actually released something based on it to store their metrics.
  • mysql, most likely with shards, probably on hostname, but I’m not a fan.
  • openTSDB by stumbleupon is an interesting timeseries database built on top of hbase
In the pre-nosql/rrdcached/graphite era I successfully used memcached and in-house webglue to rrd to scale metrics systems.The idea is to have something like rrdtool to store aggregated long term data and something you can easily query for quick retrieval of recent metrics.

Developers don’t want to be ops.

This was more of a statement than a question from someone in the audience. He commented that while appreciating the goal he didn’t see it feasible as it required one group to acquire competencies it doesn’t have an interest into, ie developers don’t want to know about operation details.

But there is no such requirement. A lot of value can be gained just by listening and contributing from one own’s realm. Work that devs and ops do should be considered not just in its own terms, but also in terms of value generated for the other group. To follow from the talk, developers should not be required to understand how monitoring systems work or how to collect metrics from their test runs, they should get that as a service provided by operations. That would become a point of contact from which discussions can be had to improve the application and further collaborate on that project and others. In turn developers can better understand the needs of a production application and enhance the code to expose selftests and metrics via some kind of API that operations can tie to a monitoring system.

Testing becomes a point of contact where the two groups meet and have a conversation in a common language.

How do you prove the value of metrics like these to management?

This is an evergreen, and for good reasons. Without management buy-in getting the time to implement proper testing environments is difficult, albeit not impossible. I must say that of all the topics you can bring to a manager for approval, metrics is probably one of the easiest to communicate value about. That said implementing something like what has been discussed is not cheap and not an easy task to get approved.

Retro-analysis can be very powerful in this instance. Companies that care tend to have some kind of record of outages and at least a rough estimate of the impact. If code was tracked in a version control system then you can run a set of tests (think integration tests on virtual systems simulating production) on whatever commit it was that was released at the time of an incident. It is possible that by graphing and analysing various metrics for a set of commits up to the day of and outage  you could have predicted it. This becomes a strong leverage to use to gain buy-in for the project.

Thanks everybody, it was a blast, I loved FOSDEM!

Tagged with:
 

4 Responses to Slides and notes from my FOSDEM talk – I’m going M.A.D. – monitoring aided development

  1. [...] This post was mentioned on Twitter by jtimberman and patrickdebois, Spike Morelli. Spike Morelli said: http://bit.ly/gjKofs slides&notes from my #fosdem talk on monitoring aided dev, trust your code and your ppl #devops [...]

  2. feth says:

    Just watched your talk at http://video.fosdem.org/2011/ -precisely http://video.fosdem.org/2011/maintracks/mad.xvid.avi
    It was inspiring, thank you.

    • spike says:

      Hi Feth,
      thank you very much for your comment. Feel free to reach out if you’d like to chat about it, I’d love to hear from people applying metrics to their workflow.

  3. Raymond Tay says:

    Excellent Spike :) Love the anecdotes and i especially feel it for TDD.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>