
Monitoring modern real time distributed infrastructure is complex and expensive. In this talk we explore Riemann, specifically, how Riemann’s low latency helped us to get real time metrics from our Distributed Systems.
Large scale real time distributed systems require emitting hundreds of thousands of metrics per seconds for effective monitoring. A significant portions of metrics are either not of any use or we don’t understand them. With the rapid growth in infrastructure, monitoring infrastructure in real time and getting accurate metrics becomes challenging especially when you have an in-house monitoring setup.
Most monitoring systems are pull/poll based where your monitoring system queries the components being monitored. Pull based monitoring systems, where the system keeps changing some x values in every y minutes, are literally dead.
Riemann is a monitoring tool that aggregates events from hosts, servers and applications and can feed them into a stream processing language to be manipulated, summarized or action-ed. Riemann is fast and highly configurable. Most importantly, it is an event-centric push model.
We use Riemann to monitor Distributed Systems. Catching problems in real time requires monitoring tools that have low latency to detect errors faster and immediately see if the fix is working. Riemann provides this along with a transient shared state for systems with many moving parts.
Riemann is written in Clojure and leverages its core concepts. Riemann configs are Clojure code.
We will walk through the concepts of Riemann
- Events
- Streams
- Indexes
We will also go over how to run Riemann in a production environment and how to write Riemann Clojure configs.
We will conclude our talk with the demo for monitoring distributed systems like Apache Zookeeper.
Get more details about the event here.