Open source logging, analysis and monitoring tools

published Nov 14, 2015 12:20 by admin ( last modified Nov 14, 2015 12:19 )

kibana syslog ng rsylog munin rrdtool 3dlogo mongodb graphite postgresql elasticsearch influxdb riak riemann ganglia collectd statsd logstash

An attempt to structure what open source logging and monitoring tools are available. I've just started checking out this area.

This article will first put structure to logging and monitoring needs and then list what is available, with short descriptions, categorized.

The use case for my analysis is an outfit of a dozen or so publically reachable machines, with in-house custom built services reachable over http with Rest, JSON-RPC and as web pages. Supporting these services there are database servers holding hundreds of Gigabytes of data, and a couple of other servers specific to the business.

A high-level overview of the field may look like this:

Logs and metrics -> Aggregation -> Monitoring -> Notification

Logs and metrics -> Aggregation -> Storage -> Log analysis

Availability, sanity and fixing

So, why should you monitor servers and log data from them? It could be divided into ensuring the availability of your systems, the sanity of your systems and fixing the systems:

Availability (Monitoring)

Are the servers on-line and the components working? How would you know? You could have:

Alarms sent when the monitoring system detects services not working at all or other critical conditions
As an aside, you could also consider a bit of "monitor-less monitoring" -Let the customers do the monitoring, and have a way for them to quickly indicate that something isn't running smoothly. For example as a form that submits the problem with automatic indication of what machine/service that message comes from, or just a text with indication of where to file a ticket.
There is probably a minimum good set of monitor info you want from the system in general: CPU, memory, disk space, open file descriptors.
There should be a place where you can see graphs of the last seven days of monitoring output.
Monitoring of application-level services, such as those running under a process manager such as pm2 or supervisord. At a minimum memory consumption per process and status

Articles

A How to Guide on Modern Monitoring and Alerting - DevOps.comDevOps.com

Sanity (Monitoring)

Even if a a system is available and responding to the customer's actions, it may not be accurate.

No instrumentation needed for this on the servers, simply monitor services from another machine, make http requests. Check response time and accuracy of result. Will also catch network connectivity issues. This is similar to end-to-end tests, integration tests and regression tests, but on live data.

Fixing (Logging)

Why did the problem come about? - Traceback and error logging, comparing logs from different subsystems. There ought to be ready-made instrumentation for services used on the machine: PostgreSQL, MongoDB, Nginx and such. It is important to make sure the systems log enough info, especially your own software. If the space requirements get big, be aggressive with log rotation. There are a number of standardized log formats:

Standardized log records

There are a couple of standards with regards to the format of log records. I believe RFC 5424 is more modern than RFC 3164, and Gelf is becoming a bit of a de facto standard in newer systems (Graylog, log4j), with log data encoded in JSON.

RFC 5424
RFC 3164
Gelf records which are in JSON format GELF - Description and examples
Heka Uses its own standardized log objects
CEE - work on this format may have been discontinued About Common Event Expression: Frequently Asked Questions — Archive

Logging

Logging should answer the questions, in increasing order of ambition:

When and for how long? - When did the problem occur and how long did it persist?
How? - How did the problem manifest itself, i.e. out of memory, out of file descriptors
Why? - Why did the problem come about?

Data interfaces/aggregators

Ok, going back to the diagrams:

Logs and metrics -> Aggregation -> Monitoring -> Notification

Logs and metrics -> Aggregation -> Storage -> Log analysis

Firstly data needs to made available for aggregation. In some cases it is about making accessible log messages that are already produced. In other cases it means introducing new data collecting services (metrics).

Writing logs to a log file that nobody knows about does not count as making data available. However writing to a well known logging service makes data available. A process that finds log files and then reads from them also makes data available.

Data interfaces/aggregators software

writing to syslog - Syslog is a central and well-known aggregating logging facility on unixish systems. Most programming languages and many frameworks have support for writing to syslog. Syslog exists in at least three flavors:
- syslog (sysklogd)
- syslog-ng HOWTO: JSON processing with syslog-ng - Asylum
  Syslog-ng filters log messages and can output them to local log files, or send them over the network to Riemann, Mongodb and many other systems. It can break a log entry into smaller parts if it is bigger than the maximum message size, however the log-msg-size setting in syslog-ng can go all the way up to 256MB.
  A way of handling JSON and other messages intermingled and output JSON: CEE prototype and a show-case for the new 3.4 features « Bazsi's blog
- rsyslog similar to syslog-ng
  - How do I configure rsyslog to send logs from a specific program to a remote syslog server? - Ask Ubuntu
collectd -The system statistics collection daemon collects and stores metrics for a local server. Can forward to logstash and to syslog. Monitoring with collectd and Riemann - Asylum
Diamond is a python daemon that collects system metrics and publishes them to Graphite (and others)
The Host sFlow agent exports physical and virtual server performance metrics using the sFlow protocol
Performance Co-Pilot: Collect performance metrics from your systems efficiently.
etsy/statsd aggregates but does not store data. Does not have plugins for generating metrics
fluentd - unified log format and storage. Does not do analysis and graphs afaict Fluentd | Open Source Data Collector | Unified Logging Layer
Logstash: Collect, Parse, Transform Logs | Elastic - has a lot of plugins
Apache Flume Distributed service for collecting, aggregating, and moving large amounts of log data
Heka — Heka 0.9.2 documentation also features monitoring
Tensor is a modular gateway and event router for Riemann, built using the Twisted framework - seems to contain metrics
munin-node is the agent process running on each node that the Munin server monitors
It is the server monitoring part of munin, but can also be used by other tools, such as Riemann
Supermann monitors processes running under Supervisor and sends metrics to Riemann.
Ganglia "Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids", it says. The web frontend is highly targeted to managing uniform computing resources. Ganglia itself generates data and aggregates it. Introduction to Ganglia on Ubuntu 14.04 | DigitalOcean

Articles

Analyzers and visualizers - monitoring

After you have the data, you may want to monitor and react to events and unusual circumstances. A monitoring tool can react when thresholds are reached and often can calculate and compare values, also over some (limited) time. There are also possibilities to do visualizations.

Riemann - Very cool analyzer, easy to get started, uses Clojure as a configuration language which is nice imho, has a web interface which is also very nice. Can aggregate, change and forward log data to log analyzers and log storage systems.
Icinga is a monitoring system which checks the availability of your resources, notifies users of outages and provides BI data

Analyzers and visualizers -logging

There are basically two kinds of analysis: One of time series data where graphs are of help, the other is events such as errors which is more texual data

RRDtool - About RRDtool Does not handle text
ElasticSearch - handles text
Kibana: Explore, Visualize, Discover Data | Elastic Often used in combination with ElasticSearch.
Graphite Documentation — Graphite 0.10.0 documentation
Graphite does two things:

Store numeric time-series data
Render graphs of this data on demand

Grafana is most commonly used for visualizing time series data for Internet infrastructure and application analytics contains Graphite

Articles

Graphite vs Kibana vs RRDtool Stackup | StackShare

Protocol brokers

These translate one protocol into another, or aggregates (which makes them a bit like a category further up). The contents of this category is just a selection of some that I found and are mostly for inspiration if/when I need to fit pieces together that may need some programming/adapting.

Bucky is a small server for collecting and translating metrics for Graphite. It can current collect metric data from CollectD daemons and from StatsD clients. (python)
graphite-ng/carbon-relay-ng A relay for carbon streams, in go (golang). Duplicates, filters and aggregates over time windows carbon data streams. Similar functionality + sharding is available from the Graphite project itself: The Carbon Daemons — Graphite 0.9.9 documentation
estatsd_server Standalone estatsd server written in Erlang
pystatsd is a front end/proxy for the Graphite stats collection and graphing server. (python) Also shows how to make a debian package out of itself

Storage back ends

Usually bundled in or required by log analyzers

PostgreSQL
MongoDB
InfluxDB - Event database
RRD - Round robin database
Whisper - part of Graphite. In its turn uses storage back ends.

Mega systems - all in one

Prometheus