Open source logging, analysis and monitoring tools
An attempt to structure what open source logging and monitoring tools are available. I've just started checking out this area.
This article will first put structure to logging and monitoring needs and then list what is available, with short descriptions, categorized.
The use case for my analysis is an outfit of a dozen or so publically reachable machines, with in-house custom built services reachable over http with Rest, JSON-RPC and as web pages. Supporting these services there are database servers holding hundreds of Gigabytes of data, and a couple of other servers specific to the business.
A high-level overview of the field may look like this:
Logs and metrics -> Aggregation -> Monitoring -> Notification
Logs and metrics -> Aggregation -> Storage -> Log analysis
Availability, sanity and fixing
So, why should you monitor servers and log data from them? It could be divided into ensuring the availability of your systems, the sanity of your systems and fixing the systems:
Availability (Monitoring)
Are the servers on-line and the components working? How would you know? You could have:
- Alarms sent when the monitoring system detects services not working at all or other critical conditions
- As an aside, you could also consider a bit of "monitor-less monitoring" -Let the customers do the monitoring, and have a way for them to quickly indicate that something isn't running smoothly. For example as a form that submits the problem with automatic indication of what machine/service that message comes from, or just a text with indication of where to file a ticket.
- There is probably a minimum good set of monitor info you want from the system in general: CPU, memory, disk space, open file descriptors.
- There should be a place where you can see graphs of the last seven days of monitoring output.
- Monitoring of application-level services, such as those running under a process manager such as pm2 or supervisord. At a minimum memory consumption per process and status
Articles
Sanity (Monitoring)
Even if a a system is available and responding to the customer's actions, it may not be accurate.
- No instrumentation needed for this on the servers, simply monitor services from another machine, make http requests. Check response time and accuracy of result. Will also catch network connectivity issues. This is similar to end-to-end tests, integration tests and regression tests, but on live data.
Fixing (Logging)
- Why did the problem come about? - Traceback and error logging, comparing logs from different subsystems. There ought to be ready-made instrumentation for services used on the machine: PostgreSQL, MongoDB, Nginx and such. It is important to make sure the systems log enough info, especially your own software. If the space requirements get big, be aggressive with log rotation. There are a number of standardized log formats:
Standardized log records
There are a couple of standards with regards to the format of log records. I believe RFC 5424 is more modern than RFC 3164, and Gelf is becoming a bit of a de facto standard in newer systems (Graylog, log4j), with log data encoded in JSON.
- RFC 5424
- RFC 3164
- Gelf records which are in JSON format GELF - Description and examples
- Heka Uses its own standardized log objects
- CEE - work on this format may have been discontinued About Common Event Expression: Frequently Asked Questions — Archive
Logging
Logging should answer the questions, in increasing order of ambition:
- When and for how long? - When did the problem occur and how long did it persist?
- How? - How did the problem manifest itself, i.e. out of memory, out of file descriptors
- Why? - Why did the problem come about?
Data interfaces/aggregators
Ok, going back to the diagrams:
Logs and metrics -> Aggregation -> Monitoring -> Notification
Logs and metrics -> Aggregation -> Storage -> Log analysis
Firstly data needs to made available for aggregation. In some cases it is about making accessible log messages that are already produced. In other cases it means introducing new data collecting services (metrics).
Writing logs to a log file that nobody knows about does not count as making data available. However writing to a well known logging service makes data available. A process that finds log files and then reads from them also makes data available.
Data interfaces/aggregators software
-
writing to syslog - Syslog is a central and well-known aggregating logging facility on unixish systems. Most programming languages and many frameworks have support for writing to syslog. Syslog exists in at least three flavors:
- syslog (sysklogd)
-
syslog-ng HOWTO: JSON processing with syslog-ng - Asylum
Syslog-ng filters log messages and can output them to local log files, or send them over the network to Riemann, Mongodb and many other systems. It can break a log entry into smaller parts if it is bigger than the maximum message size, however the log-msg-size setting in syslog-ng can go all the way up to 256MB.
A way of handling JSON and other messages intermingled and output JSON: CEE prototype and a show-case for the new 3.4 features « Bazsi's blog -
rsyslog similar to syslog-ng
-
collectd -The system statistics collection daemon collects and stores metrics for a local server. Can forward to logstash and to syslog. Monitoring with collectd and Riemann - Asylum
- Diamond is a python daemon that collects system metrics and publishes them to Graphite (and others)
- The Host sFlow agent exports physical and virtual server performance metrics using the sFlow protocol
-
Performance Co-Pilot: Collect performance metrics from your systems efficiently.
-
etsy/statsd aggregates but does not store data. Does not have plugins for generating metrics
-
fluentd - unified log format and storage. Does not do analysis and graphs afaict Fluentd | Open Source Data Collector | Unified Logging Layer
-
Logstash: Collect, Parse, Transform Logs | Elastic - has a lot of plugins
-
Apache Flume Distributed service for collecting, aggregating, and moving large amounts of log data
- Heka — Heka 0.9.2 documentation also features monitoring
- Tensor is a modular gateway and event router for Riemann, built using the Twisted framework - seems to contain metrics
-
munin-node is the agent process running on each node that the Munin server monitors
It is the server monitoring part of munin, but can also be used by other tools, such as Riemann - Supermann monitors processes running under Supervisor and sends metrics to Riemann.
-
Ganglia "Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids", it says. The web frontend is highly targeted to managing uniform computing resources. Ganglia itself generates data and aggregates it. Introduction to Ganglia on Ubuntu 14.04 | DigitalOcean
Articles
- The State of Highly-Available Syslog - Ian Unruh
- Fluentd vs Logstash ·
- Has anyone done any comparative study on syslog-ng, rsyslog, Scribe, and Flume in regards to throughput? - Quora
Analyzers and visualizers - monitoring
After you have the data, you may want to monitor and react to events and unusual circumstances. A monitoring tool can react when thresholds are reached and often can calculate and compare values, also over some (limited) time. There are also possibilities to do visualizations.
-
Riemann - Very cool analyzer, easy to get started, uses Clojure as a configuration language which is nice imho, has a web interface which is also very nice. Can aggregate, change and forward log data to log analyzers and log storage systems.
- Icinga is a monitoring system which checks the availability of your resources, notifies users of outages and provides BI data
Analyzers and visualizers -logging
There are basically two kinds of analysis: One of time series data where graphs are of help, the other is events such as errors which is more texual data
-
RRDtool - About RRDtool Does not handle text
-
ElasticSearch - handles text
-
Kibana: Explore, Visualize, Discover Data | Elastic Often used in combination with ElasticSearch.
-
Graphite Documentation — Graphite 0.10.0 documentation
Graphite does two things:
- Store numeric time-series data
- Render graphs of this data on demand
-
Grafana is most commonly used for visualizing time series data for Internet infrastructure and application analytics contains Graphite
Articles
Protocol brokers
These translate one protocol into another, or aggregates (which makes them a bit like a category further up). The contents of this category is just a selection of some that I found and are mostly for inspiration if/when I need to fit pieces together that may need some programming/adapting.
- Bucky is a small server for collecting and translating metrics for Graphite. It can current collect metric data from CollectD daemons and from StatsD clients. (python)
- graphite-ng/carbon-relay-ng A relay for carbon streams, in go (golang). Duplicates, filters and aggregates over time windows carbon data streams. Similar functionality + sharding is available from the Graphite project itself: The Carbon Daemons — Graphite 0.9.9 documentation
- estatsd_server Standalone estatsd server written in Erlang
- pystatsd is a front end/proxy for the Graphite stats collection and graphing server. (python) Also shows how to make a debian package out of itself
Storage back ends
Usually bundled in or required by log analyzers
-
PostgreSQL
-
MongoDB
-
InfluxDB - Event database
- RRD - Round robin database
- Whisper - part of Graphite. In its turn uses storage back ends.
Mega systems - all in one