Open source logging, analysis and monitoring tools

published Nov 14, 2015 12:20   by admin ( last modified Nov 14, 2015 12:19 )

logo transparent 200x75  kibana flume logosyslog ng rsylogmunin rrdtool 3dlogomongodb graphite postgresql elasticsearchinfluxdb    riak riemann ganglia   pcp logo fluentd logo collectd  statsdlogstash

An attempt to structure what open source logging and monitoring tools are available. I've just started checking out this area.

This article will first put structure to logging and monitoring needs and then list what is available, with short descriptions, categorized.

The use case for my analysis is an outfit of a dozen or so publically reachable machines, with in-house custom built services reachable over http with Rest, JSON-RPC and as web pages. Supporting these services there are database servers holding hundreds of Gigabytes of data, and a couple of other servers specific to the business.

A high-level overview of the field may look like this:

Logs and metrics -> Aggregation -> Monitoring -> Notification

Logs and metrics -> Aggregation -> Storage ->  Log analysis

Availability, sanity and fixing

So, why should you monitor servers and log data from them? It could be divided into ensuring the availability of your systems, the sanity of your systems and fixing the systems:

Availability (Monitoring)

Are the servers on-line and the components working? How would you know? You could have:

  • Alarms sent when the monitoring system detects services not working at all or other critical conditions
  • As an aside, you could also consider a bit of "monitor-less monitoring" -Let the customers do the monitoring,  and have a way for them to quickly indicate that something isn't running smoothly. For example as a form that submits the problem with automatic indication of what machine/service that message comes from, or just a text with indication of where to file a ticket.
  • There is probably a minimum good set of monitor info you want from the system in general: CPU, memory, disk space, open file descriptors.
  • There should be a place where you can see graphs of the last seven days of monitoring output.
  • Monitoring of application-level services, such as those running under a process manager such as pm2 or supervisord. At a minimum memory consumption per process and status

Articles

Sanity (Monitoring)

Even if a a system is available and responding to the customer's actions, it may not be accurate.

  • No instrumentation needed for this on the servers, simply monitor services from another machine, make http requests. Check response time and accuracy of result. Will also catch network connectivity issues. This is similar to end-to-end tests, integration tests and regression tests, but on live data.

Fixing (Logging)

  • Why did the problem come about? - Traceback and error logging, comparing logs from different subsystems. There ought to be ready-made instrumentation for services used on the machine: PostgreSQL, MongoDB, Nginx and such. It is important to make sure the systems log enough info, especially your own software. If the space requirements get big, be aggressive with log rotation. There are a number of standardized log formats:

Standardized log records

There are a couple of standards with regards to the format of log records. I believe RFC 5424 is more modern than RFC 3164, and Gelf is becoming a bit of a de facto standard in newer systems (Graylog, log4j), with log data encoded in JSON.

Logging

Logging should answer the questions, in increasing order of ambition:

  • When and for how long? - When did the problem occur and how long did it persist?
  • How? - How did the problem manifest itself, i.e. out of memory, out of file descriptors
  • Why? - Why did the problem come about?

Data interfaces/aggregators

Ok, going back to the diagrams:

Logs and metrics -> Aggregation -> Monitoring -> Notification

Logs and metrics -> Aggregation -> Storage ->  Log analysis

Firstly data needs to made available for aggregation. In some cases it is about making accessible log messages that are already produced. In other cases it means introducing new data collecting services (metrics).

Writing logs to a log file that nobody knows about does not count as making data available. However writing to a well known logging service makes data available. A process that finds log files and then reads from them also makes data available.

Data interfaces/aggregators software

Articles

Analyzers and visualizers - monitoring

After you have the data, you may want to monitor and react to events and unusual circumstances. A monitoring tool can react when thresholds are reached and often can calculate and compare values, also over some (limited) time. There are also possibilities to do visualizations.

Analyzers and visualizers -logging

There are basically two kinds of analysis: One of time series data where graphs are of help, the other is events such as errors which is more texual data

  1. Store numeric time-series data
  2. Render graphs of this data on demand

Articles

Protocol brokers

These translate one protocol into another, or aggregates (which makes them a bit like a category further up). The contents of this category is just a selection of some that I found and are mostly for inspiration if/when I need to fit pieces together that may need some programming/adapting.

Storage back ends

Usually bundled in or required by log analyzers

  • postgresql  PostgreSQL
  • mongodb MongoDB
  • influxdb InfluxDB - Event database
  • RRD - Round robin database
  • Whisper - part of Graphite. In its turn uses storage back ends.

Mega systems - all in one