DevOps Zone is brought to you in partnership with:

Luke and the other bloggers on the Avid Life Media blog develop at Avid Life Media Inc. - a leading social entertainment company that operates some of the most vibrant dating brands and communities on the Internet. Luke is a DZone MVB and is not an employee of DZone and has posted 7 posts at DZone. You can read more from them at their website. View Full User Profile

Graylog2 Optimization for High-Log Environments

07.24.2012
| 18303 views |
  • submit to reddit
Rich McDonough is the original author of this article.


As an organization grows from a startup with a handful of systems into a sophisticated environment comprising hundreds of servers, dozens of subnets, and a multitude of integration points, the benefits of automation become painfully clear. Specifically, once a system has grown to more than a handful of hosts, a number of log related problems need to be tackled:

• Moving logs to a secure, remote location (e.g. off of the original host)
• Permitting non-administrators to search logs using Role Based Access Control
• Filtering, obfuscation and sanitization of log messages
• Analysis of logs from across multiple hosts

Take for example a purchase made on one of our web applications by an end user. Typically, they will hit our landing page, login, perhaps view several profiles and send messages to other users, purchase a subscription renewal, and then exit. In this time, their actions will have created log messages across several web servers, updated database records across our master database and several read-only slaves, push tasks to our background queues for later processing, sent email, and several hosts would have interacted with our payment processors and other 3rd party integrations.

When something goes wrong in this entire set of activity, how does one search through the activity from over a dozen hosts to determine what actions caused events on each host? How do you correlate the reams of data back to meaningful events that correspond with this single user?

One of the commercial products that meet the needs of our organization is Splunk. However, Splunk, specifically, is positioned as a good choice for small environments with a small budget or large environments with a large budget for these types of tools. The volume of logging data that we produce is cost prohibitive for a Spunk installation.

At Avid Life Media, we are consumers of, and believers in, the value of open source software. Our applications are almost exclusively written using freely available tools, and we continue to realize enormous positive returns by strategically selecting tools that receive support and development from the OSS community. Naturally, our attention turned to Graylog2.

GrayLog2 is a free, open source log collection and correlation system. While it does not match the feature list of Splunk, it provides all of the basic log correlation capabilities that we need.

The steps for installing Graylog2 are well documented elsewhere. Here I will highlight some of the measures that were necessary to further operationalize the system for production use. As our environment receives anywhere from 2,000 to 6,000 log messages per minute (with occasional peaks to 40,000!), we find that our dataset grows extremely fast. As a result, we found that our storage backend for these messages was filling up very quickly. We needed a solution for saving data disk in flat files – a feature that is not presently a part of the application. Therefore, archiving data from MongoDB is discussed, as is our custom log forwarding script, a handy Python tool for sending messages from any log file on your server, based on the matching or not matching of a regular expression.

This article deals specifically with Graylog2 version 0.9.5p2. Our upgrade to 0.9.6 was delayed until a later date for reasons I’ll discuss later in this post.

Application Stack

The application stack consists of the following components:

ApplicationFunction
graylog2-serverThe Graylog syslog/GELF server
graylog2-web-interfaceA Rails 3 GUI for accessing the Graylog2 database
Mongo DBThe persistence engine for the Graylog2 application. Mongo is a NOSQL-type database that is well suited to storing logging data.
GELF DaemonA Python script that tails log files and forwards then to the GELF server

Of these, only the GELF Daemon is written in-house, with the source discussed a little later on.

All servers in this particular Production environment are running a mix of RHEL 5 and RHEL 6 releases, however there is nothing preventing the same technologies from working on any other *NIX system.

Architecture

Logging/Back-end

All clients in the environment forward their logs by using either their native syslog configuration or using GELF. While the former is readily available for most UNIX services, GELF is a more feature rich protocol developed by the Graylog2 community that can provide distinct fields for Graylog2 to index on.

This requires two ports be open to the logging host from all clients in the environment.

Data is parsed by the Graylog2 server and stored immediately in the MongoDB database. In our case we are using a dedicated NFS volume from our NetApp NAS, however there is no reason why local storage won’t do the trick equally well.

Carefully consider the RAM requirements of your server! MongoDB uses memory mapped files, so you will need to have at least the size of your database collection *free in RAM* on your host, otherwise you risk swapping! In our case we have 8GB of RAM on a single host with a 5GB capped collection.

Front-end/Web

Unlike most Rails applications, there is no MySQL requirement. All data for the entire application is stored in MongoDB. Other than that, the web interface is a very typical Rails project running under Passenger. Performance is never a major issue, even with the Rails app running on the same host as the Graylog2 server, which incidentally has 8 virtual CPU cores assigned to it. This has proven more than adequate for the needs of processing log information from several hundred systems.

Capped Storage

GrayLog2 uses the capped storage capabilities of MongoDB. A capped collection is essentially a database table that uses a FIFO mechanism for limiting storage use. For our logging table we are capped to 5GB of total storage, which should be capable of storing more than a week of logs. We archive data to file after 24 hours and allow MongoDB to remove data without interference.

Archiving

Data passes out of the Graylog2 message database fairly quickly in our environment. Consequently, we aggressively archive data to file. MongoDB is *very* amenable to this approach. Below is our archiving script, which runs nightly on the Graylog2 host.

Note that the pymongo package is available from EPEL. Please test first as your mileage may vary!

#!/usr/bin/env python
from pymongo import Connection
import gzip
import time
import traceback
connection = Connection('localhost', 27017)
db = connection.graylog2
collection = db.messages
now = int(time.time())
today = time.localtime(time.time())
yesterday = time.localtime(now - 86400)
fileName = ('/data/dumps/messages_%s_%s.gz' % (yesterday[1], yesterday[2]))
handle = gzip.open(fileName, 'wb')
for doc in collection.find():
  if (now - (int(doc['created_at'])) > 0) and
     (now - int(doc['created_at']) < 86400):
    try:
      handle.write(u'%s %s %s "%s" message: "%s" full_message: "%s"\n' % (doc[u'host'],
       doc[u'facility'],
       doc[u'level'],
       time.ctime(int(doc['created_at'])),
       str(doc[u'message']).strip(),
       str(doc[u'full_message']).strip()))
    except:
      traceback.print_exc()
print "all done"
handle.close()

Running this script is as easy as adding to your local crontab.

Client Setup

We are *heavy* believers in automation. For this reason we use Puppet for the *vast* majority of our server configs. Setting up several hundred servers with a new log host becomes as easy as adding a few lines to our Puppet config.

class graylog_clients {
  file { "/etc/syslog.conf":
    owner => root,
    group => root,
    mode => 644,
    source => "puppet://puppet/puppetfiles/graylog_clients/etc/syslog.conf"
  }
  service { syslog:
    ensure => running,
    subscribe => File["/etc/syslog.conf"]
  }
  file { "/etc/gelfDaemon.conf":
    owner => root,
    group => root,
    mode => 644,
    source => "puppet://puppet/puppetfiles/graylog_clients/etc/gelfDaemon.conf"
  }
  file { "/usr/local/bin/gelfDaemon.py":
    owner => root,
    group => root,
    mode => 755,
    source => "puppet://puppet/puppetfiles/graylog_clients/usr/local/bin/gelfDaemon.py"
  }
  file { "/etc/init.d/gelfDaemon":
    owner => root,
    group => root,
    mode => 750,
    source => "puppet://puppet/puppetfiles/graylog_clients/etc/init.d/gelfDaemon"
  }
}
node /.*awesome-host-[0-9]$/ {
  include all_clients, app_servers, snort_servers, prod_servers, graylog_clients
}

The config for our local syslog servers, supplied by the graylog_clients class is very simple:

# Log anything (except mail) of level info or higher.
# Don't log private authentication messages!
*.info;mail.none;authpriv.none;cron.none    /var/log/messages
# The authpriv file has restricted access.
authpriv.*            /var/log/secure
# Log cron stuff
cron.*              /var/log/cron
# Everybody gets emergency messages
*.emerg             *
# Save news errors of level crit and higher in a special file.
uucp,news.crit            /var/log/spooler
# Save boot messages also to boot.log
local7.*            /var/log/boot.log
*.emerg         @awesome-graylog2-host
*.alert         @awesome-graylog2-host
*.crit          @awesome-graylog2-host
*.warning       @awesome-graylog2-host

As you can see, we’ve done very little in the way of modifications. Our “Awesome Graylog2 Host” receives all syslog messages now, just like any other standard syslog server would.

The GELF Daemon

Our GELF daemon is a log collection script that tails logs in real time and forwards them to the Graylog2 GELF listener if they meet certain criteria. We wrote this to solve some issues that are not easily met with the standard syslog process:

  • Forwarding logs that match a regular expression (and skipping those that don’t)
  • Concatenating many lines from a single event into a single log message
  • Creating multiple tails on the same file

The system that instigated the creation of this script is one of our main web applications – a very, *very* noisy app that likes to create a seemingly endless stream of separate log lines for a single request. Using this script we can have a succinct, single record in Graylog2 when certain exceptions occur, instead of receiving a stream of many separate log entries as would happen with a simple syslog forwarder not handling newlines.

This approach also works wonders for things like MySQL slow logs. We can turn this:

# Time: 120201 18:53:33
# User@Host: avidlifemedia_db[avidlifemedia_db] @ amazing-server [10.0.10.199]
# Thread_id: 1668392  Schema: awesome_production
# Query_time: 7.728352  Lock_time: 0.000049  Rows_sent: 1
# Rows_examined: 2475275  Rows_affected: 0  Rows_read: 1
SELECT things.* FROM blah...

Into this:
Time: 120201 18:53:33
User@Host: avidlifemedia_db[avidlifemedia_db] @ amazing-server [10.0.10.199]
Thread_id: 1668392  Schema: awesome_production
Query_time: 7.728352  Lock_time: 0.000049  Rows_sent: 1
Rows_examined: 2475275  Rows_affected: 0  Rows_read: 1
SELECT things.* FROM blah...

… and only forward it if the slow query time is greater than a value that we care about for that particular host.

The GELF daemon is installed by Puppet, as is the config file for it. The only additional action required is to enable and run this script after installing the python-json module.

yum install python-json -y
chkconfig gelfDaemon on
service gelfDaemon start

These tasks can also be done by Puppet, or one of your own deploy scripts, or one of a hundred different ways of performing process automation in your own environment.

Note that the gelfDaemon.py script is safe to execute from the command line. This is useful if you have any issues and need to debug what the GELF process is forwarding to the Graylog2 server (any log messages that are sent to the Graylog server are dumped to the console).

Note that the script presented here is using the RHEL 5.x version of the python-json module.

Configuring the GELF Daemon

Configuring the GELF daemon is very easy. A config file, uncreatively called /etc/gelfDaemon.conf, controls its behaviour. A code sample should make using it clear:

# /etc/gelfDaemon.conf
#
# levels are defined as these:
#            CRITICAL: 2
#            ERROR: 3
#            WARNING: 4
#            NOTICE: 5
#            INFO: 6
#            DEBUG: 7
[default]
gelfServer = amazing-graylog2-server
gelfPort = 12201
gelfMaxChunkSize = 8154
[awesome-app]
path = /data/awesome-app/current/log/production.log
level = 3
regex = .*[eE]xception.*
facility = awesome-app
short_message_regex = ^.{1,60}
concatenate_on = ^\ \~$
[sphinx]
path = /data/awesome-app/sphinx/log/searchd.log
level = 5
regex = .*
facility = searchd_log
short_message_regex = ^.{1,60}

Of the two examples here, one is obviously more complicated. The awesome-app stanza will take lines that contain the word “exception” and concatenate all following lines that start with “~” onto the end of it, then send it along to the Graylog2 server as a single message. Any lines that do not contain the word “exception” in it are simply skipped.

Our Sphinx example here is taking all lines (.* to match every line) and forwarding them without any concatenation.

Note that the “facility” is *arbitrary*. GELF log facilities are not identical to syslog facilities and may be whatever you want them to be. This makes working with Graylog2 much, *much* easier and allows you to create your streams far more intelligently.

The Github source for the GELF daemon is here:

https://github.com/rmcdonough/GELF-Daemon/

Upgrading to Graylog2 version 0.9.6

Last month we attempted to upgrade to the latest build of Graylog2. Unfortunately there is a critical bug in the latest build of the graylog2-server that causes it to silently drop syslog messages that it cannot parse the date from.

http://jira.graylog2.org/browse/SERVER-90

The issue has been resolved but the fix is only available in 0.9.6p1-RC (release candidate). We’re waiting for this to make it’s way into a stable release before we try again.

Additionally, with the move away from MongoDB for the persistence engine, a new archiving script needed to be created. Extracting data from Elasticsearch is not as easy, unfortunately, and the application would certainly benefit from an archiving mechanism. We have started a new script for this, however it has not yet been completed.

About Rich McDonough

 

My elegance and wit are exceeded only by my humility.

Published at DZone with permission of Luke Galea, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)