Logstash old logo

I recently became interested in Logstash, and after playing with it for a while, I decided to create my own custom plugin for learning purpose. I chose to pull data from Reddit because a) I use it often and b) there’s no existing plugin that offers that.

The Elasticsearch site offers quite an exhaustive documentation to create one’s own Logstash plugin. Such endeavour requires Ruby skills - not only the language syntax but also the ecosystem. Expectedly, the site assumes the reader is familiar with both. Unfortunately, that’s not my case. I’ve been developing in Java a lot, I’ve dabbled somewhat in Scala, I’m quite interested in Kotlin - in the end, I’m just a JMV developer (plus some Javascript here and there). Long talk short, I start from scratch in Ruby.

At this stage, there are two possible approaches:

  1. Read documentation and tutorials about Ruby, Gems, bundler, the whole nine yard and come back in a few months (or more)
  2. Or learn on the spot by diving right into development

Given that I don’t have months, and that whatever I learned is good enough, I opted for the 2nd option. This post is a sum-up of the steps I went through, in the hopes it might benefit others who find themselves in the same situation.

The first step is not the hardest

Though new Logstash plugins can be started from scratch, the documentation advise to start from a template. This is explained in the online procedure. The generation yields the following structure:

$ tree logstash-input-reddit
├── Gemfile
├── LICENSE
├── README.md
├── Rakefile
├── lib
│   └── logstash
│       └── inputs
│           └── reddit.rb
├── logstash-input-reddit.gemspec
└── spec
    └── inputs
        └── reddit_spec.rb

Not so obviously for a Ruby newbie, this structure is one of a Ruby Gem. In general, dependencies are declared in the associated Gemfile:

source 'https://rubygems.org'
gemspec

However, in this case, the gemspec directive adds one additional indirection level. Not only dependencies, but also meta-data, are declared in the associated gemspec file. This is a feature of the Bundler utility gem.

To install dependencies, the bundler gem first needs to be installed. Aye, there’s the rub…

Ruby is the limit

Trying to install the gem yields the following:

gem install bundler
Fetching: bundler-1.13.6.gem (100%)
ERROR:  While executing gem ... (TypeError)
    no implicit conversion of nil into String

The first realization - and it took a lot of time (browsing and reading), is that there are different flavours of Ruby runtimes. Simple Ruby is not enough for Logstash plugin development: it requires a dedicated runtime that runs on the JVM aka JRuby.

The second realization is that while it’s easy to install multiple Ruby runtimes on a machine, it’s impossible to have them run at the same time. While Homebrew makes the jruby package available, it seems there’s only one single gem repository per system and it reacts very poorly to being managed by different runtimes.

After some more browsing, I found the solution: rbenv. It not only mangages ruby itself, but also all associated executables (gem, irb, rake, etc.) by isolating every runtime. This makes possible to run my Jekyll site with the latest 2.2.3 Ruby runtime and build the plugin with JRuby on my machine. rbenv is available via Homebrew:

This is how it goes:

Install rbenv
brew install rbenv
Configure the PATH
echo 'eval "$(rbenv init -)"' >> ~/.bash_profile
Source the bash profile script
. ~/.bash_profile
List all available runtimes
rbenv install -l
Available versions:
  1.8.5-p113
  1.8.5-p114
  ...
  ...
  ...
  ree-1.8.7-2012.02
  topaz-dev
Install the desired runtime
rbenv install jruby-9.1.6.0
Configure the project to use the desired runtime
cd logstash-input-reddit
rbenv local jruby-9.1.6.0
Check it's configured
ruby --version
jruby-9.1.6.0

Finally, bundler can be installed:

gem install bundler
Successfully installed bundler-1.13.6
1 gem installed

And from this point on, all required gems can be installed as well:

bundle install
Fetching gem metadata from https://rubygems.org/.........
Fetching version metadata from https://rubygems.org/..
Fetching dependency metadata from https://rubygems.org/.
Resolving dependencies...
Installing rake 12.0.0
Installing public_suffix 2.0.4
...
...
...
Installing rspec-wait 0.0.9
Installing logstash-core-plugin-api 2.1.17
Installing logstash-codec-plain 3.0.2
Installing logstash-devutils 1.1.0
Using logstash-input-reddit 0.1.0 from source at `.`
Bundle complete! 2 Gemfile dependencies, 57 gems now installed.
Use `bundle show [gemname]` to see where a bundled gem is installed.
Post-install message from jar-dependencies:

if you want to use the executable lock_jars then install ruby-maven gem before using lock_jars 

   $ gem install ruby-maven -v '~> 3.3.11'

or add it as a development dependency to your Gemfile

   gem 'ruby-maven', '~> 3.3.11'

Plugin development proper

With those requirements finally addressed, proper plugin development can start. Let’s skip finding the right API to use to make an HTTP request in Ruby or addressing Bundler warnings when installing dependencies, the final code is quite terse:

class LogStash::Inputs::Reddit < LogStash::Inputs::Base

  config_name 'reddit'
  default :codec, 'plain'
  config :subreddit, :validate => :string, :default => 'elastic'
  config :interval, :validate => :number, :default => 10

  public
  def register
    @host = Socket.gethostname
    @http = Net::HTTP.new('www.reddit.com', 443)
    @get = Net::HTTP::Get.new("/r/#{@subreddit}/.json")
    @http.use_ssl = true
  end

  def run(queue)
    # we can abort the loop if stop? becomes true
    while !stop?
      response = @http.request(@get)
      json = JSON.parse(response.body)
      json['data']['children'].each do |child|
        event = LogStash::Event.new('message' => child, 'host' => @host)
        decorate(event)
        queue << event
      end
      Stud.stoppable_sleep(@interval) { stop? }
    end
  end
end

The plugin defines two configuration parameters, which subrredit will be parsed for data and the interval between 2 calls (in seconds).

The register method initializes the class attributes, while the run method loops over:

  • Making the HTTP call to Reddit
  • Parsing the response body as JSON
  • Making dedicated fragments from the JSON, one for each post. This is particularly important because we want to index each post separately.
  • Sending each fragment as a Logstash event for indexing

Of course, it’s very crude, there’s no error handling, it doesn’t save the timestamp of the last read post to prevent indexing duplicates, etc. In its current state, the plugin offers a lot of room for improvement, but at least it works from a MVP</a> point-of-view.

Building and installing

As written above, the plugin is a Ruby gem. It can be built as any other gem:

gem build logstash-input-reddit

This creates a binary file named logstash-input-reddit-0.1.0.gem - name and version both come from the Bundler’s gemspec. It can be installed using the standard Logtstash plugin installation procedure:

bin/logstash-plugin install logstash-input-reddit-0.1.0.gem

Downstream processing

One huge benefit of Logstash is the power of its processing pipeline. The plugin is designed to produce raw data, but the indexing should handle each field separately. Extracting fields from another field can be achieved with the mutate filter.

Here’s one Logstash configuration snippet example, to fill some relevant fields (and to remove message):

filter{
  mutate {
    add_field => {
      "kind" => "%{message[kind]}"
      "subreddit" => "%{message[data][subreddit]}"
      "domain" => "%{message[data][domain]}"
      "selftext" => "%{message[data][selftext]}"
      "url" => "%{message[data][url]}"
      "title" => "%{message[data][title]}"
      "id" => "%{message[data][id]}"
      "author" => "%{message[data][author]}"
      "score" => "%{message[data][score]}"
      "created_utc" => "%{message[data][created_utc]}"
    }
    remove_field => [ "message" ]
  }
}

Once the plugin has been built and installed, Logstash can be run with a config file that includes the previous snippet. It should yield something akin to the following - when used in conjunction with the rubydebug codec:

{
       "selftext" => "",
           "kind" => "t3",
         "author" => "nfrankel",
          "title" => "Structuring data with Logstash",
      "subreddit" => "elastic",
            "url" => "https://blog.frankel.ch/structuring-data-with-logstash/",
           "tags" => [],
          "score" => "9",
     "@timestamp" => 2016-12-07T22:32:03.320Z,
         "domain" => "blog.frankel.ch",
           "host" => "LSNM33795267A",
       "@version" => "1",
             "id" => "5f66bk",
    "created_utc" => "1.473948927E9"
}

Conclusion

Starting from near-zero kwnowledge about the Ruby ecosystem, I’m quite happy of the result.

The only thing I couldn’t achieve was to add 3rd libraries (like rest-client), Logstash kept complaining about not being able to install the plugin because of a missing dependency. Falling back to standard HTTP calls solved the issue.

Also, note that the default template has some warnings on install, but they can be fixed quite easily:

  • The license should read Apache-2.0 instead of Apache License (2.0)
  • Dependencies version are open-ended ('>= 0') whereas they should be more limited i.e. '~> 2'
  • Some meta-data is missing, like the homepage

I hope this post will be useful to other Java developers wanting to develop their own Logstash plugin.