• Open your classes and methods in Kotlin

    Kotlin icon

    Though Kotlin and Spring Boot play well together, there are some friction areas between the two. IMHO, chief among them is the fact that Kotlin classes and methods are final by default.

    The Kotlin docs cite the following reason:

    The open annotation on a class is the opposite of Java’s final: it allows others to inherit from this class. By default, all classes in Kotlin are final, which corresponds to Effective Java, Item 17: Design and document for inheritance or else prohibit it.

    Kotlin designers took this advice very seriously by making all classes final by default. In order to make a class (respectively a method) inheritable (resp. overridable), it has to be annotated with the open keyword. That’s unfortunate because a lot of existing Java libraries require classes and methods to be non-final. In the Spring framework, those mainly include @Configuration classes and @Bean methods but there also Mockito and countless other frameworks and libs. All have in common to use the cglib library to generate a child class of the referenced class. Ergo: final implies no cglib implies no Spring, Mockito, etc.

    That means one has to remember to annotate every required class/method. This is not only rather tedious, but also quite error-prone. As an example, the following is the message received when one forgets about the annotation on a Spring configuration class:

    org.springframework.beans.factory.parsing.BeanDefinitionParsingException: Configuration problem:
    @Configuration class 'KotlindemoApplication' may not be final. Remove the final modifier to continue.
    

    Here’ what happens when the @Bean method is not made open:

    org.springframework.beans.factory.parsing.BeanDefinitionParsingException: Configuration problem:
    @Bean method 'controller' must not be private or final; change the method's modifiers to continue
    

    The good thing is that the rule are quite straightforward: if a class is annotated with @Configuration or a method with @Bean, they should be marked open as well. From Kotlin 1.0.6, there’s a compiler plugin to automate this process, available in Maven (and in Gradle as well) through a compiler plugin’s dependency. Here’s the full configuration snippet:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    <plugin>
      <artifactId>kotlin-maven-plugin</artifactId>
      <groupId>org.jetbrains.kotlin</groupId>
      <version>${kotlin.version}</version>
      <configuration>
        <compilerPlugins>
          <plugin>all-open</plugin>
        </compilerPlugins>
        <pluginOptions>
          <option>all-open:annotation=org.springframework.boot.autoconfigure.SpringBootApplication</option>
          <option>all-open:annotation=org.springframework.context.annotation.Bean</option>
        </pluginOptions>
      </configuration>
      <dependencies>
        <dependency>
          <groupId>org.jetbrains.kotlin</groupId>
          <artifactId>kotlin-maven-allopen</artifactId>
          <version>${kotlin.version}</version>
        </dependency>
      </dependencies>
      <executions>
        <execution>
          <id>compile</id>
          <phase>compile</phase>
          <goals>
            <goal>compile</goal>
          </goals>
        </execution>
        <execution>
          <id>test-compile</id>
          <phase>test-compile</phase>
          <goals>
            <goal>test-compile</goal>
          </goals>
        </execution>
      </executions>
    </plugin>
    

    Note on lines 10-11 the list of all annotations for which the open keyword is now not mandatory anymore.

    Even better, there’s an alternative plugin dedicated to Spring projects, that makes listing Spring-specific annotations not necessary. Lines 6-13 above can be replaced with the following for a shorter configuration:

    <configuration>
      <compilerPlugins>
        <plugin>spring</plugin>
      </compilerPlugins>
    </configuration>
    

    Whether using Spring, Mockito or any cglib-based framework/lib, the all-open plugin is a great way to streamline the development with the Kotlin language.

    Categories: Technical Tags: kotlinspring frameworkfinal
  • Feedback on Feeding Spring Boot metrics to Elasticsearch

    Logstash log

    Some weeks ago, I wrote a post detailing how to send JMX metrics from a Spring Boot app to Elasticsearch by developing another Spring Boot app.

    Getting to create such an app is not always the right idea but developers are makers - software makers, and developing new apps is quite alluring to them. However, in the overall scheme of things, this means time is not only spent in development, but also for maintenance during the entire lifetime of the app. Before going the development path, one should thoroughly check whether out-of-the-box alternatives exist.

    Back to JMX metrics: only the straightforward Logstash jmx plugin was tried before calling it quits because of an incompatibility with Elasticsearch 5. But an alternative exists with the Logstash http_poller plugin.

    This Logstash input plugin allows you to call an HTTP API, decode the output of it into event(s), and send them on their merry way. The idea behind this plugins came from a need to read springboot metrics endpoint, instead of configuring jmx to monitor my java application memory/gc/ etc.

    Jolokia is already in place, offering HTTP access to JMX. The last and only step is to configure HTTP poller plugin, which is fairly straighforward. The URL is composed of the standard actuator URL appended with /jolokia/read/ and the JMX’s ObjectName of the desired object. Here’s a sample configuration snippet, with URLs configured for:

    1. Response site for the root page
    2. HTTP 200 status code counter for the root page
    3. Operating system metrics
    4. Garbage collector metrics

    With the help of the jconsole, adding more metrics is a no-brainer.

    input {
      http_poller {
        urls => {
          "200.root" =>
            "http://localhost:8080/manage/jolokia/read/org.springframework.metrics:name=status,type=counter,value=200.root"
          "root" =>
            "http://localhost:8080/manage/jolokia/read/org.springframework.metrics:name=response,type=gauge,value=root"
          "OperatingSystem" => "http://localhost:8080/manage/jolokia/read/java.lang:type=OperatingSystem"
          "PS Scavenge" => "http://localhost:8080/manage/jolokia/read/java.lang:type=GarbageCollector,name=PS%20Scavenge"
        }
        request_timeout => 60
        schedule => { every => "10s"}
        codec => "json"
      }
    }
    

    This should output something akin to:

    {
              "request" => {
                "mbean" => "org.springframework.metrics:name=response,type=gauge,value=root",
                 "type" => "read"
              },
           "@timestamp" => 2016-01-01T10:19:45.914Z,
             "@version" => "1",
                "value" => {
                "Value" => 4163.0,
          "LastUpdated" => "2016-12-30T16:29:28+01:00"
                },
            "timestamp" => 1483121985,
               "status" => 200
    }
    

    Beyond the raw output, there are a couple of possible improvements:

    • value and request fields are inner objects, for no added value. Flattening the structure can go a long way toward making writing queries easier.
    • Adding tags depending on the type improve categorization. A possible alternative would be to parse the JMX compound name into dedicated fields with grok or dissect.
    • The @timestamp field can be replaced with the LastUpdated value and interpreted as a date.

    Filters to the rescue:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    filter {
      mutate { add_field => { "[mbean][objectName]" => "%{request[mbean]}" }}
      mutate { remove_field => "request" }
    }
    
    filter {
      if [mbean][objectName] == "java.lang:type=OperatingSystem" {
        dissect { 
          mapping => {
            "mbean[objectName]" => "%{[mbean][prefix]}:type=%{[mbean][type]}"
          }
        }
        mutate { remove_field => "value[ObjectName]" }
      } else if [mbean][objectName] == "java.lang:name=PS Scavenge,type=GarbageCollector" {
        dissect { 
          mapping => {
            "mbean[objectName]" => "%{[mbean][prefix]}:name=%{[mbean][name]},type=%{[mbean][type]}"
          }
        }
        mutate { remove_field => "value[ObjectName]" }
      } else if [mbean][objectName] =~ "^.*,type=gauge,.*$" or [mbean][objectName] =~ "^.*,type=counter,.*$" {
        date { match => [ "%{value[lastUpdated]}", "ISO8601" ] }
        mutate { replace => { "value" => "%{value[Value]}" }}
        mutate { convert => { "value" => "float" }}
        if [mbean][objectName] =~ "^.*,type=gauge,.*$" {
          dissect { 
            mapping => {
              "mbean[objectName]" =>
                "%{[mbean][prefix]}:name=%{[mbean][name]},type=%{[mbean][type]},value=%{[mbean][page]}"
            }
          }
        } else if [mbean][objectName] =~ "^.*,type=counter,.*$" {
          dissect { 
            mapping => {
              "mbean[objectName]" => 
                "%{[mbean][prefix]}:name=%{[mbean][name]},type=%{[mbean][type]},value=%{[mbean][status]}.%{[mbean][page]}"
            }
          }
        }
      }
    }
    

    A little explanation might be in order.

    Lines 1-4
    Move the initial request->mbean field to the mbean->objectName field.
    Lines 7-13
    For OS metrics, create mbean nested fields out of the objectName nested field and remove it from the value field.
    Lines 14-20
    For GC metrics, create mbean nested fields out of the objectName nested field using a slightly different pattern and remove it from the value field.
    Lines 21-24
    For gauge or counter metrics, interpret the value->lastUpdated nested field as a date, move the nested value->Value field to the root and interpret as a float value.
    Lines 25-38
    For gauge or counter metrics, create mbean nested fields using a pattern specific for each metric.

    Coupled with the initial configuration, this outputs to the following (for the gauge):

    {
        "@timestamp" => 2016-01-01T10:54:15.359Z,
             "mbean" => {
            "prefix" => "org.springframework.metrics",
              "name" => "response",
        "objectName" => "org.springframework.metrics:name=response,type=gauge,value=root",
              "type" => "gauge",
             "value" => "root"
             },
          "@version" => "1",
             "value" => 4163.0,
         "timestamp" => 1483206855,
            "status" => 200,
              "tags" => []
    }
    

    Given the current complexity of the configuration, remember the next logical step is to decouple the single snippet into multiple files. The order used by Logstash is by lexicographical file name. Name files accordingly:

    • 00-app-input.conf
    • 10-global-filter.conf
    • 20-specific-filter.conf
    • 99-console-output.conf

    Note: gaps in the number scheme allow to add intermediary filters in the future, with no renaming

    All in all, everything required is available through configuration, without any coding. Always check that it’s not the case before reinventing the wheel.

    Categories: Technical Tags: elasticsearchlogstashelastic
  • Exploratory Infrastructure projects

    Exploring the jungle

    Nowadays, most companies use one or another Agile methodology for their software development projects. That makes people involved in software development projects at least aware of agile principles - whether they truly try to follow agile practices or just pay lip service to them for a variety of reasons remains debatable. To avoid any association with tainted practices, I’d rather use the name "Exploratory Development". As with software development, exploration has a vague feeling of the final target destination, and a more or less detailed understanding on how to get there.[1] Plus on a map, the plotted path is generally not a straight line.

    However, even with the rise of the DevOps movement, the operational side of projects still seems to remain oblivious to what happens on the other side of the curtain. This post aims to provide food for thoughts to think about how true Agile can be applied to Ops.

    The legacy approach

    As an example, let’s have an e-commerce application. Sometimes, bad stuff happens and the development team needs to access logs to analyze what happened. In general, due to "security" reasons, developers cannot directly access the system, and needs to ask the operations team to send the log file(s). This too common process results in frustration, time wasting, and contributes to build even taller walls between Dev and Ops. To improve the situation, an option could be to setup a datastore to store the logs into, and a webapp to make them available to developers.

    Here are some hypotheses regarding the source architecture components:

    • Load-balancer
    • Apache Web server
    • Apache Tomcat servlet container, where the e-commerce application is deployed
    • Solr server, that provide search and faceting features to the app

    In turn, relevant data/logs that needs to be stored include:

    • Web server logs
    • Application logs proper
    • Tomcat technical logs
    • JMX data
    • Solr logs

    Let’s implement those requirements with the Elastic stack. The target infrastructure could look like this:

    Target architecture

    Defining the architecture is generally not enough. Unless one work for a dream company that empowers employees to improve the situation on their own initiative (please send it my resume), chances are there’s a need for some estimates regarding the setup of this architecture. And that goes double in case it’s done for a customer company. You might push back, stating the evidence:

    We will ask for estimates and then treat them as deadlines

    Engineers will probably think that managers are asking them to stick their neck out to produce estimates for no real reasons, and push back. But the later will kindly remind the former to "just" base estimates on assumptions, as if it was a real solution instead of plausible deniability. In other words, an assumption is a way to escape blame for a wrong estimate, and yes, it sounds much closer to contract law than to engineering concepts. That means that if any of the listed assumption is not fulfilled, then it’s acceptable for estimates to be wrong. It also implies estimates are then not considered a deadline anymore - which they never were supposed to be in the first place, if only from a semantical viewpoint.[2]

    Notwithstanding all that, and for the sake of the argument, let’s try to review possible assumptions pertaining to the proposed architecture that might impact implementation time:

    • Hardware location: on-premise, cloud-based or a mix of both?
    • Underlying operating system(s): Nix, Windows, something else, or a mix?
    • Infrastructure virtualization degree: is the infrastructure physical, virtualized, both?
    • Co-location of the OS: are there requirements regarding the location of components on physical systems?
    • Hardware availability: should there be a need to purchase physical machines?
    • Automation readiness: is there any solution already in place to automate infrastructure management? If not, in how many environments will the implementation need to be setup and if more than 2, will replication be handled manually?
    • Clustering: is any component clustered? Which one(s)? For the application, is there a session replication solution in place? Which one?
    • Infrastructure access: needs to be on-site? Needs to get a security token hardware? From software?

    And those are quite basic items regarding hardware only. Other wide areas include software (volume of logs, criticality, hardware/software mis/match, versions, etc.), people (in-house support, etc.), planning (vacation seasons, etc.), and I’m probably forgetting some important ones too. Given the sheer number of available items - and assuming they all have been listed, it stands to reason that at least one assumption would prove wrong, hence making final estimates dead wrong. In that case, playing the estimate game is just another way to provide plausible deniability. A much more useful alternative would be to create a n-dimension matrix of all items, and estimate all possible combinations. But as in software projects, the event space has just too many parameters to do that in an acceptable timeframe.

    Proposal for an alternative

    That said, what about a real working alternative that might not be to satisfy dashboard managers but the underlying business? It would start by implementing the most basic requirement, and to add more features until it’s good enough, or enough budget has been spent. Here are some possible steps from the above example:

    Foundation setup

    The initial goal of the setup is to enable log access and the most important logs are applications logs. Hence, the first setup is the following:

    Foundation architecture
    More JVM logs

    From this point on, a near-zero effort is to add scraping Tomcat’s log, to help with incident analysis by adding correlation.

    More JVM logs
    Machine decoupling

    The next logical step is to move the Elasticsearch instance to its own dedicated machine, to add an extra level of modularity to the overall architecture.

    Machine decoupling
    Even more logs

    At this point, additional logs from other components - load balancer, Solr server, etc. can be sent to Elasticsearch to improve issue-solving involving different components.

    Performance improvement

    Given that Logstash is written in Ruby, there might be some performance issues on running Logstash directly along the component, depending on each machine specific load and performances. Elastic realized it some time ago and now propose better performance via dedicated Beat. Every Logstash instance can be replaced by Filebeats.

    Not only logs

    With the Jolokia library, it’s possible to expose JMX beans through an HTTP interface. Unfortunately, there are only a few available Beats and none of them handle HTTP. However, Logstash with the http-poller plugin gets the job done.

    Reliability

    In order to improve reliability, Elasticsearch can be cluster-ized.

    The good thing about those steps is that they can implemented in (nearly) any order. This means that after laying out the base foundation - the first step, the stakeholder can decide which on makes sense for its specific context, or to stop because it’s enough regarding the added value.

    At this point, estimates still might make sense regarding the first step. But after eliminating most complexity (and its related uncertainty), it feels much more comfortable estimating the setup of an Elastic stack in a specific context.

    Conclusion

    As stated above, whether Agile principles are implemented in software development projects can be subject to debate. However, my feeling is that they have not reached the Ops sphere yet. That’s a shame, because as in Development, projects can truly benefit from real Agile practices. However, to prevent association with Agile cargo cult, I proposed the use of the term "Exploratory Infrastructure". This post described a proposal to apply such an exploratory approach to a sample infrastructure project. The main drawback of such approach is that it will cost more, as the path straight; the main benefit is that at every step, the stakeholder can choose to pursue or stop, taking into account the law of diminishing returns.


    1. This an HTML joke, because "emphasis on less"
    2. If an estimate needs to be really taken as an estimate, the compound word guesstimate is available. Aren’t enterprise semantics wonderful?
    Categories: Technical Tags: agileinfrastructureopsdevops
  • Starting Beats for Java developers

    Beats logo

    Last week, I wrote about how one could start developing one’s Logstash plugin coming from a Java developer background. However, with the acquisition of Packetbeat, Logstash now has help from Beats to push data to Elasticsearch. Beats are developed in Go, another challenge for traditional Java developers. This week, I tried porting my Logstash Reddit plugin to a dedicated Beat. This post documents my findings (spoiler: I found it much easier than with Ruby).

    Setting up the environment

    On OSX, installing the go executable is easy as pie:

    brew install go
    

    While Java or Ruby (or any language I know for that matter) can reside anywhere on the local filesystem, Go projects must all be situated in one single dedicated location, available under the $GOPATH environment variable.

    Creating the project

    As for Logstash plugins, Beats projects can be created from a template. The documentation to do that is quite straightforward. Given Go’s rigid requirements regarding location on the filesystem, just following instructions yields a new ready-to-use Go project.

    The default template code will repeatedly send an event with an incremented counter in the console:

    ./redditbeat -e -d "*"
    2016/12/13 22:55:56.013362 beat.go:267: INFO
      Home path: [/Users/i303869/projects/private/go/src/github.com/nfrankel/redditbeat]
      Config path: [/Users/i303869/projects/private/go/src/github.com/nfrankel/redditbeat]
      Data path: [/Users/i303869/projects/private/go/src/github.com/nfrankel/redditbeat/data]
      Logs path: [/Users/i303869/projects/private/go/src/github.com/nfrankel/redditbeat/logs]
    2016/12/13 22:55:56.013390 beat.go:177: INFO Setup Beat: redditbeat; Version: 6.0.0-alpha1
    2016/12/13 22:55:56.013402 processor.go:43: DBG  Processors: 
    2016/12/13 22:55:56.013413 beat.go:183: DBG  Initializing output plugins
    2016/12/13 22:55:56.013417 logp.go:219: INFO Metrics logging every 30s
    2016/12/13 22:55:56.013518 output.go:167: INFO Loading template enabled. Reading template file:
      /Users/i303869/projects/private/go/src/github.com/nfrankel/redditbeat/redditbeat.template.json
    2016/12/13 22:55:56.013888 output.go:178: INFO Loading template enabled for Elasticsearch 2.x. Reading template file:
      /Users/i303869/projects/private/go/src/github.com/nfrankel/redditbeat/redditbeat.template-es2x.json
    2016/12/13 22:55:56.014229 client.go:120: INFO Elasticsearch url: http://localhost:9200
    2016/12/13 22:55:56.014272 outputs.go:106: INFO Activated elasticsearch as output plugin.
    2016/12/13 22:55:56.014279 publish.go:234: DBG  Create output worker
    2016/12/13 22:55:56.014312 publish.go:276: DBG  No output is defined to store the topology.
      The server fields might not be filled.
    2016/12/13 22:55:56.014326 publish.go:291: INFO Publisher name: LSNM33795267A
    2016/12/13 22:55:56.014386 async.go:63: INFO Flush Interval set to: 1s
    2016/12/13 22:55:56.014391 async.go:64: INFO Max Bulk Size set to: 50
    2016/12/13 22:55:56.014395 async.go:72: DBG  create bulk processing worker (interval=1s, bulk size=50)
    2016/12/13 22:55:56.014449 beat.go:207: INFO redditbeat start running.
    2016/12/13 22:55:56.014459 redditbeat.go:38: INFO redditbeat is running! Hit CTRL-C to stop it.
    2016/12/13 22:55:57.370781 client.go:184: DBG  Publish: {
      "@timestamp": "2016-12-13T22:54:47.252Z",
      "beat": {
        "hostname": "LSNM33795267A",
        "name": "LSNM33795267A",
        "version": "6.0.0-alpha1"
      },
      "counter": 1,
      "type": "redditbeat"
    }
    

    Regarding command-line parameters: -e logs to the standard err, while -d "*" enables all debugging selectors. For the full list of parameters, type ./redditbeat --help.

    The code

    Go code is located in .go files (what a surprise…) in the project sub-folder of the $GOPATH/src folder.

    Configuration type

    The first interesting file is config/config.go which defines a struct to declare possible parameters for the Beat. As for the former Logstash plugin, let’s add a subreddit parameter, and sets it default value:

    type Config struct {
    	Period time.Duration `config:"period"`
    	Subreddit string `config:"subreddit"`
    }
    
    var DefaultConfig = Config {
    	Period: 15 * time.Second,
    	Subreddit: "elastic",
    }
    

    Beater type

    Code for the Beat itself is found in beater/redditbean.go. The default template creates a struct for the Beat and three functions:

    1. The Beat constructor - it reads the configuration:
      func New(b *beat.Beat, cfg *common.Config) (beat.Beater, error) { ... }
      
    2. The Run function - should loop over the main feature of the Beat:
      func (bt *Redditbeat) Run(b *beat.Beat) error { ... }
      
    3. The Stop function handles graceful shutdown:
      func (bt *Redditbeat) Stop() { ... }
      

    Note 1: there’s no explicit interface implementation in Go. Just implementing all methods from an interface creates an implicit inheritance relationship. For documentation purposes, here’s the Beater interface:

    type Beater interface {
    	Run(b *Beat) error
    	Stop()
    }
    

    Hence, since the Beat struct implements both Run and Stop, it is a Beater.

    Note 2: there’s no concept of class in Go, so methods cannot be declared on a concrete type. However, it exists the concept of extension functions: functions that can add behavior to a type (inside a single package). It needs to declare the receiver type: this is done between the fun keyword and the function name - here, it’s the Redditbeat type (or more correctly, a pointer to the Redditbeat type, but there’s an implicit conversion).

    The constructor and the Stop function can stay as they are, whatever feature should be developed must be in the Run function. In this case, the feature is to call the Reddit REST API and send a message for every Reddit post.

    The final code looks like the following:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    func (bt *Redditbeat) Run(b *beat.Beat) error {
    	bt.client = b.Publisher.Connect()
    	ticker := time.NewTicker(bt.config.Period)
    	reddit := "https://www.reddit.com/r/" + bt.config.Subreddit + "/.json"
    	client := &http.Client {}
    	for {
    		select {
    		case <-bt.done:
    			return nil
    		case <-ticker.C:
    		}
    		req, reqErr := http.NewRequest("GET", reddit, nil)
    		req.Header.Add("User-Agent", "Some existing header to bypass 429 HTTP")
    		if (reqErr != nil) {
    			panic(reqErr)
    		}
    		resp, getErr := client.Do(req)
    		if (getErr != nil) {
    			panic(getErr)
    		}
    		body, readErr := ioutil.ReadAll(resp.Body)
    		defer resp.Body.Close()
    		if (readErr != nil) {
    			panic(readErr)
    		}
    		trimmedBody := body[len(prefix):len(body) - len(suffix)]
    		messages := strings.Split(string(trimmedBody), separator)
    		for i := 0; i < len(messages); i ++ {
    			event := common.MapStr{
    				"@timestamp": common.Time(time.Now()),
    				"type":       b.Name,
    				"message":    "{" + messages[i] + "}",
    			}
    			bt.client.PublishEvent(event)
    		}
    	}
    }
    

    Here’s an explanation of the most important pieces:

    • line 4: creates the Reddit REST URL by concatenating Strings, including the configuration Subreddit parameter. Remember that its default value has been defined in the config.go file.
    • line 5: get a reference on a new HTTP client type
    • line 12: create a new HTTP request. Note that Go allows for multiple return values.
    • line 13: if no standard request header is set, Reddit’s API will return a 429 status code
    • line 14: Go standard errors are not handled through exceptions but are returned along regular returned values. According to the Golang wiki:

      Indicating error conditions to callers should be done by returning error value

    • line 15: the panic() function is similar to throwing an exception in Java, climbing up the stack until it’s handled. For more information, check the relevant documentation.
    • line 17: executes the HTTP request
    • line 21: reads the response body into a byte array
    • line 22: closes the body stream. Note the defer keyword:

      A defer statement defers the execution of a function until the surrounding function returns.

    • line 26: creates a slice - a reference to a part of an array, of the whole response body byte array. In essence, it removes the prefix and the suffix to keep the relevant JSON value. It would be overkill to parse the byte array into JSON.
    • line 27: splits the slice to get each JSON fragment separately
    • line 29: create the message as a simple dictionary structure
    • line 34: send it

    Configure, Build, and launch

    Default configuration parameters can be found in the redditbeat.yml file at the root of the project. Note that additional common Beat parameters are listed in the redditbeat.full.yml, along with relevant comments.

    The interesting thing about Beats is that their messages can be sent directly to Elasticsearch or to Logstash for further processing. This is configured in the aforementioned configuration file.

    redditbeat:
      period: 10s
    output.elasticsearch:
      hosts: ["localhost:9200"]
    output.logstash:
      hosts: ["localhost:5044"]
      enabled: true
    

    This configuration snippet will loop over the Run method every 10 seconds and send messages to the Logstash instance running on localhost on port 5044. This can be overridden when running the Beat (see below).

    Note: for Logstash to accept messages from Beats, the Logstash Beat plugin must be installed and Logstash input must be configured for Beats:

    input {
      beats {
        port => 5044
      }
    }
    

    To build the project, type make at the project’s root. It will create an executable that can be run.

    ./redditbeat -e -E redditbeat.subreddit=java
    

    The -E flag may override parameters found in the embedded redditbeat.yml configuration file (see above). Here, it sets the subreddit to be read to “java” instead of the default “elastic”.

    The output looks like the following:

    2016/12/17 14:51:19.748329 client.go:184: DBG  Publish: {
      "@timestamp": "2016-12-17T14:51:19.748Z",
      "beat": {
        "hostname": "LSNM33795267A",
        "name": "LSNM33795267A",
        "version": "6.0.0-alpha1"
      },
      "message": "{
        \"kind\": \"t3\", \"data\": {
          \"contest_mode\": false, \"banned_by\": null, 
          \"domain\": \"blogs.oracle.com\", \"subreddit\": \"java\", \"selftext_html\": null, 
          \"selftext\": \"\", \"likes\": null, \"suggested_sort\": null, \"user_reports\": [], 
          \"secure_media\": null, \"saved\": false, \"id\": \"5ipzgq\", \"gilded\": 0, 
          \"secure_media_embed\": {}, \"clicked\": false, \"report_reasons\": null, 
          \"author\": \"pushthestack\", \"media\": null, \"name\": \"t3_5ipzgq\", \"score\": 11, 
          \"approved_by\": null, \"over_18\": false, \"removal_reason\": null, \"hidden\": false, 
          \"thumbnail\": \"\", \"subreddit_id\": \"t5_2qhd7\", \"edited\": false, 
          \"link_flair_css_class\": null, \"author_flair_css_class\": null, \"downs\": 0, 
          \"mod_reports\": [], \"archived\": false, \"media_embed\": {}, \"is_self\": false, 
          \"hide_score\": false, \"spoiler\": false, 
          \"permalink\": \"/r/java/comments/5ipzgq/jdk_9_will_no_longer_bundle_javadb/\", 
          \"locked\": false, \"stickied\": false, \"created\": 1481943248.0, 
          \"url\": \"https://blogs.oracle.com/java-platform-group/entry/deferring_to_derby_in_jdk\", 
          \"author_flair_text\": null, \"quarantine\": false, 
          \"title\": \"JDK 9 will no longer bundle JavaDB\", \"created_utc\": 1481914448.0, 
          \"link_flair_text\": null, \"distinguished\": null, \"num_comments\": 4, 
          \"visited\": false, \"num_reports\": null, \"ups\": 11
        }
      }",
      "type": "redditbeat"
    }
    

    Conclusion

    Strangely enough, I found developing a Beat easier than a Logstash plugin. Go is more low-level and some concepts feel really foreign (like implicit interface implementation), but the ecosystem is much simpler - as the language is more recent. Also, Beats are more versatile, in that they can send to Elasticsearch and/or Logstash.

    Categories: Development Tags: LogstashElasticsearchBeatGo
  • Starting Logstash plugin development for Java developers

    Logstash old logo

    I recently became interested in Logstash, and after playing with it for a while, I decided to create my own custom plugin for learning purpose. I chose to pull data from Reddit because a) I use it often and b) there’s no existing plugin that offers that.

    The Elasticsearch site offers quite an exhaustive documentation to create one’s own Logstash plugin. Such endeavour requires Ruby skills - not only the language syntax but also the ecosystem. Expectedly, the site assumes the reader is familiar with both. Unfortunately, that’s not my case. I’ve been developing in Java a lot, I’ve dabbled somewhat in Scala, I’m quite interested in Kotlin - in the end, I’m just a JMV developer (plus some Javascript here and there). Long talk short, I start from scratch in Ruby.

    At this stage, there are two possible approaches:

    1. Read documentation and tutorials about Ruby, Gems, bundler, the whole nine yard and come back in a few months (or more)
    2. Or learn on the spot by diving right into development

    Given that I don’t have months, and that whatever I learned is good enough, I opted for the 2nd option. This post is a sum-up of the steps I went through, in the hopes it might benefit others who find themselves in the same situation.

    The first step is not the hardest

    Though new Logstash plugins can be started from scratch, the documentation advise to start from a template. This is explained in the online procedure. The generation yields the following structure:

    $ tree logstash-input-reddit
    ├── Gemfile
    ├── LICENSE
    ├── README.md
    ├── Rakefile
    ├── lib
    │   └── logstash
    │       └── inputs
    │           └── reddit.rb
    ├── logstash-input-reddit.gemspec
    └── spec
        └── inputs
            └── reddit_spec.rb
    

    Not so obviously for a Ruby newbie, this structure is one of a Ruby Gem. In general, dependencies are declared in the associated Gemfile:

    source 'https://rubygems.org'
    gemspec
    

    However, in this case, the gemspec directive adds one additional indirection level. Not only dependencies, but also meta-data, are declared in the associated gemspec file. This is a feature of the Bundler utility gem.

    To install dependencies, the bundler gem first needs to be installed. Aye, there’s the rub…

    Ruby is the limit

    Trying to install the gem yields the following:

    gem install bundler
    Fetching: bundler-1.13.6.gem (100%)
    ERROR:  While executing gem ... (TypeError)
        no implicit conversion of nil into String
    

    The first realization - and it took a lot of time (browsing and reading), is that there are different flavours of Ruby runtimes. Simple Ruby is not enough for Logstash plugin development: it requires a dedicated runtime that runs on the JVM aka JRuby.

    The second realization is that while it’s easy to install multiple Ruby runtimes on a machine, it’s impossible to have them run at the same time. While Homebrew makes the jruby package available, it seems there’s only one single gem repository per system and it reacts very poorly to being managed by different runtimes.

    After some more browsing, I found the solution: rbenv. It not only mangages ruby itself, but also all associated executables (gem, irb, rake, etc.) by isolating every runtime. This makes possible to run my Jekyll site with the latest 2.2.3 Ruby runtime and build the plugin with JRuby on my machine. rbenv is available via Homebrew:

    This is how it goes:

    Install rbenv
    brew install rbenv
    Configure the PATH
    echo 'eval "$(rbenv init -)"' >> ~/.bash_profile
    Source the bash profile script
    . ~/.bash_profile
    List all available runtimes
    rbenv install -l
    Available versions:
      1.8.5-p113
      1.8.5-p114
      ...
      ...
      ...
      ree-1.8.7-2012.02
      topaz-dev
    
    Install the desired runtime
    rbenv install jruby-9.1.6.0
    Configure the project to use the desired runtime
    cd logstash-input-reddit
    rbenv local jruby-9.1.6.0
    
    Check it's configured
    ruby --version
    jruby-9.1.6.0
    

    Finally, bundler can be installed:

    gem install bundler
    Successfully installed bundler-1.13.6
    1 gem installed
    

    And from this point on, all required gems can be installed as well:

    bundle install
    Fetching gem metadata from https://rubygems.org/.........
    Fetching version metadata from https://rubygems.org/..
    Fetching dependency metadata from https://rubygems.org/.
    Resolving dependencies...
    Installing rake 12.0.0
    Installing public_suffix 2.0.4
    ...
    ...
    ...
    Installing rspec-wait 0.0.9
    Installing logstash-core-plugin-api 2.1.17
    Installing logstash-codec-plain 3.0.2
    Installing logstash-devutils 1.1.0
    Using logstash-input-reddit 0.1.0 from source at `.`
    Bundle complete! 2 Gemfile dependencies, 57 gems now installed.
    Use `bundle show [gemname]` to see where a bundled gem is installed.
    Post-install message from jar-dependencies:
    
    if you want to use the executable lock_jars then install ruby-maven gem before using lock_jars 
    
       $ gem install ruby-maven -v '~> 3.3.11'
    
    or add it as a development dependency to your Gemfile
    
       gem 'ruby-maven', '~> 3.3.11'
    

    Plugin development proper

    With those requirements finally addressed, proper plugin development can start. Let’s skip finding the right API to use to make an HTTP request in Ruby or addressing Bundler warnings when installing dependencies, the final code is quite terse:

    class LogStash::Inputs::Reddit < LogStash::Inputs::Base
    
      config_name 'reddit'
      default :codec, 'plain'
      config :subreddit, :validate => :string, :default => 'elastic'
      config :interval, :validate => :number, :default => 10
    
      public
      def register
        @host = Socket.gethostname
        @http = Net::HTTP.new('www.reddit.com', 443)
        @get = Net::HTTP::Get.new("/r/#{@subreddit}/.json")
        @http.use_ssl = true
      end
    
      def run(queue)
        # we can abort the loop if stop? becomes true
        while !stop?
          response = @http.request(@get)
          json = JSON.parse(response.body)
          json['data']['children'].each do |child|
            event = LogStash::Event.new('message' => child, 'host' => @host)
            decorate(event)
            queue << event
          end
          Stud.stoppable_sleep(@interval) { stop? }
        end
      end
    end
    

    The plugin defines two configuration parameters, which subrredit will be parsed for data and the interval between 2 calls (in seconds).

    The register method initializes the class attributes, while the run method loops over:

    • Making the HTTP call to Reddit
    • Parsing the response body as JSON
    • Making dedicated fragments from the JSON, one for each post. This is particularly important because we want to index each post separately.
    • Sending each fragment as a Logstash event for indexing

    Of course, it’s very crude, there’s no error handling, it doesn’t save the timestamp of the last read post to prevent indexing duplicates, etc. In its current state, the plugin offers a lot of room for improvement, but at least it works from a MVP</a> point-of-view.

    Building and installing

    As written above, the plugin is a Ruby gem. It can be built as any other gem:

    gem build logstash-input-reddit
    

    This creates a binary file named logstash-input-reddit-0.1.0.gem - name and version both come from the Bundler’s gemspec. It can be installed using the standard Logtstash plugin installation procedure:

    bin/logstash-plugin install logstash-input-reddit-0.1.0.gem
    

    Downstream processing

    One huge benefit of Logstash is the power of its processing pipeline. The plugin is designed to produce raw data, but the indexing should handle each field separately. Extracting fields from another field can be achieved with the mutate filter.

    Here’s one Logstash configuration snippet example, to fill some relevant fields (and to remove message):

    filter{
      mutate {
        add_field => {
          "kind" => "%{message[kind]}"
          "subreddit" => "%{message[data][subreddit]}"
          "domain" => "%{message[data][domain]}"
          "selftext" => "%{message[data][selftext]}"
          "url" => "%{message[data][url]}"
          "title" => "%{message[data][title]}"
          "id" => "%{message[data][id]}"
          "author" => "%{message[data][author]}"
          "score" => "%{message[data][score]}"
          "created_utc" => "%{message[data][created_utc]}"
        }
        remove_field => [ "message" ]
      }
    }
    

    Once the plugin has been built and installed, Logstash can be run with a config file that includes the previous snippet. It should yield something akin to the following - when used in conjunction with the rubydebug codec:

    {
           "selftext" => "",
               "kind" => "t3",
             "author" => "nfrankel",
              "title" => "Structuring data with Logstash",
          "subreddit" => "elastic",
                "url" => "https://blog.frankel.ch/structuring-data-with-logstash/",
               "tags" => [],
              "score" => "9",
         "@timestamp" => 2016-12-07T22:32:03.320Z,
             "domain" => "blog.frankel.ch",
               "host" => "LSNM33795267A",
           "@version" => "1",
                 "id" => "5f66bk",
        "created_utc" => "1.473948927E9"
    }
    

    Conclusion

    Starting from near-zero kwnowledge about the Ruby ecosystem, I’m quite happy of the result.

    The only thing I couldn’t achieve was to add 3rd libraries (like rest-client), Logstash kept complaining about not being able to install the plugin because of a missing dependency. Falling back to standard HTTP calls solved the issue.

    Also, note that the default template has some warnings on install, but they can be fixed quite easily:

    • The license should read Apache-2.0 instead of Apache License (2.0)
    • Dependencies version are open-ended ('>= 0') whereas they should be more limited i.e. '~> 2'
    • Some meta-data is missing, like the homepage

    I hope this post will be useful to other Java developers wanting to develop their own Logstash plugin.

    Categories: Development Tags: Logstashpluginruby
  • Feeding Spring Boot metrics to Elasticsearch

    Elasticsearch logo

    This week’s post aims to describe how to send JMX metrics taken from the JVM to an Elasticsearch instance.

    Business app requirements

    The business app(s) has some minor requirements.

    The easiest use-case is to start from a Spring Boot application. In order for metrics to be available, just add the Actuator dependency to it:

    <dependency>
      <groupId>org.springframework.boot</groupId>
      <artifactId>spring-boot-starter-actuator</artifactId>
    </dependency>
    

    Note that when inheriting from spring-boot-starter-parent, setting the version is not necessary and taken from the parent POM.

    To send data to JMX, configure a brand-new @Bean in the context:

    @Bean @ExportMetricWriter
    MetricWriter metricWriter(MBeanExporter exporter) {
        return new JmxMetricWriter(exporter);
    }
    

    To-be architectural design

    There are several options to put JMX data into Elasticsearch.

    Possible options

    1. The most straightforward way is to use Logstash with the JMX plugin
    2. Alternatively, one can hack his own micro-service architecture:
      • Let the application send metrics to the JVM - there’s the Spring Boot actuator for that, the overhead is pretty limited
      • Have a feature expose JMX data on an HTTP endpoint using Jolokia
      • Have a dedicated app poll the endpoint and send data to Elasticsearch

      This way, every component has its own responsibility, there’s not much performance overhead and the metric-handling part can fail while the main app is still available.

    3. An alternative would be to directly poll the JMX data from the JVM

    Unfortunate setback

    Any architect worth his salt (read lazy) should always consider the out-of-the-box option. The Logstash JMX plugin looks promising. After installing the plugin, the jmx input can be configured into the Logstash configuration file:

    input {
      jmx {
        path => "/var/logstash/jmxconf"
        polling_frequency => 5
        type => "jmx"
      }
    }
    
    output {
      stdout { codec => rubydebug }
    }
    

    The plugin is designed to read JVM parameters (such as host and port), as well as the metrics to handle from JSON configuration files. In the above example, they will be watched in the /var/logstash/jmxconf folder. Moreover, they can be added, removed and updated on the fly.

    Here’s an example of such configuration file:

    {
      "host" : "localhost",
      "port" : 1616,
      "alias" : "petclinic",
      "queries" : [
      {
        "object_name" : "org.springframework.metrics:name=*,type=*,value=*",
        "object_alias" : "${type}.${name}.${value}"
      }]
    }
    

    A MBean’s ObjectName can be determined from inside the jconsole:

    JConsole screenshot

    The plugin allows wildcard in the metric’s name and usage of captured values in the alias. Also, by default, all attributes will be read (those can be restricted if necessary).

    Note: when starting the business app, it’s highly recommended to set the JMX port through the com.sun.management.jmxremote.port system property.

    Unfortunately, at the time of this writing, running the above configuration fails with messages of this kind:

    [WARN][logstash.inputs.jmx] Failed retrieving metrics for attribute Value on object blah blah blah
    [WARN][logstash.inputs.jmx] undefined method `event' for #<LogStash::Inputs::Jmx:0x70836e5d>
    

    For reference purpose, the Github issue can be found here.

    The do-it yourself alternative

    Considering it’s easier to poll HTTP endpoints than JMX - and that implementations already exist, let’s go for option 3 above. Libraries will include:

    • Spring Boot for the business app
    • With the Actuator starter to provides metrics
    • Configured with the JMX exporter for sending data
    • Also with the dependency to expose JMX beans on an HTTP endpoints
    • Another Spring Boot app for the “poller”
    • Configured with a scheduled service to regularly poll the endpoint and send it to Elasticsearch

    Draft architecture component diagram

    Additional business app requirement

    To expose the JMX data over HTTP, simply add the Jolokia dependency to the business app:

    <dependency>
      <groupId>org.jolokia</groupId>
      <artifactId>jolokia-core</artifactId>
     </dependency>
    

    From this point on, one can query for any JMX metric via the HTTP endpoint exposed by Jolokia - by default, the full URL looks like /jolokia/read/<JMX_ObjectName>.

    Custom-made broker

    The broker app responsibilities include:

    • reading JMX metrics from the business app through the HTTP endpoint at regular intervals
    • sending them to Elasticsearch for indexing

    My initial move was to use Spring Data, but it seems the current release is not compatible with Elasticsearch latest 5 version, as I got the following exception:

    java.lang.IllegalStateException:
      Received message from unsupported version: [2.0.0] minimal compatible version is: [5.0.0]
    

    Besides, Spring Data is based on entities, which implies deserializing from HTTP and serializing back again to Elasticsearch: that has a negative impact on performance for no real added value.

    The code itself is quite straightforward:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    @SpringBootApplication
    @EnableScheduling
    open class JolokiaElasticApplication {
    
      @Autowired lateinit var client: JestClient
    
      @Bean open fun template() = RestTemplate()
    
      @Scheduled(fixedRate = 5000)
      open fun transfer() {
        val result = template().getForObject(
          "http://localhost:8080/manage/jolokia/read/org.springframework.metrics:name=status,type=counter,value=beans",
          String::class.java)
        val index = Index.Builder(result).index("metrics").type("metric").id(UUID.randomUUID().toString()).build()
        client.execute(index)
      }
    }
    
    fun main(args: Array<String>) {
      SpringApplication.run(JolokiaElasticApplication::class.java, *args)
    }
    

    Of course, it’s a Spring Boot application (line 1). To poll at regular intervals, it must be annotated with @EnableScheduling (line 2) and have the polling method annotated with @Scheduled and parameterized with the interval in milliseconds (line 9).

    In Spring Boot application, calling HTTP endpoints is achieved through the RestTemplate. Once created (line 7) - it’s singleton, it can be (re)used throughout the application. The call result is deserialized into a String (line 11).

    The client to use is Jest. Jest offers a dedicated indexing API: it just requires the JSON string to be sent, as well as the index name, the object name as well as its id (line 14). With the Spring Boot Elastic starter on the classpath, a JestClient instance is automatically registered in the bean factory. Just autowire it in the configuration (line 5) to use it (line 15).

    At this point, launching the Spring Boot application will poll the business app at regular intervals for the specified metrics and send it to Elasticsearch. It’s of course quite crude, everything is hard-coded, but it gets the job done.

    Conclusion

    Despite the failing plugin, we managed to get the JMX data from the business application to Elasticsearch by using a dedicated Spring Boot app.

  • Structuring data with Logstash

    Logstash old logo

    Given the trend around microservices, it has become mandatory to be able to follow a transaction across multiple microservices. Spring Cloud Sleuth is such a distributed tracing system fully integrated into the Spring Boot ecosystem. By adding the spring-cloud-starter-sleuth into a project’s POM, it instantly becomes Sleuth-enabled and every standard log call automatically adds additional data, such as spanId and traceId to the usual data.

    2016-11-25 19:05:53.221  INFO [demo-app,b4d33156bc6a49ec,432b43172c958450,false] 25305 ---\n
    [nio-8080-exec-1] ch.frankel.blog.SleuthDemoApplication      : this is an example message
    

    (broken on 2 lines for better readability)

    Now, instead of sending the data to Zipkin, let’s say I need to store it into Elasticsearch instead. A product is as good as the way it’s used. Indexing unstructured log messages is not very useful. Logstash configuration allows to pre-parse unstructured data and send structured data instead.

    Grok

    Grokking data is the usual way to structure data with pattern matching.

    Last week, I wrote about some hints for the configuration. Unfortunately, the hard part comes in writing the matching pattern itself, and those hints don’t help. While it might be possible to write a perfect Grok pattern on the first draft, the above log is complicated enough that it’s far from a certainty, and chances are high to stumble upon such message when starting Logstash with an unfit Grok filter:

    "tags" => [
        [0] "_grokparsefailure"
    ]
    

    However, there’s “an app for that” (sounds familiar?). It offers three fields:

    1. The first field accepts one (or more) log line(s)
    2. The second the grok pattern
    3. The 3rd is the result of filtering the 1st by the 2nd

    The process is now to match fields one by one, from left to right. The first data field .e.g. 2016-11-25 19:05:53.221 is obviously a timestamp. Among common grok patterns, it looks as if the TIMESTAMP_ISO8601 pattern would be the best fit.

    Enter %{TIMESTAMP_ISO8601:timestamp} into the Pattern field. The result is:

    {
      "timestamp": [
        [
          "2016-11-25 17:05:53.221"
        ]
      ]
    }
    

    The next field to handle looks like the log level. Among the patterns, there’s one LOGLEVEL. The Pattern now becomes %{TIMESTAMP_ISO8601:timestamp} *%{LOGLEVEL:level} and the result:

    {
      "timestamp": [
        [
          "2016-11-25 17:05:53.221"
        ]
      ],
      "level": [
        [
          "INFO"
        ]
      ]
    }
    

    Rinse and repeat until all fields have been structured. Given the initial log line, the final pattern should look something along those lines:

    %{TIMESTAMP_ISO8601:timestamp} *%{LOGLEVEL:level} \[%{DATA:application},%{DATA:traceId},%{DATA:spanId},%{DATA:zipkin}]\n
    %{DATA:pid} --- *\[%{DATA:thread}] %{JAVACLASS:class} *: %{GREEDYDATA:log}
    

    (broken on 2 lines for better readability)

    And the associated result:

    {
            "traceId" => "b4d33156bc6a49ec",
             "zipkin" => "false",
              "level" => "INFO",
                "log" => "this is an example message",
                "pid" => "25305",
             "thread" => "nio-8080-exec-1",
               "tags" => [],
             "spanId" => "432b43172c958450",
               "path" => "/tmp/logstash.log",
         "@timestamp" => 2016-11-26T13:41:07.599Z,
        "application" => "demo-app",
           "@version" => "1",
               "host" => "LSNM33795267A",
              "class" => "ch.frankel.blog.SleuthDemoApplication",
          "timestamp" => "2016-11-25 17:05:53.221"
    }
    

    Dissect

    The Grok filter gets the job done. But it seems to suffer from performance issues, especially if the pattern doesn’t match. An alternative is to use the dissect filter instead, which is based on separators.

    Unfortunately, there’s no app for that - but it’s much easier to write a separator-based filter than a regex-based one. The mapping equivalent to the above is:

    %{timestamp} %{+timestamp} %{level}[%{application},%{traceId},%{spanId},%{zipkin}]\n
    %{pid} %{}[%{thread}] %{class}:%{log}
    

    (broken on 2 lines for better readability)

    This outputs the following:

    {
            "traceId" => "b4d33156bc6a49ec",
             "zipkin" => "false",
                "log" => " this is an example message",
              "level" => "INFO ",
                "pid" => "25305",
             "thread" => "nio-8080-exec-1",
               "tags" => [],
             "spanId" => "432b43172c958450",
               "path" => "/tmp/logstash.log",
         "@timestamp" => 2016-11-26T13:36:47.165Z,
        "application" => "demo-app",
           "@version" => "1",
               "host" => "LSNM33795267A",
              "class" => "ch.frankel.blog.SleuthDemoApplication      ",
          "timestamp" => "2016-11-25 17:05:53.221"
    }
    

    Notice the slight differences: by moving from a regex-based filter to a separator-based one, some strings end up padded with spaces. There are 2 ways to handle that:

    • change the logging pattern in the application - which might make direct log reading harder
    • strip additional spaces with Logstash

    With the second option, the final filter configuration snippet is:

    filter {
      dissect {
    	mapping => { "message" => ... }
      }
      mutate {
        strip => [ "log", "class" ]
      }
    }
    

    Conclusion

    In order to structure data, the grok filter is powerful and used by many. However, depending on the specific log format to parse, writing the filter expression might be quite complex a task. The dissect filter, based on separators, is an alternative that makes it much easier - at the price of some additional handling. It also is an option to consider in case of performance issues.

    Categories: Development Tags: elklogstashelasticparsingdata
  • Debugging hints for Logstash

    Logstash old logo

    As a Java developer, when you are first shown how to run the JVM in debug mode, attach to it and then set a breakpoint, you really feel like you’ve reached a step on your developer journey. Well, at least I did. Now the world is going full microservice and knowing that trick means less and less in it everyday.

    This week, I was playing with Logstash to see how I could send all of an application exceptions to an Elasticsearch instance, so I could display them on a Kibana dashboard for analytics purpose. Of course, nothing was seen in Elasticsearch at first. This post describes helped me toward making it work in the end.

    The setup

    Components are the following:

    • The application. Since a lot of exceptions were necessary, I made use of the Java Bullshifier. The only adaptation was to wire in some code to log exceptions in a log file.
      public class ExceptionHandlerExecutor extends ThreadPoolExecutor {
      
          private static final Logger LOGGER = LoggerFactory.getLogger(ExceptionHandlerExecutor.class);
      
          public ExceptionHandlerExecutor(int corePoolSize, int maximumPoolSize,
                  long keepAliveTime, TimeUnit unit, BlockingQueue<Runnable> workQueue) {
              super(corePoolSize, maximumPoolSize, keepAliveTime, unit, workQueue);
          }
      
          @Override
          protected void afterExecute(Runnable r, Throwable t) {
              if (r instanceof FutureTask) {
                  FutureTask<Exception> futureTask = (FutureTask<Exception>) r;
                  if (futureTask.isDone()) {
                      try {
                          futureTask.get();
                      } catch (InterruptedException | ExecutionException e) {
                          LOGGER.error("Uncaught error", e);
                      }
                  }
              }
          }
      }
    • Logstash
    • Elasticsearch
    • Kibana

    The first bump

    Before being launched, Logstash needs to be configured - especially its input and its output.

    There’s no out-of-the-box input focused on exception stack traces. Those are multi-lines messages: hence, only lines starting with a timestamp mark the beginning of a new message. Logstash achieves that with a specific codec including a regex pattern.

    Some people, when confronted with a problem, think "I know, I'll use regular expressions."
    Now they have two problems.
    -- Jamie Zawinski

    Some are better than others at regular expressions but nobody learned it as his/her mother tongue. Hence, it’s not rare to have errors. In that case, one should use one of the many available online regex validators. They are just priceless for understanding why some pattern doesn’t match.

    The relevant Logstash configuration snippet becomes:

    input {
      file {
        path => "/tmp/logback.log"
        start_position => "beginning"
        codec => multiline {
          pattern => "^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}"
          negate => true
          what => "previous"
        }
      }
    }
    

    The second bump

    Now, data found its way into Elasticsearch, but message were not in the expected format. In order to analyze where this problem came from, messages can be printed on the console instead of indexed in Elasticsearch.

    That’s quite easy with the following snippet:

    output {
      stdout { codec => rubydebug }
    }
    

    With messages printed on the console, it’s possible to understand where the issue occurs. In that case, I was able to tweak the input configuration (and add the forgotten negate => true bit).

    Finally, I got the expected result:

    Conclusion

    With more and more tools with every passing day, the toolbelt of the modern developer needs to increase as well. Unfortunately, there’s no one-size-fits-all solution: in order to know a tool’s every nook and cranny, one needs to use and re-use it, be creative, and search on Google… a lot.

    Categories: Development Tags: elklogstashelasticdebugging
  • Another post-processor for Spring Boot

    Spring Boot logo

    Most Spring developers know about the BeanPostProcessor and the BeanFactoryPostProcessor classes. The former enables changes to new bean instances before they can be used, while the latter lets you modify bean definitions - the metadata to create the bean. Commons use-cases include:

    • Bootstrapping processing of @Configuration classes, via ConfigurationClassPostProcessor
    • Resolving ${...} placeholders, through PropertyPlaceholderConfigurer
    • Autowiring of annotated fields, setter methods and arbitrary config methods - AutowiredAnnotationBeanPostProcessor
    • And so on, and so forth…

    Out-of-the-box and custom post-processors are enough to meet most requirements regarding the Spring Framework proper.

    Then comes Spring Boot, with its convention over configuration approach. Among its key features is the ability to read configuration from different sources such as the default application.properties, the default application.yml, another configuration file and/or System properties passed on the command line. What happens behind the scene is that they all are merged into an Environment instance. This object can then be injected into any bean and queried for any value by passing the key.

    As a starter designer, how to define a default configuration value? Obviously, it cannot be set via a Spring @Bean-annotated method. Let’s analyze the Spring Cloud Sleuth starter as an example:

    Spring Cloud Sleuth implements a distributed tracing solution for Spring Cloud, borrowing heavily from Dapper, Zipkin and HTrace. For most users Sleuth should be invisible, and all your interactions with external systems should be instrumented automatically. You can capture data simply in logs, or by sending it to a remote collector service.

    Regarding the configuration, the starter changes the default log format to display additional information (to be specific, span and trace IDs but that’s not relevant to the post). Let’s dig further.

    As for auto-configuration classes, the magic starts in the META-INF/spring.factories file in the Spring Cloud Sleuth starter JAR:

    # Environment Post Processor
    org.springframework.boot.env.EnvironmentPostProcessor=\
    org.springframework.cloud.sleuth.autoconfig.TraceEnvironmentPostProcessor
    

    The definition of the interface looks like the following:

    public interface EnvironmentPostProcessor {
      void postProcessEnvironment(ConfigurableEnvironment environment, SpringApplication application);
    }
    

    And the implementation like that:

    public class TraceEnvironmentPostProcessor implements EnvironmentPostProcessor {
    
      private static final String PROPERTY_SOURCE_NAME = "defaultProperties";
    
      @Override
      public void postProcessEnvironment(ConfigurableEnvironment environment, SpringApplication application) {
        Map<String, Object> map = new HashMap<String, Object>();
        map.put("logging.pattern.level",
          "%clr(%5p) %clr([${spring.application.name:},%X{X-B3-TraceId:-},%X{X-B3-SpanId:-},%X{X-Span-Export:-}]){yellow}");
        map.put("spring.aop.proxyTargetClass", "true");
        addOrReplace(environment.getPropertySources(), map);
      }
    
      private void addOrReplace(MutablePropertySources propertySources, Map<String, Object> map) {
        MapPropertySource target = null;
        if (propertySources.contains(PROPERTY_SOURCE_NAME)) {
          PropertySource<?> source = propertySources.get(PROPERTY_SOURCE_NAME);
          if (source instanceof MapPropertySource) {
            target = (MapPropertySource) source;
            for (String key : map.keySet()) {
              if (!target.containsProperty(key)) {
                target.getSource().put(key, map.get(key));
              }
            }
          }
        }
        if (target == null) {
          target = new MapPropertySource(PROPERTY_SOURCE_NAME, map);
        }
        if (!propertySources.contains(PROPERTY_SOURCE_NAME)) {
          propertySources.addLast(target);
        }
      }
    }
    

    As can be seen, the implementation will add both the logging.pattern.level and the spring.aop.proxyTargetClass properties (with relevant values) to the environment (if they don’t exist yet). If they do, they will be added at the bottom of the list.

    With @Conditional, starters can provide default beans in auto-configuration classes, while with EnvironmentPostProcessor, they can provide default property values as well. Using both in conjunction can go a long way toward offering a great convention over configuration Spring Boot experience when designing your own starter.

  • Travis CI tutorial Java projects

    Travis CI logo

    As a consultant, I’ve done a lot of Java projects in different “enterprise” environments. In general, the Continuous Integration stack - when there’s one, is comprised of:

    • Github Enterprise or Atlassian Stash for source version control
    • Jenkins as the CI server, sometimes but rarely Atlassian Bamboo
    • Maven for the build tool
    • JaCoCo for code coverage
    • Artifactory as the artifacts repository - once I had Nexus

    Recently, I started to develop Kaadin, a Kotlin-DSL to design Vaadin applications. I naturally hosted it on Github, and wanted the same features as for my real-life projects above. This post describes how to achieve all desired features using a whole new stack, that might not be familiar with enterprise Java developers.

    Github was a perfect match. Then I went on to search for a Jenkins cloud provider to run my builds… to no avail. This wasn’t such a surprise, as I already searched for that last year for a course on Continuous Integration without any success. I definitely could have installed my own instance on any IaaS platform, but it always pays to be idiomatic. There are plenty of Java projects hosted on Github. They mostly use Travis CI, so I went down this road.

    Travis CI registration

    Registering is as easy as signing in using Github credentials, accepting all conditions and setting the project(s) that need to be built.

    Travis CI projects choice

    Basics

    The project configuration is read from a .travis.yml file at its root. Historically, the platform was aimed at Ruby, but nowadays different kind of projects in different languages can be built. The most important configuration part is to define the language. Since mine is a Java project, the second most important is which JDK to use:

    language: java
    jdk: oraclejdk8
    

    From that point on, every push on the Github repository will trigger the build - including branches.

    Compilation

    If a POM exists at the root of the project, it’s automatically detected and Maven (or the Maven wrapper) will be used as the buidl tool in that case. It’s good idea to use the wrapper for it sets the Maven version. Baring that, there’s no hint of the version used.

    As written above, Travis CI was originally designed for Ruby projects. In Ruby, dependencies are installed system-wide, not per project as for Maven. Hence, the lifecycle is composed of:

    1. an install dependencies phase, translated by default into ./mvnw install -DskipTests=true -Dmaven.javadoc.skip=true -B -V
    2. a build phase, explicitly defined by each project

    This two-phase mapping is not relevant for Maven projects as Maven uses a lazy approach: if dependencies are not in the local repository, they will be downloaded when required. For a more idiomatic management, the first phase can be bypassed, while the second one just needs to be set:

    install: true
    script: ./mvnw clean install
    

    Improve build speed

    In order to speed up future builds, it’s a good idea to keep the Maven local repository between different runs, as it would be the case on Jenkins or a local machine. The following configuration achieves just that:

    cache:
      directories:
      - $HOME/.m2
    

    Tests and failing the build

    As for standard Maven builds, phases are run sequentially, so calling install will first use compile and then test. A single failing test, and the build will fail. A report is sent by email when it happens.

    A build failed in Travis CI

    Most Open Source projects display their build status badge on their homepage to build trust with their users. This badge is provided by Travis CI, it just needs to be hot-linked, as seen below:

    Kaadin build status

    Code coverage

    Travis CI doesn’t use JaCoCo reports generated during the Maven build. One has to use another tool. There are several available: I chose https://codecov.io/, for no other reason than because Mockito also uses it. The drill is the same, register using Github, accept conditions and it’s a go.

    Codecov happily uses JaCoCo coverage reports, but chooses to display only lines coverage - the less meaningful metrics IMHO. Yet, it’s widespread enough, so let’s display it to users via a nice badge that can be hot-linked:

    Codecov

    The build configuration file needs to be updated:

    script:
      - ./mvnw clean install
      - bash <(curl -s https://codecov.io/bash)
    

    On every build, Travis CI calls the online Codecov shell script, which will somehow update the code coverage value based on the JaCoCo report generated during the previous build command.

    Deployment to Bintray

    Companies generally deploy built artifacts on an internal repository. Open Source projects should be deployed on public repositories for users to download - this means Bintray and JCenter. Fortunately, Travis CI provides a lot of different remotes to deploy to, including Bintray.

    Usually, only a dedicated branch should deploy to the remote e.g. release. That parameter is available in the configuration file:

    deploy:
      -
        on:
          branch: release
        provider: bintray
        skip_cleanup: true
        file: target/bin/bintray.json
        user: nfrankel
        key: $BINTRAY_API_KEY
    

    Note the above $BINTRAY_API_KEY variable. Travis CI offers environment variables to allow for some flexibility. Each of them can be defined as not to be displayed in the logs. Beware that in that case, it’s treated as a secret and it cannot be displayed again in the user interface.

    Travis CI environment variables

    For Bintray, it means getting hold of the API key on Bintray, creating a variable with a relevant name and setting its value. Of course, other variables can be created, as many as required.

    Most of the deployment configuration is delegated to a dedicated Bintray-specific JSON file. Among other information, it contains both the artifactId and version values from the Maven POM. Maven filtering is configured to automatically get the content from the POM and source it to the configuration file to use: notice the path reference is under the generated target folder.

    Conclusion

    Open Source development on Github requires a different approach than for in-house enterprise development. In particular, the standard build pipeline is based on a completely different stack. Using this stack is quite time-consuming because of different defaults and all the documentation that needs to be read, but it’s far from impossible.