Archive

Posts Tagged ‘elastic’
  • Exploring data sets with Kibana

    :imagesdir: /assets/resources/exploring-data-sets-with-kibana/ :icons: font :experimental:

    In this post, I’d like to explore a sample data set using Kibana.

    This requires some data to start with: let’s index some tweets. It’s quite straightforward to achieve that by following explanations found in my good friend David’s http://david.pilato.fr/blog/2015/06/01/indexing-twitter-with-logstash-and-elasticsearch/[blog post^] and wait for some time to fill the index with data.

    == Basic metric

    Let’s start with something basic, the number of tweets indexed so far.

    In Kibana, go to menu:Visualize[Metric], then choose the twitter index. For the Aggregation field, choose “Count”; then click on btn:[Save] and name the visualization accordingly e.g. “Number of tweets”.

    image:basic-metric-create.png[Create a basic metric,351,146] image:basic-metric-display.png[Display the number of tweets,320,202]

    == Geo-map

    Another simple visualization is to display the tweets based on their location on a world map.

    In Kibana, go to menu:Visualize[Tile map], then choose the twitter index.

    Select Geo Coordinates for the bucket type and keep default values,Geohash for Aggregation and coordinates.coordinates for Field.

    image::geo-map-display.png[Localized map of tweets,637,402,align=center]

    == Bucket metric

    For this kind of metric, suppose a business requirement is to display the top 5 users. Unfortunately, as some (most?) business requirements go, this is not deterministic enough. It misses both the range and the aggregation period. Let’s agree for range time to be a sliding window over the last day, and the period to be an hour.

    In Kibana, go to menu:Visualize[Vertical bar chart], then choose the twitter index. Then:

    • For the Y-Axis, keep Count for the Aggregation field
    • Choose X-Axis for the buckets type ** Select Date histogram for the Aggregation field ** Keep the value @timestamp for the Field field ** Set the Interval field to Hourly
    • Click on btn:[Add sub-buckets]
    • Choose Split bars for the buckets type ** Select Terms for the Sub Aggregation field ** user.screen.name for the Field field ** Keep the other fields default value
    • Don’t forget to click on the btn:[Apply changes]
    • Click on btn:[Save] and name the visualization accordingly e.g. “Top 5 users hourly”.

    image:bucket-metric-create.png[Create a bucket metric,234,472] image:bucket-metric-display.png[Display the top 5 users hourly,637,202]

    === Equivalent visualisations

    Other visualizations can be used with the exact same configuration: Area chart and Data table.

    The output of the Area chart is not as readable, regarding the explored data set, but the Data table offers interesting options.

    From a visualization, click on the bottom right arrow icon to display a table view of the data instead of a graphic.

    image::tabular-metric.png[Alternative tabular metric display,635,200,align=center]

    Visualizations make use of Elasticsearch public API. From the tabular view, the JSON request can also be displayed by clicking on the btn:[Request] button (oh, surprise…). This way, Kibana can be used as a playground to quickly prototype requests before using them in one’s own applications.

    image::request-metric.png[Executed API request,635,200,align=center]

    === Changing requirements a bit

    The above visualization picks out the 5 top users having the most tweeted during each hour and display them during the last day. That’s the reason why there are more than 5 users displayed. But the above requirement can be interpreted in another way: take the top 5 users over the course of the last day, and break their number of tweets by hour.

    To do that, just move the X-Axis bucket below the Split bars bucket. This will change the output accordingly.

    image::another-bucket-metric-display.png[Display the top 5 users over the last day,637,203,align=center]

    === Filtering irrelevant data

    As can be seen in the above histogram, top users mostly are about recruiting and/or job offers. This is not really what is wanted in the first place. It’s possible to remove this noise by adding a filter: in the Split bars section, click on btn:[Advanced] to display additional parameters and type the desired regex in the Exclude field.

    image::filtered-bucket-metric-create.png[Filter out a bucket metric,342,69,align=center]

    The new visualization is quite different:

    image::filtered-bucket-metric-display.png[Display the top 5 users hourly without any recruitment-related user,637,202,align=center]

    == Putting it all together

    With the above visualizations available and configured, it’s time to put them together on a dedicated dashboard. Go to menu:Dashboard[Add] to list all available visualizations.

    image::add-visualization-dashboard.png[Add visualizations to a dashboard,732,391,align=center]

    It’s as simple as clicking on the desired one, laying it out on the board and resetting its size. Rinse and repeat until happy with the result and then click on btn:[Save].

    image::configured-dashboard.png[A configured dashboard,910,429,align=center]

    Icing on the cake, using the btn:[Rectangle] tool on the map visualization will automatically add a filter that only displays data bound by the rectangle coordinates for all visualizations found on the dashboard.

    image::filtered-dashboard.png[A filtered dashboard,910,453,align=center]

    That trick is not limited to the map visualization (try playing with other ones) but filtering on location quickly gives insights when exploring data sets.

    == Conclusion

    While this post only brushes off the surface of what Kibana has to offer, there are more visualizations available as well as Timelion, the new powerful (but sadly under-documented) the “time series expression interface”. In all cases, even basic features as shown above already provide plenty of different options to make sense of one’s data sets.

    Categories: Technical Tags: elasticsearchelastickibanabig data
  • Feedback on Feeding Spring Boot metrics to Elasticsearch

    Logstash log

    Some weeks ago, I wrote a post detailing how to send JMX metrics from a Spring Boot app to Elasticsearch by developing another Spring Boot app.

    Getting to create such an app is not always the right idea but developers are makers - software makers, and developing new apps is quite alluring to them. However, in the overall scheme of things, this means time is not only spent in development, but also for maintenance during the entire lifetime of the app. Before going the development path, one should thoroughly check whether out-of-the-box alternatives exist.

    Back to JMX metrics: only the straightforward Logstash jmx plugin was tried before calling it quits because of an incompatibility with Elasticsearch 5. But an alternative exists with the Logstash http_poller plugin.

    This Logstash input plugin allows you to call an HTTP API, decode the output of it into event(s), and send them on their merry way. The idea behind this plugins came from a need to read springboot metrics endpoint, instead of configuring jmx to monitor my java application memory/gc/ etc.

    Jolokia is already in place, offering HTTP access to JMX. The last and only step is to configure HTTP poller plugin, which is fairly straighforward. The URL is composed of the standard actuator URL appended with /jolokia/read/ and the JMX’s ObjectName of the desired object. Here’s a sample configuration snippet, with URLs configured for:

    1. Response site for the root page
    2. HTTP 200 status code counter for the root page
    3. Operating system metrics
    4. Garbage collector metrics

    With the help of the jconsole, adding more metrics is a no-brainer.

    input {
      http_poller {
        urls => {
          "200.root" =>
            "http://localhost:8080/manage/jolokia/read/org.springframework.metrics:name=status,type=counter,value=200.root"
          "root" =>
            "http://localhost:8080/manage/jolokia/read/org.springframework.metrics:name=response,type=gauge,value=root"
          "OperatingSystem" => "http://localhost:8080/manage/jolokia/read/java.lang:type=OperatingSystem"
          "PS Scavenge" => "http://localhost:8080/manage/jolokia/read/java.lang:type=GarbageCollector,name=PS%20Scavenge"
        }
        request_timeout => 60
        schedule => { every => "10s"}
        codec => "json"
      }
    }
    

    This should output something akin to:

    {
              "request" => {
                "mbean" => "org.springframework.metrics:name=response,type=gauge,value=root",
                 "type" => "read"
              },
           "@timestamp" => 2016-01-01T10:19:45.914Z,
             "@version" => "1",
                "value" => {
                "Value" => 4163.0,
          "LastUpdated" => "2016-12-30T16:29:28+01:00"
                },
            "timestamp" => 1483121985,
               "status" => 200
    }
    

    Beyond the raw output, there are a couple of possible improvements:

    • value and request fields are inner objects, for no added value. Flattening the structure can go a long way toward making writing queries easier.
    • Adding tags depending on the type improve categorization. A possible alternative would be to parse the JMX compound name into dedicated fields with grok or dissect.
    • The @timestamp field can be replaced with the LastUpdated value and interpreted as a date.

    Filters to the rescue:

    {% highlight raw linenos %} filter { mutate { add_field => { “[mbean][objectName]” => “%{request[mbean]}” }} mutate { remove_field => “request” } }

    filter { if [mbean][objectName] == “java.lang:type=OperatingSystem” { dissect { mapping => { “mbean[objectName]” => “%{[mbean][prefix]}:type=%{[mbean][type]}” } } mutate { remove_field => “value[ObjectName]” } } else if [mbean][objectName] == “java.lang:name=PS Scavenge,type=GarbageCollector” { dissect { mapping => { “mbean[objectName]” => “%{[mbean][prefix]}:name=%{[mbean][name]},type=%{[mbean][type]}” } } mutate { remove_field => “value[ObjectName]” } } else if [mbean][objectName] =~ “^.,type=gauge,.$” or [mbean][objectName] =~ “^.,type=counter,.$” { date { match => [ “%{value[lastUpdated]}”, “ISO8601” ] } mutate { replace => { “value” => “%{value[Value]}” }} mutate { convert => { “value” => “float” }} if [mbean][objectName] =~ “^.,type=gauge,.$” { dissect { mapping => { “mbean[objectName]” => “%{[mbean][prefix]}:name=%{[mbean][name]},type=%{[mbean][type]},value=%{[mbean][page]}” } } } else if [mbean][objectName] =~ “^.,type=counter,.$” { dissect { mapping => { “mbean[objectName]” => “%{[mbean][prefix]}:name=%{[mbean][name]},type=%{[mbean][type]},value=%{[mbean][status]}.%{[mbean][page]}” } } } } } {% endhighlight %}

    A little explanation might be in order.

    Lines 1-4
    Move the initial request->mbean field to the mbean->objectName field.
    Lines 7-13
    For OS metrics, create mbean nested fields out of the objectName nested field and remove it from the value field.
    Lines 14-20
    For GC metrics, create mbean nested fields out of the objectName nested field using a slightly different pattern and remove it from the value field.
    Lines 21-24
    For gauge or counter metrics, interpret the value->lastUpdated nested field as a date, move the nested value->Value field to the root and interpret as a float value.
    Lines 25-38
    For gauge or counter metrics, create mbean nested fields using a pattern specific for each metric.

    Coupled with the initial configuration, this outputs to the following (for the gauge):

    {
        "@timestamp" => 2016-01-01T10:54:15.359Z,
             "mbean" => {
            "prefix" => "org.springframework.metrics",
              "name" => "response",
        "objectName" => "org.springframework.metrics:name=response,type=gauge,value=root",
              "type" => "gauge",
             "value" => "root"
             },
          "@version" => "1",
             "value" => 4163.0,
         "timestamp" => 1483206855,
            "status" => 200,
              "tags" => []
    }
    

    Given the current complexity of the configuration, remember the next logical step is to decouple the single snippet into multiple files. The order used by Logstash is by lexicographical file name. Name files accordingly:

    • 00-app-input.conf
    • 10-global-filter.conf
    • 20-specific-filter.conf
    • 99-console-output.conf

    Note: gaps in the number scheme allow to add intermediary filters in the future, with no renaming

    All in all, everything required is available through configuration, without any coding. Always check that it’s not the case before reinventing the wheel.

    Categories: Technical Tags: elasticsearchlogstashelastic
  • Structuring data with Logstash

    Logstash old logo

    Given the trend around microservices, it has become mandatory to be able to follow a transaction across multiple microservices. Spring Cloud Sleuth is such a distributed tracing system fully integrated into the Spring Boot ecosystem. By adding the spring-cloud-starter-sleuth into a project’s POM, it instantly becomes Sleuth-enabled and every standard log call automatically adds additional data, such as spanId and traceId to the usual data.

    2016-11-25 19:05:53.221  INFO [demo-app,b4d33156bc6a49ec,432b43172c958450,false] 25305 ---\n
    [nio-8080-exec-1] ch.frankel.blog.SleuthDemoApplication      : this is an example message
    

    (broken on 2 lines for better readability)

    Now, instead of sending the data to Zipkin, let’s say I need to store it into Elasticsearch instead. A product is as good as the way it’s used. Indexing unstructured log messages is not very useful. Logstash configuration allows to pre-parse unstructured data and send structured data instead.

    Grok

    Grokking data is the usual way to structure data with pattern matching.

    Last week, I wrote about some hints for the configuration. Unfortunately, the hard part comes in writing the matching pattern itself, and those hints don’t help. While it might be possible to write a perfect Grok pattern on the first draft, the above log is complicated enough that it’s far from a certainty, and chances are high to stumble upon such message when starting Logstash with an unfit Grok filter:

    "tags" => [
        [0] "_grokparsefailure"
    ]
    

    However, there’s “an app for that” (sounds familiar?). It offers three fields:

    1. The first field accepts one (or more) log line(s)
    2. The second the grok pattern
    3. The 3rd is the result of filtering the 1st by the 2nd

    The process is now to match fields one by one, from left to right. The first data field .e.g. 2016-11-25 19:05:53.221 is obviously a timestamp. Among common grok patterns, it looks as if the TIMESTAMP_ISO8601 pattern would be the best fit.

    Enter %{TIMESTAMP_ISO8601:timestamp} into the Pattern field. The result is:

    {
      "timestamp": [
        [
          "2016-11-25 17:05:53.221"
        ]
      ]
    }
    

    The next field to handle looks like the log level. Among the patterns, there’s one LOGLEVEL. The Pattern now becomes %{TIMESTAMP_ISO8601:timestamp} *%{LOGLEVEL:level} and the result:

    {
      "timestamp": [
        [
          "2016-11-25 17:05:53.221"
        ]
      ],
      "level": [
        [
          "INFO"
        ]
      ]
    }
    

    Rinse and repeat until all fields have been structured. Given the initial log line, the final pattern should look something along those lines:

    %{TIMESTAMP_ISO8601:timestamp} *%{LOGLEVEL:level} \[%{DATA:application},%{DATA:traceId},%{DATA:spanId},%{DATA:zipkin}]\n
    %{DATA:pid} --- *\[%{DATA:thread}] %{JAVACLASS:class} *: %{GREEDYDATA:log}
    

    (broken on 2 lines for better readability)

    And the associated result:

    {
            "traceId" => "b4d33156bc6a49ec",
             "zipkin" => "false",
              "level" => "INFO",
                "log" => "this is an example message",
                "pid" => "25305",
             "thread" => "nio-8080-exec-1",
               "tags" => [],
             "spanId" => "432b43172c958450",
               "path" => "/tmp/logstash.log",
         "@timestamp" => 2016-11-26T13:41:07.599Z,
        "application" => "demo-app",
           "@version" => "1",
               "host" => "LSNM33795267A",
              "class" => "ch.frankel.blog.SleuthDemoApplication",
          "timestamp" => "2016-11-25 17:05:53.221"
    }
    

    Dissect

    The Grok filter gets the job done. But it seems to suffer from performance issues, especially if the pattern doesn’t match. An alternative is to use the dissect filter instead, which is based on separators.

    Unfortunately, there’s no app for that - but it’s much easier to write a separator-based filter than a regex-based one. The mapping equivalent to the above is:

    %{timestamp} %{+timestamp} %{level}[%{application},%{traceId},%{spanId},%{zipkin}]\n
    %{pid} %{}[%{thread}] %{class}:%{log}
    

    (broken on 2 lines for better readability)

    This outputs the following:

    {
            "traceId" => "b4d33156bc6a49ec",
             "zipkin" => "false",
                "log" => " this is an example message",
              "level" => "INFO ",
                "pid" => "25305",
             "thread" => "nio-8080-exec-1",
               "tags" => [],
             "spanId" => "432b43172c958450",
               "path" => "/tmp/logstash.log",
         "@timestamp" => 2016-11-26T13:36:47.165Z,
        "application" => "demo-app",
           "@version" => "1",
               "host" => "LSNM33795267A",
              "class" => "ch.frankel.blog.SleuthDemoApplication      ",
          "timestamp" => "2016-11-25 17:05:53.221"
    }
    

    Notice the slight differences: by moving from a regex-based filter to a separator-based one, some strings end up padded with spaces. There are 2 ways to handle that:

    • change the logging pattern in the application - which might make direct log reading harder
    • strip additional spaces with Logstash

    With the second option, the final filter configuration snippet is:

    filter {
      dissect {
    	mapping => { "message" => ... }
      }
      mutate {
        strip => [ "log", "class" ]
      }
    }
    

    Conclusion

    In order to structure data, the grok filter is powerful and used by many. However, depending on the specific log format to parse, writing the filter expression might be quite complex a task. The dissect filter, based on separators, is an alternative that makes it much easier - at the price of some additional handling. It also is an option to consider in case of performance issues.

    Categories: Development Tags: elklogstashelasticparsingdata
  • Debugging hints for Logstash

    Logstash old logo

    As a Java developer, when you are first shown how to run the JVM in debug mode, attach to it and then set a breakpoint, you really feel like you’ve reached a step on your developer journey. Well, at least I did. Now the world is going full microservice and knowing that trick means less and less in it everyday.

    This week, I was playing with Logstash to see how I could send all of an application exceptions to an Elasticsearch instance, so I could display them on a Kibana dashboard for analytics purpose. Of course, nothing was seen in Elasticsearch at first. This post describes helped me toward making it work in the end.

    The setup

    Components are the following:

    • The application. Since a lot of exceptions were necessary, I made use of the Java Bullshifier. The only adaptation was to wire in some code to log exceptions in a log file. {% highlight java %} public class ExceptionHandlerExecutor extends ThreadPoolExecutor { private static final Logger LOGGER = LoggerFactory.getLogger(ExceptionHandlerExecutor.class); public ExceptionHandlerExecutor(int corePoolSize, int maximumPoolSize, long keepAliveTime, TimeUnit unit, BlockingQueue workQueue) { super(corePoolSize, maximumPoolSize, keepAliveTime, unit, workQueue); } @Override protected void afterExecute(Runnable r, Throwable t) { if (r instanceof FutureTask) { FutureTask futureTask = (FutureTask) r; if (futureTask.isDone()) { try { futureTask.get(); } catch (InterruptedException | ExecutionException e) { LOGGER.error("Uncaught error", e); } } } } } {% endhighlight %} </li>
    • Logstash
    • Elasticsearch
    • Kibana
    • </ul> ## The first bump Before being launched, Logstash needs to be configured - especially its input and its output. There's no out-of-the-box input focused on exception stack traces. Those are multi-lines messages: hence, only lines starting with a timestamp mark the beginning of a new message. Logstash achieves that with a specific `codec` including a `regex` pattern.
      Some people, when confronted with a problem, think "I know, I'll use regular expressions."
      Now they have two problems.
      -- Jamie Zawinski
      Some are better than others at regular expressions but nobody learned it as his/her mother tongue. Hence, it's not rare to have errors. In that case, one should use one of the many available online regex validators. They are just priceless for understanding why some pattern doesn't match. The relevant Logstash configuration snippet becomes: ```javascript input { file { path => "/tmp/logback.log" start_position => "beginning" codec => multiline { pattern => "^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}" negate => true what => "previous" } } } ``` ## The second bump Now, data found its way into Elasticsearch, but message were not in the expected format. In order to analyze where this problem came from, messages can be printed on the console instead of indexed in Elasticsearch. That's quite easy with the following snippet: ```javascript output { stdout { codec => rubydebug } } ``` With messages printed on the console, it's possible to understand where the issue occurs. In that case, I was able to tweak the `input` configuration (and add the forgotten `negate => true` bit). Finally, I got the expected result: ## Conclusion With more and more tools with every passing day, the toolbelt of the modern developer needs to increase as well. Unfortunately, there's no one-size-fits-all solution: in order to know a tool's every nook and cranny, one needs to use and re-use it, be creative, and search on Google... a lot.
    Categories: Development Tags: elklogstashelasticdebugging