Nowadays, most companies use one or another Agile methodology for their software development projects. That makes people involved in software development projects at least aware of agile principles - whether they truly try to follow agile practices or just pay lip service to them for a variety of reasons remains debatable. To avoid any association with tainted practices, I’d rather use the name “Exploratory Development”. As with software development, exploration has a vague feeling of the final target destination, and a more or less detailed understanding on how to get there.footnoteref:[joke, This an HTML joke, because “emphasis on less”] Plus on a map, the plotted path is generally not a straight line.
However, even with the rise of the DevOps movement, the operational side of projects still seems to remain oblivious to what happens on the other side of the curtain. This post aims to provide food for thoughts to think about how true Agile can be applied to Ops.
== The legacy approach
As an example, let’s have an e-commerce application. Sometimes, bad stuff happens and the development team needs to access logs to analyze what happened. In general, due to “security” reasons, developers cannot directly access the system, and needs to ask the operations team to send the log file(s). This too common process results in frustration, time wasting, and contributes to build even taller walls between Dev and Ops. To improve the situation, an option could be to setup a datastore to store the logs into, and a webapp to make them available to developers.
Here are some hypotheses regarding the source architecture components:
- Apache Web server
- Apache Tomcat servlet container, where the e-commerce application is deployed
- Solr server, that provide search and faceting features to the app
In turn, relevant data/logs that needs to be stored include:
- Web server logs
- Application logs proper
- Tomcat technical logs
- JMX data
- Solr logs
Let’s implement those requirements with the Elastic stack. The target infrastructure could look like this:
Defining the architecture is generally not enough. Unless one work for a dream company that empowers employees to improve the situation on their own initiative (please send it my resume), chances are there’s a need for some estimates regarding the setup of this architecture. And that goes double in case it’s done for a customer company. You might push back, stating the evidence:
image::wewillaskfortestimates.jpg[We will ask for estimates and then treat them as deadlines,502,362,align=”center”]
Engineers will probably think that managers are asking them to stick their neck out to produce estimates for no real reasons, and push back. But the later will kindly remind the former to “just” base estimates on assumptions, as if it was a real solution instead of plausible deniability. In other words, an assumption is a way to escape blame for a wrong estimate, and yes, it sounds much closer to contract law than to engineering concepts. That means that if any of the listed assumption is not fulfilled, then it’s acceptable for estimates to be wrong. It also implies estimates are then not considered a deadline anymore - which they never were supposed to be in the first place, if only from a semantical viewpoint.footnoteref:[guesstimate, If an estimate needs to be really taken as an estimate, the compound word guesstimate is available. Aren’t enterprise semantics wonderful?]
Notwithstanding all that, and for the sake of the argument, let’s try to review possible assumptions pertaining to the proposed architecture that might impact implementation time:
- Hardware location: on-premise, cloud-based or a mix of both?
- Underlying operating system(s): Nix, Windows, something else, or a mix?
- Infrastructure virtualization degree: is the infrastructure physical, virtualized, both?
- Co-location of the OS: are there requirements regarding the location of components on physical systems?
- Hardware availability: should there be a need to purchase physical machines?
- Automation readiness: is there any solution already in place to automate infrastructure management? If not, in how many environments will the implementation need to be setup and if more than 2, will replication be handled manually?
- Clustering: is any component clustered? Which one(s)? For the application, is there a session replication solution in place? Which one?
- Infrastructure access: needs to be on-site? Needs to get a security token hardware? From software?
And those are quite basic items regarding hardware only. Other wide areas include software (volume of logs, criticality, hardware/software mis/match, versions, etc.), people (in-house support, etc.), planning (vacation seasons, etc.), and I’m probably forgetting some important ones too. Given the sheer number of available items - and assuming they all have been listed, it stands to reason that at least one assumption would prove wrong, hence making final estimates dead wrong. In that case, playing the estimate game is just another way to provide plausible deniability. A much more useful alternative would be to create a n-dimension matrix of all items, and estimate all possible combinations. But as in software projects, the event space has just too many parameters to do that in an acceptable timeframe.
== Proposal for an alternative
That said, what about a real working alternative that might not be to satisfy dashboard managers but the underlying business? It would start by implementing the most basic requirement, and to add more features until it’s good enough, or enough budget has been spent. Here are some possible steps from the above example:
Foundation setup:: The initial goal of the setup is to enable log access and the most important logs are applications logs. Hence, the first setup is the following: + image::foundation-architecture.png[Foundation architecture,450,294,align=”center”] + More JVM logs:: From this point on, a near-zero effort is to add scraping Tomcat’s log, to help with incident analysis by adding correlation. + image::more-jvm-logs.png[More JVM logs,608,294,align=”center”] + Machine decoupling:: The next logical step is to move the Elasticsearch instance to its own dedicated machine, to add an extra level of modularity to the overall architecture. + image::machine-decoupling.png[Machine decoupling,739,365,align=”center”] + Even more logs:: At this point, additional logs from other components - load balancer, Solr server, etc. can be sent to Elasticsearch to improve issue-solving involving different components. Performance improvement:: Given that Logstash is written in Ruby, there might be some performance issues on running Logstash directly along the component, depending on each machine specific load and performances. Elastic realized it some time ago and now propose better performance via dedicated Beat. Every Logstash instance can be replaced by Filebeats. Not only logs:: With the Jolokia library, it’s possible to expose JMX beans through an HTTP interface. Unfortunately, there are only a few available Beats and none of them handle HTTP. However, Logstash with the http-poller plugin gets the job done. Reliability:: In order to improve reliability, Elasticsearch can be cluster-ized.
The good thing about those steps is that they can implemented in (nearly) any order. This means that after laying out the base foundation - the first step, the stakeholder can decide which on makes sense for its specific context, or to stop because it’s enough regarding the added value.
At this point, estimates still might make sense regarding the first step. But after eliminating most complexity (and its related uncertainty), it feels much more comfortable estimating the setup of an Elastic stack in a specific context.
As stated above, whether Agile principles are implemented in software development projects can be subject to debate. However, my feeling is that they have not reached the Ops sphere yet. That’s a shame, because as in Development, projects can truly benefit from real Agile practices. However, to prevent association with Agile cargo cult, I proposed the use of the term “Exploratory Infrastructure”. This post described a proposal to apply such an exploratory approach to a sample infrastructure project. The main drawback of such approach is that it will cost more, as the path straight; the main benefit is that at every step, the stakeholder can choose to pursue or stop, taking into account the law of diminishing returns.
This week, I was tasked to create a development infrastructure with the following components:
- Subversion for SCM
- Trac for bugtracking
- Hudson for CI
- Sonar for Quality Reporting
- Nexus, a Maven repository
I added 2 more components:
- A database in order to store data from each application in the same datastore
- A LDAP in order to authentify oneself in each application with the same credentials
Since we had no server to begin from, I had to install the infrastructure on my development machine, meaning Windows and 1 Gb RAM. The following is a personal feedback I have from this week.
Subversion means Apache HTTP Server. Since I already installed Apache a few times, installing (and configuring it) one more time went like a breeze. I really love this software.
Likewise, once one’s got the Subversion installer, installing it and configuring the underlying Apache instance can be done in less than 1 hour.
Now comes the fun part. Trac was a choice not made by myself (I even didn’t know this product until this week). This choice was made from Trac’s features. Little did we know that Trac runs on Python.
I (and my company) have no Python skills, so please, don’t make fun of me because of what follows.
I installed Python and then configured Apache to use it. That was the easy part.
First lesson, Trac doesn’t run on the latest Python (v3.2). I learnt that the hard way.
Second lesson, Trac needs a whole stack of component connectors:
- since Trac can display you Subversion repository, it needs to communicate with Subversion. It is done through Python so here come Python svn bindings. And if you installed Python v2.6 which is compatible with Trac, that’s too bad because bindings are only avalaible for Python 2.5
- Trac uses SQLLite (first time I heard of this, I’m beginning to feel like a complete idiot) out-of-the-box to store its data. Since all the other applications can run on MySQL and I happen to know a little about MySQL, I wanted Trac to use MySQL (a logical step). Not surprisingly, Python needs drivers to use MySQL, like in Java. This is done through MySQL Python. Completely unlike in Java, drivers are not put in the classpath of the running application, but installed in the Python distribution. Python 2.6+ is not supported though Trac supports it. Life is fun.
- Trac needs a installation component called setuptools that manages Python eggs (a specific format for Python extensions).
- Trac uses a templating engine. Version 0.10 of Trac needed ClearSilver, version 0.11 needs Genshi. Since I used Trac v0.11 and completely forgot to install Genshi, it complained about… not finding ClearSilver. Thanks for the misleading error message, I lost about 1 hour.
Third lesson: in order to use Eclipse Mylyn (an Eclipse plugin that manages to connect to a bugtracker and displays issues in Eclipse), Trac needs the XML-RPC plugin to be installed. Using the previously installed setuptools, it only needs a command-line instruction.
Fourth lesson, a project is linked to an environment. An environment is created using a command-line tool and then configured in Apache.
Fifth lesson, it is impossible to create sub-projects though a patch exists to modify the code in order to do so.
Hudson comes into two flavors:
- Hudson standalone as a Java Web Start installer
- a war file
The standalone runs nicely even though I can only guess at the underlying complexity. It can be configured to run as a Windows service. I am amazed: this is too good to be true. My deepest respect to the developement team. Finally, I uninstall Hudson (equally as easy) because…
I will have to run 3 Java applications: Hudson, Sonar and Nexus. If each runs in standalone mode, this means 3 JVM running concurrently and I’m bound to my crappy 1Gb RAM. Since each of these applications is delivered in a WAR archive, better have a lightweight server such as Tomcat (or Jetty). Since I’v used Tomcat since the dawn of time and since I see Jetty more like an embedded container, I installed Tomcat (this is a no-op, even for first time users).
Deploying Hudson in Tomcat is easy enough. Just don’t forget to configure Tomcat’s JVM with the
-DHUDSON_HOMEJava property. This tells Hudson to use this folder for storing its files (runs, builds, etc.).
Sonar was deployed easily in Tomcat. I configured it to use MySQL and everything went fine. The only comment I have about Sonar is that it comes in a whooping 45+ Mb archive. I mentioned this to the development team: it will likely be curbed in future versions.
Nexus / Artifactory
Nearly finished I thought and went on to deploy Nexus. Deploying Nexus war in Tomcat went fine. Configuring Nexus went not. I wasn’t able to find a single line of documentation regarding using Nexus not run in standalone mode, even though it is available for download in war format. Having lost too much time with Track, I decided to use Artifactory instead. I already installed this Maven repository on a customer’s site and it gave me complete satisfaction.
Just don’t forget to configure Tomcat’s JVM with the
-artifactory.homeJava property. This tells Artifactory to use this folder for storing its internal files.
Yet, I ran into an error trying to use MySQL instead of JavaDB. Whatever the configuration I used, Artifactory was not able to use the password I provided to acces MySQL: I kept having
Access denied for user 'artifactory'@'localhost' (using password: NO)messages. Having spent a considerable time trying to correct the thing, I went back to using JavaDB. Since Artifactory provides an repository backup functionality out-of-the-box, I can live with it.
Finally, having Artifactory up and running, I kept running into
OutOfMemoryError: PermGenSpaceerrors. Tomcat was using 512Mb (which is a really low value for 3 important applications) but it has no effect on permanent generation space. I tried different
-XX:MaxPermSizeconfigurations without any luck. So in a bold move, I switched from a Sun JVM to a JRockit. Since then, I got no such errors anymore, though I have very high Garbage Collection time. Nothing a little more RAM and correct tuning can cure.
I wanted all my applications to use the same authentication component. I used OpenLDAP a couple of times before but it was always ankward to install and use (I’m no LDAP guru) . I browsed the web a little to find OpenSource and free (as a beer) LDAP servers and stumbled upon OpenDS. Easy to install, easy to configure and last but not least, easy to manage LDAP entities. Everything done through its GUI.
Configuring Apache and Tomcat to use LDAP is done through XML editing, for Artifactory and Hudson, it is done through the webapp. Artifactory even provides a test button to check whether the connection is correct.
All in all, I’m pretty happy with the components installed, save one. Trac does provide many functionalities but it has not its place in our infrastructure since Python is not my friend and it needs to many components I won’t be able to debug if we run across a problem. JIRA would be the bugtracker of choice, but since it is not free, it has no chance. I’ve already used MantisBT, and though it uses a PHP stack (and not Java), I feel more at ease with it. In its latest version, it can even integrate third party PHP Wikis. Any thought about it ?
Posts Tagged ‘infrastructure’