Nowadays, most companies use one or another Agile methodology for their software development projects. That makes people involved in software development projects at least aware of agile principles - whether they truly try to follow agile practices or just pay lip service to them for a variety of reasons remains debatable. To avoid any association with tainted practices, I’d rather use the name “Exploratory Development”. As with software development, exploration has a vague feeling of the final target destination, and a more or less detailed understanding on how to get there.footnoteref:[joke, This an HTML joke, because “emphasis on less”] Plus on a map, the plotted path is generally not a straight line.
However, even with the rise of the DevOps movement, the operational side of projects still seems to remain oblivious to what happens on the other side of the curtain. This post aims to provide food for thoughts to think about how true Agile can be applied to Ops.
== The legacy approach
As an example, let’s have an e-commerce application. Sometimes, bad stuff happens and the development team needs to access logs to analyze what happened. In general, due to “security” reasons, developers cannot directly access the system, and needs to ask the operations team to send the log file(s). This too common process results in frustration, time wasting, and contributes to build even taller walls between Dev and Ops. To improve the situation, an option could be to setup a datastore to store the logs into, and a webapp to make them available to developers.
Here are some hypotheses regarding the source architecture components:
- Apache Web server
- Apache Tomcat servlet container, where the e-commerce application is deployed
- Solr server, that provide search and faceting features to the app
In turn, relevant data/logs that needs to be stored include:
- Web server logs
- Application logs proper
- Tomcat technical logs
- JMX data
- Solr logs
Let’s implement those requirements with the Elastic stack. The target infrastructure could look like this:
Defining the architecture is generally not enough. Unless one work for a dream company that empowers employees to improve the situation on their own initiative (please send it my resume), chances are there’s a need for some estimates regarding the setup of this architecture. And that goes double in case it’s done for a customer company. You might push back, stating the evidence:
image::wewillaskfortestimates.jpg[We will ask for estimates and then treat them as deadlines,502,362,align=”center”]
Engineers will probably think that managers are asking them to stick their neck out to produce estimates for no real reasons, and push back. But the later will kindly remind the former to “just” base estimates on assumptions, as if it was a real solution instead of plausible deniability. In other words, an assumption is a way to escape blame for a wrong estimate, and yes, it sounds much closer to contract law than to engineering concepts. That means that if any of the listed assumption is not fulfilled, then it’s acceptable for estimates to be wrong. It also implies estimates are then not considered a deadline anymore - which they never were supposed to be in the first place, if only from a semantical viewpoint.footnoteref:[guesstimate, If an estimate needs to be really taken as an estimate, the compound word guesstimate is available. Aren’t enterprise semantics wonderful?]
Notwithstanding all that, and for the sake of the argument, let’s try to review possible assumptions pertaining to the proposed architecture that might impact implementation time:
- Hardware location: on-premise, cloud-based or a mix of both?
- Underlying operating system(s): Nix, Windows, something else, or a mix?
- Infrastructure virtualization degree: is the infrastructure physical, virtualized, both?
- Co-location of the OS: are there requirements regarding the location of components on physical systems?
- Hardware availability: should there be a need to purchase physical machines?
- Automation readiness: is there any solution already in place to automate infrastructure management? If not, in how many environments will the implementation need to be setup and if more than 2, will replication be handled manually?
- Clustering: is any component clustered? Which one(s)? For the application, is there a session replication solution in place? Which one?
- Infrastructure access: needs to be on-site? Needs to get a security token hardware? From software?
And those are quite basic items regarding hardware only. Other wide areas include software (volume of logs, criticality, hardware/software mis/match, versions, etc.), people (in-house support, etc.), planning (vacation seasons, etc.), and I’m probably forgetting some important ones too. Given the sheer number of available items - and assuming they all have been listed, it stands to reason that at least one assumption would prove wrong, hence making final estimates dead wrong. In that case, playing the estimate game is just another way to provide plausible deniability. A much more useful alternative would be to create a n-dimension matrix of all items, and estimate all possible combinations. But as in software projects, the event space has just too many parameters to do that in an acceptable timeframe.
== Proposal for an alternative
That said, what about a real working alternative that might not be to satisfy dashboard managers but the underlying business? It would start by implementing the most basic requirement, and to add more features until it’s good enough, or enough budget has been spent. Here are some possible steps from the above example:
Foundation setup:: The initial goal of the setup is to enable log access and the most important logs are applications logs. Hence, the first setup is the following: + image::foundation-architecture.png[Foundation architecture,450,294,align=”center”] + More JVM logs:: From this point on, a near-zero effort is to add scraping Tomcat’s log, to help with incident analysis by adding correlation. + image::more-jvm-logs.png[More JVM logs,608,294,align=”center”] + Machine decoupling:: The next logical step is to move the Elasticsearch instance to its own dedicated machine, to add an extra level of modularity to the overall architecture. + image::machine-decoupling.png[Machine decoupling,739,365,align=”center”] + Even more logs:: At this point, additional logs from other components - load balancer, Solr server, etc. can be sent to Elasticsearch to improve issue-solving involving different components. Performance improvement:: Given that Logstash is written in Ruby, there might be some performance issues on running Logstash directly along the component, depending on each machine specific load and performances. Elastic realized it some time ago and now propose better performance via dedicated Beat. Every Logstash instance can be replaced by Filebeats. Not only logs:: With the Jolokia library, it’s possible to expose JMX beans through an HTTP interface. Unfortunately, there are only a few available Beats and none of them handle HTTP. However, Logstash with the http-poller plugin gets the job done. Reliability:: In order to improve reliability, Elasticsearch can be cluster-ized.
The good thing about those steps is that they can implemented in (nearly) any order. This means that after laying out the base foundation - the first step, the stakeholder can decide which on makes sense for its specific context, or to stop because it’s enough regarding the added value.
At this point, estimates still might make sense regarding the first step. But after eliminating most complexity (and its related uncertainty), it feels much more comfortable estimating the setup of an Elastic stack in a specific context.
As stated above, whether Agile principles are implemented in software development projects can be subject to debate. However, my feeling is that they have not reached the Ops sphere yet. That’s a shame, because as in Development, projects can truly benefit from real Agile practices. However, to prevent association with Agile cargo cult, I proposed the use of the term “Exploratory Infrastructure”. This post described a proposal to apply such an exploratory approach to a sample infrastructure project. The main drawback of such approach is that it will cost more, as the path straight; the main benefit is that at every step, the stakeholder can choose to pursue or stop, taking into account the law of diminishing returns.
Disclaimer: this post touches bits and pieces of finance, management and sociology for which I’m far from qualified. However, I’ve plenty of experience working in companies where they had big effects and I couldn’t resist drawing my own conclusions. I’ll happily listen to realistic solutions.
I’ve been working for more than a decade in the software industry, always as a consultant. Most of my employers were pure consulting companies, with only my latest employer being a software provider. During all these years, my customers were traditional companies in various sectors: banking, insurance, telco, public administration, retail, …
Though my work always revolved around software, in various positions, I couldn’t help notice that it could have been much easier if the context had been different. I’m not talking about personal disputes between two persons, or such trivial things, but about pervasive ideas that originate from very up above - most of the time outside the company, and permeate all hierarchy layers in a organised fashion: the closer to upper management, the stronger its presence.
== On short-term objectives
Like it or not, most companies are on the stock exchange. Such companies have to legally publish their results on a quarterly basis. To simplify, the better the results, the higher the share value. This tends to promote objectives based on each quarter to increase the latter. This leads to short-term decisions to increase share value that have very definite consequences on the long-term.
Interestingly enough, the GAFA seem to have a whole another strategy, as they aim for the long-term. Google invests massively in bio-technology and self-driven cars, Amazon makes no benefits since its inception and Apple has a tradition of not being really interested in giving its money to shareholders. I guess despite those respective strategies, none of them could be viewed as unsuccessful.
== On productivity
Part of our core values is about always doing “better”. This is of course felt in the private sector, but is also ingrained in us as individuals. I ran twice the half-marathon and the second time, I just had to run it faster than the first time. I didn’t think about it, it was just… expected. For vacations also, most of us want to spend better vacations than the year before, even though they might have been the best so far.
If it so in the personal sphere, what is it like in the professional one? Well, I guess anyone having ever worked knows about it. It’s all about defining metrics, measuring them, then finding ways to improve them in order to improve productivity. This is not bad per se, but there are definitely issues. IMHO, the greatest is the term “productivity” itself. I can understand it when applied to simple basic tasks, but I’ve never heard about judge or policeman productivity. People might comment that the productivity of public service cannot be easily evaluated. Perhaps, but what about the productivity of private jobs such as lawyers? Or engineers?
Trying to measure the productivity of developers is based on the the thinking that they have more in common with manual labourers than engineers.
== On the nature of development
Though impossible to measure, different developers definitely have different productivity. For example, when one developer commits a compile error, it impacts the work of all other developers working on the same project. Thus, one developer can have a negative productivity, as it is the case with engineers and lawyers.
In order for one to understand that, one must probably have worked as engineers or with them. Thus, one could in the best of case understood the nature of development work, or in the worst, realized the difference in productivity of different people.
For people who never had a chance to get familiar with engineering work, all engineers are the same. The only difference then would be on the cost of each different one.
== On quality
Continuous improvement goes through measurement. Measuring some metrics and not others have definite incidence on behaviour of professionals, as qualities not measured exist in neither retrospectives nor in annual reviews. Engineers pride themselves on delivering quality software. But what is quality? Regarding software, there are two facets of quality:
- On one hand, external quality can be defined as the gap between implementation and specifications. The highest quality possible is when they are both fully aligned i.e. the software just reflects specifications
- On the other hand, internal quality is defined as the relative cost to add new features. With the highest internal quality, the cost of adding new features is linear. With the lowest internal quality, the cost of adding new features is so high that it’s les expensive to rewrite the software from scratch.
Internal quality is pretty hard to measure (read, I don’t know about any) since it’s defined as a gap - between the current current cost and the “best” cost. To compute such quality, one would have to do 2 softwares… and it’s not something that is likely to happen. The logical conclusion to this inability is that internal quality is at best vaguely talked about - and at worst completely ignored, while measurable metrics such as costs and planning are focused upon. When any of those start to get threatened, then quality becomes the fifth wheel.
== On individual objectives
The agreed upon way to improve - in general and specifically profits, is by defining objectives. Those objectives are agreed by C-levels which in turn dispatch more specifics objectives into their respective departments, and this goes down until the lowest levels of the organisational hierarchy. Management does like that because it acts upon an assumption: in this case, the assumption is that optimising every single place in the organisation will likely improve the organisation as a whole. Interesting, but completely wrong as only the simplest of systems work like this. Systems like human organisations are much more complex, so they have plenty of local optimums: optimising one area probably has negative effect in another.
In one of my previous job, managers/sales persons had no individual incentives. Their bonuses was solely based on the performance of the company as a whole. Some of you might wonder how it could possible. They might even be suspicious that they didn’t go that extra mile to get more money flowing in given there was no incentive. I cannot be the judge of that but what I know is that collaboration was very high and they achieved much in that regard. Besides, there was a very good work atmosphere and much less of the political games seen in traditionally-managed companies.
== On Return Over Investment
As an engineer, I always expected a company and all its employees would try to to maximize the benefits of their investments. My experience has been completely different.
A colleague told me the following story: imagine a new bank coming on the market. For a limited time, they propose for every $20 bill you bring to exchange it for a $50 bill. As a company, what’s the budget you should allocate to such a project? $100, $1,000, 100k? The answer is quite easy: as many as you afford as the return over investment is very high, and probably higher than any other project running in the company.
Most companies official policy is to at least describe the expected benefits for a planned project and even better approximate a ROI. Most of the times, business want to push a project based on more or less imaginary assumptions, and the benefits/ROI paragraph is just a nicely wrapping of those.
In Software Development, many seemingly strange (i.e. stupid) decisions happening are the direct results of the conjunction of two or more of the previous points.
- Outsourcing: those who take decisions to outsource have no clues about development, they think all developers have the same productivity and then why hire a local developer when you can have 3 remote ones for the same price? And of course, development is such a simple process that direct communication has no place in it.
- No attention to quality: as productivity, when a property is hard to measure, it’s better forgotten. If you add to this the fact that project manager have individual objectives on the budget and the planning, it’s no mystery only engineers that pride themselves about their craft care about quality. Besides, lack of quality only shows in the long-term…
- Death-march projects: no understanding of software development plus individual objectives
And this might goes on ad nauseam.
What I noticed however is that successful recent technology companies have completely different strategies. For sure, they fully understand software development but it’s not only that. They definitely have long-term plans. Proofs of that include amazon still making no benefits and Google investing in biotech. I’ve never heard any of them trying to measure productivity, but surely they know about differences in developers. They also care about quality… a lot, as they realize it’s an asset which maintenance costs are directly related to quality. Finally, they are not ruled by finance.
I don’t think it’s all about company politics as in the introducing cartoons, as I see everyday people who are genuinely interested in teamwork. However, I believe management focusing on short-term financial KPIs is a dead-end and companies blindly following those old ways are doomed to fail. I you happen to find different ones, I suggest you stick to them.
When I was a young software programmer, I had to develop features with estimates given by more senior programmers. If more time was required for the task, I had to explain the reasons - and I’d better be convincing about that. After some years, I became the one who had to provide feature estimates, but this did no mean it was easier: if the development team took more time to develop, I had to justify it to my management. Now, after even more years, I have to provide estimates for entire projects, not just fine-grained features.
But in essence, what do we need estimates for? For big enough companies, those are required by internal processes. Whatever the process, it goes somewhat along these lines: in order to start a project, one needs estimate in order to request a budget, then approved (or not) by management.
Guess what? It doesn’t work, it never did and I’m pretty sure it never will.
Note: some organizations are smart enough to realize this and couple estimates with a confidence factor. Too bad this factor has no place in an Excel sheets and that it is lost at some point during data aggregation :-(
Unfortunately, my experience is the following: estimates are nearly always undervalued! Most of the time, this has the following consequences, (that might not be exclusive):
- In general, the first option is to cancel all planned vacations of team members. The second step is to make members work longer hours, soon followed by cutting on week-ends so they work 6/7. While effective in the very short-term, it brings down the team productivity very soon afterwards. People need rest and spirit - and developers are people too!
- After pressure on the development team, it's time to negotiate. In this phase, project management goes to the customer and try to strike a deal to remove parts of the project scope (if it was ever defined...). However, even if the scope is reduced, it generally is not enough to finish on budget and on time.
- The final and last step is the most painful: go back to the powers that be, and ask for more budget. It is painful for the management, because it's acknowledging failure. At this point, forget the initial schedule.
- Meanwhile and afterwards, management will communicate everything is fine and goes according to the plan. Human nature...
In some cases (e.g. a bid), that’s even worse, as the project manager will frequently (always?) complain that those estimates are too high and pressuring you to get lower ones, despite the fact that lowering estimates never lowers workload.
You could question why estimates are always wrong. Well, this is not the point of this post but my favorite answer is that Civil Engineering is around 5,000 years old and civil engineers also rarely get their estimates right. Our profession is barely 50 years old and technology and methodologies keep changing all the time.
I’m not a methodologist, not a Project Manager, not a Scrum Master… only a Software Architect. I don’t know if Agile will save the world; however, I witnessed first-hand every upfront estimate attempt as a failure. I can only play this game of fools for so long, because we’re all doomed to loose by participating in it.
Posts Tagged ‘agile’