Pushing the envelope: 2018

Thursday, June 21, 2018

Don't hire a DevOps coach - hire a DevOps change agent

Each day I get between three and five unsolicited emails from staffing firms who have spotted that I have "DevOps" in my LinkedIn profile or Dice/Indeed/etc. resume, and who are trying to fill a "DevOps coach" position.

But when I ask a few questions, I realize that they are really trying to fine a team coach, and really they just want what I would call a "tool jockey". There are lots of people who have learned some of the tools that are associated with DevOps today - AWS EC2 and S3 or maybe Azure or GCE, Chef or Puppet, Docker, maybe Lambda, and know a scripting language or two.

That's not a DevOps coach.

A coach is an expert. It is not someone who is new to this stuff, and it is someone who has used a range of tools, so that they are more than a one trick pony. They have also been around since before DevOps - long before - so that they have perspective and remember why DevOps came about. Otherwise, they don't really understand the problem that DevOps is trying to solve.

In a shift to continuous delivery and other DevOps practices, it is absolutely essential to have an experienced person guiding it. There are too many ways to get into serious trouble. I have seen things completely collapse from the weight of bad decisions, in the form of unmaintainable and brittle automation.

If you need tool engineers, hire them. But don't call them DevOps coaches. Get a real DevOps coach.

A very important reality is this: Very smart people just out of college who know the latest tools can rapidly create a mountain of unmaintainable code and "paint you into a corner" so that there is no way out.

The choices that are made are very important. Should we use a cloud framework? Should we eliminate our middle tier? Should we use Ruby, Java, or Javascript for out back end? Should we have a DevOps team? Should we have a QA function? How should Security work with our teams? Should teams manage their own environments and deployments? Should they support their own apps? Should we have project managers? - it goes on and on.

A tool jockey will not know where to even start with these questions. An experienced DevOps coach will.

Wednesday, April 18, 2018

Creating a lightweight Java microservice

In my recent post Creating a recommender microservice in Java 9, I lamented that if you build your Java app using Maven, it will include the entirety of all of the projects that your app uses—typically hundreds of Jars—making your microservice upwards of a Gb in size—not very “micro”. To solve that problem, one needs a tool that scans the code and removes the unused parts. There actually is a Maven plugin that tries to do that, called shade, but shade does not provide the control that is needed, especially if your app uses reflection somewhere within some library, which most do.

In this article I am going to show how to solve this for Java 8 and earlier. Java 9 and later support Java modules, including for the Java runtime, and that is a very different build process.

To reduce the Jar file footprint, I created a tool that I call jarcon, for Jar Consolidator. Jarcon uses a library known as the Class Dependency Analyzer (CDA), to perform the actual class dependency analysis: my tool merely wraps that functionality in a set of functions that let us call CDA from a command line, assemble just the needed classes, and write them all to a single Jar file, which we can then deploy.

Note that while Maven is the standard build tool for Java apps, I always wrap my Maven calls in a makefile. I do that because when creating a microservice, I need to call command like tools such as Docker, Compose, Kubernetes, and many other things, and make is the most general purpose tool on Unix/Linux systems. It is also language agnostic, and many of my projects are multi-language, especially when I use machine learning components.

To call jarcon to consolidate your Jar files, this is the basic syntax:
java -cp tools-classpath \
    com.cliffberg.jarcon.JarConsolidator \
    your-app-classpath \
    output-jar-name \
    jar-manifest-version \
    jar-manifest-name

Here is a sample makefile snippet that calls jarcon to consolidate all of my projectʼs Jar files into a single Jar file, containing only the classes that are actually used:

consolidate:
    java -cp $(JARCON_ROOT):$(CDA_ROOT)/lib/* \
        com.cliffberg.jarcon.JarConsolidator \
        --verbose \
        "$(IMAGEBUILDDIR)/$(APP_JAR_NAME):$(IMAGEBUILDDIR)/jars/*" \
        scaledmarkets.recommenders.mahout.UserSimilarityRecommender \
        $(ALL_JARS_NAME) \
        "1.0.0" "Cliff Berg"

The above example produces a 6.5Mb Jar file containing 3837 classes. This is in contrast to the 97.7Mb collection of 114 Jar files that would be included in the container image if jarcon were not used.

The components of the microservice container image are,

Our application Jars, and the Jars used by our application.
Java runtime.
Base OS.

In the example above, we have compressed #1 from 97.7Mb down to 6.5Mb, but the Java runtime still consumes many tens of Mb. The OS can vary a great deal: if we use, say, Centos, we are talking about 300Mb just for the OS. If instead we use Alpine Linux, then #3 is only about 20Mb. That leaves the Java runtime. To solve that we need the Java module system, which requires Java 9 or later. Java 9 also requires some different considerations for Maven. I will leave that for a future article.

Saturday, February 24, 2018

A deep learning DevOps pipeline

Many organizations today are using deep learning and other machine learning techniques to analyze customer behavior, recommend products to customers, detect fraud and other patterns, and generally use data to improve their business.
Still more organizations are dabbling with machine learning, but have difficulty moving those experiments into the mainstream of their IT processes: that is, how do they create a machine learning DevOps “pipeline”?

The difficulty is that machine learning programs do not implement business logic as testable units, and so “code coverage” cannot be measured in the normal way, since neural networks logically consist of “neurons”—aka “nodes”—instead of lines of functional code; nor is it possible to deterministically define “correct” outputs for some of the input cases. Worse, creating test data can be laborious, since a proper test data set for a machine learning application generally consists of 20% of the size of data used to train the application—which is often tens or hundreds of thousands of cases. A final challenge—and this is arguably the worst problem of all—is that neural networks often work fine for “normal” data but can produce very wrong results if the data is slightly off: identifying these “slightly off” cases can be difficult. To summarize, the problems pertain to,

Measuring coverage.
Generating test data.
Identifying apparently normal test cases that generate incorrect results.

In a paper last September, four researchers (Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana) explained how they solved these problems by treating them as an optimization problem. We have used their techniques to create a DevOps style automated testing pipeline. For the purpose of this article, I will confine the discussion to multilayer—so-called “deep”—neural networks.

Developing a deep neural network is an iterative process: one must first analyze the data to decide how to organize it, for example whether it should be clustered, categorized, or additional attributes derived, whether time or spatial correlations need to be accounted for (e.g., through convolution). After that, the neural network architecture must be chosen: how many layers, and the layer sizes, as well as the way in which the network learns (e.g., through back propagation). This process takes quite a long time, with each iteration measured by the success rate of the network when tested with an independent data set.

Thus, developing a neural network model is an exploratory process. However, many real neural network applications entail multiple networks, and often functional software code as well, making the networks part of a larger application system, with an entire team working together. In addition, before a change to a network can be tested, the modified network must be trained, and training is extremely compute intensive, often requiring hours of GPU time. These factors make an automated testing “pipeline” useful.

In the figure above, step 1 represents the process of network adjustment: think of it as a developer or data scientist making a change to the network to see if the change improves performance. Most likely the developer will test the change locally, using a local GPU; but that test is probably only cursory, possibly using a somewhat simplified network architecture, or trained with a small dataset. To do a real test, a much higher capacity run must be performed. The change must also be tested in the context of the entire application. To enable these things, the programmer saves their changes to a shared model definition repository, which is then accessed by an automated testing process.

What is unique about our process is that in step 2 we use the algorithm developed by Kexin et. al. to generate additional test cases. These additional test cases are derived by analyzing the network, and finding cases that produce anomalous results. We then execute the test suite (step 3), including the additional test cases, and we record which neural nodes were activated, producing a coverage metric (step 4). Finally, in step 5 we execute the test cases again (probably in parallel with the first execution), but using an independent network implementation: our expectation is that the results will be extremely similar, within a certain percentage, say one percent: any differences are examined manually to determine with is the “correct” result, which is then recorded and checked against the next time around. This process avoids having to manually inspect and label a large number of test cases.

We are still in the early days of using this process, but we have found that it dramatically improves the overall process of test case management, test case completeness assessment, and also reduces the turnaround time for testing model changes.

Tuesday, February 6, 2018

Low code and DevOps

So-called low-code and no-code frameworks enable non-programmers to create business applications without having to know programming languages such as Java, Python, and C Sharp. People sometimes ask me if this means that DevOps is not relevant.

The amount of code is not the issue. The issues are,

Do your apps interact?
Do your apps share data stores?
Do your apps change frequently?
Do you have multiple app teams?
Do you have severe lead time pressure?
Do you have very high pressure that things work?
Do you have large scale usage?

If any of these are true, then you begin to have a need for DevOps, and the more that are true, the greater the case for DevOps.

Whether your code is created via a drag-and-drop GUI, or through careful hand coding of Java, C Sharp, or some other language does not matter. Low-code platforms provide a runtime platform, which can be a SaaS service or an on-premises server, and so it is little different from a PaaS arrangement that is either in a cloud or on premises. The question of whether it is low-code or coded is irrelevant.

To illustrate, consider a low-code application, developed using a low-code tool such as Appian. Two of the Appian app types are site and Web API. Suppose that we create one of each: a site, and a separate Web API which the site uses. If we assume that our user base is a few thousand users, and that they work 9-5 five days a week, then we do not need 24x7 availability and so we can use weekends to perform upgrades, and it also means that we do not need to worry about scale, because handling a few thousand users will be pretty easy, and so we do not need a sophisticated scaling architecture. So far, it sounds like we do not need DevOps.

Now consider what happens when we add a few more apps, and a few more teams. Suppose some of those other apps use our Web API, and suppose our app uses some of their Web APIs. Also suppose that we need a new feature in our app, and it requires that one of the other Web APIs add a new method. Now suppose that we want to be able to turn around new features in two week sprints, so that every two weeks we have a deployable new release, which we will deploy on the weekend. Do we need DevOps?

We do, if we want to stay sane. In particular, we will want to have,

Continuous automated unit level integration testing for each team.
Automated regression tests against all of the Web APIs.
The ability of each team to perform automated integration tests that verify inter-dependent API changes and app changes.
The ability to stand up full stack test environments on demand, so that the above tests can be run whenever needed, without having to wait for an environment.

This is starting to sound a-lot like DevOps - and it is. At this point, we are well on our way to fully automated pipelines continuous delivery: we are doing DevOps, and the fact that code is created by drag-and-drop tools does not matter one bit, except that it means that our developers are more productive and hence we probably have an even higher rate of feature development - making the case for DevOps even greater.

What if non-programmers are creating their own apps? In that case, the question is, are they impacting each other? For example, are they modifying database schema, or data structures in NoSQL stores? If so, they you will be in serious trouble if you do not have DevOps practices in place. Are those non-programmers writing Web APIs? If so, then you have the same considerations.

It is an integration question: if you have lots of things, and they need to work together, and you cannot afford to take your time about it, then you need DevOps.