Saturday, January 21, 2017

Inserting DevOps Into a Not-Very-Agile Organization

This is an account of a recent experience at a very large company that has a mix of traditional (waterfall) and some partly Agile IT projects, and a tiny bit of DevOps here and there. Tricia Ratliff and I were asked to take a new project and help the project's manager to “do Agile right”—meaning use DevOps and anything else that makes sense.

It worked. In two months, a team that had almost no Agile experience and no knowledge of any automation tools was able to,
  1. Learn Acceptance Test-Driven Development (ATDD) and the associated tools (Cucumber, Selenium).
  2. Learn to use virtualization and Linux (Docker) containers, both on their laptops and in our data center.
  3. Learn to use OpenShift—Red Hat's enhanced version of the Kubernetes container orchestration framework.
  4. Learn a NoSQL database (Cassandra—project has since switched to MongoDB), having no prior experience with NoSQL databases.
  5. Get productive and start delivering features in two-week Agile iterations, using a fully “left-shifted” testing process in which developers get all integration tests to pass before they merge their code into the main development branch.
All team members received all training: e.g., analysts learned all of the same tools that testers and developers learned, and everyone received access to all repositories and servers (TeamForge, Jenkins, OpenShift, and our container image registry). This was actually very important, because it established very early a relationship between the analysts, testers and developers, since the developers helped the analysts and testers to set up and start using the tools, including git, GitEye, Eclipse, etc. That relationship is what makes things work so well on this team.

The Benefits We Have Seen

One of the benefits we have seen is that the team is not dependent on anyone or anything outside our team except for our git server (TeamForge). For example, our OpenShift cluster and Jenkins server were down for two weeks awhile back, but the team was able to continue working and delivering completed stories, because they were able to perform integration tests locally on their laptops.

Another benefit that we have seen is that work is going faster and faster, as the team becomes more proficient in using the tools and refines its process. It is almost impossible to compare the productivity of different Agile teams, but I personally estimate that this team is at least two times as productive as the other non-DevOps “Agile” team that I am working with at the same customer location—even though this team's application is more difficult to test because it has a UI, whereas the other team's application does not. (I actually think that the DevOps team might be three or four times as productive as other teams at this client.)

What makes the team so productive is that the team is able to work in a “red-green cycle” (see here, and here), whereby they code, test, code, test, code, and test until the tests pass, and then move on to the next Agile story. The red-green cycle is known to the Test-Driven Development (TDD) community, but we use it for integration testing, locally (on our laptops). It is made possible by the use of the ATDD process, in which automated tests are written before the associated application code. (See here for a comparison of TDD and ATDD.) I will note that the team's application is a Web application, with automated tests written for each bit of UI functionality before the associated application code is written.

Figure 1: The red-green cycle of test-first development.
This is important: the ability of developers to use a red-green cycle is a game changer, in terms of productivity, code quality, and code maintainability; and we are doing this for integration testing—shifting it “left” into a red-green cycle on the developer's laptop.

How We Did It

The shift to proper Agile and DevOps is not a small adjustment or substitution. It is a very large set of interdependent changes, involving new methods, new mindsets, and new tools. More on this later.

It is important to emphasize that there is no single one way to do DevOps. By its nature, all of this is contextual. It is not possible to define a “standard process” for how teams should provision, develop/test and deploy. However, it is possible to devise common patterns that can be re-used, as long as a team feels free to adjust the pattern. If you want to have a learning organization—and having a learning organization is essential for effective Agile and DevOps—then teams need to have the ability to experiment and to tailor their process.

We did not use “baby steps” in our adoption of DevOps for this project. Rather, we undertook to change everything from the outset, since ours was a green field project, and we did not want to entrench legacy practices. We used an Agile coach (Tricia) and DevOps coach (me) to explain the new techniques to the team and make a compelling case for each engineering practice, but nothing was mandated. The coaches facilitated discussions on the various topics, and the team developed its own processes as a group.

By far the most impactful thing we did was to establish a complete end-to-end testing strategy from the outset. We included the entire team and all extended team members in the development of this strategy. The team consisted of all of the developers and testers (including technical lead and test lead), as well as the team's manager. The extended team members who we included were the team's application architect, an expert from the organization's test automation group, and testing managers representing integration testing and user acceptance testing. (We would have liked to include someone from the infrastructure side, but at that time it was not clear who that should be.) Devising the testing strategy began with a four-hour session at a whiteboard. The output of that session was a table of testing categories, with columns as follows: (1) category of test, (2) who writes the tests, (3) where/when the tests will be run, and (4) how we plan to measure sufficiency or “coverage” for the tests. (The format of this planning artifact was based on prior work of mine and others, which is described in this four-part article.)

We did not try to fill in all of the cells in our testing strategy table in our first session, but having the rows defined provided the foundation for our DevOps “pipeline”. Some of the testing categories that we included in our table were (a) unit tests, (b) behavioral integration tests, (c) end-to-end use case level tests, (d) failure recovery tests, (e) performance and stress tests, (f) security tests and scans, (g) browser compatibility tests, and (h) exploratory tests. Our “shift left” testing philosophy dictated that we run every kind of test as early as possible—on our laptops if it makes sense—rather than waiting for it to be run only downstream by a Jenkins job. In practice, we run the unit tests, behavioral integration tests, use case level tests, and basic security scans on our laptops before committing code to the main branch, which triggers a Jenkins “continuous integration” (CI) job that re-runs the unit tests, builds a container image, deploys a system instance (service/pod) to our OpenShift cluster, and then re-runs the behavioral and end-to-end tests against that system instance. Our Jenkins CI job therefore verifies that the system, as deployed, will work the same as the system that we tested on our laptops, because we use the same OpenShift template for deploying a test instance as we use for deploying a production instance. We plan to add Jenkins jobs for non-functional testing such as performance testing, deep security scanning and automated penetration testing: these will each begin by deploying a system instance to test, and will end by destroying that system instance: thus, no system instance is re-used across multiple tests, and we can perform any of these tests in parallel.

The issue of running tests locally is an important one. Too often DevOps is depicted or described as a sequence of tests run by a server such as Jenkins. However, that model re-creates batch processing, whereby developers submit jobs and wait. Shifting left is about avoiding the waiting: it is about creating a red-green cycle for each developer. Thus, real DevOps is actually about shortening that sequence of Jenkins jobs—perhaps even eliminating it. In an ideal DevOps process, there would be no Jenkins jobs: those jobs are a necessary evil, and they exist only because some tests are impractical to perform on one's laptop.

We spent two months preparing for our first Agile development iteration. During the two month startup period we requested tools, defined our work process, met with business and technical stakeholders, and received training. Most meetings involved the entire team, although there were also many one-on-one behind-the-scenes meetings with various stakeholders to talk through concerns. We arranged for an all-day all-team hands-on training on OpenShift from Red Hat, and we also arranged for a four-hour hands-on Cucumber and Selenium training session from an internal test automation group. Both of those training sessions were essential, and we saw an immense jump in overall understanding among the team after each of those training sessions.

However, it was not all smooth.

Obstacles We Overcame

We encountered many institutional obstacles. Our project manager was committed to using a DevOps approach, and was key because he was always ready to escalate an issue and get blockers removed. One of the myths in the Agile community is that project managers are not needed in Agile projects; yet in my experience a project manager is extremely important if the setting is a large IT organization with a-lot of centralization, because a project manager has indisputable authority within the organization and a person with authority is needed to advocate effectively for the team.

One mistake we made is that we did not make it clear to the Testing resources unit that the test programmers would need to arrive at the same time as our developers. The late arrival of the test programmers was a significant problem during the initial iterations, because when they arrived they were not up to speed on how we were doing things or on the application stories that the team and Product Owner had collaboratively developed. This problem was exacerbated by the fact that the new arrivals could not participate in the actual work until they had access to the team’s source code repo, but per organization policy they could not be granted access to that until they had had corporate git training, and there was no git training scheduled until the next month. Fortunately the project manager was eventually able to escalate this issue and get the test programmers git access, but during iteration one the application developers wrote most of the automated tests because the test programmers did not have git access.

Bureaucratic obstacles like this were the norm, and so it was extremely important that the team coaches and project manager stay on top of what was in the way each day, and escalate issues to the appropriate manager, explaining that in an Agile project such as ours, a two week delay was an entire iteration, and so obstacles had to be removed in hours or days—not weeks. Some of the functional areas that we had extended conversations with were application architecture, systems engineering, data architecture, the infrastructure engineering team, and others, because not only did each of these functions have authority over different aspects of our project pipeline, but we needed their help to enable us to connect our application into the enterprise infrastructure.

For example, I recall explaining to the assigned infrastructure engineer that we did not need “an environment”, nor did we need for him to install an application server for us, since we were going to receive access to a cluster that was being created, into which we would dynamically create containers built from base images that contain all of the components that we need (such as an application server), and it took three conversations before he grasped that. I still have trouble explaining to people in the organization that this team does not have any “environments” because we use dynamic environment provisioning via an OpenShift cluster. This has presented communication challenges and policy confusion because much of the organization's procedures and governance rules are phrased in terms of control of environment types, such as the “integration test environment” and the “user acceptance environment”. For us, however, an environment does not exist until we perform a deployment, and then after we run the tests, we destroy the environment to free up the resources.

Some of the tools that we needed were not available right away, because various enterprise groups were busy setting them up or testing them for security, so team members had to download them at home to try them out. These external groups included (1) the new OpenShift cluster team, (2) the Infrastructure Engineering team which was creating the JBoss base image that we were to use; (3) the team that manages the internal software library repository; and (4) the team that was setting up the container image registry (Artifactory) that we were to use. Once they understood what we were trying to do, all of these groups did their best to get us what we needed, but it took time since these things were being done for the first time.

Another significant challenge was that the organization requires its software developers to work on its “production” network, using “production” laptops—a practice that makes it very difficult for software engineers who need to be able to download new tools on a frequent basis to try them out: one is not allowed to download software from the Internet into the production network. (I have advocated for having a less secure sandbox network that all developers can access via a remote desktop.) It also turned out that the laptops, which use Windows, have a browser security tool called Bromium which uses virtualization, and it was preventing our team from being able to run virtual machines on their laptops—something that is required to be able to run Linux containers under Windows. We had to have Bromium technical support spend several weeks at our site, coming up with a configuration that would allow our team to run virtual machines. Diagnosing that problem and arranging for the solution delayed the team's learning about Linux containers and OpenShift by almost two months. (Also, using Windows laptops for building software that will be deployed in Linux containers makes absolutely no sense, and has the side effect that developers do not become familiar with the target environment—Linux).

DevOps Is Not a Small Change

I mentioned earlier that the shift to proper Agile and DevOps is not a small adjustment or substitution. This is interesting because we asked for some changes to the IT process controls in order to enable us to do things the way we wanted to do them. From the point of view of the controls group, our process was almost the same as the standard process, but that is only because the IT control view of a software development process is so disconnected from the reality: our process is vastly different from how other Agile teams at this client work. Some of the differences are,
  1. We created an integrated automated testing strategy without handoffs: other Agile projects at this customer have a sequential testing process, with different teams responsible for each.
  2. We use what is called a “build once, deploy many” approach, whereby we build a deployable image and store that, and that is deployed automatically to successive test environments for automated testing—many times per day.
  3. Developer laptops are their “DEV” environment—we do not have a shared “DEV” environment as other Agile teams at this client do. Thus, each developer tests with their own full stack, including their own test database instance, eliminating interference between testing by developers.
  4. We replaced manual integration testing with an industry standard “continuous integration” (CI) automated testing approach. This is unusual for this client.
  5. Analysts and testers write automated test specs using the well known Cucumber tool—they don’t run tests. The analysts focus mainly on the logical functions of the application, and the testers focus on nuances such as error cases and data requirements, and encode their understanding in the Cucumber test specs (feature files). This is widely known in the industry as Acceptance Test-Driven Development (ATDD).
  6. Test programmers and developers write automated tests that implement the test specs.
  7. Team members are allowed to switch roles, and one of our analysts has written test code. The only restriction that we have is that if you have written the test code for a story, someone else must write the application code for that story. This ensures that for a story to pass its tests, two different people must have the same understanding of the story's requirements.
  8. Our test lead supervises this process to make sure that this rule is followed, and that tests are written in a timely manner. Our Scrum Master makes sure that test coding and app coding tasks are maintained on our physical planning wall, as well as in our Agile tool (VersionOne).
  9. As a result of the above, a developer never waits for a tester: tests are written and developers run the tests—not the testers. Thus, a developer codes a story, runs tests, fixes the defects, re-runs the tests, and so on until there are no defects. In other words, we use a red/green cycle.
  10. The organization's “User Acceptance Testing” (UAT) team is a testing partner—not a handoff. They provided two members of our team, and those testers write our end-to-end use case level tests, which get run alongside all of our other tests.
  11. UAT adds its tests to the project repo, and so the developers can run the UAT tests locally on their laptops and in their CI environment—there is no delay.
  12. The development team writes its own CI build scripts. Other projects at this client typically receive their Jenkins build scripts from an integration team.
  13. The development team writes the production deployment OpenShift templates and scripts, tests those, and uses those to deploy to each test environment.
  14. All functional tests are run in each environment type (laptop, CI).
  15. Developers run security scans and code quality scans locally before they check in their code, and review the scan results produced by the CI build as an integrated task for the development of each story.
  16. All of our CI test environments are created from scratch for each test run, using the OpenShift template that the team coded. Thus, CI tests are always “clean”.
  17. The CI test database is cleared and data is loaded from scratch prior to each behavioral test case. Thus, there is never an issue with data left over from a prior test, or with a tester waiting for someone to be done testing a database table. (Eventually we plan to integrate the database into our OpenShift pod configuration, so that a fresh database will be deployed each time we deploy the application for a test run, but we are waiting for an approved container image for our database.)
  18. We don't have a DBA on our team, because we are using a NoSQL database, which has no schema. We work with an analyst who maintains a logical data model, but the model is maintained during development—not ahead of time. Our CI tests are all regression tests, so any breaks caused by changes to the data model are caught immediately.
  19. We have a very minimal need for “defect management”, because most tests are automated, and so most test results are visible in a dashboard in our Jenkins server. A developer does not check in a story’s code until it passes all of the behavioral tests locally for that story, and all affected tests are still passing. However, our exploratory testing is manual (by definition), and we record issues found during those test sessions. Exploratory testing is for the purpose of discovering things that the test designers did not anticipate, as well as for assessing the overall usability of the application.
  20. We don’t accept any story as done unless it has zero known defects. Thus, our build images that are marked as deployable typically have zero known defects.
  21. Everyone on the team (analyst, tester programmer, developer, coaches, Scrum Master, PM) has write access to all of the team’s tools and repositories (git repos, images, Jenkins project, OpenShift project, VersionOne project, Sharepoint project), and everyone received training in all of those tools. However, we feel we have devised a secure process that leverages the built-in secure change history that these tools provide.
  22. Our Product Owner reviews test scenarios as written in the “Cucumber” test specs, to ensure that they meet the intent of the associated Agile story's acceptance criteria. To do this, the Product Owner accesses test specs from the git source code control system. There are no spreadsheets—all artifacts are “executable”. To learn how to do this, the Product Owner—who is a manager who works in a business area—attended git training.
  23. Our iterations are two weeks—automated testing makes a three week iteration unnecessary: three weeks is the norm for other Agile teams at this client. Yet, our team seem to produce significantly more completed work per iteration than other teams.
  24. We only need minimal data item level tests, since our acceptance test driven process, for which we measure coverage, actually covers data items.
  25. We revise governance artifacts during each iteration, so that they are always up to date.
These are not small changes, and when you put them all together, it is huge; yet, we were able to implement all of this in a very short period of time, starting with a team that had no experience with any of these techniques. It is also important to note that we did not receive “exceptions” for our processes—we have official approval for everything we did. This shows that it can be done, and that adopting these approaches does not have to be gradual: but to do it requires a commitment from the project's management, as well as the insertion of people who know how these techniques work.

2 comments:

  1. In the second benefits paragraph, you first speak of a "code, test..." cycle, but the rest of the paragraph describes a :test, code..." cycle.

    ReplyDelete
    Replies
    1. Hi Huet - I was speaking loosely, but thanks for pointing it out! - Cliff

      Delete