Saturday, February 24, 2018

A deep learning DevOps pipeline

Many organizations today are using deep learning and other machine learning techniques to analyze customer behavior, recommend products to customers, detect fraud and other patterns, and generally use data to improve their business.
Still more organizations are dabbling with machine learning, but have difficulty moving those experiments into the mainstream of their IT processes: that is, how do they create a machine learning DevOps “pipeline”?

The difficulty is that machine learning programs do not implement business logic as testable units, and so “code coverage” cannot be measured in the normal way, since neural networks logically consist of “neurons”—aka “nodes”—instead of lines of functional code; nor is it possible to deterministically define “correct” outputs for some of the input cases. Worse, creating test data can be laborious, since a proper test data set for a machine learning application generally consists of 20% of the size of data used to train the application—which is often tens or hundreds of thousands of cases. A final challenge—and this is arguably the worst problem of all—is that neural networks often work fine for “normal” data but can produce very wrong results if the data is slightly off: identifying these “slightly off” cases can be difficult. To summarize, the problems pertain to,
  1. Measuring coverage.
  2. Generating test data.
  3. Identifying apparently normal test cases that generate incorrect results.
In a paper last September, four researchers (Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana) explained how they solved these problems by treating them as an optimization problem. We have used their techniques to create a DevOps style automated testing pipeline. For the purpose of this article, I will confine the discussion to multilayer—so-called “deep”—neural networks.


Developing a deep neural network is an iterative process: one must first analyze the data to decide how to organize it, for example whether it should be clustered, categorized, or additional attributes derived, whether time or spatial correlations need to be accounted for (e.g., through convolution). After that, the neural network architecture must be chosen: how many layers, and the layer sizes, as well as the way in which the network learns (e.g., through back propagation). This process takes quite a long time, with each iteration measured by the success rate of the network when tested with an independent data set.

Thus, developing a neural network model is an exploratory process. However, many real neural network applications entail multiple networks, and often functional software code as well, making the networks part of a larger application system, with an entire team working together. In addition, before a change to a network can be tested, the modified network must be trained, and training is extremely compute intensive, often requiring hours of GPU time. These factors make an automated testing “pipeline” useful.

In the figure above, step 1 represents the process of network adjustment: think of it as a developer or data scientist making a change to the network to see if the change improves performance. Most likely the developer will test the change locally, using a local GPU; but that test is probably only cursory, possibly using a somewhat simplified network architecture, or trained with a small dataset. To do a real test, a much higher capacity run must be performed. The change must also be tested in the context of the entire application. To enable these things, the programmer saves their changes to a shared model definition repository, which is then accessed by an automated testing process.

What is unique about our process is that in step 2 we use the algorithm developed by Kexin et. al. to generate additional test cases. These additional test cases are derived by analyzing the network, and finding cases that produce anomalous results. We then execute the test suite (step 3), including the additional test cases, and we record which neural nodes were activated, producing a coverage metric (step 4). Finally, in step 5 we execute the test cases again (probably in parallel with the first execution), but using an independent network implementation: our expectation is that the results will be extremely similar, within a certain percentage, say one percent: any differences are examined manually to determine with is the “correct” result, which is then recorded and checked against the next time around. This process avoids having to manually inspect and label a large number of test cases.

We are still in the early days of using this process, but we have found that it dramatically improves the overall process of test case management, test case completeness assessment, and also reduces the turnaround time for testing model changes.

No comments:

Post a Comment