Saturday, December 16, 2017

Creating a recommender microservice in Java 9

Even if you don't know what a recommender is, you no doubt have used one: if you have bought products online, most sites now recommend purchases based on your prior purchases. For example, if you use Netflix, they will recommend movies and TV shows that you might like. Recommenders are very important for businesses because they stimulate more sales. They also help consumers by suggesting things that the consumer might like. Most people view that in a positive way, in contrast to unrelated ads that show up while the user is using viewing content.

Traditional recommenders

Traditional recommenders are based on statistical algorithms. These fall into two categories: (1) user similarity: algorithms that compare you to other users and guess that you might like what those other users like, and (2) item similarity: algorithms that compare catalog items, and guess that if you liked one item, you might like other similar items. Thus, in the first type of algorithm, the task is to compare you to other customers, usually based on your purchase history: customers are clustered or categorized. In the other approach, purchase items are compared and categorized. Combining these two techniques is especially powerful: it turns out that similar users can be used to recommend an array of similar items. There are more sophisticated approaches as well, including “singlular value decomposition” (SVD), which reduces the dimensionality of the item space by mathematically grouping items that are dependent on each other. There are also neural network algorithms. In this article I will focus on a user similarity approach, since what I most want to demonstrate is the microservice oriented implementation of a recommender. (In a future article, I will show how this approach can be applied to a neural network based recommender.)

Traditional recommenders use a purely statistical approach. For example, under an item similarity approach, a customer who purchases an item with categories “book”, “adventure”, and “historical” would be assumed to be potentially interested in other books that are also categorized as “adventure” and “historical”. In practice, matches are graded based on a similarity metric, which is actually a distance in the multi-dimensional category space: those items that are the shortest “distance” from a purchase item are deemed to be most similar.

For a user similarity approach, customers are profiled based on the categories of their purchases, and matched up with other customers who are most similar. The purchases of those other similar customers are then recommended to the current customer. Again, similarity is measured using “distance” in category space.

The challenge of building a traditional recommender therefore reduces to three aspects: (1) finding or creating a statistical library that can categorize items or users and measure metrics like “distance” between them, (2) finding a data processing system that can handle the volume of users and catalog items, and (3) tying those things together. In the open source world, the tools most commonly used for these tasks are the Lucene, Mahout, and SOLR packages from Apache.

The basic user similarity recommendation algorithm

I am focusing on user similarity because it is really the first step for a recommender: once you figure out the kinds of things someone might like, item similarity is then applicable. Item similarity alone is kind of limited because it has no way of expanding someone’s exposure to other kinds of items that they might like.

The basic algorithm for a user similarity recommender is as follows:
Neighborhood analysis:
Given user U,
For every other user Uother,
1. Compute a similarity S between U and Uother.
2. Retain the top users, ranked by similarity, as a neighborhood N.

Then,
User similarity, by item, scoped to the neighborhood:
For every item I that a user in N has a preference for, but for which U has no preference yet,
1. Compute every user Un in N that has a preference for I.
2. Incorporate Un’s preference for I, weightted by S, into a running average.

Below we will see what this looks like with Mahout’s APIs.

Search versus another real time algorithm

Many recommenders (most?) today rely on a two-step process, whereby a batch job (for example, using Mahout or Spark) pre-computes a set of similarities, and then indexes that via a search engine. The search engine is then used as the real time algorithm to fetch recommended (similar) items.

I am not going to use that approach here. The reason is that this article is really preparing for the next article on the recommender topic, which will use a neural network based algorithm, such that a pre-trained neural network is used to make real time recommendations. That is a much more powerful approach, and in this article I am laying the groundwork for that.

Why a microservice?

A microservice is a lightweight Web service designed to be massively scalable and easy to maintain. The microservice approach is highly applicable to a recommender:
  1. Multiple recommenders can be tried and compared, being selected merely by their URL.
  2. The approach can handle large volumes of requests, because microservices are stateless and so can be replicated, leveraging container technology or serverless technology.
  3. The recommenders can be fine-tuned frequently, being re-deployed each time with minimal complexity.
In the interest of scalability, it is desirable to keep the footprint of a microservice small. Also, many organizations have in-house expertise in Java, and so if Java can be used, that widens the scope of who can work on the recommender.

The design

There are many technology stack choices for a Java Web application. The most popular is Spring Boot, and it is a very excellent framework. However, I will use SparkJava (not to be confused with Apache Spark) because it is extremely lightweight, and because it also has a wonderfully understandable API. Note that Spring Boot has things that SparkJava does not, such as a persistence framework, but for our machine learning microservice, the mathematical framework we will be using (Mahout—see below) has persistence, so that’s covered. We are also going to specifically address scaling in a particular way which addresses the unique needs of a recommender, which needs to perform very heavy offline data processing, so we would also not be using Spring Boot’s scaling mechanisms.

To give you an idea of how simple it is to use SparkJava to create a Web service, all I had to do is add this to the main method:
spark.Spark.port(8080);
spark.Spark.get("/recommend", "application/json", (Request request, spark.Response response) -> {
    ...handler code...
    // return any POJO message object
    return new MyResponseObject(...results...);
}, new JsonTransformer());
static class JsonTransformer implements spark.ResponseTransformer {
    private Gson gson = new Gson();
    public String render(Object responseObject) {
        return gson.toJson(
responseObject);
    }
}
No weird XML jazz, no creating some funky directory structure: just call a few intuitive methods.

In order to build a recommender, one also has to decide on how to perform the statistical calculations. Again, there are many choices there, including writing your own. I chose the Apache Mahout framework because it is rich and powerful. The down side is that its documentation is fragmented and incomplete: if you use Mahout, expect to have to dig around to find things (for example, the API docs for MySQLJDBCDataModel are not with the other API docs), and expect to have to look at the framework’s source code in github. Most (but not all) of the APIs can be found here, but the API docs also do not tell you much—they are full of the notoriously unhelpful kinds of programmer comments such as “getValue() - gets the value”. Then again, it is open source, so it cannot be expected to be as well documented as, say, AWS’s APIs.

I also chose MySQL, because it is simple to set up and many people are familiar with it, and because Mahout has a driver for MySQL, so use of something like Hibernate is not necessary. (Mahout supports other database types as well, including some NoSQL databases.)

Creating a recommender with Mahout

Creating a recommender with Mahout is actually pretty simple. Consider the code below.
UserSimilarity similarity = new PearsonCorrelationSimilarity(this.model);
UserNeighborhood neighborhood =
    new ThresholdUserNeighborhood(0.1, similarity, this.model);
UserBasedRecommender recommender =
    new GenericUserBasedRecommender(model, neighborhood, similarity);
recommendations = recommender.recommend(2, 2);
This instantiates a recommender given a model, and implements the algorithms shown earlier. More on the model in a moment.  Right now, notice the choice of similarity algorithm: PearsonCorrelationSimilarity. That similarity algorithm measures the covariance between each user’s item preferences. Some common alternative approaches are cosine similarity and Euclidean similarity: these are geometric approaches based on the idea that similar items will be “close together” in item space. The range of similarity algorithms supported by Mahout can be found here:
http://apache.github.io/mahout/0.10.1/docs/mahout-mr/org/apache/mahout/cf/taste/impl/similarity/package-frame.html
Note also the use of ThresholdUserNeighborhood. This selects users who are within a certain similarity range of each other: an alternative neighborhood algorithm is NearestUserNeighborhood, which selects the nearest set (in similarity) of a specified number of users.

Back to the model: To create a model, you need to prepare your reference data: the recommender does statistical analysis on this data to compare users, based on their preferences. For example, I created a model as follows:
File csvFile = new File("TestBasic.csv");
PrintWriter pw = new PrintWriter(csvFile);
Object[][] data = {
    {1,100,3.5},
    ...
    {10,101,2.8}
}
printData(pw, data);
pw.close();
this.model = new
org.apache.mahout.cf.taste.impl.model.file.FileDataModel(csvFile);
Note that above, I used a file-based model, whereas I will use MySQL for the full example.

The data rows consist of values for [ user-id, item-id, preference ]. The first two fields are obvious; “preference” is a float value from 1 through 5 that indicates the user’s preference for the item, where 1 is the lowest preference.

To use a recommender, you have to first prepare it by allowing it to analyze the data. Thus, the general pattern for a recommender has two parts: a data analysis part, and a usage part. The analysis is usually done offline on a recurring basis, for example nightly or weekly. That part is not a microservice: it is a heavy-duty data processing program, typically running on a cluster computing platform such as Hadoop. In the sample code above, only the following line should be in the microservice:
recommendations = recommender.recommend(2, 2);
All of the lines that precede it perform the preparatory analysis: these would not be called as a microservice.

In the sample microservice that I show here, the preparatory analysis steps are performed when the microservice starts

Persisting the analyzed model—or not

For a trivial recommender, you can simply perform the analysis calculations in the main method of your application, and then publish a web service in that main method that can then use the trained model. However, that is not a very scalable approach: each deployed instance of your recommender would need to go through the data analysis.

To avoid that, you can separate the data analysis into a separate program (as shown in the architecture shown earlier: see the “Data Preparation” block), and persist the trained model to a distributed file system, such as Hadoop’s HDFS. Each microservice instance can then simply load the trained model at startup.

Mahout refers to the ability to persist a trained model as a persistence strategy. Unfortunately, at present the SVDRecommender is the only recommender in Mahout that has implemented a persistence strategy. An SVD recommender is an important class of recommender, based on a mathematical technique for identifying redundant degrees of freedom in the data and collapsing them out, so that one ends up with a more compact model. This is highly applicable for product recommenders when the product catalog is large. The mathematics for performing SVD are time intensive, and that is why a persistence strategy was implemented for the SVDRecommender. The others need one too, however.

Mahout also has an API called Refreshable: All of the Mahout Recommender classes (package org.apache.mahout.cf.taste.impl.recommender) implement the Refreshable interface, providing them with the refresh(Collection<Refreshable>) method. It is intended for updating models on the fly. It could in theory be used for injecting pre-analyzed matrices, but the current implementations do not support that: they re-compute the entire model based on updated source data. Thus, each newly launched container will have to go through a model data prep computation.

It’s not that bad: the model data prep calculations don’t take very long; SVD computation is an exception, and that is why it has a special persistence implementation. So all we need to do is call a recommender’s refresh() method with a null argument, and it will purge all derived objects, as well as the cache maintained by the MySQLJDBCDataModel class, causing it to lazily re-load data as needed to re-compute all derived objects. Newly launched containers containing the recommender will compute the derived objects from scratch.

Packaging as a container

To package the microservice as a container image, you must create a Dockerfile. (See here.)

If you create the container image based on Centos 7 and include JDK 8, the resulting image is 776MB. That’s not a very “micro” microservice. Much of the size comes from the Centos 7 Linux distro that we added to the container image. If you build from Alpine Linux and include only JRE 8, the resulting image is 468MB. The remaining size is due to two things: (1) all of the 113 third party Java jars that Maven thinks need to be added to the image, and (2) the size of the JRE.

It is a certainty that the application does not actually need all 113 jars. Maven determines what is needed based on the dependencies declared in pom files. However, the actual number of dependent jars is usually much smaller: your code usually only calls a small percent of the methods of each dependent project, and many projects have multiple jar files.

What we need is a Java method traverser, that can remove uncalled Java methods from the class files and then package all that as a single jar. I do not know of such a tool, however. The tools would also have to allow one to manually add method signatures for those that are known to be called by reflection. Or, if the tool works at runtime instead of through static analysis, it could gather that information automatically. Test tools such as Cobertura instrument the class files and track which code gets used at runtime: such a tool could easily track with methods never get called and then strip those from the class files—including those in the third party JARs. I wish someone would write a tool like that—I don’t have time to do so—we could have Java applications that are 20Mb instead of 200Mb.

I should mention that Java 9 introduces a very useful new feature for minimizing the footprint of a deployed Java application: the module system (formerly known as “Jigsaw”). The module system makes it possible to deploy only the pieces of the JDK runtime that are needed by an application, greatly reducing the deployed footprint. I did not use the module feature for this demonstration because it would not have made much of a difference: it might have saved 10Mb in Java standard modules, but none of the required external JARs are packaged as modules at this point, so our footprint would still be essentially what it is without modules.

Launching the container

In the sample code I launch MySQL and the recommender container with the Docker Compose tool, using these two Compose files:
test-bdd/docker-compose-mysql.yml
docker-compose.yml

I used Compose because it is a great tool for development. Normally you would launch the container using an orchestration system such as Kubernetes or native cloud infrastructure that provides elastic scaling. I will not go into that here because it is beyond the scope of this article, which is really about creating a recommender.

Full code

The full code for this sample microservice can be found in github at https://github.com/ScaledMarkets/recommender-tfidf