Creating an on-line recommender system with Apache Mahout

Recently we’ve been implementing a recommender system for Yap.TV: you can see it in action after installing the app and going to the “Just for you” tab. We’re using Apache Mahout as the base for doing recommendations. Mahout is a “scalable machine learning library” and contains both local and distributed implementations of user- and item- based recommenders using collaborative filtering algorithms banzai water slide australia.

screen568x568

For now we’ll focus on the local, single-machine implementation. It should work well if you have up to 10s of millions of preference values. Above that, you should probably consider the Hadoop-based implementation, as the data simply won’t fit into memory.

Writing a basic recommender with Mahout is quite simple; As Mahout is very configurable, usually there are different implementations to choose from; I’ll just describe what I think are “good starting points”.

Basics

First you need a file with the input data. The format is quite simple: either comma-separated (user id, item id) pairs or (user id, item id, preference value) triples. This expresses what you already know: what users like which items, and optionally how much (e.g. on a 1-5 scale). The ids must be integers, the preference value is treated as a float.

Let’s first create a user-based recommender: that is a recommender, which when asked for recommendations for user A, first looks up “similar” users to A, and then tries to find best items, which these similar users have rated, but A hasn’t. To do that, we need to create 4 components:

  • data model: this will use the file
  • user similarity: a measure which given two users, will return a number representing how similar they are
  • neighborhood: for finding the neighborhood of a given user
  • recommender: which takes these pieces together to produce recommendations

For unary input data (where users either like items or we don’t know), a good starting point is:

1
2
3
4
val dataModel = new FileDataModel(file)
val userSimilarity = new LogLikelihoodSimilarity(dataModel)
val neighborhood = new NearestNUserNeighborhood(25, userSimilarity, dataModel)
val recommender = new GenericBooleanPrefUserBasedRecommender(dataModel, neighborhood, userSimilarity)

If we have preference values (triples in the input data):

1
2
3
4
val dataModel = new FileDataModel(file)
val userSimilarity = new PearsonCorrelationSimilarity(dataModel)
val neighborhood = new NearestNUserNeighborhood(25, userSimilarity, dataModel)
val recommender = new GenericUserBasedRecommender(dataModel, neighborhood, userSimilarity)

Now we are ready to get some recommendations; this is as simple as:

1
2
3
4
5
6
// Gets 10 recommendations
val result = recommender.recommend(userId, 10)
 
// We get back a list of item-estimated preference value, 
// sorted from the highest score
result.foreach(r => println(r.getItemID() + ": " + r.getValue()))

On-line

What about the on-line aspect? The above will work great for existing users; what about new users which register in the service? For sure we want to provide some reasonable recommendations for them as well. Creating a recommender instance is expensive (for sure takes longer than a “normal” network request), so we can’t just create a new recommender each time.

Luckily Mahout has a possibility of adding temporary users to a data model. The general setup then is:

  • periodically re-create the whole recommender using current data (e.g. each day or each hour – depending on how long it takes)
  • when doing a recommendation, check if the user exists in the system
  • if yes, do the recommendation as always
  • if no, create a temporary user, fill in the preferences, and do the recommendation

The first part (periodically re-creating the recommender) may be actually quite tricky if you are limited on memory: when creating the new recommender, you need to hold two copies of the data in memory (to still be able to server requests from the old one). But as that doesn’t really have anything to do with recommendations, I won’t go into details here.

As for the temporary users, we can wrap our data model with a PlusAnonymousConcurrentUserDataModel instance. This class allows to obtain a temporary user id; the id must be later released so that it can be re-used (there’s a limited number of such ids). After obtaining the id, we have to fill in the preferences, and then we can proceed with the recommendation as always:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
val dataModel = new PlusAnonymousConcurrentUserDataModel(
    new FileDataModel(file),
    100)
 
val recommender: org.apache.mahout.cf.taste.recommender.Recommender = ...
 
// we are assuming a unary model: we only know which items a user likes
def recommendFor(userId: Long, userPreferences: List[Long]) = {
  if (userExistsInDataModel(userId)) {
    recommendForExistingUser(userId)
  } else {
    recommendForNewUser(userPreferences)
  }
}
 
def recommendForNewUser(userPreferences: List[Long]) = {
  val tempUserId = dataModel.takeAvailableUser()
 
  try {
    // filling in a Mahout data structure with the user's preferences
    val tempPrefs = new BooleanUserPreferenceArray(userPreferences.size)
    tempPrefs.setUserID(0, tempUserId)
    userPreferences.zipWithIndex.foreach { case (preference, idx) => 
      tempPrefs.setItemID(idx, preference) 
    }
    dataModel.setTempPrefs(tempPrefs, tempUserId)
 
    recommendForExistingUser(tempUserId)
  } finally {
    dataModel.releaseUser(tempUserId)
  }
}
 
def recommendForExistingUser(userId: Long) = {
  recommender.recommend(userId, 10)
}

Incorporating business logic

It often happens that we want to boost the score of selected items because of some business rules. In our use-case, for example if a show has a new episode, we want to give it a higher score. That’s possible using the IDRescorer interface for Mahout. An instance of a rescorer is provided when invoking Recommender.recommend. For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
val rescorer = new IDRescorer {
  def rescore(id: Long, originalScore: Double) = {
    if (showIsNew(id)) {
      originalScore * 1.2 
    } else {
      originalScore
    }
  }
 
  def isFiltered(id: Long) = false
}
 
// Gets 10 recommendations
val result = recommender.recommend(userId, 10, rescorer)

Summary

Mahout is a great basis for creating recommenders. It’s very configurable and provides many extension points. There’s still quite a lof of work in picking the right configuration parameter values, setting up rescoring and evaluating the recommendation results, but the algorithms are solid, so there’s one thing less to worry about.

There’s also a very good book, Mahout in Action, which covers recommender systems and other components of Mahout. It’s based on version 0.5 (current one is 0.8), but the code examples mostly work and the main logic of the project is the same.

Adam

  • Marcin Kubala

    Hi Adam,
    it seems that you have a (minor) typo at the penultimate code listing – signatures at line #12 and #16 doesn’t match together.

  • http://www.warski.org/ Adam Warski

    Fixed, thanks!

  • Pingback: Using Amazon’s Elastic Map Reduce to compute recommendations with Apache Mahout 0.8 | Blog of Adam Warski()

  • Pingback: Using Amazon’s Elastic MapReduce to Compute Recommendations with Apache Mahout 0.8 | Big Data NewsBig Data News()

  • Jordi Aranda

    Hi Adam, first of all thanks for the post, it was pretty interesting. Actually I’ve been also playing a little bit around with Apache Mahout on some basic recommenders and the most compromising step for me is the matrix update process. I would be very grateful if you could write something about it (e.g.: multiple update strategies and the corresponding implementations). Thanks!

  • http://www.warski.org/ Adam Warski

    I suppose you are referring to the Hadoop-based item-item algorithm (where co-occurences are counted)? Or do you mean updating the matrix in some other way?

  • Jordi Aranda

    Probably I did not explain myself well. What I would like to know is how to update the user-item matrix (data model): every time a user votes a certain item, the data model is actually changing and hence, recommendations might also change. All examples I have found so far work with an “static” dataset. In that way, I would like to know what are the main strategies to update the data model. Is it always a batch process and one must reload the data model in memory from new ratings? I did not found any blog post talking about this issue, which to me seems to be quite important. Thanks for your time!

  • http://www.warski.org/ Adam Warski

    Ah, I understand. I don’t recall any specific algorithms which would deal with updating the user-item matrix (though my knowledge is certainly limited ;) ). What you do instead, is periodically throw away the old model and build a new one.

    The assumption we make here is that individual changes in user preferences don’t impact *that* much the overall behaviour of the algorithm. So recomputing the preferences can be done e.g. once every hour, once every day, or once every week, depending on the data size and the rate at which preference change.

    To actually make recommendations to new users/users with changed preferences, I apply what’s described in the “on-line” section of the blog – temporarily create a new user, and make recommendations basing only on the information from which the recommender was created.