Classifying web pages using non-negative matrix factorization

Posted on
11 May 2018

Here is a list of articles selected from the BBC Sport website.

Champions League: Liverpool played crazy football, says Jurgen Klopp
World Snooker Championship 2018: Barry Hawkins dominates Ding Junhui in one-sided quarter-final
Islam Strmani: Newcastle forward given three-game ban
World Snooker Championship 2018: Mark Williams joins Ding Junhui and Judd Trump in last eight
Champions League: When Liverpool need to keep the ball and keep their shape - Lawrenson
World Snooker Championship 2018: John Higgins wins epic against Judd Trump to complete semis line-up
Atletico Madrid 1-0 Arsenal (agg 2-1)
World Snooker Championship 2018: Mark Williams eyes 'unbelievable' world crown
West Ham striker Andy Carroll apologises to manager David Moyes following row
World Snooker Championship 2018: Judd Trump finds no positives despite unlucky Crucible exit
Real Madrid: Are the Champions League finalists lucky or ruthless?
World Snooker Championship 2018: Judd Trump 'fired up' for title quest after Ricky Walden win
Champions League final: Liverpool given 16,626 tickets
World Snooker Championship 2018: Ali Carter buoyed by run to quarter-finals

You might notice that some of the articles are about football (at the time of writing, Liverpool had just qualified for the Champions League final), and the others are about snooker (the snooker world championship was approaching its climax as I wrote this blog). It would be trivial for anyone to open the links and categorise the articles accordingly. The question is, can we write a computer program to do this automatically? Admittedly, this is a contrived toy problem, but it touches upon the sorts of things that data analysts do everyday, and is a nice way of introducing the concept of non-negative matrix factorization (more on this shortly).

I wrote a Python script to parse the web pages and count word frequencies (discarding common words such as 'a' and 'the'). The result was a $1113 \times 14$ matrix: $1113$ rows corresponding to the distinct words detected in the articles, and $14$ columns, one for each article, containing the corresponding word counts. We'll return to this matrix in a little while, but first I'd like to talk about data matrices in more general terms.

A data matrix is a set of observations of a number of variables. In our matrix of word counts above, the observations were individual web pages and the variables were the frequencies of the different words on those pages, but there are numerous other examples. For instance, the observations might be pixels in an image, with spectrometry data as the variables. Or observations might correspond to different people, with various test results as the variables. In this blog post we are going to assume that each column of our data matrix corresponds to an observation and each row corresponds to a variable.

Data matrices often have the following properties.

  1. Their entries are non-negative. This isn't always the case but it is true for many important applications.
  2. They can be very large and may be sparse. I recently encountered a data matrix from seismic tomography which had size $87000 \times 67000$, but with only $0.23\%$ of nonzero entries.

Large matrices are cumbersome to deal with. So it is natural to ask whether we can encapsulate the data using a smaller matrix, especially if our data matrix contains many zeros. Various techniques exist to do this, for example principal components analysis, or linear discriminant analysis. The drawback of these methods is that they do not preserve the non-negativity of the original data matrix, making the results potentially difficult to interpret. This is where non-negative matrix factorization comes in.

Non-negative matrix factorization (NMF) takes a non-negative data matrix $A$ and attempts to factor it into the product of a tall, skinny matrix $W$ (known as the features matrix) and a short, wide matrix $H$ (the coefficients matrix). Both $W$ and $H$ are non-negative. This is shown in the graphic below. Note the presence of $\approx$ rather than $=$: an exact NMF may not exist. In practice NMF algorithms iterate towards acceptable solutions, rather than obtaining the optimal solution. We have two new routines for computing non-negative matrix factorizations coming to the NAG Library soon.

 

Non-negative matrix factorization
Graphical representation of non-negative matrix factorization

 

The strength of NMF is that the preservation of non-negativity makes it easier to interpret the factors $W$ and $H$. In general $W$ tells us how the different variables can be grouped together into $k$ features that in some way represent the data. The matrix $H$ tells us how our original observations are built from these features, with the non-negativity ensuring this is done in a purely additive manner.

The best way of understanding this is to go back to our original example. Recall that we had a $1113 \times 14$ matrix of word frequencies for our $14$ web pages. I used one of the upcoming NAG Library routines to factorize it, choosing $k=2$. This resulted in a $1113 \times 2$ features matrix, $W$ and a $2\times 14$ coefficients matrix $H$. Let's discuss them in turn.

Each column of $W$ corresponds to a particular weighted grouping of the $1113$ distinct words from the BBC Sport articles. The larger the entries in the column, the more important the corresponding word is deemed to be. Rather than displaying $W$ in its entirety, I looked at the 10 largest entries in each column to see what the most important words were. The results are shown in this screenshot from my Python script below.

 

Most important words in the two categories
The most important words in the two columns of the features matrix

 

If you're at all familiar with significant terms and people in the snooker and football worlds, you'll hopefully agree that the first column corresponds to snooker and the second column football. It seems that our non-negative matrix factorization has successfully detected the two categories of web page. Let's denote these using the symbols snooker and football. Can we now use the NMF to accurately categorise the individual pages? To do this we need to look at the coefficients matrix $H$.

This coefficients matrix is of size $2 \times 14$. The entries in each column show us how well that particular web page fits into our two categories. I assigned each page to a category by simply selecting the largest entry in the column. The results are below. The symbol next to each link shows how it was categorised by the NMF. I will let you judge for yourself whether the categorisations are correct!

Champions League: Liverpool played crazy football, says Jurgen Klopp football
World Snooker Championship 2018: Barry Hawkins dominates Ding Junhui in one-sided quarter-final snooker
Islam Strmani: Newcastle forward given three-game ban football
World Snooker Championship 2018: Mark Williams joins Ding Junhui and Judd Trump in last eight snooker
Champions League: When Liverpool need to keep the ball and keep their shape - Lawrenson football
World Snooker Championship 2018: John Higgins wins epic against Judd Trump to complete semis line-up snooker
Atletico Madrid 1-0 Arsenal (agg 2-1) football
World Snooker Championship 2018: Mark Williams eyes 'unbelievable' world crown snooker
West Ham striker Andy Carroll apologises to manager David Moyes following row football
World Snooker Championship 2018: Judd Trump finds no positives despite unlucky Crucible exit snooker
Real Madrid: Are the Champions League finalists lucky or ruthless? football
World Snooker Championship 2018: Judd Trump 'fired up' for title quest after Ricky Walden win snooker
Champions League final: Liverpool given 16,626 tickets football
World Snooker Championship 2018: Ali Carter buoyed by run to quarter-finals snooker

This was a small example, designed for illustrative purposes only. However, non-negative matrix factorization has become an important tool in the analysis of higher-dimensional data. The real-world applications of NMF include facial recognition, image processing, the detection of exoplanets in astronomy, mass spectrometry, text mining, speech denoising and bioinformatics, to name but a few.

Look out for our new non-negative matrix factorization routines f01sa (for dense data matrices) and f01sb (for large, sparse data matrices) coming to the NAG Library soon.