Comparing Brazilian and US university theses using natural language processing

Topic Modeling

Another strategy I used to explore the themes from theses of both Universities was topic modeling with Latent Semantic Indexing (LSI).

Latent Semantic Indexing

This algorithm gets data from tf-idf and uses matrix decomposition to group documents in topics. We will need some linear algebra to understand this, so let’s start.

Singular Value Decomposition (SVD)

First we need to define how to do this matrix decomposition. We will use Singular Value Decomposition (SVD). Given a matrix M of dimensions m x n, M can be described as:

Where U and V* are orthonormal basis ( V* represents the transpose of matrix V). An orthonormal basis is the result if we have two things (normal + orthogonal):

  • when all vectors are of length 1
  • when all vectors are mutually orthogonal (they make an angle of 90°)

D is a diagonal matrix (the entries outside the main diagonal are all zero).

To get a sense of how all of this works together we will use the brilliant geometric explanation from this article by David Austing.

Let’s say we have a matrix M:

We can take a point ( x, y) in the plane and transforming it into another point using matrix multiplication:

The effect of this transformation is shown below:

As we can see, the plane is horizontally stretched by a factor of 3, while there is no vertical change.

Now, if we take another matrix, M’:

The effect is:

It is not so clear how to simply describe the geometric effect of the transformation. However, let’s rotate our grid through a 45 degree angle and see what happens.

We see now that this new grid is transformed in the same way that the original grid was transformed by the diagonal matrix: the grid is stretched by a factor of 3 in one direction.

Now let’s use some definitions. M is a diagonal matrix (the entries outside the main diagonal are all zero) and both M and M’ are symmetric (if we get the columns and use them as new rows, we will get the same matrix).

Multiplying by a diagonal matrix results in a scaling effect (a linear transformation that enlarges or shrinks objects by a scale factor).

The effect we saw (the same result for both M and M’) is a very special situation that results from the fact that the matrix M’ is symmetric. If we have a symmetric 2 x 2 matrix, it turns out that we may always rotate the grid in the domain so that the matrix acts by stretching and perhaps reflecting in the two directions. In other words, symmetric matrices behave like diagonal matrices. - David Austin

“This is the geometric essence of the singular value decomposition for 2 x 2 matrices: for any 2 x 2 matrix, we may find an orthogonal grid that is transformed into another orthogonal grid.” - David Austin

We will express this fact using vectors: with an appropriate choice of orthogonal unit vectors v1 and v2, the vectors Mv1 and Mv2 are orthogonal.

We will use n1 and n2 to denote unit vectors in the direction of Mv1 and Mv2. The lengths of Mv1 and Mv2  - denoted by σ1 and σ2 - describe the amount that the grid is stretched in those particular directions.

Now that we have a geometric essence, let’s go back to the formula:

  • U is a matrix whose columns are the vectors n1 and n2 (unit vectors of the ‘new’ grid, in the direction of v1 and v2)
  • D is a diagonal matrix whose entries are σ1 and σ2 (the length of each vector)
  • V* is a matrix whose columns are v1 and v2 (vectors of the ‘old’ grid)

Now that we understand a little about how SVD works, let’s see how LSI makes use of the technique to group texts. As Ian Soboroff shows on his Information Retrieval course slides:

  • U is a matrix for transforming new documents
  • D is the diagonal matrix that gives relative importance of dimensions (we will talk more about these dimensions in a minute)

To see how this works we will use document titles from two domains (Human Computer Interaction and Graph Theory). These examples are from the paper An Introduction to Latent Semantic Analysis.

c1:Human machine interface for ABC computer applications 
c2: A survey of user opinion of computer system response time
c3: System and human system engineering testing of EPS
m1: The generation of random, binary, orderedtrees 
m2: The intersection graph of paths in trees
m3: Graph minors: A survey

The first step is to create a matrix with the number of times each term appears:

| termo     | c1 | c2 | c3 | m1 | m2 | m3 |
| human | 1 | 0 | 1 | 0 | 0 | 0 |
| interface | 1 | 0 | 0 | 0 | 0 | 0 |
| computer | 1 | 1 | 0 | 0 | 0 | 0 |
| user | 0 | 1 | 0 | 0 | 0 | 0 |
| system | 0 | 1 | 2 | 0 | 0 | 0 |
| survey | 0 | 1 | 0 | 0 | 0 | 1 |
| trees | 0 | 0 | 0 | 1 | 1 | 0 |
| graph | 0 | 0 | 0 | 0 | 1 | 1 |
| minors | 0 | 0 | 0 | 0 | 0 | 1 |

Decomposing the matrix we have this (you can use this online tool to apply the SVD):

# U Matrix (to transform new documents)
-0.386  0.222 -0.096 -0.458  0.357 -0.105
-0.119 0.055 -0.434 -0.379 0.156 -0.040
-0.345 -0.062 -0.615 -0.089 -0.264 0.135
-0.226 -0.117 -0.181 0.290 -0.420 0.175
-0.760 0.218 0.493 0.133 -0.018 0.044
-0.284 -0.498 -0.176 0.374 0.033 -0.311
-0.013 -0.321 0.289 -0.571 -0.582 -0.386
-0.069 -0.621 0.185 -0.252 0.236 0.675
-0.057 -0.382 0.005 0.085 0.453 -0.485

Matrix that gives relative importance of dimensions:

# D Matrix (relative importance of dimensions)

Representation of M in k dimensions (in this case, we have k documents):

# V* Matrix (representation ofM in k dimensions)

Okay, we have the matrices. But now the matrix is not 2 x 2. Do we really need the amount of dimensions that this term-document matrix has? Are all dimensions important features for each term and each document?

Let’s go back to the example of David Austin. Let’s say now we have M”:

Now M” is no longer a symmetric matrix. For this matrix, the value of σ2 is zero. On the grid, the result of the multiplication is:

We have that if a value from the main diagonal of D is zero this term does not appear in the decomposition of M.

In this way, we see that the rank of M, which is the dimension of the image of the linear transformation, is equal to the number of non-zero values. - David Austin

What LSI does is to change the dimensionality of the terms.

In the original matrix terms are k-dimensional (k is the number of documents). The new space has lower dimensionality, so the dimensions are now groups of terms that tend to co-occur in the same documents. - Ian Soboroff

Now we can go back to the example. Let’s create a space with two dimensions. For this we will use only two values of the diagonal matrix D:

# D2 Matrix

As Alex Thomo explains in this tutorial, terms are represented by the row vectors of U2 x D2 ( U2 is U with only 2 dimensions) and documents are represented by the column vectors of D2 x V2* ( V2* is V* with only 2 dimensions). We multiply by D2 because D is the diagonal matrix that gives relative importance of dimensions, remember?

Then we calculate the coordinates of each term and each document through these multiplications. The result is:

human     = (-1.031, 0.440)
interface = (-0.318, 0.109)
computer = (-0.922, -0.123)
user = (-0.604, -0.232)
system = (-2.031, -0.232)
survey = (-0.759, -0.988)
trees = (-0.035, -0.637)
graph = (-0.184, -1.231)
minors = (-0.152, -0.758)

Using matplotlib to visualize this, we have:

Cool, right? The vectors in red are Human Computer Interaction documents and the blue ones are of Graph Theory documents.

What about the choice of the number of dimensions?

The number of dimensions retained in LSI is an empirical issue. Because the underlying principle is that the original data should not be perfectly regenerated but, rather, an optimal dimensionality should be found that will cause correct induction of underlying relations, the customary factor-analytic approach of choosing a dimensionality that most parsimoniously represent the true variance of the original data is not appropriate. - Source

The measure of similarity computed in the reduced dimensional space is usually, but not always, the cosine between vectors.

And now we can go back to the dataset with theses from the Universities. I used the lsi model from gensim. I did not find many differences between the works of the Universities (all seemed to belong to the same cluster). The topic that most differentiated the works of the Universities was this one:

y topic:
[('object', 0.29383227033104375),
('software', -0.22197520420133632),
('algorithm', 0.20537550622495102),
('robot', 0.18498675015157251),
('model', -0.17565360130127983),
('project', -0.164945961528315),
('busines', -0.15603883815175643),
('management', -0.15160458583774569),
('process', -0.13630070297362168),
('visual', 0.12762128292042879)]

Visually we have:

In the image the y topic is on the y-axis. We can see that Carnegie Mellon theses are more associated with ‘object’, ‘robot’, and ‘algorithm’ and the theses from UFPE are more associated with ‘software’, ‘project’, and ‘business’.

You can see the Jupyter notebook with the code here.


Article by channel:

Read more articles tagged: Natural Language Processing