Natural Language Processing Using Stanford’s CoreNLP

Analyzing Text Data in Just Two Lines of Code

Introduction

Analyzing text data using Stanford’s CoreNLP makes text data analysis easy and efficient. With just a few lines of code, CoreNLP allows for the extraction of all kinds of text properties, such as named-entity recognition or part-of-speech tagging. CoreNLP is written in Java and requires Java to be installed on your device but offers programming interfaces for several popular programming languages, including Python, which I will be using in this demonstration. Additionally, it supports four languages other than English: Arabic, Chinese, German, French, and Spanish.

I. How To Install CoreNLP

First, we have to download CoreNLP. If you’re using a MacBook, open the terminal and enter the following line of code and hit enter:

wget https://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip https://nlp.stanford.edu/software/stanford-english-corenlp-2018-10-05-models.jar

This will start the download of CoreNLP’s latest version (3.9.2 as of February 2019). You should see something like this on your screen:

Downloading CoreNLP will take a while depending on your internet connection. When the download is complete, all that’s left is unzipping the file with the following commands:

unzip stanford-corenlp-full-2018-10-05.zip
mv stanford-english-corenlp-2018-10-05-models.jar stanford-corenlp-full-2018-10-05

The command mv A B moves file A to folder B or alternatively changes the filename from A to B.

II. Starting the Server and Installing Python API

In order to be able to use CoreNLP, you will have to start the server. Doing so is pretty easy as all you have to do is to move into the folder created in step I and use Java to run CoreNLP. Let’s look at the commands we need for that:

cd stanford-corenlp-full-2018-10-05
java -mx6g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -timeout 5000

The cd command opens the folder we created. Then, to run the server, we use Java. The parameter-mx6gspecifies the amount of memory that CoreNLP is allowed to use. In this case, it’s six gigabytes. The -timeout 5000parameter specifies the timeout in milliseconds.

Now, you should see something like this:

The number I’ve highlighted is going to be important when using CoreNLP in Python.

The last thing needed before starting to analyze text is to install a Python API:

I’m going to use py-corenlp but there are other Python packages that you can check out here. If you should happen to be an avid user of NLTK, there is also an API for NLTK that lets you use CoreNLP. The full instructions can be found here.

III. NLP with CoreNLP

After having finished installing CoreNLP, we can finally start analyzing text data in Python. First, let’s import py-corenlp and initialize CoreNLP. This is where the number I highlighted above comes into play:

from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000')

The syntax in NLTK is very similar:

from nltk.parse import CoreNLPParser

result = CoreNLPParser(url='http://localhost:9000')

The rest of this demonstration is going to focus on py-corenlp but you could also use NLTK as pointed out above. The main difference between the two is that in py-corenlp outputs a raw JSON file that you can then use to extract whatever you’re specifically interested in while NLTK provides you with functions that do so for you.

The only other function needed to conduct NLP using py-corenlp isnlp.annotate().Inside the function, one can specify what kind of analysis CoreNLP should execute. In this demonstration, I will take a look at four different sentences with different sentiments. All of this can be done in one single line of code but for readability purposes, it’s better to stretch it over several lines.

text = "This movie was actually neither that funny, nor super witty. The movie was meh. I liked watching that movie. If I had a choice, I would not watch that movie again."
result = nlp.annotate(text,
properties={
'annotators': 'sentiment, ner, pos',
'outputFormat': 'json',
'timeout': 1000,
})

The annotators parameter specifies what kind of analyses CoreNLP is going to do. In this case, I’ve specified that I want CoreNLP to do sentiment analysis as well as named-entity recognition and part-of-speech tagging. The JSON output format will allow me to easily index into the results for further analysis.

Sentiment Analysis

Sentiment analysis with CoreNLP is very straightforward. After having run the code chunk above, no further computation is needed. Let’s look at the results for the four sentences defined above:

for s in result["sentences"]:
print("{}: '{}': {} (Sentiment Value) {} (Sentiment)".format(
s["index"],
" ".join([t["word"] for t in s["tokens"]]),
s["sentimentValue"], s["sentiment"]))

Running this for-loop outputs the results of the sentiment analysis:

0: 'This movie was actually neither that funny , nor super witty.': 1 (Sentiment Value) Negative (Sentiment)
1: 'The movie was meh.': 2 (Sentiment Value) Neutral (Sentiment)
2: 'I liked watching that movie.': 3 (Sentiment Value) Positive (Sentiment)
3: 'If I had a choice , I would not watch that movie again.': 1 (Sentiment Value) Negative (Sentiment)

The scale for sentiment values ranges from zero to four. Zero means that the sentence is very negative while four means it’s extremely positive. As you can see, CoreNLP did a very good job. The first sentence is tricky since it includes positive words like ‘funny’ or ‘witty’, however, CoreNLP correctly realized that they are negated. The easier sentences were classified correctly too.

Another option when trying to understand these classifications is to take a look at the sentiment distribution that again ranges from zero to four. Let’s take a look at the sentiment distribution of sentence two:

As you can see, the distribution peaks around a sentiment value of two and can thus be classified as neutral.

POS Tagging

Part-of-speech tagging, just like sentiment analysis does not require any additional computation. In a similar fashion to the code chunk above, all that’s needed to retrieve the desired information is a for-loop:

pos = []
for word in result["sentences"][2]["tokens"]:
pos.append('{} ({})'.format(word["word"], word["pos"]))

" ".join(pos)

Running the code returns the following:

'I (PRP) liked (VBD) watching (VBG) that (IN) movie (NN) . (.)'

The abbreviations in parentheses represent the POS tags and follow the Penn Treebank POS tagset, which you can find here. To give you an intuition, PRP stands for personal pronoun, VBD for a verb in past tense, and NN for a noun.

Named-Entity Recognition

Another possible use of CoreNLP is named-entity recognition. Let’s create a new sentence in order for named-entity recognition to make more sense:

The earphones Jim bought for Jessica while strolling through the Apple store at the airport in Chicago, USA, were great.”

Again, all we need to do is define a for-loop:

pos = []
for word in result["sentences"][1]['tokens']:
pos.append('{} ({})'.format(word['word'], word['ner']))

" ".join(pos)

Running the code above gives us the following result:

'The (O) earphones (O) Jim (PERSON) bought (O) for (O) Jessica (PERSON) while (O) strolling (O) through (O) the (O) Apple (ORGANIZATION) store (O) at (O) the (O) airport (O) in (O) Chicago (CITY) , (O) USA (COUNTRY) , (O) was (O) meh (O) . (O)'

As we can see, CoreNLP has correctly identified that Jim and Jessica are people, Apple is an organization, Chicago is a city, and the United States is a country.

Before explaining how to shut down the server, I’d like to point out that CoreNLP provides many other functionalities (lemmatization, stemming, tokenization, etc.) that can all be accessed without having to run any additional computations. A complete list of all parameters for nlp.annotate() can be found here.

IV. Shutting Down the Server & Conclusion

If you would like to shut down the server, navigate to the terminal window you used to start the server earlier and press Ctrl + C.

To summarize, CoreNLP’s efficiency is what makes it so convenient. You only have to specify what analysis you’re interested in once and avoid unnecessary computations that might slow you down when working with larger data sets. If you’d like to know more about the details of how CoreNLP works and what options there are, I’d recommend reading the documentation.

Browse

Article by channel:

Read more articles tagged: Natural Language Processing