Deep Learning and Its Applications to Natural Language Processing

  • Haiqin Yang
  • Linkai Luo
  • Lap Pong Chueng
  • David Ling
  • Francis Chin

Part of the Cognitive Computation Trends book series (COCT, volume 2)


Natural language processing (NLP), utilizing computer programs to process large amounts of language data, is a key research area in artificial intelligence and computer science. Deep learning technologies have been well developed and applied in this area. However, the literature still lacks a succinct survey, which would allow readers to get a quick understanding of (1) how the deep learning technologies apply to NLP and (2) what the promising applications are. In this survey, we try to investigate the recent developments of NLP, centered around natural language understanding, to answer these two questions. First, we explore the newly developed word embedding or word representation methods. Then, we describe two powerful learning models, Recurrent Neural Networks and Convolutional Neural Networks. Next, we outline five key NLP applications, including (1) part-of-speech tagging and named entity recognition, two fundamental NLP applications; (2) machine translation and automatic English grammatical error correction, two applications with prominent commercial value; and (3) image description, an application requiring technologies of both computer vision and NLP. Moreover, we present a series of benchmark datasets which would be useful for researchers to evaluate the performance of models in the related applications.

4.1 Introduction

Deep learning has revived neural networks and artificial intelligence technologies to effectively learn data representation from the original data (LeCun et al. ; Goodfellow et al. ). Excellent performance has been reported in speech recognition (Graves et al. ) and computer vision (Krizhevsky et al. ). Now, much effort has now turned to the area of natural language processing.

Natural language processing (NLP), utilizing computer programs to process large amounts of language data, is a key research area in artificial intelligence and computer science. Challenges of NLP include speech recognition, natural language understanding, and natural language generation. Though much effort has been devoted in this area, the literature still lacks a succinct survey, which would allow readers to get a quick understanding of how the deep learning technologies apply to NLP and what the interesting applications are.

In this survey, we try to investigate recent development of NLP to answer the above two questions. We mainly focus on the topics that tackle the challenge of natural language understanding. We will divide the introduction into the following three aspects:

  • summarizing the neural language models to learn word vector representations, including Word2vec and Glove (Mikolov et al. ,; Pennington et al. ),

  • introducing the powerful tools of the recurrent neural networks (RNNs) (Elman ; Chung et al. ; Hochreiter and Schmidhuber ) and the convolutional neural networks (CNNs) (Kim ; dos Santos and Gatti ; Gehring et al. ), for language models to capture dependencies in languages. More specifically, we will introduce two popular extensions of RNNs, i.e., the long short-term memory (LSMT) (Hochreiter and Schmidhuber ) network and the Gated Recurrent Unit (GRU) (Chung et al. ) network, and briefly discuss the efficiency of CNNs for NLP.

  • outlining and sketching the development of five key NLP applications, including part-of-speech (POS) tagging (Collobert et al. ; Toutanova et al. ), named entity recognition (NER) (Collobert et al. ; Florian et al. ), machine translation (Bahdanau et al. ; Sutskever et al. ), automatic English grammatical error correction (Bhirud et al. ; Hoang et al. ; Manchanda et al. ; Ng et al. ), and image description (Bernardi et al. ; Hodosh et al. ; Karpathy and Fei-Fei ).

Finally, we present a series of benchmark datasets which are popularly applied in the above models and applications, while concluding the whole article with some discussions. We hope this short review of the recent progress of NLP can help researchers new to the area to quickly enter this field.



The work described in this paper was partially supported by the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. UGC/IDS14/16).


    Marcus M, Santorini B, Marcinkiewicz M, Taylor A (1999) Treebank-3 LDC99T42. Web Download. Linguistic Data Consortium, Philadelphia.

4.2 Learning Word Representations

A critical issue of NLP is to effectively represent the features from the original text data. Traditionally, the numerical statistics, such as term frequency or term frequency inverse document frequency (tf-idf), are utilized to determine the importance of a word. However, in NLP, the goal is to extract the semantic meaning from the given corpus. In the following, we will introduce the state-of-the-art word embedding methods, including word2vec (Mikolov et al. ) and Glove (Pennington et al. ).

Word embeddings (or word representations) are arguably the most widely known technique in the recent history of NLP. Formally, a word embedding or a word representation is represented as a vector of real numbers for each word in the vocabulary. There are various approaches to learn word embeddings, which force similar words to be as close as possible in the semantic space. Among them word2vec and Glove have attracted a great amount of attention in recent 4 years. These two methods are based on the distributional hypothesis (Harris ), where words appearing in similar contexts tend to have similar meaning, and the concept that one can know a word by the company it keeps (Firth ).

Word2vec (Mikolov et al. 2013a)

is not a new concept; however, it gained popularity only after two important papers Mikolov et al. (,) were published in 2013. Word2vec models are constructed by shallow (only two-layer) feedforward neural networks to reconstruct linguistic contexts of words. The networks are fed a large corpus of text and then produce a vector space that is shown to carry the semantic meanings. In Mikolov et al. (), two wor2vec models, i.e., Continuous Bag of Words (CBOW) and skip-gram, are introduced. In CBOW, the word embeddings is constructed through a supervised deep learning approach by considering the fake learning task of predicting a word by its surrounding context, which is usually restricted to a small window of words. In skip-gram, the model utilizes the current word to predict its surrounding context words. Both approaches take the value of the vector of a fixed-size inner layer as the embedding. Note that the order of context words does not influence the prediction in both settings. According to Mikolov et al. (), CBOW trains faster than skip-gram, but skip-gram does better job in detecting infrequent words.

One main issue of word2vec is the high computational cost due to the huge amount of corpora. In Mikolov et al. (), hierarchical softmax and negative sampling are proposed to address the computational issue. Moreover, to enhance computational efficiency, several tricks are adopted: including (1) eliminating most frequent words such as “a”, “the”, and etc., as they provide less informational value than rare words; and (2) learning common phrases and treating them as single words, e.g., “‘New York” is replaced by “New_York”. More details about the algorithms and the tricks can be found in Rong (). An implementation of word2vec in C language is available in the Google Code Archive and its Python version can be downloaded in gensim.

Glove (Pennington et al. 2014)

is based on the hyperthesis that related words often appear in the same documents and looks at the ratio of the co-occurrence probability of two words rather than their co-occurrence probability. That is, the Glove algorithm involves collecting word co-occurrence statistics in the form of a word co-occurrence matrix X, whose element X represents how often word i appears in the context of word j. It then defines a weighted cost function to yield the final word vectors for all the words in the vocabulary. The corresponding source code for the model and pre-trained word vectors are available here.

Word embeddings are widely adopted in a variant of NLP tasks. In Kim (), the pre-trained word2vec is directly employed for sentence-level classifications. In Hu et al. (, ), the pre-trained word2vec is tested in predicting the quality of online health expert question-answering services. It is noted that the determination of word vector dimensions is mostly task-dependent. For example, a smaller dimensionality works better for more syntactic tasks such as named entity recognition (Melamud et al. ) or part-of-speech (POS) tagging (Plank et al. ), while a larger dimensionality is more effective for more semantic tasks such as sentiment analysis (Ruder et al. ).

4.3 Learning Models

A long-running challenge of NLP models is to capture dependencies, especially the long-distance dependencies, of sentences. A natural idea is to apply the powerful sequence data learning models, i.e., the recurrent neural networks (RNNs) (Elman ), in language models. Hence, in the following, we will introduce RNNs and more especially, the famous long short-term memory (LSMT) network (Hochreiter and Schmidhuber ) and the recently proposed Gated Recurrent Unit (GRU) (Chung et al. ). Moreover, we will briefly describe convolutional neural networks (CNNs) in NLP, which can be efficiently trained.

4.3.1 Recurrent Neural Networks (RNNs)

RNNs are powerful tools for language models, since they have the ability to capture long-distance dependencies in sequence data. The idea to model long-distance dependencies is quite straightforward, that is, to simply use the previous hidden state h as input when calculating the current hidden state h. See Fig. for an illustration, where the recursive node can be unfolded into a sequence of nodes.

Mathematically, an RNN can be defined by the following equation:

$$\displaystyle \begin{aligned} {\mathbf{h}}_t = \left \{ \begin{array}{l@{\quad }l} \tanh \left({\mathbf{W}}_{xh} {\mathbf{x}}_t + {\mathbf{W}}_{hh} {\mathbf{h}}_{t-1} + {\mathbf{b}}_h \right) & t \geq 1, \\ \boldsymbol{0} & \mathrm{otherwise.} \end{array} \right. \end{aligned} $$

where x is the t-th sequence input, W is the weight matrix, and b is the bias vector. At the t-th (≥ 1) time stamp, the only difference between an RNN and a standard neural network lies in the additional connection Wh from the hidden state at time step t − 1 to that at the t time stamp.

Though RNNs are simply and easy to compute, they encounter the vanishing gradient problem, which results in little change in the weights and thus no training, or the exploding gradient problems, which results in large changes in the weights and thus unstable training. These problems typically arises in the back propagation algorithm for updating the weights of the networks (Pascanu et al. ). In Pascanu et al. (), a gradient norm clipping strategy is proposed to deal with exploding gradients and a soft constraint is proposed for the vanishing gradients problem. The proposed method does not utilize the information in a whole.

RNNs are very effective for sequence processing, especially for short-term dependencies, i.e., neighboring contexts. However, if the sequence is long, the long term information is lost. One successful and popular model is to modify the RNN architecture, producing namely the long short-term memory (LSMT) (Hochreiter and Schmidhuber ) network. The creativity of LSTM is to introduce the memory cell c and gates that controlling the signal flows in the architecture. See the illustrated architecture in Fig. a and the corresponding formulas as follows:

$$\displaystyle \begin{aligned} {\mathbf{f}}_t & = \sigma \left({\mathbf{W}}_{xf} {\mathbf{x}}_t + {\mathbf{W}}_{hf}{\mathbf{h}}_{t-1} + {\mathbf{b}}_f \right) {} \end{aligned} $$

$$\displaystyle \begin{aligned} {\mathbf{i}}_t & = \sigma \left({\mathbf{W}}_{xi} {\mathbf{x}}_t + {\mathbf{W}}_{hi}{\mathbf{h}}_{t-1} + {\mathbf{b}}_i \right) {} \end{aligned} $$

$$\displaystyle \begin{aligned} {\mathbf{o}}_t & = \sigma \left({\mathbf{W}}_{xo} {\mathbf{x}}_t + {\mathbf{W}}_{ho}{\mathbf{h}}_{t-1} + {\mathbf{b}}_o \right) {} \end{aligned} $$

$$\displaystyle \begin{aligned} \tilde{\mathbf{c}}_t & = \tanh \left({\mathbf{W}}_{xc} {\mathbf{x}}_t + {\mathbf{W}}_{hc}{\mathbf{h}}_{t-1} + {\mathbf{b}}_c \right) \end{aligned} $$

$$\displaystyle \begin{aligned} {\mathbf{c}}_t & = {\mathbf{f}}_t \odot {\mathbf{c}}_{t-1} + {\mathbf{i}}_t \odot \tilde{\mathbf{c}}_t \end{aligned} $$

$$\displaystyle \begin{aligned} {\mathbf{h}}_t & = {\mathbf{o}}_t \odot \tanh({\mathbf{c}}_t). {} \end{aligned} $$

Equations (), () and () correspond to the forget gate, the input gate, and the output gate, respectively. σ is the logistic function outputting the value in the range [0, 1], W and b are the weight matrix and bias vector, respectively, and ⊙ is the element wise multiplication operator. Equations , and corresponds to the forget gate, input gate and output gate, respectively. The function of these gates, as their name indicate, is either allow all signal information to pass through (the gate output equals 1) or block it from passing (the gate output equals 0).

In addition to the standard LSTM model described above, a few LSTM variants have been proposed and proven to be effective. Among them, the Gated Recurrent Unit (GRU) (Chung et al. ) network is one of the most popular ones. GRU is simpler than a standard LSTM as it combines the input gate and the forget gate into a single update gate. See the illustrated architecture in Fig. b and the corresponding formulas as follows:

$$\displaystyle \begin{aligned} {\mathbf{r}}_t & = \sigma \left({\mathbf{W}}_{xr} {\mathbf{x}}_t + {\mathbf{W}}_{hr}{\mathbf{h}}_{t-1} + {\mathbf{b}}_r \right) \end{aligned} $$

$$\displaystyle \begin{aligned} {\mathbf{z}}_t & = \sigma \left({\mathbf{W}}_{xz} {\mathbf{x}}_t + {\mathbf{W}}_{hz}{\mathbf{h}}_{t-1} + {\mathbf{b}}_z \right) \end{aligned} $$

$$\displaystyle \begin{aligned} \tilde{\mathbf{h}}_t & = \tanh \left({\mathbf{W}}_{xh} {\mathbf{x}}_t + {\mathbf{W}}_{hh} ({\mathbf{r}}_t \odot {\mathbf{h}}_{t-1}) + {\mathbf{b}}_h \right) \end{aligned} $$

$$\displaystyle \begin{aligned} {\mathbf{h}}_t & = (1 – {\mathbf{z}}_t)\odot {\mathbf{h}}_{t-1} + {\mathbf{z}}_t\odot \tilde{\mathbf{h}}_t. {} \end{aligned} $$

Compared to the LSTM, the GRU has slightly fewer parameters and also does not have a separate “cell” to store intermediate information. Due to its simplicity, GRU has been extensively used in many sequence learning tasks to conserve memory or computation time. Besides GRU, there are a few variants that share similar but slightly different architecture as LSTM. More details can be found in Gers and Schmidhuber (), Koutník et al. (), Graves et al. (), and Józefowicz et al. ().

4.3.2 Convolutional Neural Networks (CNNs)

While RNNs are the ideal choices for many NLP tasks, they have an inherent limitation. Most RNNs rely on bi-directional encoders to build representations of both past and future contexts (Bahdanau et al. ; Zhou et al. ). They can only process one word at a time. It is less natural to utilize the parallelization architecture of GPU computation in the training and the hierarchical representations over the input sequence (Gehring et al. ). To tackle these challenges, researchers have proposed the convolutional architecture for neural machine translation (Gehring et al. ). The work borrows the idea of CNNs which utilize layers with convolving filters to extract local features and have been successfully applied in image processing (LeCun et al. ). In the convolutional architecture, the input elements x = ( x1, x2, …, x) are embedded in a distributional space as w = ( w1, w2, …, w)), where \(w_j\in \mathbb {R}^f\). The final input element representation is computed by e = ( w1 +  p1, w2 +  p2, …, w +  p), where p = ( p1, p2, …, p) is the embedded representation of the absolute position of input elements with \(p_j\in \mathbb {R}^f\). A convolutional block structure is applied in the input elements to output the decoder network g = ( g1, g2, …, g). The proposed architecture is reported to outperform the previous best result by 1.9 BLEU on WMT’16 English-Romanian translation (Zhou et al. ).

CNNs not only can compute all words simultaneously by taking advantage of GPU parallelization computation, which shows much faster training than RNNs, but they also show better performance than the LSTM models (Zhou et al. ). Other NLP tasks, such as sentence-level sentiment analysis (Kim ; dos Santos and Gatti ), character-level machine translation (Costa-Jussà and Fonollosa ), and simple question answering (Yin et al. ), also demonstrate the effectiveness of CNNs.

4.4 Applications

In the following, we present the development of five key NLP applications: part-of-speech (POS) tagging and named entity recognition (NER) are two fundamental NLP applications, which can enrich the analysis of other NLP applications (Collobert et al. ; Florian et al. ; Toutanova et al. ); machine translation and automatic English grammatical error correction are two applications containing direct commercial value (Bahdanau et al. ; Bhirud et al. ; Hoang et al. ; Manchanda et al. ; Ng et al. ; Sutskever et al. ); and image description, an attractive and significant application requiring the techniques of both computer vision and NLP (Bernardi et al. ; Hodosh et al. ; Karpathy and Fei-Fei ).

4.4.1 Part-of-Speech (POS) Tagging

Part-of-speech (POS) tagging (Collobert et al. ) aims at labeling (assocating) each word with a unique tag that indicates its syntactic role, e.g., plural noun, adverbs, etc. The POS tags are usually utilized as common input features for various NLP tasks, e.g., information retrieval, machine translation (Ueffing and Ney ), grammar checking (Ng et al. ), etc.

Nowadays, the most common used POS category is the tag set in the Penn Treebank Project, which defines 48 different tags (Marcus et al. ). They are commonly used in various NLP libraries, such as NLTK in Python, Stanford tagger, and Apache OpenNLP.

The existing algorithms for tagging can be generally categorized into two groups, the rule-based group and the stochastic group. The rule-based methods such as the Eric Brills tagger (Brill ) and the disambiguation rules in LanguageTool, are usually hand-crafted, derived from corpus, or developed collaboratively (e.g., for LanguageTool). The rule-based methods can achieve a pretty low error rate (Brill ), but generally, they are still less sophisticated when compared with stochastic taggers. In contrast, stochastic taggers, such as the Hidden Markov Model (HMM) (Brants ) and the Maximum Entropy Markov Model (MEMM) (McCallum et al. ), model the sequence of POS tags as the hidden states, which can be learned from the observed word sequence of sentences. The probability of co-occurrence of words and tags is modeled by HMM (Brants ) and the conditional probability of tags given the words is modeled by MEMM (McCallum et al. ) to output the corresponding tags.

Later, more advanced methods have been proposed to improve both HMM and MEMM. The methods include utilizing bidirectional cyclic dependency network tagger (Manning ) and using other linguistic features (Jurafsky and Martin ). More than 96% accuracy was reported by both HMM (Brants ) and MEMM (Manning ). More state-of-the-art performances can be found on internet.

4.4.2 Named Entity Recognition (NER)

Named entity recognition (NER) is a classic NLP task that seeks to locate and classify named entities such as person names, organizations, locations, numbers, dates, etc. from the text corpora. Most existing NER taggers are built on linear statistical models, such as Hidden Markov Models (McCallum et al. ) and Conditional Random Field (Lafferty et al. ). Traditional NER techniques heavily rely on hand-crafted features for the taggers and only apply for small corpora (Chieu and Ng ).

4.4.3 Neural Machine Translation

The objective of machine translation (MT) is to translate text or speech from one language to another one. Conventional MT utilizes statistical models whose parameters are inferred from bilingual text corpora. Recently, a major development in MT is the adoption of sequence to sequence learning models, promoting the state-of-art technique called neural machine translation (NMT) (Wu et al. ; Gehring et al. ; Vaswani et al. ). NMT has been proven great success owing to the rapid development of deep learning technologies, whose architecture is comprised of an encoder-decoder model (Sutskever et al. ), and an attention mechanism (Bahdanau et al. ).

An encoder model RNN provides a representation of the source sentence by inputing a sequence of source words \(\mathbf {x}=\left ( x_1, \dots , x_m \right )\) and producing a sequence of hidden states \(\mathbf {h}=\left ( h_1, \dots , h_m \right )\). According to Sutskever et al. (), a bidirectional RNN is usually favored to reduce long sentence dependencies, and the final state h is the concatenation of the states produced by forward and backward RNNs, \(\mathbf {h} = \left [ \overrightarrow {\mathbf {h}}; \overleftarrow {\mathbf {h}} \right ]\). The decoder is also a recurrent neural network, RNN, which predicts the probability of a target word of a sentence y, based on the hidden state h, the previous words \({\mathbf {y}}_{<k} = \left ( y_1, \dots , y_{k-1} \right )\), the recurrent hidden state in the decoder RNN s, and the context vector c. The context vector c is also called the attention vector, which is computed as a weighted vector of the source hidden state h: \(\sum _{j=1}^{m} \alpha _{ij} h_j\), where m is the length of source sentence, and α is the attention weight. The attention weight can be calculated in the fashion of concatenation of bi-directional encoder (Bahdanau et al. ) or a simpler version with a location-based function on the target hidden state (Luong et al. ). Finally, the decoder outputs a distribution over a fixed-size vocabulary through softmax approximation:

$$\displaystyle \begin{aligned} P(y_k | {\mathbf{y}}_{<k}, \mathbf{x}) = \mathrm{softmax} \left( g(y_{k-1}, {\mathbf{c}}_k, {\mathbf{s}}_k ) \right) \end{aligned} $$

where g is a non-linear function. The encoder-decoder and attention-driven model is trained end-to-end by optimizing the negative log likelihood of the target words using stochastic gradient descent (SGD).

Next, we summarize some aspects in advancing NMT. The first issue is to restrict the size of the vocabulary. Though NMT is an open vocabulary problem, the number of target words of NMT must be limited, because the complexity of training an NMT model increases as the number of target words increases. In practice, the target vocabulary size K is often in the range of 30 k (Bahdanau et al. ) to 80 k (Sutskever et al. ). Any word out of the vocabulary is represented as an unknown word, denoted by unk. The traditional NMT model works well if there are fewer unknown words in the target sentences, but it has been observed that the performance of translation degrades dramatically if there are too many unknown words (Jean et al. ). An intuitive solution to address this problem is to use a larger vocabulary, while simultaneously reducing the computational complexity using sampling approximations (Jean et al. ; Mi et al. ; Ji et al. ). Other researcher reported that the unknown word problem can be addressed alternatively without expanding vocabulary. For example, one can replace the unknown word with special token unk, and then post-process the target sentence by copying the unk from source sentence or applying word translation to the unknown word (Luong et al. ). Instead of implementing word-based neural machine translation, other researchers proposed to using character-based NMT to eliminate unknown words (Costa-Jussà and Fonollosa ; Chung et al. ), or using a hybrid method – a combination of word-level and character-level NMT model (Luong and Manning ). The implementation of subword units also shows significant effectiveness in reducing the vocabulary size (Sennrich et al. ). The algorithm, called byte pair encoding (BPE), starts with a vocabulary of characters, and replaces the most frequent n-gram pairs with a new n-gram. To summarize, the word-level, BPE-level and character-level vocabulary forms the fundamental treatment of neural machine translation practice.

The third issue is the implementation of neural machine translation. To deploy neural machine translation systems, one needs to build the encoder-decoder model (with attention mechanism) and to train the end-to-end model on GPUs. Nowadasy, there are quite many toolkits publicly available for research, development and deployment:

4.4.4 Automatic English Grammatical Error Correction

Since English is not the first language of many people in the world, to facilitate the writing, grammar checkers have been developed. Some commercial or freeware such as Microsoft Word, Grammarly, LanguageTool, Apache Wave, and Ginger, can provide grammar checking services. However, due to various exceptions and rules in natural languages, these grammar checkers are still fall far short of human English teachers.

To boost the development of grammatical error checking and correction, various shared tasks and focused sessions were launched to attract researchers’ interests and contributions. The tasks include the Helping Our Own (HOO) Shared Task in 2011 (Dale and Kilgarriff ), the CoNLL Shared Task in 2013 (Ng et al. ) and 2014 (Ng et al. ), respectively, and the AESW Shared Task in 2016 (Daudaravicius et al. ). Each of the shared tasks provided the original text corpus and the corresponding ones corrected by human editors. The dataset of CoNLL Shared Task 2013 and 2014 is a collection of 1,414 marked student essays from the National University of Singapore, where all the students are non-native English speakers. The detected grammatical errors are classified into 28 types. Meanwhile, the datasets of the HOO and the AESW shared tasks are extracted from published papers and proceedings of conferences. The HOO task is a collection of fractional texts from 19 published papers, while the AESW one is a collection of shuffled sentences generated from 9,919 published papers (mainly from physics and mathematics).

4.5 Datasets for Natural Language Processing

Many datasets have been published in different research domains for natural language processing. We try to provide the basic ones mentioned in previous sections.

4.6 Conclusions and Discussions

In this survey, we have provided a succinct review of the recent development of NLP, including word representation, learning models, and key applications. Nowadays, Word2vect and Glove are two main successful methods to learn the word representation in the semantic space. RNNs and CNNs are two mainstreams of learning models to train the NLP models. After exploring the five key applications, we envision the following interesting research topics. First, it is effective to include additional features or results (e.g., POS tagging and NER) to improve the performance for other applications, such as machine translations and automatic grammar correction. Second, it is worth investigating the end-to-end model, which may further improve the model performance. For example, nowadays, the embedded word representation is learned independently to the applications. One may explore new representations which fit for the later applications, e.g., sentimental analysis, text matching. Third, it is promising to explore the advancement of multidisciplinary approaches. For example, in the image description application, one needs the technologies from both computer vision and natural language processing. It is significant to understand both areas and make the breakthrough.


Article by channel:

Read more articles tagged: Natural Language Processing