Named Entity Recognition: A Brief Overview


Named entity recognition(NER) is the task to identify mentions of rigid designators from text belonging to predefined semantic types such as person, location, organization etc. NER always servers as the foundation of many natural language applications such as question answering, text summarization, and machine translation.

Despite the various definitions of NE(Named Entity), researchers have reached common consensus on the types of NEs to recognize. We generally divide NEs into two categories:

  • Generic NEs Person and Location

  • Domain-specific NEs proteins, enzymes, and genes.

There are 4 mainstream approaches used in NER:

  1. Rule-Based Approaches: Don't need annotated data as they rule on hand-crafted rules

  2. Unsupervised Learning Approaches: Rely on unsupervised algorithms without hand-labelled training examples

  3. Feature-based Supervised Learning: Rely on supervised algorithms with a lot of feature engineering involved

  4. Deep Learning Approaches: Automatically discover representations from raw input

Formal Definition

A named entity (NE) is a word or a phrase that clearly identifies one item from a set of other items that have similar attributes. Examples being organizations, person, location names. NER is the process of locating and classifying named entities in text into predefined entity categories.

Formally, given a sequence of tokens s={w1,w2,...,wN}s = \{ w_1, w_2, ..., w_N \} , NER is to output a list of tuples {Is,Ie,t}\{ I_s, I_e, t\} , each of which is a named entity mentioned in s.Here, Is[1,N]I_s \in [1,N] and Ie[1,N]I_e \in [1,N] are the start and end indexes of a NER; t is the entity type from a predefined category set.


NER systems are usually evaluated by comparing their outputs against human annotations. The comparison can be quantified by either exact-match or relaxed match.

Exact-Match Evaluation

NER essentially involves two subtasks: boundary detection and type identification. In "exact-match evaluation", a correctly recognized instance requires a system to correctly identify its boundary and type, simultaneously.

Relaxed-Match Evaluation

A correct type is credited if an entity is assigned its correct type regardless its boundaries as long as there is an overlap with ground truth boundaries; a correct boundary is credited regardless an entity type's assignment.


Deep Learning Techniques for NER

There are three core strengths of applying deep learning techniques to NER.

  1. NER benefits from the non-linear transformations, which generates non-linear mappings from input to output. DL models are able to learn complex and intricate features from data compared to linear models (log-linear HMM, linear chain CRF).

  2. DL saves a significant amount of effort on designing NER features. The traditional models required a considerable amount of engineering skill and domain expertise.

  3. Deep NER models can be trained on an end-to-end paradigm which enables us to build complex NER systems.

General Deep Learning Architecture for NER

  • Distributed representations for input consider word- and character-level embeddings as well as the incorporation of additional features.

  • Context encoder is to capture the context dependencies using CNN, RNN, or other networks.

  • Tag decoder predicts tags for tokens in the input sentence.

Distributed Representations for Input

Distributed representation represents words in low dimensional real-valued dense vectors where each dimension represents a latent feature. Automatically learned from the text, distributed representation captures semantic and syntactic properties of word

Word-Level Representation

Some studies employed word-level representation, which is typically pre-trained over large collections of text through unsupervised algorithms such as continuous bag-of-words (CBOW) and continuous skip-gram models. Recent studies have shown the importance of such pre-trained word embeddings. Using as the input, the pre-trained word embeddings can be either fixed or further fine-tuned during NER model training. Commonly used word embeddings include Google Word2Vec, Stanford GloVe, Facebook fastText and SENNA.

Character-Level Representation

Instead of only considering word-level representations as to the basic input, several studies incorporated character-based word representations learned from an end-to-end neural model. Character-level representation has been found useful for exploiting explicit sub-word-level information such as prefix and suffix. Another advantage of character-level representation is that it naturally handles out-of-vocabulary tokens.

Recent studies (like CharNER) have shown that taking characters as the primary representation is superior to words as the basic input unit.

Hybrid Representation

Besides word-level and character-level representations, some studies also incorporate additional information (e.g., gazetteers, lexical similarity, linguistic dependency and visual features ) into the final representations of words, before feeding into context encoding layers. In other words, the DL-based representation is combined with a feature-based approach in a hybrid manner. Adding additional information may lead to improvements in NER performance, with the price of hurting generality of these systems

Context Encoders

Convolutional Neural Networks

Some studies proposed a sentence approach network where a word is tagged with the consideration of the whole sentence. Each word in the input sequence is embedded in an N-dimensional vector after the stage of input representation. Then a convolutional layer is used to produce local features around each word, and the size of the output of the convolutional layers depends on the number of words in the sentence. The global feature vector is constructed by combining local feature vectors extracted by the convolutional layers. Finally, these fixed-size global features are fed into a tag decoder to compute a distribution score for all possible tags for the words in the network input.

Recurrent Neural Networks

Recurrent neural networks, together with its variants such as a gated recurrent unit (GRU) and long-short term memory (LSTM), have demonstrated remarkable achievements in modelling sequential data. In particular, bidirectional RNNs efficiently make use of past information (via forward states) and future information (via backward states) for a specific time frame. A token encoded by a bidirectional RNN will contain evidence from the whole input sentence.


Neural sequence labelling models are typically based on complex convolutional or recurrent networks which consist of encoders and decoders. Transformer, proposed by Vaswani, dispenses with recurrence and convolutions entirely. A transformer utilizes stacked self-attention and pointwise, fully connected layers to build basic blocks for encoder and decoder.

Tag Decoder

Tag decoder is the final stage in a NER model. It takes context-dependent representations as input and produces a sequence of tags corresponding to the input sequence.

MLP + Softmax

Tag decoder is the final stage in a NER model. It takes context-dependent representations as input and produces a sequence of tags corresponding to the input sequence.

Conditional Random Fields

A conditional random field (CRF) is a random field globally conditioned on the observation sequence. CRFs have been widely used in feature-based supervised learning approaches. Many deep learning-based NER models use a CRF layer as the tag decoder. CRF is the most common choice for tag decoder and the state-of-the-art performance on CoNLL03 and OntoNotes5.0 is achieved with a CRF tag decoder.