Transformers and 'Embed, Encode, Attend, Predict' in NER #6910

CedricMingneau · 2021-02-03T14:39:06Z

CedricMingneau
Feb 3, 2021

I've based my understanding on how SpaCy's NER works of Matthew's video on YouTube (https://www.youtube.com/watch?v=sqDHBH9IjRU&t=2502s). In the new release it's possible to extract features from words using Transformers. As far as I've understood how transformers work, the features they extract are already embedded and encoded in their context. They also hold a notion of attention.

If I decide to use a Transformer in combination with the NER component, surely the 'embed, encode and attend steps' won't all be repeated for the features vectors coming out of the Tranformer component? How then are these feature vectors used to make NER predictions?

Answered by honnibal

Feb 4, 2021

A transformer model like BERT does indeed come with both the "embed" and "encode" steps. In the case of transformers, they use multi-head self-attention attention as their contextual encoding strategy. This was what the paper title "Attention is all you need" refers to: the attention layer can be used as the contextual encoder, instead of a BiLSTM or CNN.

My terminology in that blog post probably should be updated to be less confusing, given that attention is used in so many ways. I would now term the third step reduce: it's the step where you select what information would be relevant to a decision, and then somehow blend that information to make a single vector you can pass into the pred…

View full answer

honnibal · 2021-02-04T02:36:55Z

honnibal
Feb 4, 2021
Maintainer

A transformer model like BERT does indeed come with both the "embed" and "encode" steps. In the case of transformers, they use multi-head self-attention attention as their contextual encoding strategy. This was what the paper title "Attention is all you need" refers to: the attention layer can be used as the contextual encoder, instead of a BiLSTM or CNN.

My terminology in that blog post probably should be updated to be less confusing, given that attention is used in so many ways. I would now term the third step reduce: it's the step where you select what information would be relevant to a decision, and then somehow blend that information to make a single vector you can pass into the prediction step.

So all that said, the transformer-based NER model uses the pretrained transformer as its embed and encode step. The actual entity recognition process is transition-based, which means the modelling task is to enter a while loop, at each step predicting a state transformation action from a given state, until a termination state is reached, at which point the predicted structure is read off the state. The "reduce" part of each step consists of extracting a particular set of tokens to represent the state, concatenating their transformer-encoded vectors, and passing the result through a feed-forward network to compute a single vector representing the state. The state vector is then passed through a feed-forward network to make the action prediction.

It's not really necessary to have a fully detailed mental model of this algorithm to use the NER, or even to customize its configuration in many ways. But that's a quick sketch of what's going on.

2 replies

CedricMingneau Feb 4, 2021
Author

Thank you for your speedy response and for taking the time to explain all of this 🙂

spartan289 Jan 5, 2025

Thank you very much, now i know how it works.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transformers and 'Embed, Encode, Attend, Predict' in NER #6910

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Transformers and 'Embed, Encode, Attend, Predict' in NER #6910

CedricMingneau Feb 3, 2021

Replies: 1 comment · 2 replies

honnibal Feb 4, 2021 Maintainer

CedricMingneau Feb 4, 2021 Author

spartan289 Jan 5, 2025

CedricMingneau
Feb 3, 2021

Replies: 1 comment 2 replies

honnibal
Feb 4, 2021
Maintainer

CedricMingneau Feb 4, 2021
Author