Transformers and 'Embed, Encode, Attend, Predict' in NER #6910
-
I've based my understanding on how SpaCy's NER works of Matthew's video on YouTube (https://www.youtube.com/watch?v=sqDHBH9IjRU&t=2502s). In the new release it's possible to extract features from words using Transformers. As far as I've understood how transformers work, the features they extract are already embedded and encoded in their context. They also hold a notion of attention. If I decide to use a Transformer in combination with the NER component, surely the 'embed, encode and attend steps' won't all be repeated for the features vectors coming out of the Tranformer component? How then are these feature vectors used to make NER predictions? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
A transformer model like BERT does indeed come with both the "embed" and "encode" steps. In the case of transformers, they use multi-head self-attention attention as their contextual encoding strategy. This was what the paper title "Attention is all you need" refers to: the attention layer can be used as the contextual encoder, instead of a BiLSTM or CNN. My terminology in that blog post probably should be updated to be less confusing, given that attention is used in so many ways. I would now term the third step reduce: it's the step where you select what information would be relevant to a decision, and then somehow blend that information to make a single vector you can pass into the prediction step. So all that said, the transformer-based NER model uses the pretrained transformer as its embed and encode step. The actual entity recognition process is transition-based, which means the modelling task is to enter a while loop, at each step predicting a state transformation action from a given state, until a termination state is reached, at which point the predicted structure is read off the state. The "reduce" part of each step consists of extracting a particular set of tokens to represent the state, concatenating their transformer-encoded vectors, and passing the result through a feed-forward network to compute a single vector representing the state. The state vector is then passed through a feed-forward network to make the action prediction. It's not really necessary to have a fully detailed mental model of this algorithm to use the NER, or even to customize its configuration in many ways. But that's a quick sketch of what's going on. |
Beta Was this translation helpful? Give feedback.
A transformer model like BERT does indeed come with both the "embed" and "encode" steps. In the case of transformers, they use multi-head self-attention attention as their contextual encoding strategy. This was what the paper title "Attention is all you need" refers to: the attention layer can be used as the contextual encoder, instead of a BiLSTM or CNN.
My terminology in that blog post probably should be updated to be less confusing, given that attention is used in so many ways. I would now term the third step reduce: it's the step where you select what information would be relevant to a decision, and then somehow blend that information to make a single vector you can pass into the pred…