Skip to content

Transformers and 'Embed, Encode, Attend, Predict' in NER #6910

Discussion options

You must be logged in to vote

A transformer model like BERT does indeed come with both the "embed" and "encode" steps. In the case of transformers, they use multi-head self-attention attention as their contextual encoding strategy. This was what the paper title "Attention is all you need" refers to: the attention layer can be used as the contextual encoder, instead of a BiLSTM or CNN.

My terminology in that blog post probably should be updated to be less confusing, given that attention is used in so many ways. I would now term the third step reduce: it's the step where you select what information would be relevant to a decision, and then somehow blend that information to make a single vector you can pass into the pred…

Replies: 1 comment 2 replies

Comment options

You must be logged in to vote
2 replies
@CedricMingneau
Comment options

@spartan289
Comment options

Answer selected by svlandeg
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / transformer Feature: Transformer
3 participants