What are BERT embeddings?

BERT embeddings are powerful, contextual numerical representations of words or subword tokens, learned by the Bidirectional Encoder Representations from Transformers (BERT) model. They capture the semantic and syntactic meaning of text, crucial for understanding language by reflecting how a word's meaning changes based on its surrounding context.

Understanding Embeddings

At its core, an embedding is a trained numerical representation of a categorical feature. In the context of natural language processing (NLP), this means a list of floating-point values that are learned during model training. The number of these values, also called the embedding size (e.g., 768 for bert-base-uncased), can vary from model to model. For BERT, these embeddings are rich, high-dimensional vectors that encode the meaning of words or subword units within a specific sentence.

The Contextual Advantage of BERT Embeddings

Unlike earlier word embeddings (e.g., Word2Vec or GloVe) which assign a single, static vector to each word regardless of its usage, BERT generates contextualized embeddings. This means the numerical representation of a word changes depending on the other words around it in a sentence.

Why Context Matters

Consider the English language where many words have multiple meanings (polysemy). BERT's contextual nature allows it to differentiate between these meanings:

Example 1: "I went to the bank to deposit money."
- Here, "bank" refers to a financial institution.
Example 2: "The boat was tied to the river bank."
- Here, "bank" refers to the side of a river.

BERT produces distinct embeddings for "bank" in these two sentences, reflecting their different meanings, a capability vital for accurate language understanding.

How BERT Generates Embeddings

BERT's architecture, based on the Transformer's encoder mechanism, is key to its ability to create these sophisticated contextual embeddings.

Tokenization: Input text is first broken down into subword units (using WordPiece tokenization) and special tokens like [CLS] (for classification tasks) and [SEP] (to separate sentences or parts of sentences) are added.
Initial Embeddings: Each token is initially converted into a fixed-size vector. This initial vector is a sum of three components:
- Token Embeddings: Represents the intrinsic meaning of the token itself.
- Segment Embeddings: Indicates which sentence the token belongs to (useful for tasks involving multiple sentences, like question answering).
- Position Embeddings: Captures the order of tokens in the sequence, as Transformers process tokens in parallel without inherent positional information.
Transformer Layers: These initial embeddings are then fed through multiple layers of Transformer encoders. Each layer uses self-attention mechanisms to weigh the importance of all other tokens in the sentence when processing a specific token. This process allows information to flow bidirectionally, learning the context from both left and right directions simultaneously.
Final Output: The output of the final Transformer layer for each input token is its BERT embedding – a dense, high-dimensional vector (e.g., 768 dimensions for bert-base-uncased models).

Characteristics and Benefits

Contextual: Accurately captures polysemy and nuanced meanings based on surrounding words.
High-Dimensional: Provides rich semantic and syntactic information, enabling fine-grained distinctions between meanings.
Pre-trained: Learned from vast amounts of text data (e.g., Wikipedia, BookCorpus), allowing for effective transfer learning to downstream tasks with less labeled data.
Bidirectional: Processes text by considering the full context from both directions, leading to a deeper understanding.

Applications of BERT Embeddings

BERT embeddings serve as a powerful foundation for various NLP tasks, significantly improving performance:

Semantic Search: Finding documents or passages that are semantically similar to a given query, rather than just matching keywords.
Text Classification: Accurately categorizing text into predefined classes (e.g., sentiment analysis, spam detection, topic classification).
Question Answering: Identifying precise answers within a given text or corpus in response to a natural language query.
Named Entity Recognition (NER): Locating and classifying named entities (like persons, organizations, locations, dates) in unstructured text.
Machine Translation: Aiding in the understanding of source text and improving the quality of translations.
Feature Extraction: Used as robust input features for other machine learning models or statistical methods for various custom NLP applications.

Extracting BERT Embeddings (Practical Insight)

Developers can easily extract BERT embeddings using popular libraries like Hugging Face's transformers. The output of the last hidden layer provides the contextualized embeddings for each token.

from transformers import BertModel, BertTokenizer
import torch

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

text = "BERT embeddings are highly useful for NLP tasks."
encoded_input = tokenizer(text, return_tensors='pt')

# Get the model output (without computing gradients)
with torch.no_grad():
    output = model(**encoded_input)

# The 'last_hidden_state' contains the contextual embeddings for each token
# Its shape will be (batch_size, sequence_length, hidden_size)
# For 'bert-base-uncased', hidden_size is 768.
token_embeddings = output.last_hidden_state

# The embedding for the [CLS] token (first token) is often used for sentence-level tasks
cls_embedding = token_embeddings[:, 0, :]

print(f"Text: '{text}'")
print(f"Shape of all token embeddings: {token_embeddings.shape} (batch_size, sequence_length, hidden_size)")
print(f"Shape of [CLS] token embedding: {cls_embedding.shape} (batch_size, hidden_size)")

This code snippet demonstrates how to obtain the 768-dimensional vector representation for each token in the input sentence, or a single vector for the entire sentence (from the [CLS] token), which can then be used in downstream applications.

BERT Embeddings vs. Traditional Embeddings

To highlight the advancements, here's a comparison:

Feature	Traditional Word Embeddings (e.g., Word2Vec)	BERT Embeddings
Contextuality	Static (one vector per word)	Dynamic (vector changes based on context)
Polysemy	Struggles with words having multiple meanings	Handles multiple meanings effectively
Architecture	Shallow neural networks	Deep Transformer encoder
Training Obj.	Predict surrounding words or based on counts	Masked Language Model (MLM), Next Sentence Prediction (NSP)
Represent.	Word-level	Subword-level, then contextualized at token level
Bidirectional	Unidirectional (e.g., skip-gram) or windowed	Fully Bidirectional (attention across entire seq.)