hugger.mapper module#
- class hugger.mapper.HuggingMapper(model_name: str = 'cambridgeltl/SapBERT-from-PubMedBERT-fulltext', *, pooling: str = 'mean_pooling', padding: Annotated[bool, Strict(strict=True)] = True, truncation: Annotated[bool, Strict(strict=True)] = True, return_tensors: Literal['pt', 'np', 'tf', 'jax', None] = 'pt', max_length: int = 512, **tokenizer_kwargs)[source]#
Bases:
objectA class for mapping text to embeddings using a Hugging Face model. This class provides methods to load a pre-trained model and tokenizer, embed text, and configure pooling methods for generating embeddings.
- Parameters:
model_name (str) – The name of the pre-trained model to be used for generating embeddings (default is “cambridgeltl/SapBERT-from-PubMedBERT-fulltext”).
tokenizer_kwargs (dict) – Additional keyword arguments to be passed to the tokenizer (default is {‘padding’: True, ‘truncation’: True, ‘return_tensors’: ‘pt’, ‘max_length’: 512}).
pooling (str) – The pooling method to be used for generating embeddings (default is “mean_pooling”).
- tokenizer#
The pre-trained tokenizer instance.
- model#
The pre-trained model instance.
- Type:
- embed_text(text_input: str) torch.Tensor[source]#
Embeds a given text using the pre-trained model and pooling function.
- embed_text(text_input: str) Tensor[source]#
Embeds a given text using the pre-trained model and pooling function.
- Parameters:
text (str) – The text to be embedded.
- Returns:
The normalized embedding of the input text.
- Return type:
- property model: AutoModel#
Returns the pre-trained model instance.
- Returns:
The loaded model instance.
- Return type:
- property pooling: str#
Returns the pooling method used for generating embeddings.
- Returns:
The pooling method.
- Return type:
- property tokenizer: AutoTokenizer#
Returns the pre-trained tokenizer instance.
- Returns:
The loaded tokenizer instance.
- Return type:
- class hugger.mapper.NodeMapper(df: DataFrame, text_col: str, id_col: str = 'id', model_name: str = 'cambridgeltl/SapBERT-from-PubMedBERT-fulltext', *, pooling: str = 'mean_pooling', padding: Annotated[bool, Strict(strict=True)] = True, truncation: Annotated[bool, Strict(strict=True)] = True, return_tensors: Literal['pt', 'np', 'tf', 'jax', None] = 'pt', max_length: int = 512, **tokenizer_kwargs)[source]#
Bases:
HuggingMapperA class for mapping nodes to their corresponding text embeddings using a Hugging Face model. This class extends the HuggingMapper class to handle a DataFrame containing node IDs and their associated text. It provides methods to generate embeddings for each node and find similar nodes based on a given input text.
- Parameters:
df (pandas.DataFrame) – A DataFrame containing the node IDs and their corresponding text.
text_col (str) – The name of the column in the DataFrame that contains the text to be embedded.
id_col (str) – The name of the column in the DataFrame that contains the node IDs (default is “id”).
model_name (str) – The name of the pre-trained model to be used for generating embeddings (default is “cambridgeltl/SapBERT-from-PubMedBERT-fulltext”).
tokenizer_kwargs (dict) – Additional keyword arguments to be passed to the tokenizer (default is {‘padding’: True, ‘truncation’: True, ‘return_tensors’: ‘pt’, ‘max_length’: 512}).
pooling (str) – The pooling method to be used for generating embeddings (default is “mean_pooling”).
- df#
The DataFrame containing the node IDs and their corresponding text.
- Type:
- get_similar(input_text: str, threshold: float = 0.8, metric: str = 'cosine') dict[source]#
Finds similar items in the mapping based on a similarity threshold.
- get_match(input_text: str, threshold: float = 0.8, metric: str = 'cosine') tuple[source]#
Finds the best match for the input text from the mapping based on a similarity threshold.
- property embeddings_df: DataFrame#
Returns a DataFrame containing the node IDs and their corresponding embeddings. The DataFrame is constructed from the mapping of node IDs to their embeddings, with the node IDs as the index.
- Returns:
A DataFrame where the index consists of node IDs and the columns contain the corresponding embeddings.
- Return type:
- get_match(input_text: str, *, threshold: float = 0, metric: str = 'cosine') list[source]#
Finds the best match for the input text from the mapping based on a similarity threshold.
- Parameters:
- Returns:
A tuple containing the ID of the best match and its corresponding metadata. The metadata includes the text of the match and its similarity score. If no match is found above the threshold, returns (None, None).
- Return type:
- Raises:
TypeError – If input_text is not a string or if metric is not a string.
ValueError – If metric is not one of the supported similarity metrics (“cosine” or “jaccard”).
Examples
>>> import pandas as pd >>> df = pd.DataFrame({"id": ["n1", "n2"], "text": ["hello", "world"]}) >>> mapper = NodeMapper(df, text_col='text', id_col='id') Loading tokenizer for model: cambridgeltl/SapBERT-from-PubMedBERT-fulltext Loading model: cambridgeltl/SapBERT-from-PubMedBERT-fulltext Generating embeddings for 2 nodes ... >>> best_match_id, metadata = mapper.get_match("earth", threshold=0.8, metric="cosine")
- get_similar(input_text: str, *, threshold: float = 0, top_k: int | None = None, metric: str = 'cosine') list[source]#
Finds similar items in the mapping based on a similarity threshold.
- Parameters:
input_text (str) – The input text to find similar items for.
threshold (float) – The minimum similarity score required to consider an item similar (default is 0).
top_k (Optional[int]) – The maximum number of similar items to return (default is None, meaning all similar items).
metric (str) – The similarity metric to use for comparison (default is “cosine”).
- Returns:
A dictionary containing the IDs of similar items as keys and their corresponding metadata (text and similarity score) as values. The dictionary is sorted in descending order by score.
- Return type:
- Raises:
TypeError – If input_text is not a string or if metric is not a string.
ValueError – If metric is not one of the supported similarity metrics (“cosine” or “jaccard”).
Examples
>>> import pandas as pd >>> df = pd.DataFrame({"id": ["n1", "n2"], "text": ["hello", "world"]}) >>> mapper = NodeMapper(df, text_col='text', id_col='id') Loading tokenizer for model: cambridgeltl/SapBERT-from-PubMedBERT-fulltext Loading model: cambridgeltl/SapBERT-from-PubMedBERT-fulltext Generating embeddings for 2 nodes ... >>> similar_items = mapper.get_similar("planet", threshold=0.8, metric="cosine")
- property mapping: dict#
Returns the mapping of node IDs to their corresponding text.
- Returns:
A dictionary where keys are node IDs and values are the corresponding text.
- Return type:
- property mapping_embeddings: dict#
Returns the mapping of node IDs to their corresponding embeddings.
- Returns:
A dictionary where keys are node IDs and values are the corresponding embeddings.
- Return type:
- plot_tsne(random_state: int = 42, title: str = 't-SNE of Node Embeddings', labels: dict | None = None, tsne_kwargs: dict | None = None, px_scatter_kwargs: dict | None = None)[source]#
Quick t-SNE visualization of the node embeddings.
- Parameters:
random_state (int) – The random seed for reproducibility (default is 42).
title (str) – The title of the plot (default is “t-SNE of Node Embeddings”).
labels (Optional[dict]) – A dictionary mapping node IDs to labels for the plot. If none, the axes will be labeled as “t-SNE 1” and “t-SNE 2” (default is None).
tsne_kwargs (Optional[dict]) – Additional keyword arguments to pass to the TSNE constructor (default is None).
px_scatter_kwargs (Optional[dict]) – Additional keyword arguments to pass to the Plotly Express scatter function (default is None).
- hugger.mapper.attention_pooling(model_output: Tensor, attention_scores: Tensor) Tensor[source]#
Applies attention-based pooling to aggregate token embeddings. This function computes a weighted sum of token embeddings using provided attention scores. The attention scores are normalized using softmax to obtain attention weights, which are then used to pool the token embeddings along the sequence dimension.
- Parameters:
model_output (tuple or torch.Tensor) – The output from a model, where the first element (or the tensor itself) contains token embeddings of shape (batch_size, sequence_length, embedding_dim).
attention_scores (torch.Tensor) – Attention scores for each token, of shape (batch_size, sequence_length).
- Returns:
The pooled embeddings of shape (batch_size, embedding_dim), obtained by applying attention-based weighted sum over the token embeddings.
- Return type:
- hugger.mapper.get_embeddings(model: AutoModel, encoded_input: BatchEncoding, pooling_function=<function attention_pooling>) Tensor[source]#
Generates sentence embeddings using a Hugging Face model and a specified pooling function.
This function takes a pre-trained Hugging Face model and a batch of encoded sentences, computes their embeddings, applies a pooling function to obtain sentence-level representations, and normalizes the resulting embeddings.
- Parameters:
model (transformers.AutoModel) – The Hugging Face model used to generate token embeddings.
encoded_input (transformers.BatchEncoding) – The batch of tokenized sentences to embed.
pooling_function (Callable) – The pooling function to aggregate token embeddings into sentence embeddings. Defaults to attention_pooling.
- Returns:
The normalized sentence embeddings as a tensor.
- Return type:
- Raises:
AssertionError – If encoded_input is not an instance of transformers.BatchEncoding.
Examples
>>> from transformers import AutoTokenizer, AutoModel >>> huggingface_model_name = 'sentence-transformers/all-MiniLM-L6-v2' >>> tokenizer = AutoTokenizer.from_pretrained(huggingface_model_name) >>> model = AutoModel.from_pretrained(huggingface_model_name) >>> sentences = ["dogs are happy", "cats are cute"] >>> encoded = get_tokens(tokenizer, sentences) >>> embeddings = get_embeddings(model, encoded)
- hugger.mapper.get_tokens(tokenizer: AutoTokenizer, input: list[str] | str, *, padding: Annotated[bool, Strict(strict=True)] = True, truncation: Annotated[bool, Strict(strict=True)] = True, return_tensors: Literal['pt', 'np', 'tf', 'jax', None] = 'pt', max_length: int = 512, **tokenizer_kwargs) BatchEncoding[source]#
Encodes a list of sentences using a Hugging Face tokenizer.
- Parameters:
tokenizer (transformers.AutoTokenizer) – The tokenizer instance from Hugging Face’s transformers library.
input (list[str] | str) – A list of sentences to be tokenized.
tokenizer_kwargs (dict) – Additional keyword arguments to pass to the tokenizer (default is
{'padding': True, 'truncation': True, 'return_tensors': 'pt', 'max_length': 512}).
- Returns:
The encoded inputs as a BatchEncoding object, suitable for model input.
- Return type:
Examples
>>> from transformers import AutoTokenizer >>> tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L12-v2') >>> sentences = ["dogs are happy", "cats are cute"] >>> encoded = get_tokens(tokenizer, sentences)
- hugger.mapper.map_pooling(pooling: str)[source]#
Maps a string representing the pooling type to the corresponding pooling function.
- Parameters:
pooling (str) – The type of pooling to be used. Must be one of ‘mean_pooling’ or ‘attention_pooling’.
- Returns:
The corresponding pooling function.
- Return type:
Callable
- Raises:
TypeError – If the input is not a string.
ValueError – If the pooling type is not recognized.
Examples
>>> map_pooling('mean_pooling') <function mean_pooling at 0x...> >>> map_pooling('attention_pooling') <function attention_pooling at 0x...>
- hugger.mapper.mean_pooling(model_output: Tensor, attention_mask: Tensor) Tensor[source]#
Computes the mean pooled sentence embedding from token embeddings and an attention mask.
Given the output of a transformer model and the corresponding attention mask, this function calculates a single embedding vector for each sentence by averaging the token embeddings, taking into account only the tokens that are not masked (i.e., valid tokens).
- Parameters:
model_output (torch.Tensor or tuple of torch.Tensor) – The output from a transformer model. The first element should contain the token embeddings with shape (batch_size, sequence_length, embedding_dim).
attention_mask (torch.Tensor) – A mask indicating valid tokens (1 for valid, 0 for padding) with shape (batch_size, sequence_length).
- Returns:
A tensor of shape (batch_size, embedding_dim) containing the mean pooled embeddings for each sentence.
- Return type: