hugger package

hugger package#

class hugger.HuggingMapper(model_name: str = 'cambridgeltl/SapBERT-from-PubMedBERT-fulltext', *, pooling: str = 'mean_pooling', padding: Annotated[bool, Strict(strict=True)] = True, truncation: Annotated[bool, Strict(strict=True)] = True, return_tensors: Literal['pt', 'np', 'tf', 'jax', None] = 'pt', max_length: int = 512, **tokenizer_kwargs)[source]#

Bases: object

A class for mapping text to embeddings using a Hugging Face model.

This class provides methods to load a pre-trained model and tokenizer, embed text, and configure pooling methods for generating embeddings.

Parameters:

model_name (str, optional) – The name of the pre-trained model to be used for generating embeddings (default is “cambridgeltl/SapBERT-from-PubMedBERT-fulltext”).
pooling (str, optional) – The pooling method to be used for generating embeddings (“mean_pooling” or “attention_pooling”, default is “mean_pooling”).
padding (bool, optional) – Whether to pad sequences to the same length (default is True).
truncation (bool, optional) – Whether to truncate sequences to the maximum length (default is True).
return_tensors (str or None, optional) – The type of tensors to return (“pt”, “np”, “tf”, “jax”, or None, default is “pt”).
max_length (int, optional) – The maximum sequence length (default is 512).
**tokenizer_kwargs – Additional keyword arguments to be passed to the tokenizer.

model_name#

The name of the pre-trained model.

Type:: str

padding#

Padding option for tokenization.

Type:: bool

truncation#

Truncation option for tokenization.

Type:: bool

return_tensors#

The type of tensors returned by the tokenizer.

Type:: str or None

max_length#

Maximum sequence length for tokenization.

Type:: int

pooling#

The pooling method used for generating embeddings.

Type:: str

tokenizer#

The pre-trained tokenizer instance.

Type:: transformers.AutoTokenizer

model#

The pre-trained model instance.

Type:: transformers.AutoModel

embed_text(text_input: str) → torch.Tensor[source]#: Embeds a given text using the pre-trained model and pooling function.

embed_text(text_input: str) → Tensor[source]#

Embeds a given text using the pre-trained model and pooling function.

Parameters:: text (str) – The text to be embedded.
Returns:: The normalized embedding of the input text.
Return type:: torch.Tensor

property model: AutoModel#

Returns the pre-trained model instance.

Returns:: The loaded model instance.
Return type:: transformers.AutoModel

property pooling: str#

Returns the pooling method used for generating embeddings.

Returns:: The pooling method.
Return type:: str

property tokenizer: AutoTokenizer#

Returns the pre-trained tokenizer instance.

Returns:: The loaded tokenizer instance.
Return type:: transformers.AutoTokenizer

class hugger.NodeMapper(df: DataFrame, text_col: str, id_col: str = 'id', model_name: str = 'cambridgeltl/SapBERT-from-PubMedBERT-fulltext', *, pooling: str = 'mean_pooling', padding: Annotated[bool, Strict(strict=True)] = True, truncation: Annotated[bool, Strict(strict=True)] = True, return_tensors: Literal['pt', 'np', 'tf', 'jax', None] = 'pt', max_length: int = 512, **tokenizer_kwargs)[source]#

Bases: HuggingMapper

A class for mapping nodes to their corresponding text embeddings using a Hugging Face model.

This class extends HuggingMapper to handle a pandas DataFrame containing node IDs and their associated text. It provides methods to generate embeddings for each node, find similar nodes based on a given input text, and visualize embeddings.

Parameters:

df (pandas.DataFrame) – DataFrame containing the node IDs and their corresponding text.
text_col (str) – The name of the column in the DataFrame that contains the text to be embedded.
id_col (str, optional) – The name of the column in the DataFrame that contains the node IDs (default is “id”).
model_name (str, optional) – The name of the pre-trained model to be used for generating embeddings (default is “cambridgeltl/SapBERT-from-PubMedBERT-fulltext”).
pooling (str, optional) – The pooling method to be used for generating embeddings (“mean_pooling” or “attention_pooling”, default is “mean_pooling”).
padding (bool, optional) – Whether to pad sequences to the same length (default is True).
truncation (bool, optional) – Whether to truncate sequences to the maximum length (default is True).
return_tensors (str or None, optional) – The type of tensors to return (“pt”, “np”, “tf”, “jax”, or None, default is “pt”).
max_length (int, optional) – The maximum sequence length (default is 512).
**tokenizer_kwargs – Additional keyword arguments to be passed to the tokenizer.

df#

The DataFrame containing the node IDs and their corresponding text.

Type:: pandas.DataFrame

text_col#

The name of the column in the DataFrame that contains the text to be embedded.

Type:: str

id_col#

The name of the column in the DataFrame that contains the node IDs.

Type:: str

mapping#

A dictionary mapping node IDs to their corresponding text.

Type:: dict

mapping_embeddings#

A dictionary mapping node IDs to their corresponding embeddings.

Type:: dict

get_similar(input_text: str, threshold: float = 0, top_k: int | None = None, metric: str = 'cosine') → dict[source]#: Finds similar items in the mapping based on a similarity threshold.

get_match(input_text: str, threshold: float = 0, metric: str = 'cosine') → tuple[source]#: Finds the best match for the input text from the mapping based on a similarity threshold.

to_numpy() → dict[source]#: Converts the mapping embeddings to a dictionary of NumPy arrays.

plot_tsne(random_state: int = 42, title: str = 't-SNE of Node Embeddings', labels: dict | None = None, tsne_kwargs: dict | None = None, px_scatter_kwargs: dict | None = None)[source]#: Visualizes the node embeddings using t-SNE and Plotly.

embeddings_df : pandas.DataFrame: Returns a DataFrame containing the node IDs and their corresponding embeddings.

property embeddings_df: DataFrame#

Returns a DataFrame containing the node IDs and their corresponding embeddings. The DataFrame is constructed from the mapping of node IDs to their embeddings, with the node IDs as the index.

Returns:: A DataFrame where the index consists of node IDs and the columns contain the corresponding embeddings.
Return type:: pandas.DataFrame

get_match(input_text: str, *, threshold: float = 0, metric: str = 'cosine') → list[source]#

Finds the best match for the input text from the mapping based on a similarity threshold.

Parameters:

input_text (str) – The input text to find a match for.
threshold (float) – The minimum similarity score required to consider a match valid (default is 0).
metric (str) – The similarity metric to use for comparison (default is “cosine”).

Returns:

A tuple containing the ID of the best match and its corresponding metadata. The metadata includes the text of the match and its similarity score. If no match is found above the threshold, returns (None, None).

Return type:

tuple

Raises:

TypeError – If input_text is not a string or if metric is not a string.
ValueError – If metric is not one of the supported similarity metrics (“cosine” or “jaccard”).

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({"id": ["n1", "n2"], "text": ["hello", "world"]})
>>> mapper = NodeMapper(df, text_col='text', id_col='id')
Loading tokenizer for model: cambridgeltl/SapBERT-from-PubMedBERT-fulltext
Loading model: cambridgeltl/SapBERT-from-PubMedBERT-fulltext
Generating embeddings for 2 nodes ...
>>> best_match_id, metadata = mapper.get_match("earth", threshold=0.8, metric="cosine")

get_similar(input_text: str, *, threshold: float = 0, top_k: int | None = None, metric: str = 'cosine') → list[source]#

Finds similar items in the mapping based on a similarity threshold.

Parameters:

input_text (str) – The input text to find similar items for.
threshold (float) – The minimum similarity score required to consider an item similar (default is 0).
top_k (Optional[int]) – The maximum number of similar items to return (default is None, meaning all similar items).
metric (str) – The similarity metric to use for comparison (default is “cosine”).

Returns:

A dictionary containing the IDs of similar items as keys and their corresponding metadata (text and similarity score) as values. The dictionary is sorted in descending order by score.

Return type:

dict

Raises:

TypeError – If input_text is not a string or if metric is not a string.
ValueError – If metric is not one of the supported similarity metrics (“cosine” or “jaccard”).

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({"id": ["n1", "n2"], "text": ["hello", "world"]})
>>> mapper = NodeMapper(df, text_col='text', id_col='id')
Loading tokenizer for model: cambridgeltl/SapBERT-from-PubMedBERT-fulltext
Loading model: cambridgeltl/SapBERT-from-PubMedBERT-fulltext
Generating embeddings for 2 nodes ...
>>> similar_items = mapper.get_similar("planet", threshold=0.8, metric="cosine")

property mapping: dict#

Returns the mapping of node IDs to their corresponding text.

Returns:: A dictionary where keys are node IDs and values are the corresponding text.
Return type:: dict

property mapping_embeddings: dict#

Returns the mapping of node IDs to their corresponding embeddings.

Returns:: A dictionary where keys are node IDs and values are the corresponding embeddings.
Return type:: dict

plot_tsne(random_state: int = 42, title: str = 't-SNE of Node Embeddings', labels: dict | None = None, tsne_kwargs: dict | None = None, px_scatter_kwargs: dict | None = None)[source]#

Quick t-SNE visualization of the node embeddings.

Parameters:

random_state (int) – The random seed for reproducibility (default is 42).
title (str) – The title of the plot (default is “t-SNE of Node Embeddings”).
labels (Optional[dict]) – A dictionary mapping node IDs to labels for the plot. If none, the axes will be labeled as “t-SNE 1” and “t-SNE 2” (default is None).
tsne_kwargs (Optional[dict]) – Additional keyword arguments to pass to the TSNE constructor (default is None).
px_scatter_kwargs (Optional[dict]) – Additional keyword arguments to pass to the Plotly Express scatter function (default is None).

to_numpy()[source]#

Converts the mapping embeddings to a NumPy array.

Returns:: A dictionary where keys are node IDs and values are the corresponding embeddings as NumPy arrays.
Return type:: dict

hugger package

Contents

hugger package#

Submodules#