HuggingMapper Tutorial#

In this notebook, we demo how to use the HuggingMapper class to generate text embeddings using state-of-the-art Hugging Face transformer models.

The HuggingMapper provides a simple interface for loading transformer models, tokenizing text, and extracting normalized embeddings.

Here we:

Initializing the HuggingMapper with a pre-trained model
Generating embeddings for individual texts
and lists of texts

Let’s get started!

# uncomment if colab
#!pip install pandas hugging-mapper

Calling HuggingMapper will instantly load the given huggingface model

Click here for more info on:
"Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads."

What is an HF_TOKEN / huggingface user access token?

User Access Tokens are the preferred way to authenticate an application or notebook to Hugging Face services. You can manage your access tokens in your settings.

If you have an HF_TOKEN you can add it to your environment variables, repository secrets, and/or you can access it in your venv by saving the HF_TOKEN in an .env file and then loading it via package python-dotenv.

For example:

More information for getting a Huggingface user token: their docs
Save to “HF_TOKEN” variable

Example .env file:
```
HF_TOKEN=hf***...
```

Access the .env variables via python-dotenv

e.g.

from dotenv import load_dotenv
load_dotenv()

from hugger.mapper import HuggingMapper

# init
mapper = HuggingMapper(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
)

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED:	can be ignored when loading from different task/architecture; not ok if you expect identical arch.

Get embeddings for given text

# generate embedding for a single text
embedding = mapper.embed_text("Good morning")
print(embedding.shape)

# generate embeddings for a list of texts
embeddings = mapper.embed_text(["Hello world", "Good evening", "Lunch time!"])
print(embeddings.shape)

torch.Size([1, 384])
torch.Size([3, 384])