HuggingMapper Tutorial#
In this notebook, we demo how to use the HuggingMapper class to generate text embeddings using state-of-the-art Hugging Face transformer models.
The HuggingMapper provides a simple interface for loading transformer models, tokenizing text, and extracting normalized embeddings.
Here we:
Initializing the
HuggingMapperwith a pre-trained modelGenerating embeddings for individual texts
and lists of texts
Let’s get started!
# uncomment if colab
#!pip install pandas hugging-mapper
Calling HuggingMapper will instantly load the given huggingface model
Click here for more info on:
"Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads."
What is an HF_TOKEN / huggingface user access token?
If you have an HF_TOKEN you can add it to your environment variables, repository secrets, and/or you can access it in your venv by saving the HF_TOKEN in an .env file and then loading it via package python-dotenv.
For example:
More information for getting a Huggingface user token: their docs
Save to “HF_TOKEN” variable
Example .env file:
HF_TOKEN=hf***...
Access the .env variables via python-dotenv
e.g.
from dotenv import load_dotenv load_dotenv()
from hugger.mapper import HuggingMapper
# init
mapper = HuggingMapper(
model_name="sentence-transformers/all-MiniLM-L6-v2",
)
BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key | Status | |
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED | |
Notes:
- UNEXPECTED: can be ignored when loading from different task/architecture; not ok if you expect identical arch.
Get embeddings for given text
# generate embedding for a single text
embedding = mapper.embed_text("Good morning")
print(embedding.shape)
# generate embeddings for a list of texts
embeddings = mapper.embed_text(["Hello world", "Good evening", "Lunch time!"])
print(embeddings.shape)
torch.Size([1, 384])
torch.Size([3, 384])