HuggingMapper Tutorial

HuggingMapper Tutorial#

Open In Colab

In this notebook, we demo how to use the HuggingMapper class to generate text embeddings using state-of-the-art Hugging Face transformer models.

The HuggingMapper provides a simple interface for loading transformer models, tokenizing text, and extracting normalized embeddings.

Here we:

  • Initializing the HuggingMapper with a pre-trained model

  • Generating embeddings for individual texts

  • and lists of texts

Let’s get started!

# uncomment if colab
#!pip install pandas hugging-mapper

Calling HuggingMapper will instantly load the given huggingface model

Click here for more info on:
"Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads."

What is an HF_TOKEN / huggingface user access token?

If you have an HF_TOKEN you can add it to your environment variables, repository secrets, and/or you can access it in your venv by saving the HF_TOKEN in an .env file and then loading it via package python-dotenv.

For example:

  1. More information for getting a Huggingface user token: their docs

  2. Save to “HF_TOKEN” variable

    Example .env file:

    HF_TOKEN=hf***...
    
  3. Access the .env variables via python-dotenv

    e.g.

    from dotenv import load_dotenv
    load_dotenv()
    

from hugger.mapper import HuggingMapper

# init
mapper = HuggingMapper(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
)
BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED:	can be ignored when loading from different task/architecture; not ok if you expect identical arch.

Get embeddings for given text

# generate embedding for a single text
embedding = mapper.embed_text("Good morning")
print(embedding.shape)

# generate embeddings for a list of texts
embeddings = mapper.embed_text(["Hello world", "Good evening", "Lunch time!"])
print(embeddings.shape)
torch.Size([1, 384])
torch.Size([3, 384])