NodeMapper tutorial

NodeMapper tutorial#

Open In Colab

In this demonstration we learn how we can use the NodeMapper class to

  • create a map from ids to text embeddings and

  • perform similarity search using huggign face models.

NodeMapper extends HuggingMapper by allowing you to:

  • Find similar nodes based on their texts, returning associated ids

  • Retrieve the best match or top-k matches for a given input

  • Visualize the embeddings in a tsne

Let’s get started!

# uncomment if colab
#!pip install pandas hugging-mapper

First we generate demo data for the tutorial

import pandas as pd

# An example dataframe
# generate data
ids = ["id1", "id2", "id3", "id4", "id5", "id6", "id7", "id8"]
texts = [
    "They are happy",
    "I would like to order a doughnut",
    "The grass is green",
    "They are sad",
    "Have you poured the foundation?",
    "I am feeling grey",
    "blue",
    "home",
]
# to dataframe
df = pd.DataFrame({"id": ids, "text": texts})

Initializing NodeMapper will

  • load the given huggingface model

  • generate embeddings for the text column

  • creating a dictionary of the node ids : text embeddings

Click here for more info on:
"Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads."

What is an HF_TOKEN / huggingface user access token?

If you have an HF_TOKEN you can add it to your environment variables, repository secrets, and/or you can access it in your venv by saving the HF_TOKEN in an .env file and then loading it via package python-dotenv.

For example:

  1. More information for getting a Huggingface user token: their docs

  2. Save to “HF_TOKEN” variable

    Example .env file:

    HF_TOKEN=hf***...
    
  3. Access the .env variables via python-dotenv

    e.g.

    from dotenv import load_dotenv
    load_dotenv()
    

from hugger.mapper import NodeMapper
# init
mapper = NodeMapper(
    df=df,
    text_col="text",
    id_col="id",
    model_name="sentence-transformers/all-MiniLM-L6-v2",
)
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
WARNING:huggingface_hub.utils._http:Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED:	can be ignored when loading from different task/architecture; not ok if you expect identical arch.
Generating embeddings for 8 nodes ...

Like HuggingMapper can simply get embeddings for given text

# generate embedding for a single text
embedding = mapper.embed_text("Good morning")
print(embedding.shape)

# generate embeddings for a list of texts
embeddings = mapper.embed_text(["Hello world", "Good evening", "Lunch time!"])
print(embeddings.shape)
torch.Size([1, 384])
torch.Size([3, 384])

But the main purpose of NodeMapper is to query for similar texts and their corresponding ids given a text input

# retrieve those most similar to given text, above threshold
mapper.get_similar("concrete", threshold=0)  # threshold 0 returns all
{'id5': {'text': 'Have you poured the foundation?',
  'score': 0.40414732694625854},
 'id8': {'text': 'home', 'score': 0.2964898943901062},
 'id7': {'text': 'blue', 'score': 0.27229541540145874},
 'id3': {'text': 'The grass is green', 'score': 0.1960599422454834},
 'id6': {'text': 'I am feeling grey', 'score': 0.16133837401866913},
 'id2': {'text': 'I would like to order a doughnut',
  'score': 0.1378764808177948},
 'id1': {'text': 'They are happy', 'score': 0.12416304647922516},
 'id4': {'text': 'They are sad', 'score': 0.11256895959377289}}
# retrieve top match, above threshold
print(mapper.get_match("joyful", threshold=0.4), "\n")
print(mapper.get_match("we are crying", threshold=0.4), "\n")
print(mapper.get_match("eatting a donut", threshold=0.4), "\n")
('id1', {'text': 'They are happy', 'score': 0.5418257713317871}) 

('id4', {'text': 'They are sad', 'score': 0.5035915374755859}) 

('id2', {'text': 'I would like to order a doughnut', 'score': 0.5166528224945068})
# retrieve top k matches, above threshold
print(mapper.get_similar("yellow", threshold=0.3, top_k=2), "\n")
print(mapper.get_similar("laughter", top_k=3), "\n")
{'id7': {'text': 'blue', 'score': 0.6618930697441101}, 'id6': {'text': 'I am feeling grey', 'score': 0.3539319336414337}} 

{'id1': {'text': 'They are happy', 'score': 0.42021429538726807}, 'id4': {'text': 'They are sad', 'score': 0.36576390266418457}, 'id8': {'text': 'home', 'score': 0.3427163362503052}}