NodeMapper tutorial#

In this demonstration we learn how we can use the NodeMapper class to

create a map from ids to text embeddings and
perform similarity search using huggign face models.

NodeMapper extends HuggingMapper by allowing you to:

Find similar nodes based on their texts, returning associated ids
Retrieve the best match or top-k matches for a given input
Visualize the embeddings in a tsne

Let’s get started!

# uncomment if colab
#!pip install pandas hugging-mapper

First we generate demo data for the tutorial

import pandas as pd

# An example dataframe
# generate data
ids = ["id1", "id2", "id3", "id4", "id5", "id6", "id7", "id8"]
texts = [
    "They are happy",
    "I would like to order a doughnut",
    "The grass is green",
    "They are sad",
    "Have you poured the foundation?",
    "I am feeling grey",
    "blue",
    "home",
]
# to dataframe
df = pd.DataFrame({"id": ids, "text": texts})

Initializing NodeMapper will

load the given huggingface model
generate embeddings for the text column
creating a dictionary of the node ids : text embeddings

Click here for more info on:
"Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads."

What is an HF_TOKEN / huggingface user access token?

User Access Tokens are the preferred way to authenticate an application or notebook to Hugging Face services. You can manage your access tokens in your settings.

If you have an HF_TOKEN you can add it to your environment variables, repository secrets, and/or you can access it in your venv by saving the HF_TOKEN in an .env file and then loading it via package python-dotenv.

For example:

More information for getting a Huggingface user token: their docs
Save to “HF_TOKEN” variable

Example .env file:
```
HF_TOKEN=hf***...
```

Access the .env variables via python-dotenv

e.g.

from dotenv import load_dotenv
load_dotenv()

from hugger.mapper import NodeMapper
# init
mapper = NodeMapper(
    df=df,
    text_col="text",
    id_col="id",
    model_name="sentence-transformers/all-MiniLM-L6-v2",
)

Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
WARNING:huggingface_hub.utils._http:Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED:	can be ignored when loading from different task/architecture; not ok if you expect identical arch.

Generating embeddings for 8 nodes ...

Like HuggingMapper can simply get embeddings for given text

# generate embedding for a single text
embedding = mapper.embed_text("Good morning")
print(embedding.shape)

# generate embeddings for a list of texts
embeddings = mapper.embed_text(["Hello world", "Good evening", "Lunch time!"])
print(embeddings.shape)

torch.Size([1, 384])
torch.Size([3, 384])

But the main purpose of NodeMapper is to query for similar texts and their corresponding ids given a text input

# retrieve those most similar to given text, above threshold
mapper.get_similar("concrete", threshold=0)  # threshold 0 returns all

{'id5': {'text': 'Have you poured the foundation?',
  'score': 0.40414732694625854},
 'id8': {'text': 'home', 'score': 0.2964898943901062},
 'id7': {'text': 'blue', 'score': 0.27229541540145874},
 'id3': {'text': 'The grass is green', 'score': 0.1960599422454834},
 'id6': {'text': 'I am feeling grey', 'score': 0.16133837401866913},
 'id2': {'text': 'I would like to order a doughnut',
  'score': 0.1378764808177948},
 'id1': {'text': 'They are happy', 'score': 0.12416304647922516},
 'id4': {'text': 'They are sad', 'score': 0.11256895959377289}}

# retrieve top match, above threshold
print(mapper.get_match("joyful", threshold=0.4), "\n")
print(mapper.get_match("we are crying", threshold=0.4), "\n")
print(mapper.get_match("eatting a donut", threshold=0.4), "\n")

('id1', {'text': 'They are happy', 'score': 0.5418257713317871}) 

('id4', {'text': 'They are sad', 'score': 0.5035915374755859}) 

('id2', {'text': 'I would like to order a doughnut', 'score': 0.5166528224945068})

# retrieve top k matches, above threshold
print(mapper.get_similar("yellow", threshold=0.3, top_k=2), "\n")
print(mapper.get_similar("laughter", top_k=3), "\n")

{'id7': {'text': 'blue', 'score': 0.6618930697441101}, 'id6': {'text': 'I am feeling grey', 'score': 0.3539319336414337}} 

{'id1': {'text': 'They are happy', 'score': 0.42021429538726807}, 'id4': {'text': 'They are sad', 'score': 0.36576390266418457}, 'id8': {'text': 'home', 'score': 0.3427163362503052}}