NodeMapper tutorial#
In this demonstration we learn how we can use the NodeMapper class to
create a map from ids to text embeddings and
perform similarity search using huggign face models.
NodeMapper extends HuggingMapper by allowing you to:
Find similar nodes based on their texts, returning associated ids
Retrieve the best match or top-k matches for a given input
Visualize the embeddings in a tsne
Let’s get started!
# uncomment if colab
#!pip install pandas hugging-mapper
First we generate demo data for the tutorial
import pandas as pd
# An example dataframe
# generate data
ids = ["id1", "id2", "id3", "id4", "id5", "id6", "id7", "id8"]
texts = [
"They are happy",
"I would like to order a doughnut",
"The grass is green",
"They are sad",
"Have you poured the foundation?",
"I am feeling grey",
"blue",
"home",
]
# to dataframe
df = pd.DataFrame({"id": ids, "text": texts})
Initializing NodeMapper will
load the given huggingface model
generate embeddings for the text column
creating a dictionary of the node ids : text embeddings
Click here for more info on:
"Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads."
What is an HF_TOKEN / huggingface user access token?
If you have an HF_TOKEN you can add it to your environment variables, repository secrets, and/or you can access it in your venv by saving the HF_TOKEN in an .env file and then loading it via package python-dotenv.
For example:
More information for getting a Huggingface user token: their docs
Save to “HF_TOKEN” variable
Example .env file:
HF_TOKEN=hf***...
Access the .env variables via python-dotenv
e.g.
from dotenv import load_dotenv load_dotenv()
from hugger.mapper import NodeMapper
# init
mapper = NodeMapper(
df=df,
text_col="text",
id_col="id",
model_name="sentence-transformers/all-MiniLM-L6-v2",
)
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
WARNING:huggingface_hub.utils._http:Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key | Status | |
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED | |
Notes:
- UNEXPECTED: can be ignored when loading from different task/architecture; not ok if you expect identical arch.
Generating embeddings for 8 nodes ...
Like HuggingMapper can simply get embeddings for given text
# generate embedding for a single text
embedding = mapper.embed_text("Good morning")
print(embedding.shape)
# generate embeddings for a list of texts
embeddings = mapper.embed_text(["Hello world", "Good evening", "Lunch time!"])
print(embeddings.shape)
torch.Size([1, 384])
torch.Size([3, 384])
But the main purpose of NodeMapper is to query for similar texts and their corresponding ids given a text input
# retrieve those most similar to given text, above threshold
mapper.get_similar("concrete", threshold=0) # threshold 0 returns all
{'id5': {'text': 'Have you poured the foundation?',
'score': 0.40414732694625854},
'id8': {'text': 'home', 'score': 0.2964898943901062},
'id7': {'text': 'blue', 'score': 0.27229541540145874},
'id3': {'text': 'The grass is green', 'score': 0.1960599422454834},
'id6': {'text': 'I am feeling grey', 'score': 0.16133837401866913},
'id2': {'text': 'I would like to order a doughnut',
'score': 0.1378764808177948},
'id1': {'text': 'They are happy', 'score': 0.12416304647922516},
'id4': {'text': 'They are sad', 'score': 0.11256895959377289}}
# retrieve top match, above threshold
print(mapper.get_match("joyful", threshold=0.4), "\n")
print(mapper.get_match("we are crying", threshold=0.4), "\n")
print(mapper.get_match("eatting a donut", threshold=0.4), "\n")
('id1', {'text': 'They are happy', 'score': 0.5418257713317871})
('id4', {'text': 'They are sad', 'score': 0.5035915374755859})
('id2', {'text': 'I would like to order a doughnut', 'score': 0.5166528224945068})
# retrieve top k matches, above threshold
print(mapper.get_similar("yellow", threshold=0.3, top_k=2), "\n")
print(mapper.get_similar("laughter", top_k=3), "\n")
{'id7': {'text': 'blue', 'score': 0.6618930697441101}, 'id6': {'text': 'I am feeling grey', 'score': 0.3539319336414337}}
{'id1': {'text': 'They are happy', 'score': 0.42021429538726807}, 'id4': {'text': 'They are sad', 'score': 0.36576390266418457}, 'id8': {'text': 'home', 'score': 0.3427163362503052}}