NodeMapper tutorial#
# uncomment if colab
#!pip install pandas hugging-mapper
Returning node ids based on similarity of text embeddings.
Start by importing NodeMapper
from hugger.mapper import NodeMapper
Demo data for the tutorial
# An example dataframe
import pandas as pd
# generate data
ids = ["id1", "id2", "id3", "id4", "id5", "id6", "id7", "id8"]
texts = [
"They are happy",
"I would like to order a doughnut",
"The grass is green",
"They are sad",
"Have you poured the foundation?",
"I am feeling grey",
"blue",
"home",
]
# to dataframe
df = pd.DataFrame({"id": ids, "text": texts})
Initializing NodeMapper will
load the given huggingface model
generate embeddings for the text column
creating a dictionary of the node ids : text embeddings
# init
mapper = NodeMapper(
df=df,
text_col="text",
id_col="id",
model_name="sentence-transformers/all-MiniLM-L6-v2",
)
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
WARNING:huggingface_hub.utils._http:Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key | Status | |
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED | |
Notes:
- UNEXPECTED: can be ignored when loading from different task/architecture; not ok if you expect identical arch.
Generating embeddings for 8 nodes ...
Like HuggingMapper can simply get embeddings for given text
# generate embedding for a single text
embedding = mapper.embed_text("Good morning")
print(embedding.shape)
# generate embeddings for a list of texts
embeddings = mapper.embed_text(["Hello world", "Good evening", "Lunch time!"])
print(embeddings.shape)
torch.Size([1, 384])
torch.Size([3, 384])
But the main purpose of NodeMapper is to find similar texts and their corresponding ids
# retrieve those most similar to given text, above threshold
mapper.get_similar("concrete", threshold=0) # threshold 0 returns all
{'id5': {'text': 'Have you poured the foundation?',
'score': 0.4041473865509033},
'id8': {'text': 'home', 'score': 0.2964898943901062},
'id7': {'text': 'blue', 'score': 0.27229535579681396},
'id3': {'text': 'The grass is green', 'score': 0.19606004655361176},
'id6': {'text': 'I am feeling grey', 'score': 0.1613382250070572},
'id2': {'text': 'I would like to order a doughnut',
'score': 0.1378765106201172},
'id1': {'text': 'They are happy', 'score': 0.124162957072258},
'id4': {'text': 'They are sad', 'score': 0.1125689372420311}}
# retrieve top match, above threshold
print(mapper.get_match("joyful", threshold=0.4), "\n")
print(mapper.get_match("we are crying", threshold=0.4), "\n")
print(mapper.get_match("eatting a donut", threshold=0.4), "\n")
('id1', {'text': 'They are happy', 'score': 0.5418257713317871})
('id4', {'text': 'They are sad', 'score': 0.5035915374755859})
('id2', {'text': 'I would like to order a doughnut', 'score': 0.5166527628898621})
# retrieve top k matches, above threshold
print(mapper.get_similar("yellow", threshold=0.3, top_k=2), "\n")
print(mapper.get_similar("laughter", top_k=3), "\n")
{'id7': {'text': 'blue', 'score': 0.6618931293487549}, 'id6': {'text': 'I am feeling grey', 'score': 0.35393184423446655}}
{'id1': {'text': 'They are happy', 'score': 0.42021429538726807}, 'id4': {'text': 'They are sad', 'score': 0.3657638728618622}, 'id8': {'text': 'home', 'score': 0.34271639585494995}}