NodeMapper tutorial#

# uncomment if colab
#!pip install pandas hugging-mapper

Returning node ids based on similarity of text embeddings.

Start by importing NodeMapper

from hugger.mapper import NodeMapper

Demo data for the tutorial

# An example dataframe
import pandas as pd

# generate data
ids = ["id1", "id2", "id3", "id4", "id5", "id6", "id7", "id8"]
texts = [
    "They are happy",
    "I would like to order a doughnut",
    "The grass is green",
    "They are sad",
    "Have you poured the foundation?",
    "I am feeling grey",
    "blue",
    "home",
]
# to dataframe
df = pd.DataFrame({"id": ids, "text": texts})

Initializing NodeMapper will

load the given huggingface model
generate embeddings for the text column
creating a dictionary of the node ids : text embeddings

# init
mapper = NodeMapper(
    df=df,
    text_col="text",
    id_col="id",
    model_name="sentence-transformers/all-MiniLM-L6-v2",
)

Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
WARNING:huggingface_hub.utils._http:Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED:	can be ignored when loading from different task/architecture; not ok if you expect identical arch.

Generating embeddings for 8 nodes ...

Like HuggingMapper can simply get embeddings for given text

# generate embedding for a single text
embedding = mapper.embed_text("Good morning")
print(embedding.shape)

# generate embeddings for a list of texts
embeddings = mapper.embed_text(["Hello world", "Good evening", "Lunch time!"])
print(embeddings.shape)

torch.Size([1, 384])
torch.Size([3, 384])

But the main purpose of NodeMapper is to find similar texts and their corresponding ids

# retrieve those most similar to given text, above threshold
mapper.get_similar("concrete", threshold=0)  # threshold 0 returns all

{'id5': {'text': 'Have you poured the foundation?',
  'score': 0.4041473865509033},
 'id8': {'text': 'home', 'score': 0.2964898943901062},
 'id7': {'text': 'blue', 'score': 0.27229535579681396},
 'id3': {'text': 'The grass is green', 'score': 0.19606004655361176},
 'id6': {'text': 'I am feeling grey', 'score': 0.1613382250070572},
 'id2': {'text': 'I would like to order a doughnut',
  'score': 0.1378765106201172},
 'id1': {'text': 'They are happy', 'score': 0.124162957072258},
 'id4': {'text': 'They are sad', 'score': 0.1125689372420311}}

# retrieve top match, above threshold
print(mapper.get_match("joyful", threshold=0.4), "\n")
print(mapper.get_match("we are crying", threshold=0.4), "\n")
print(mapper.get_match("eatting a donut", threshold=0.4), "\n")

('id1', {'text': 'They are happy', 'score': 0.5418257713317871}) 

('id4', {'text': 'They are sad', 'score': 0.5035915374755859}) 

('id2', {'text': 'I would like to order a doughnut', 'score': 0.5166527628898621})

# retrieve top k matches, above threshold
print(mapper.get_similar("yellow", threshold=0.3, top_k=2), "\n")
print(mapper.get_similar("laughter", top_k=3), "\n")

{'id7': {'text': 'blue', 'score': 0.6618931293487549}, 'id6': {'text': 'I am feeling grey', 'score': 0.35393184423446655}}
{'id1': {'text': 'They are happy', 'score': 0.42021429538726807}, 'id4': {'text': 'They are sad', 'score': 0.3657638728618622}, 'id8': {'text': 'home', 'score': 0.34271639585494995}}