Biome ontology tagging demo#
# uncomment if colab
#!pip install pandas hugging-mapper
Click here for more info on:
"Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads."
What is an HF_TOKEN / huggingface user access token?
If you have an HF_TOKEN you can add it to your environment variables, repository secrets, and/or you can access it in your venv by saving the HF_TOKEN in an .env file and then loading it via package python-dotenv.
For example:
More information for getting a Huggingface user token: their docs
Save to “HF_TOKEN” variable
Example .env file:
HF_TOKEN=hf***...
Access the .env variables via python-dotenv
e.g.
from dotenv import load_dotenv load_dotenv()
# from dotenv import load_dotenv
# load_dotenv()
Prep the data#
First we can encode each of the texts/terms in the ontology that we will wish to search against e.g.
The GOLD Biome Ontology#
We will read in as pandas dataframe
import pandas as pd
gold = pd.read_csv("https://github.com/cmungall/gold-ontology/raw/refs/heads/main/gold_definitions.csv")
gold.head()
| id | label | level | parent | mixs_extension | env_broad | env_local | env_medium | host_taxon | anatomical_site | other | interpretation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ID | LABEL | A oboInOwl:inSubset | C % | C RO:0001000 some % | C RO:0001025 some % | C RO:0002507 some % | C RO:0002219 some % | C RO:0002162 some % | C BFO:0000050 some % | C RO:0002321 some % | CLASS_TYPE |
| 1 | GOLDTERMS:3901 | Host-associated > Arthropoda | 2 | GOLDTERMS:4086 | NaN | NaN | NaN | NaN | NCBITaxon:6656 | NaN | NaN | equivalent |
| 2 | GOLDTERMS:3902 | Host-associated > Fish > Circulatory system > ... | 4 | GOLDTERMS:4788 | NaN | NaN | NaN | UBERON:0000178 | NaN | NaN | NaN | equivalent |
| 3 | GOLDTERMS:3903 | Host-associated > Mammals > Excretory system | 3 | GOLDTERMS:4118 | NaN | NaN | NaN | NaN | NaN | UBERON:8450002 | NaN | equivalent |
| 4 | GOLDTERMS:3905 | Host-associated > Mammals > Gastrointestinal t... | 4 | GOLDTERMS:Host-associated-Mammals-Gastrointest... | NaN | NaN | NaN | NaN | NaN | UBERON:0000160 | NaN | equivalent |
we will generate embeddings for the text in the ‘label’ column using NodeMapper. calling nodemapper will automatically start embedding the text.
You can try out different model_names from hugging face
from hugger.mapper import NodeMapper
# from huggingface
model_name = "sentence-transformers/all-MiniLM-L6-v2"
mapper = NodeMapper(
df=gold.iloc[1:],
text_col="label",
id_col="id",
model_name=model_name,
)
# you can access embeddings via the mapping_embeddings attribute,
# which is a dictionary mapping from node ID to embedding tensor
# mapper.mapping_embeddings
# or as a df
mapper.embeddings_df.head()
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
WARNING:huggingface_hub.utils._http:Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key | Status | |
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED | |
Notes:
- UNEXPECTED: can be ignored when loading from different task/architecture; not ok if you expect identical arch.
Generating embeddings for 1114 nodes ...
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 374 | 375 | 376 | 377 | 378 | 379 | 380 | 381 | 382 | 383 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GOLDTERMS:3901 | 0.009514 | 0.086892 | 0.023316 | 0.026116 | -0.023097 | -0.050559 | 0.033452 | -0.037260 | -0.039984 | 0.064949 | ... | -0.110169 | -0.024626 | -0.007807 | 0.007168 | 0.035057 | 0.027889 | 0.046115 | 0.066802 | 0.056781 | 0.056555 |
| GOLDTERMS:3902 | -0.008052 | 0.038370 | -0.063252 | -0.005456 | -0.049487 | -0.031551 | 0.019530 | 0.052514 | -0.038635 | 0.009836 | ... | -0.100109 | 0.033774 | 0.000423 | 0.024461 | -0.016237 | 0.023719 | -0.009798 | 0.146847 | 0.091117 | 0.000485 |
| GOLDTERMS:3903 | -0.026248 | 0.038573 | -0.061714 | 0.008909 | 0.041057 | -0.040874 | 0.054731 | 0.021568 | 0.014194 | 0.073426 | ... | -0.025986 | -0.034963 | -0.019086 | 0.015453 | 0.083072 | 0.061473 | 0.023279 | 0.144654 | 0.121220 | -0.076814 |
| GOLDTERMS:3905 | 0.055233 | -0.005709 | -0.046615 | 0.011348 | -0.006670 | -0.066701 | 0.026809 | 0.014501 | -0.018533 | 0.012538 | ... | -0.032097 | -0.035911 | -0.055973 | -0.017228 | 0.079648 | -0.000671 | 0.003997 | 0.148112 | 0.104614 | -0.048857 |
| GOLDTERMS:3906 | -0.026600 | -0.016657 | -0.051515 | 0.030630 | -0.037302 | -0.031447 | 0.033980 | -0.079638 | 0.029881 | 0.015700 | ... | -0.117798 | 0.065548 | -0.001040 | 0.062581 | 0.079624 | -0.015832 | 0.096679 | 0.082751 | 0.074077 | -0.055275 |
5 rows × 384 columns
the embeddings are stored in the mapping_embeddings attribute, which is a dictionary mapping from node ID to embedding tensor.
Of course a better way to visualize in 2D, for which can do a quick plot with the plot_tsne()
mapper.plot_tsne(title="t-SNE of GOLD embeddings")
The searching texts#
In this example we will use the text in sample metadata such as names, descriptions and project title.
for each sample we will find the most semanticly similar gold biome term(s)
by comparing the encoded gold biome term vectors vs. the encoded sample text metadata vector (i.e., concatenated project name, sample title, sample description.. etc)
most similar will be based on best cosine similarity between vectors
# read in sample metadata
# TODO replace with repo link
df = pd.read_csv("https://raw.githubusercontent.com/angelphanth/hugging-mapper/refs/heads/main/docs/tutorial/assets/biosamples-marine-sample.tsv", sep="\t")
# quick replace of semicolons
df = df.rename(columns={df.columns[1]: "text"})
df['text'] = df['text'].str.replace(";", " ")
df['text'] = df['text'].str.replace("_", " ")
df['text'] = df['text'].str.replace("-", " ")
# sanity check
df.head()
| sample_accession | text | tag | |
|---|---|---|---|
| 0 | SAMEA112526651 | Traversing European Coastlines (TREC) expediti... | env_tax:marine;env_geo;env_tax;env_geo:marine |
| 1 | SAMEA8514111 | DSMP Metabarcodes from the global deepsea with... | env_tax:marine;env_geo;env_tax;env_geo:marine |
| 2 | SAMEA11192074 | Biosynthetic potential of the global ocean mic... | env_tax:marine;env_geo;env_tax;env_geo:marine |
| 3 | SAMN08327646 | CTD47.S.2 CTD47.S.2 marine metagenome | env_tax:marine;env_geo;env_tax;env_geo:marine |
| 4 | SAMN50655180 | Metagenome or environmental sample from estua... | env_tax:marine;env_geo;env_geo:coastal;env_tax... |
Search#
For this demo we will take an even smaller subset of the df as original was 700K long
# get subset
subset = df.sample(50, random_state=42, ignore_index=True)
print("New subset shape:", subset.shape)
New subset shape: (50, 3)
# init
trial = {}
counter = 0
for i in range(len(subset)):
# init sample dict
trial[i] = {}
# sample accession and text
trial[i]['sample_accession'] = subset.loc[i, "sample_accession"]
trial[i]['text'] = subset.loc[i, "text"]
# get top 3 predictions
top_ks = mapper.get_similar(trial[i]['text'], top_k=3)
top_k_ids = list(top_ks.keys())
for j, k in enumerate(top_k_ids, start=1):
trial[i][f'predicted_{j}'] = top_ks[k]['text']
trial[i][f'score_{j}'] = top_ks[k]['score']
# also get actual tag for comparison
trial[i]['actual'] = subset.loc[i, "tag"]
# counter for progress tracking
counter += 1
if counter % 10 == 0:
# verbose
print(f"Processed {counter} examples")
# write to json
# with open("assets/trial_results.json", "w") as f:
# json.dump(trial, f, indent=4)
Processed 10 examples
Processed 20 examples
Processed 30 examples
Processed 40 examples
Processed 50 examples
# check it out
result_df = pd.DataFrame.from_dict(trial, orient="index")
result_df.head()
# save to tsv
#result_df.to_csv(f"assets/gold-trial-{model_name.split('/')[-1]}.tsv", sep="\t")
| sample_accession | text | predicted_1 | score_1 | predicted_2 | score_2 | predicted_3 | score_3 | actual | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | SAMN43483370 | Keywords: GSC:MIxS MIMS:6.0 102.100.100/40223... | Engineered > Lab enrichment > Defined media > ... | 0.479996 | Engineered > Lab enrichment > Defined media > ... | 0.473488 | Engineered > Lab enrichment > Defined media > ... | 0.463391 | env_tax:marine;env_geo;env_geo:coastal;env_tax... |
| 1 | SAMN36679984 | Metagenome or environmental sample from marin... | Environmental > Aquatic > Marine > Oceanic > P... | 0.565467 | Environmental > Aquatic > Marine > River plume... | 0.544373 | Host-associated > Microbial > Dinoflagellates | 0.533412 | env_tax:marine;env_geo;env_tax;env_geo:marine |
| 2 | SAMN19289410 | Model organism or animal sample from Rhinogob... | Host-associated > Mammals > Respiratory system... | 0.432447 | Engineered > Modeled > Simulated communities (... | 0.422206 | Host-associated > Mammals | 0.415564 | env_tax:marine;env_geo;env_geo:coastal;env_tax... |
| 3 | SAMN40082646 | Larvae sample of Montipora capitata offspring... | Host-associated > Mollusca > Larvae | 0.503204 | Host-associated > Invertebrates > Cnidaria > C... | 0.488396 | Host-associated > Invertebrates > Cnidaria > C... | 0.476329 | env_tax:marine;env_geo;env_tax;env_geo:marine |
| 4 | SAMN04156407 | Metagenome or environmental sample from marin... | Environmental > Aquatic > Marine > Oceanic > M... | 0.494691 | Environmental > Aquatic > Marine > Intertidal ... | 0.493328 | Host-associated > Microbial > Dinoflagellates | 0.478134 | env_tax:marine;env_geo;env_tax;env_geo:marine |