Biome Ontology Tagging Demo

Biome Ontology Tagging Demo#

Open In Colab

In this notebook, we demonstrate how to use the NodeMapper class to perform semantic tagging of sample metadata with ontology terms from the GOLD Biome Ontology, leveraging hugging face transformer models.

With NodeMapper, you can:

  • Load and embed all ontology terms as vectors using a pre-trained transformer model

  • Efficiently encode and search sample metadata for the most semantically similar ontology terms

  • Retrieve top matches and visualize embeddings for exploration and quality control

This workflow enables scalable, automated annotation of biological samples with ontology terms, making it easier to organize, search, and analyze large datasets.

Let’s get started!

# uncomment if colab
#!pip install pandas hugging-mapper
Click here for more info on:
"Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads."

What is an HF_TOKEN / huggingface user access token?

If you have an HF_TOKEN you can add it to your environment variables, repository secrets, and/or you can access it in your venv by saving the HF_TOKEN in an .env file and then loading it via package python-dotenv.

For example:

  1. More information for getting a Huggingface user token: their docs

  2. Save to “HF_TOKEN” variable

    Example .env file:

    HF_TOKEN=hf***...
    
  3. Access the .env variables via python-dotenv

    e.g.

    from dotenv import load_dotenv
    load_dotenv()
    


# from dotenv import load_dotenv
# load_dotenv()

Prep the data#

First we can encode each of the texts/terms in the ontology that we will wish to search against e.g.

The GOLD Biome Ontology#

We will read in as pandas dataframe

import pandas as pd

gold = pd.read_csv("https://github.com/cmungall/gold-ontology/raw/refs/heads/main/gold_definitions.csv")

gold.head()
id label level parent mixs_extension env_broad env_local env_medium host_taxon anatomical_site other interpretation
0 ID LABEL A oboInOwl:inSubset C % C RO:0001000 some % C RO:0001025 some % C RO:0002507 some % C RO:0002219 some % C RO:0002162 some % C BFO:0000050 some % C RO:0002321 some % CLASS_TYPE
1 GOLDTERMS:3901 Host-associated > Arthropoda 2 GOLDTERMS:4086 NaN NaN NaN NaN NCBITaxon:6656 NaN NaN equivalent
2 GOLDTERMS:3902 Host-associated > Fish > Circulatory system > ... 4 GOLDTERMS:4788 NaN NaN NaN UBERON:0000178 NaN NaN NaN equivalent
3 GOLDTERMS:3903 Host-associated > Mammals > Excretory system 3 GOLDTERMS:4118 NaN NaN NaN NaN NaN UBERON:8450002 NaN equivalent
4 GOLDTERMS:3905 Host-associated > Mammals > Gastrointestinal t... 4 GOLDTERMS:Host-associated-Mammals-Gastrointest... NaN NaN NaN NaN NaN UBERON:0000160 NaN equivalent

we will generate embeddings for the text in the ‘label’ column using NodeMapper. calling nodemapper will automatically start embedding the text.

You can try out different model_names from hugging face

from hugger.mapper import NodeMapper

# from huggingface
model_name = "sentence-transformers/all-MiniLM-L6-v2"

mapper = NodeMapper(
    df=gold.iloc[1:],
    text_col="label",
    id_col="id",
    model_name=model_name,
)

# you can access embeddings via the mapping_embeddings attribute,
# which is a dictionary mapping from node ID to embedding tensor
# mapper.mapping_embeddings

# or as a df
mapper.embeddings_df.head()
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
WARNING:huggingface_hub.utils._http:Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED:	can be ignored when loading from different task/architecture; not ok if you expect identical arch.
Generating embeddings for 1114 nodes ...
0 1 2 3 4 5 6 7 8 9 ... 374 375 376 377 378 379 380 381 382 383
GOLDTERMS:3901 0.009514 0.086892 0.023316 0.026116 -0.023097 -0.050559 0.033452 -0.037260 -0.039984 0.064949 ... -0.110169 -0.024626 -0.007807 0.007167 0.035056 0.027889 0.046115 0.066802 0.056781 0.056555
GOLDTERMS:3902 -0.008052 0.038370 -0.063252 -0.005456 -0.049487 -0.031551 0.019530 0.052514 -0.038635 0.009836 ... -0.100109 0.033774 0.000423 0.024461 -0.016237 0.023719 -0.009798 0.146847 0.091117 0.000485
GOLDTERMS:3903 -0.026248 0.038573 -0.061714 0.008909 0.041057 -0.040874 0.054731 0.021568 0.014194 0.073426 ... -0.025986 -0.034963 -0.019086 0.015453 0.083072 0.061473 0.023279 0.144654 0.121220 -0.076814
GOLDTERMS:3905 0.055233 -0.005709 -0.046615 0.011348 -0.006670 -0.066701 0.026809 0.014501 -0.018533 0.012538 ... -0.032097 -0.035911 -0.055973 -0.017228 0.079648 -0.000671 0.003997 0.148112 0.104614 -0.048857
GOLDTERMS:3906 -0.026599 -0.016657 -0.051515 0.030630 -0.037302 -0.031447 0.033980 -0.079638 0.029881 0.015700 ... -0.117798 0.065548 -0.001040 0.062581 0.079624 -0.015832 0.096679 0.082751 0.074077 -0.055275

5 rows × 384 columns

the embeddings are stored in the mapping_embeddings attribute, which is a dictionary mapping from node ID to embedding tensor.

Of course a better way to visualize in 2D, for which can do a quick plot with the plot_tsne()

import plotly.io as pio
pio.renderers.default = "notebook_connected" # for readthedocs

mapper.plot_tsne(title="t-SNE of GOLD embeddings")

The searching texts#

  • In this example we will use the text in sample metadata such as names, descriptions and project title.

  • for each sample we will find the most semanticly similar gold biome term(s)

  • by comparing the encoded gold biome term vectors vs. the encoded sample text metadata vector (i.e., concatenated project name, sample title, sample description.. etc)

  • most similar will be based on best cosine similarity between vectors

# read in sample metadata
# TODO replace with repo link
df = pd.read_csv("https://raw.githubusercontent.com/angelphanth/hugging-mapper/refs/heads/main/docs/tutorial/assets/biosamples-marine-sample.tsv", sep="\t")
# quick replace of semicolons
df = df.rename(columns={df.columns[1]: "text"})
df['text'] = df['text'].str.replace(";", " ")
df['text'] = df['text'].str.replace("_", " ")
df['text'] = df['text'].str.replace("-", " ")
# sanity check
df.head()
sample_accession text tag
0 SAMEA112526651 Traversing European Coastlines (TREC) expediti... env_tax:marine;env_geo;env_tax;env_geo:marine
1 SAMEA8514111 DSMP Metabarcodes from the global deepsea with... env_tax:marine;env_geo;env_tax;env_geo:marine
2 SAMEA11192074 Biosynthetic potential of the global ocean mic... env_tax:marine;env_geo;env_tax;env_geo:marine
3 SAMN08327646 CTD47.S.2 CTD47.S.2 marine metagenome env_tax:marine;env_geo;env_tax;env_geo:marine
4 SAMN50655180 Metagenome or environmental sample from estua... env_tax:marine;env_geo;env_geo:coastal;env_tax...