Biome ontology tagging demo

Biome ontology tagging demo#

Open In Colab

# uncomment if colab
#!pip install pandas hugging-mapper
Click here for more info on:

"Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads."

What is an HF_TOKEN / huggingface user access token?

If you have an HF_TOKEN you can add it to your environment variables, repository secrets, and/or you can access it in your venv by saving the HF_TOKEN in an .env file and then loading it via package python-dotenv.

For example:

  1. More information for getting a Huggingface user token: their docs

  2. Save to “HF_TOKEN” variable

    Example .env file:

    HF_TOKEN=hf***...
    
  3. Access the .env variables via python-dotenv

    e.g.

    from dotenv import load_dotenv
    load_dotenv()
    


# from dotenv import load_dotenv
# load_dotenv()

Prep the data#

First we can encode each of the texts/terms in the ontology that we will wish to search against e.g.

The GOLD Biome Ontology#

We will read in as pandas dataframe

import pandas as pd

gold = pd.read_csv("https://github.com/cmungall/gold-ontology/raw/refs/heads/main/gold_definitions.csv")

gold.head()
id label level parent mixs_extension env_broad env_local env_medium host_taxon anatomical_site other interpretation
0 ID LABEL A oboInOwl:inSubset C % C RO:0001000 some % C RO:0001025 some % C RO:0002507 some % C RO:0002219 some % C RO:0002162 some % C BFO:0000050 some % C RO:0002321 some % CLASS_TYPE
1 GOLDTERMS:3901 Host-associated > Arthropoda 2 GOLDTERMS:4086 NaN NaN NaN NaN NCBITaxon:6656 NaN NaN equivalent
2 GOLDTERMS:3902 Host-associated > Fish > Circulatory system > ... 4 GOLDTERMS:4788 NaN NaN NaN UBERON:0000178 NaN NaN NaN equivalent
3 GOLDTERMS:3903 Host-associated > Mammals > Excretory system 3 GOLDTERMS:4118 NaN NaN NaN NaN NaN UBERON:8450002 NaN equivalent
4 GOLDTERMS:3905 Host-associated > Mammals > Gastrointestinal t... 4 GOLDTERMS:Host-associated-Mammals-Gastrointest... NaN NaN NaN NaN NaN UBERON:0000160 NaN equivalent

we will generate embeddings for the text in the ‘label’ column using NodeMapper. calling nodemapper will automatically start embedding the text.

You can try out different model_names from hugging face

from hugger.mapper import NodeMapper

# from huggingface
model_name = "sentence-transformers/all-MiniLM-L6-v2"

mapper = NodeMapper(
    df=gold.iloc[1:],
    text_col="label",
    id_col="id",
    model_name=model_name,
)

# you can access embeddings via the mapping_embeddings attribute,
# which is a dictionary mapping from node ID to embedding tensor
# mapper.mapping_embeddings

# or as a df
mapper.embeddings_df.head()
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
WARNING:huggingface_hub.utils._http:Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED:	can be ignored when loading from different task/architecture; not ok if you expect identical arch.
Generating embeddings for 1114 nodes ...
0 1 2 3 4 5 6 7 8 9 ... 374 375 376 377 378 379 380 381 382 383
GOLDTERMS:3901 0.009514 0.086892 0.023316 0.026116 -0.023097 -0.050559 0.033452 -0.037260 -0.039984 0.064949 ... -0.110169 -0.024626 -0.007807 0.007168 0.035057 0.027889 0.046115 0.066802 0.056781 0.056555
GOLDTERMS:3902 -0.008052 0.038370 -0.063252 -0.005456 -0.049487 -0.031551 0.019530 0.052514 -0.038635 0.009836 ... -0.100109 0.033774 0.000423 0.024461 -0.016237 0.023719 -0.009798 0.146847 0.091117 0.000485
GOLDTERMS:3903 -0.026248 0.038573 -0.061714 0.008909 0.041057 -0.040874 0.054731 0.021568 0.014194 0.073426 ... -0.025986 -0.034963 -0.019086 0.015453 0.083072 0.061473 0.023279 0.144654 0.121220 -0.076814
GOLDTERMS:3905 0.055233 -0.005709 -0.046615 0.011348 -0.006670 -0.066701 0.026809 0.014501 -0.018533 0.012538 ... -0.032097 -0.035911 -0.055973 -0.017228 0.079648 -0.000671 0.003997 0.148112 0.104614 -0.048857
GOLDTERMS:3906 -0.026600 -0.016657 -0.051515 0.030630 -0.037302 -0.031447 0.033980 -0.079638 0.029881 0.015700 ... -0.117798 0.065548 -0.001040 0.062581 0.079624 -0.015832 0.096679 0.082751 0.074077 -0.055275

5 rows × 384 columns

the embeddings are stored in the mapping_embeddings attribute, which is a dictionary mapping from node ID to embedding tensor.

Of course a better way to visualize in 2D, for which can do a quick plot with the plot_tsne()

mapper.plot_tsne(title="t-SNE of GOLD embeddings")

The searching texts#

  • In this example we will use the text in sample metadata such as names, descriptions and project title.

  • for each sample we will find the most semanticly similar gold biome term(s)

  • by comparing the encoded gold biome term vectors vs. the encoded sample text metadata vector (i.e., concatenated project name, sample title, sample description.. etc)

  • most similar will be based on best cosine similarity between vectors

# read in sample metadata
# TODO replace with repo link
df = pd.read_csv("https://raw.githubusercontent.com/angelphanth/hugging-mapper/refs/heads/main/docs/tutorial/assets/biosamples-marine-sample.tsv", sep="\t")
# quick replace of semicolons
df = df.rename(columns={df.columns[1]: "text"})
df['text'] = df['text'].str.replace(";", " ")
df['text'] = df['text'].str.replace("_", " ")
df['text'] = df['text'].str.replace("-", " ")
# sanity check
df.head()
sample_accession text tag
0 SAMEA112526651 Traversing European Coastlines (TREC) expediti... env_tax:marine;env_geo;env_tax;env_geo:marine
1 SAMEA8514111 DSMP Metabarcodes from the global deepsea with... env_tax:marine;env_geo;env_tax;env_geo:marine
2 SAMEA11192074 Biosynthetic potential of the global ocean mic... env_tax:marine;env_geo;env_tax;env_geo:marine
3 SAMN08327646 CTD47.S.2 CTD47.S.2 marine metagenome env_tax:marine;env_geo;env_tax;env_geo:marine
4 SAMN50655180 Metagenome or environmental sample from estua... env_tax:marine;env_geo;env_geo:coastal;env_tax...