Biome ontology tagging demo#

# uncomment if colab
#!pip install pandas hugging-mapper

Click here for more info on:

"Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads."

What is an HF_TOKEN / huggingface user access token?

User Access Tokens are the preferred way to authenticate an application or notebook to Hugging Face services. You can manage your access tokens in your settings.

If you have an HF_TOKEN you can add it to your environment variables, repository secrets, and/or you can access it in your venv by saving the HF_TOKEN in an .env file and then loading it via package python-dotenv.

For example:

More information for getting a Huggingface user token: their docs
Save to “HF_TOKEN” variable

Example .env file:
```
HF_TOKEN=hf***...
```

Access the .env variables via python-dotenv

e.g.

from dotenv import load_dotenv
load_dotenv()

# from dotenv import load_dotenv
# load_dotenv()

Prep the data#

First we can encode each of the texts/terms in the ontology that we will wish to search against e.g.

The GOLD Biome Ontology #

We will read in as pandas dataframe

import pandas as pd

gold = pd.read_csv("https://github.com/cmungall/gold-ontology/raw/refs/heads/main/gold_definitions.csv")

gold.head()

	id	label	level	parent	mixs_extension	env_broad	env_local	env_medium	host_taxon	anatomical_site	other	interpretation
0	ID	LABEL	A oboInOwl:inSubset	C %	C RO:0001000 some %	C RO:0001025 some %	C RO:0002507 some %	C RO:0002219 some %	C RO:0002162 some %	C BFO:0000050 some %	C RO:0002321 some %	CLASS_TYPE
1	GOLDTERMS:3901	Host-associated > Arthropoda	2	GOLDTERMS:4086	NaN	NaN	NaN	NaN	NCBITaxon:6656	NaN	NaN	equivalent
2	GOLDTERMS:3902	Host-associated > Fish > Circulatory system > ...	4	GOLDTERMS:4788	NaN	NaN	NaN	UBERON:0000178	NaN	NaN	NaN	equivalent
3	GOLDTERMS:3903	Host-associated > Mammals > Excretory system	3	GOLDTERMS:4118	NaN	NaN	NaN	NaN	NaN	UBERON:8450002	NaN	equivalent
4	GOLDTERMS:3905	Host-associated > Mammals > Gastrointestinal t...	4	GOLDTERMS:Host-associated-Mammals-Gastrointest...	NaN	NaN	NaN	NaN	NaN	UBERON:0000160	NaN	equivalent

we will generate embeddings for the text in the ‘label’ column using NodeMapper. calling nodemapper will automatically start embedding the text.

You can try out different model_names from hugging face

from hugger.mapper import NodeMapper

# from huggingface
model_name = "sentence-transformers/all-MiniLM-L6-v2"

mapper = NodeMapper(
    df=gold.iloc[1:],
    text_col="label",
    id_col="id",
    model_name=model_name,
)

# you can access embeddings via the mapping_embeddings attribute,
# which is a dictionary mapping from node ID to embedding tensor
# mapper.mapping_embeddings

# or as a df
mapper.embeddings_df.head()

Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
WARNING:huggingface_hub.utils._http:Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED:	can be ignored when loading from different task/architecture; not ok if you expect identical arch.

Generating embeddings for 1114 nodes ...

	0	1	2	3	4	5	6	7	8	9	...	374	375	376	377	378	379	380	381	382	383
GOLDTERMS:3901	0.009514	0.086892	0.023316	0.026116	-0.023097	-0.050559	0.033452	-0.037260	-0.039984	0.064949	...	-0.110169	-0.024626	-0.007807	0.007168	0.035057	0.027889	0.046115	0.066802	0.056781	0.056555
GOLDTERMS:3902	-0.008052	0.038370	-0.063252	-0.005456	-0.049487	-0.031551	0.019530	0.052514	-0.038635	0.009836	...	-0.100109	0.033774	0.000423	0.024461	-0.016237	0.023719	-0.009798	0.146847	0.091117	0.000485
GOLDTERMS:3903	-0.026248	0.038573	-0.061714	0.008909	0.041057	-0.040874	0.054731	0.021568	0.014194	0.073426	...	-0.025986	-0.034963	-0.019086	0.015453	0.083072	0.061473	0.023279	0.144654	0.121220	-0.076814
GOLDTERMS:3905	0.055233	-0.005709	-0.046615	0.011348	-0.006670	-0.066701	0.026809	0.014501	-0.018533	0.012538	...	-0.032097	-0.035911	-0.055973	-0.017228	0.079648	-0.000671	0.003997	0.148112	0.104614	-0.048857
GOLDTERMS:3906	-0.026600	-0.016657	-0.051515	0.030630	-0.037302	-0.031447	0.033980	-0.079638	0.029881	0.015700	...	-0.117798	0.065548	-0.001040	0.062581	0.079624	-0.015832	0.096679	0.082751	0.074077	-0.055275

5 rows × 384 columns

the embeddings are stored in the mapping_embeddings attribute, which is a dictionary mapping from node ID to embedding tensor.

Of course a better way to visualize in 2D, for which can do a quick plot with the plot_tsne()

mapper.plot_tsne(title="t-SNE of GOLD embeddings")

The searching texts#

In this example we will use the text in sample metadata such as names, descriptions and project title.
for each sample we will find the most semanticly similar gold biome term(s)
by comparing the encoded gold biome term vectors vs. the encoded sample text metadata vector (i.e., concatenated project name, sample title, sample description.. etc)
most similar will be based on best cosine similarity between vectors

# read in sample metadata
# TODO replace with repo link
df = pd.read_csv("https://raw.githubusercontent.com/angelphanth/hugging-mapper/refs/heads/main/docs/tutorial/assets/biosamples-marine-sample.tsv", sep="\t")
# quick replace of semicolons
df = df.rename(columns={df.columns[1]: "text"})
df['text'] = df['text'].str.replace(";", " ")
df['text'] = df['text'].str.replace("_", " ")
df['text'] = df['text'].str.replace("-", " ")
# sanity check
df.head()

	sample_accession	text	tag
0	SAMEA112526651	Traversing European Coastlines (TREC) expediti...	env_tax:marine;env_geo;env_tax;env_geo:marine
1	SAMEA8514111	DSMP Metabarcodes from the global deepsea with...	env_tax:marine;env_geo;env_tax;env_geo:marine
2	SAMEA11192074	Biosynthetic potential of the global ocean mic...	env_tax:marine;env_geo;env_tax;env_geo:marine
3	SAMN08327646	CTD47.S.2 CTD47.S.2 marine metagenome	env_tax:marine;env_geo;env_tax;env_geo:marine
4	SAMN50655180	Metagenome or environmental sample from estua...	env_tax:marine;env_geo;env_geo:coastal;env_tax...

Search#

For this demo we will take an even smaller subset of the df as original was 700K long

# get subset
subset = df.sample(50, random_state=42, ignore_index=True)
print("New subset shape:", subset.shape)

New subset shape: (50, 3)

# init
trial = {}
counter = 0
for i in range(len(subset)):
    # init sample dict
    trial[i] = {}
    # sample accession and text
    trial[i]['sample_accession'] = subset.loc[i, "sample_accession"]
    trial[i]['text'] = subset.loc[i, "text"]
    # get top 3 predictions
    top_ks = mapper.get_similar(trial[i]['text'], top_k=3)
    top_k_ids = list(top_ks.keys())
    for j, k in enumerate(top_k_ids, start=1):
        trial[i][f'predicted_{j}'] = top_ks[k]['text']
        trial[i][f'score_{j}'] = top_ks[k]['score']
    # also get actual tag for comparison
    trial[i]['actual'] = subset.loc[i, "tag"]
    # counter for progress tracking
    counter += 1
    if counter % 10 == 0:
        # verbose
        print(f"Processed {counter} examples")
        # write to json
        # with open("assets/trial_results.json", "w") as f:
        #     json.dump(trial, f, indent=4)

Processed 10 examples
Processed 20 examples
Processed 30 examples
Processed 40 examples
Processed 50 examples

# check it out
result_df = pd.DataFrame.from_dict(trial, orient="index")
result_df.head()

# save to tsv
#result_df.to_csv(f"assets/gold-trial-{model_name.split('/')[-1]}.tsv", sep="\t")

	sample_accession	text	predicted_1	score_1	predicted_2	score_2	predicted_3	score_3	actual
0	SAMN43483370	Keywords: GSC:MIxS MIMS:6.0 102.100.100/40223...	Engineered > Lab enrichment > Defined media > ...	0.479996	Engineered > Lab enrichment > Defined media > ...	0.473488	Engineered > Lab enrichment > Defined media > ...	0.463391	env_tax:marine;env_geo;env_geo:coastal;env_tax...
1	SAMN36679984	Metagenome or environmental sample from marin...	Environmental > Aquatic > Marine > Oceanic > P...	0.565467	Environmental > Aquatic > Marine > River plume...	0.544373	Host-associated > Microbial > Dinoflagellates	0.533412	env_tax:marine;env_geo;env_tax;env_geo:marine
2	SAMN19289410	Model organism or animal sample from Rhinogob...	Host-associated > Mammals > Respiratory system...	0.432447	Engineered > Modeled > Simulated communities (...	0.422206	Host-associated > Mammals	0.415564	env_tax:marine;env_geo;env_geo:coastal;env_tax...
3	SAMN40082646	Larvae sample of Montipora capitata offspring...	Host-associated > Mollusca > Larvae	0.503204	Host-associated > Invertebrates > Cnidaria > C...	0.488396	Host-associated > Invertebrates > Cnidaria > C...	0.476329	env_tax:marine;env_geo;env_tax;env_geo:marine
4	SAMN04156407	Metagenome or environmental sample from marin...	Environmental > Aquatic > Marine > Oceanic > M...	0.494691	Environmental > Aquatic > Marine > Intertidal ...	0.493328	Host-associated > Microbial > Dinoflagellates	0.478134	env_tax:marine;env_geo;env_tax;env_geo:marine

Biome ontology tagging demo

Contents

Biome ontology tagging demo#

Prep the data#

The GOLD Biome Ontology#

The searching texts#

Search#

The GOLD Biome Ontology #