CLIBD: Bridging Vision and Genomics for Biodiversity Monitoring at Scale

Abstract

Measuring biodiversity is crucial for understanding ecosystem health. While prior works have developed machine learning models for taxonomic classification of photographic images and DNA separately, in this work, we introduce a multimodal approach combining both, using CLIP-style contrastive learning to align images, barcode DNA, and text-based representations of taxonomic labels in a unified embedding space. This allows for accurate classification of both known and unknown insect species without task-specific fine-tuning, leveraging contrastive learning for the first time to fuse DNA and image data. Our method surpasses previous single-modality approaches in accuracy by over 8% on zero-shot learning tasks, showcasing its effectiveness in biodiversity studies.

Overview

Our CLIBD model consists of three encoders for processing images, DNA barcodes, and text. We start with pretrained encoders and fine-tune them using contrastive loss to align the image, DNA, and text embeddings. At inference time, we embed a query image and match it to a database of existing image and DNA embeddings (keys). We use cosine similarity to find the closest key embedding and use its taxonomic label to classify the query image.

Taxonomic classification and retrieval

We can use our aligned image-DNA embedding space to performance taxonomic classification and do cross-modal retrieval from image to DNA. We use the data from BIOSCAN-1M, a large dataset of insect images paired with DNA barcode, and establish train/validation/test split such that we set aside a set of species that are unseen during for evaluation. For the validation and test splits, we also separate samples that we use as queries from the labeled database of keys that we match against.

We use image embeddings as queries to DNA embeddings and show that we are able to perform taxonomic classification over order, family, genus, and species. As we go down the taxonomy, the classification problem becomes increasingly challenging as we go from just 16 classes at the order level to over 8000 at the species level. In addition, there is a large imbalance across classes, with many species having less than 10 records.

Accuracy when retrieving within and between image and DNA modalities

Examples of retrieving with image query and matching against DNA keys show that even with cross-modal retrieval we can retrieve samples that are visually similar. The dark green border indicates the retrieved sample had the correct species, while the yellow green border indicates that the genus matches but not the species.

Examples of retrieving from image to DNA

We compare our aligned embedding space with that of BioCLIP, a recent model that aligns images with text derived from taxnomic labels. We show that by aligning with DNA, we can achieve more accurate taxonomic classification. Note that we use the pretrained model of BioCLIP which is trained on a wider dataset than ours, and does not use the same careful training split we use so that it is likely that their model was already exposed to our unseen species during training. We believe these two differences account for the higher performance of our CLIBD (with image-text embeddings) on the seen subset but lower performance on the unseen subset.

BibTeX for CLIBD

@article{gong2024clibd,
    author={Gong, ZeMing and Wang, Austin T. and Huo, Xiaoliang and Haurum, Joakim Bruslund and Lowe, Scott C. and Taylor, Graham W. and Chang, Angel X.},
    title={{CLIBD}: Bridging Vision and Genomics for Biodiversity Monitoring at Scale},
    journal={arXiv preprint},
    year={2024},
    eprint={2405.17537},
    archivePrefix={arXiv},
    primaryClass={cs.AI},
    doi={10.48550/arxiv.2405.17537},
}

BibTeX for BIOSCAN-1M

@inproceedings{bioscan1m,
    title={A Step Towards Worldwide Biodiversity Assessment: The {BIOSCAN-1M} Insect Dataset},
    booktitle={Advances in Neural Information Processing Systems},
    author={Gharaee, Zahra and Gong, ZeMing and Pellegrino, Nicholas and Zarubiieva, Iuliia and Haurum, Joakim Bruslund and Lowe, Scott C. and McKeown, Jaclyn T. A. and Ho, Chris C. Y. and McLeod, Joschka and Wei, Yi-Yun C. and Agda, Jireh and Ratnasingham, Sujeevan and Steinke, Dirk and Chang, Angel X. and Taylor, Graham W. and Fieguth, Paul},
    pages={43593--43619},
    publisher={Curran Associates, Inc.},
    year={2023},
    volume={36},
    url={https://proceedings.neurips.cc/paper_files/paper/2023/file/87dbbdc3a685a97ad28489a1d57c45c1-Paper-Datasets_and_Benchmarks.pdf},
}