BIOSCAN-ML integrates machine learning with biological expertise to advance biodiversity monitoring through DNA barcodes. This is a part of BIOSCAN, a larger international collaboration led by the International Barcode of Life (iBOL) Consortium.
For foster the use of machine learning for biodiversity monitoring with DNA barcodes, we have released two datasets with images of insect specimens paired with DNA barcodes, and partial taxonomic labels:
For working with the BIOSCAN datasets, please use our dataset package. To browse the BIOSCAN-5M dataset, we provide the BIOSCAN-Browser.
With these datasets, we are investigating different machine learning techniques to classify specimens based on image and/or DNA barcode, and to develop algorithms to support biologists in taxonomic classification of potentially novel species. Below are recent works investigating different pretraining strategies for DNA barcodes and images for use in taxonomic classification:
- BarcodeBERT uses BERT-like masked language modeling to learn representations for DNA barcodes.
- BarcodeMamba compares the use of Mamba to transformer architecture for DNA barcodes.
- BarcodeMAE uses MAE-LM to learn representations for DNA barcodes.
- CLIBD uses CLIP-style contrastive learning to align image and DNA representations.