DNASimCLR: a contrastive learning-based deep learning approach for gene sequence data classification | BMC Bioinformatics

Gene sequence classificationThe rapid advancement in high-throughput sequencing technologies has revolutionized the study of microorganisms, shifting away from reliance solely on cultured cells or virus strains to direct sampling from unknown environmental sources [4]. In the realm of medical disease research, the significance of microorganisms in numerous diseases is evident [32]. However, processing genetic data from microorganisms collected within the human body presents challenges due to the presence of unknown components resulting from direct environmental sampling. The first thing we need to do is to make a judgment on the source of the samples [15]. Consequently, the classification of short gene sequences becomes a basic task [29]. Furthermore, in infectious disease virus research, swift identification of pathogen types holds paramount importance for subsequent treatments [33]. Therefore, the classification of microbial gene sequence emerges as a pivotal field of study.Traditionally, microbial gene sequencing classification relied on a homology-based approach—searching for similar DNA/RNA sequences within databases. Methods such as BLAST [2], BLAT [20], BLASTX [2], Diamond [7], BWA [23], BOWTIE [22], and others have demonstrated high accuracy. However, considerable limitations arise as numerous gene sequences cannot be classified due to poor matches with all gene types in the database. This often stems from missing data in genomic databases that is, the genetic sequences of many of these species are missing. Consequently, homology-based approaches often ineffective when dealing with new species. Additionally, the slow data processing speed of homology-based methods severely restricts their utility [27].Recently, diverse machine learning-based approaches, including deep learning, have emerged to address these challenges. Unlike traditional methods relying on existing databases, machine learning techniques learn mathematical functions by training on available databases to accomplish predictive tasks. Meanwhile Deep learning also holds significant research value in the representation learning of microbial gene sequence data [14]. The exploration of deep learning, is gaining momentum in handling microbial gene sequence data [12]. Antonino Fiannaca et al., 2018 [3] proposed a 16S short-read sequence classification technique based on k-mer representation and deep learning architecture, which accordingly generated a model of each taxonomic unit, validated it as an effective method for bacterial sequence classification, and could be integrated into commonly used metagenomic analysis tools to successfully classify SG and AMP data. Mateo Roja-Carulla et al., 2019 [25] proposed that GeNet is a method for Shotgun metagenomic classification from original DNA sequences, using hierarchical structures between tags for training. It shows competitive accuracy and good recall rates, and requires fewer memory resources. The representation of GeNet learning is practical for biological tasks, enabling pathogen detection accuracy of more than 90%. Qiaoxing Liang et al., 2020 [24] proposed DeepMicrobes, a deep learning-based framework that overcomes the limitations of new species taxonomic in metagenome studies, has superior species and genus identification accuracy, and has demonstrated competitiveness in abundance estimation, helping to explore the role of unknown metagenome species. Meryem Altin Karagoz et al. 2021 [1] proposed a deep learning method based on k-mer representation, which combined with relative abundance index (RAI) to classify metagenomic fragments, showing that metagenomic data generated under different sequencing platforms is competitive. For the first time, the RAI score is used as a spectral representation in a deep learning algorithm, showing improved performance for data sets with multiple parameter ranges. In the field of natural and natural processing models, Florian Mock et al. 2022 [13] proposed BERTax, a neural network using natural language processing, precisely classifies DNA sequence superkingdoms and phyla without relying on representative relatives in databases. It matches or exceeds existing methods of species classification, especially when dealing with new species. Combining BERTax with databases further improves prediction quality, expanding accurate classification across diverse genomic sequences and enhancing overall information acquisition.In addition to metagenomic applications, deep learning models have also been applied to the field of virus sequences. Tampuu A, et al. at 2019 [30] introduce ViraMiner, a novel deep learning method, to identify diverse viruses in human biospecimens, overcoming the challenge of detecting unknown or highly divergent viruses. Using Convolutional Neural Networks on raw metagenomic contigs from 19 experiments, ViraMiner significantly outperforms other machine learning methods, achieving a high accuracy of 0.923 area under the ROC curve with 300 bp contigs. It is the first model capable of detecting viral sequences within raw metagenomic data, providing insights into “unknown” sequences and enhancing our understanding of infectious diseases. Jie Ren, et al. at 2020 [28] introduce DeepVirFinder, a reference free machine learning method, excels in identifying viral sequences in metagenomic data, surpassing traditional methods. Trained on extensive pre-2015 data and enriched with additional viral sequences, it outperforms VirFinder. In colorectal carcinoma patient samples, it detected 51,138 viral sequences within 175 bins, showing potential for non-invasive CRC diagnosis. Jakub M. Bartoszewicz et al. 2021 [5] uses deep neural networks to reliably predict whether a virus can directly infect humans and has developed interpretative tools and novel nucleotide resolution correlates graph methods that can be used to detect regions of interest in novel pathogens, such as SARS-CoV-2 coronavirus. In addition, in the field of proteins, Wang Liu-Wei et al. proposed DeepViral in 2021 [31], a deep learning method for predicting protein–protein interactions (PPI) between humans and viruses. However, these methods typically rely on labeled data for model training, which becomes challenging due to the scarcity of microbial data labels, leading to complexities in feature extraction. Additionally, achieving a model with broad applicability proves to be difficult.Contrastive learningCurrently, contrastive learning stands out as a promising direction in the field of machine learning, particularly in the realm of unsupervised feature extraction. The fundamental concept of contrastive learning involves training the network’s feature extraction capability by contrasting similar and dissimilar data points in the feature space (Fig. 1). The vector representations of similar data obtained through the encoder are as close as possible, while the vector representations of dissimilar data are as distinct as possible. This approach has proven its efficacy in various domains, including computer vision, signal processing, and natural language processing, delivering promising performance [21]. Several noteworthy studies have emerged in the field of contrastive learning, such as SimCLR v1/v2 [8, 9], MoCo v1/v2/v3 [10, 11, 18], and BYOL [16], achieving state-of-the-art performance across multiple domains. SimCLR, MoCo, and BYOL represent three significant methods for unsupervised feature extraction in computational technology. SimCLR emphasizes data augmentation and contrastive loss to learn more useful feature representations through contrastive learning. MoCo employs momentum contrast to learn from unlabelled data, utilizing momentum updates to construct a contrast set. BYOL is a self-supervised learning approach encouraging the network to predict its augmented versions for learning visual feature representations. These methods train models to distinguish between different data points from a large pool of unlabelled data to derive the final feature extraction model, significantly enriching the training methods of unsupervised learning and enabling the application of various complex neural network models to large-scale unlabelled data. Given its principle of contrasting different data, this method can learn rich and distinct representations, showcasing broad prospects for the application of contrastive learning to various types of data [6].Fig. 1A concise statement for contrastive learning [8]. Two independent data enhancement operations (t ∼ T and t′ ∼ T) are applied to the same input data, resulting in two associated data representations. An gated embedding vectors encoder network f(·) and a feedforward neural network g(·) are trained to maximize agreement using a contrastive loss. After the pre-training, we throw away the feedforward neural network g(·) and use the encoder network f(·) to complete the follow-up workIn summary, current research in contrastive learning demonstrates the effectiveness of training feature extraction networks based on contrasting different data. Many contrastive learning models have achieved excellent results in their respective domains [26], proving their ability to efficiently derive a powerful feature extraction model from unlabelled data. However, despite its success in other fields of machine learning, including computer vision and natural language processing [17], contrastive learning has not been widely applied in microbiome bioinformatics research. While it holds immense potential, as demonstrated in various domains, its adoption remains relatively limited in the context of microbial genomics and metagenomics analysis. Most studies in microbiome bioinformatics primarily focus on traditional supervised and unsupervised learning techniques, leaving untapped potential for contrastive learning to advance microbiome bioinformatics research.Our research contributionsTo address the aforementioned challenges, this paper introduces the DNASimCLR framework, a deep learning method based on contrastive learning for the feature extraction of microbial sequence data. Unlike other approaches, we leverage unlabelled data for pre-training to enhance feature extraction. Our methodology involves two key steps: initial pre-training using unlabelled gene sequence data, followed by fine-tuning the resulting network for classification during the training phase.In terms of data processing, we employ one-hot coding to represent DNA sequences. Based on SimCLR framework, with convolutional neural network serving as the feature extraction module. To assess the performance of our classification method, we conducted tests on a microbial gene database from various sources. Applying our method, we performed taxonomic classification and short-sequence virus host prediction on read sequences of varying lengths (250 bp, 500 bp, 1000 bp, 1300 bp, and 10,000 bp), achieving a remarkable classification accuracy of 99%. Our contributions include:

(1)

Pioneering the application of contrastive learning to the feature extraction of microbial gene sequences, along with the development of a data processing method that extends contrastive learning to genetic data, overcoming limitations observed in the original SimCLR approach designed for image data.

(2)

Establishing a high-performance gene sequence classifier, substantially enhancing the effectiveness of existing deep learning methods.

(3)

The division of our method into pre-training and classification phases facilitates easy adaptation to other genomics problems, such as gene function and metagenomic clustering. This adaptability underscores the versatility and broad applicability of the proposed DNASimCLR framework in advancing genomics research.

Hot Topics

Related Articles