A multi-view multi-label fast model for Auricularia cornea phenotype identification and classification

Data acquisitionData acquisition equipmentThe images of dried Auricularia cornea fruiting body used in this study were captured using the FScan2000 edible mushroom phenotyping device. The device, depicted in Fig. 1, comprises two main components: the edible mushroom phenotyping box and the image processing software. Inside the box, three cameras of identical configuration are positioned at the top, left side, and bottom. The dimensions of the phenotyping box are 570 mm \(\times\) 430 mm \(\times\) 280 mm. The cameras have imaging pixels of 16 million (4608 \(\times\) 3456), with a 1/2.3-inch CMOS sensor. The focal length of the camera lens is 8 mm. The internal light source of the box comprises 360\(\circ\) surround light and LED white light. The maximum shooting size of the phenotyping box is 400 mm \(\times\) 300 mm, with a minimum accuracy of 0.12 mm.Fig. 1Edible mushroom phenotype collection device: FScan2000.Data collection methodsThe experimental materials used in this study were obtained from Haotian Village, Najin Town, Taonan City, Jilin Province. These materials comprise a white variant strain of Auricularia cornea, a new edible mushroom variety bred by the team led by academician Yu Li from Jilin Agricultural University15. This strain exhibits higher protein and reducing sugar content compared to common Auricularia cornea varieties. As depicted in Fig. 2, in this study, the images collected from the side where the ear stalk of Auricularia auricula fruiting bodies is located are considered the top view, while the opposite side constitutes the bottom view. The fruiting bodies are arranged on the platform as shown in Fig. 2a, and the images collected from the right side are considered the side view of the Auricularia auricula fruiting bodies.Fig. 2Three views of Auricularia cornea. (a) Top view, (b) side view, (c) bottom view.After drying, Auricularia auricula tends to curl, resulting in significant differences in its features when observed from various angles. When placed on a flat surface, the observable angles become random. As shown in Fig. 3, each perspective provides partial phenotypic characteristics. Information such as the quantity, pigmentation, and damage of the fruiting bodies is distributed across different views, making it difficult to accurately and comprehensively analyze the phenotypes from a single perspective.Fig. 3Graded features distribution in different views of Auricularia cornea images.In this study, Auricularia auricula samples were placed on a transparent tray in the center of the FScan2000. The distance and angle of the camera were fixed, and images were captured from the top, bottom, and side views to ensure the completeness of phenotypic information. To further ensure the accuracy and reliability of the data, multi-label annotations were applied to the images of each sample. According to the dried Auricularia auricula standard DB22/T 2605-2016, six phenotypic characteristics of the fruiting bodies were recorded in detail: size, quantity, shape, color, presence of pigmentation, and any damage. The fruiting bodies were classified into four grades: Grade 1, Grade 2, Grade 3, and Substandard. The classification standards are provided in Table 1.In this study, Auricularia cornea specimens were positioned on the transparent tray in the center of the FScan2000 device to ensure fixed shooting distances and angles. Top view images, bottom view images, and side view images of the Auricularia cornea were captured to ensure comprehensive coverage of its drying product grading features. To further ensure the accuracy and reliability of the Auricularia cornea data, the image data of each specimen underwent multi-label annotation. In accordance with the Auricularia cornea dried product standard DB22/T 2605-2016, this study meticulously recorded six indicators for Auricularia cornea, including size, number, shape, color, presence of pigmentation, and whether the fruiting bodies were damaged. These indicators serve as the basis for Auricularia cornea grading, categorizing it into four grades: first grade, second grade, third grade, and off-grade. Table 1 outlines the criteria for Auricularia cornea grading.
Table 1 Phenotypic classification of Auricularia cornea.This study utilized standard ruler measurements to record the size information of Auricularia cornea fruiting bodies, categorizing them into five classes (grades: 1, 2, 3, 4, 5), namely: maximum diameter of 20 millimeters or less (20 mm-), 20 millimeters to 30 millimeters (20 mm+), 30 millimeters to 40 millimeters (30 mm+), 40 millimeters to 50 millimeters (40 mm+), and 50 millimeters or above (50 mm+). The quantity of Auricularia cornea fruiting bodies was classified into two categories (grades: 1, 2), representing either single or multiple. The shape of Auricularia cornea fruiting bodies was categorized into three classes based on the degree of curling after drying: fully expanded, naturally curled, and curled into a ball. The degree of color of the Auricularia cornea fruiting body was classified into three grades (grades: 1, 2, 3). Pigmentation status was divided into two categories (grades: 1, 2), indicating absence or presence of spots. Damage conditions were classified into two categories (grades: 1, 2), representing absence or presence of damage. The distribution of various label data for Auricularia cornea samples is presented in Table 2.
Table 2 Distribution of data for each label.The dataset comprises 691 sets of triview Auricularia cornea data, with each set including top view, bottom view, and side view images of the Auricularia cornea.Data preprocessingThe model employed in this study utilizes images of size 224\(\times\)224 as input data. Directly using the collected images as input would result in significant compression of Auricularia cornea images. As depicted in Fig. 4, this study crops out the Auricularia cornea from the original images. For all Auricularia cornea images, this study crops out the Auricularia cornea centered at its position. Specifically, the process involves initially annotating the positions of Auricularia cornea in the collected images using LabelMe software. Subsequently, a YOLOv8 model is fine-tuned based on a small portion of annotated data to enable detection and segmentation of Auricularia cornea.Fig. 4The orientation of Auricularia cornea on a flat surface exhibits randomness, resulting in varying characteristics when capturing the same Auricularia cornea from different angles. As illustrated in Fig. 5, this study enhances data diversity by augmenting the cropped images of Auricularia cornea through angle flipping. By augmenting the dataset to four times its original size, a total of 2764 sets of triview Auricularia cornea data were obtained.Fig. 5Data augmentation methods.Dataset partitioningFrom the dataset of 2764 sets of triview Auricularia cornea, this study obtained 2764 sets of Auricularia cornea from different angles, including top view, side view, bottom view, top and side view, bottom and side view, and top and bottom view. These datasets from different views were divided into training and testing sets in an 8:2 ratio.In practical production applications, when capturing randomly placed Auricularia cornea from above, the images obtained contain both top view and bottom view images. Through experimentation, it has been determined that both combinations of top view images with side view images and bottom view images with side view images encompass the complete phenotypic features of Auricularia cornea. Can meet the recognition and classification tasks of Auricularia cornea phenotypes.Therefore, this study selected either top view or bottom view images, combined with side view images, to split a set of triview data into two sets of dual-view data. As a result, 5528 sets of dual-view Auricularia cornea data were obtained, and the dataset was divided into training and validation sets in an 8:2 ratio. Specifically, 4423 sets of dual-view Auricularia cornea data were allocated to the training set, while 1105 sets were allocated to the testing set.Model structure and improvementA multi-view, multi-task rapid grading model was developed in this study. The top-layer parameters of the backbone network were shared, whereas the bottom-layer parameters of the backbone network and the multi-task classifier parameters remained independent. Initially, a lightweight and fast CNN was designed as a feature extraction module for the various views. To differentiate the feature information required for different classification tasks, a multi-task classifier based on class-specific attention was constructed. Additionally, homoscedastic uncertainty was employed to measure the loss weights between tasks, and distinct loss function weights were designed for each classification task based on the data distribution of each phenotype. Figure 6 illustrates the overall structure of the model. The dual-view data of Auricularia cornea were input into the Lightweight Fast CNN to extract the comprehensive phenotypic features. Subsequently, the Class-specific Attention Classifier calculated and extracted the features needed for each specific phenotypic classification task, achieving precise identification and classification of Auricularia cornea phenotypes.Fig. 6Multi-view and multi-task fast model structure.The lightweight and fast CNN employed for multi-view feature extractionThe PConv method efficiently extracts spatial features by simultaneously reducing redundant computation and memory access16. Figure 7 depicts the operational principle of PConv. It involves applying conventional convolution solely to a part of input channels for spatial feature extraction, while maintaining the remaining channels unchanged. Thus, when Cp represents 1/4 of the total channel number, the FLOPs (Floating Point Operations) of PConv amount to only 1/16 of those of regular convolutional kernels. Each partial convolution divides the input feature map into two parts, selecting the initial Cp channels for conventional convolution operations. This approach to channel partitioning remains constant. However, different channels contribute variably to the identify classification task, potentially causing convergence difficulties and impacting the final grading performance of the model. This study introduces a channel attention mechanism to model the relationships among different channels in the input data. This mechanism aids the model in dynamically adjusting the weights of each channel to amplify significant features and mitigate irrelevant ones, thus enhancing the feature extraction ability of the model17.Fig. 7Partial convolution structure diagram structure.Figure 8 illustrates the operational principle of the channel attention module. Initially, features are extracted from the input data through operations such as convolutional layers, resulting in multiple feature maps, with each feature map corresponding to a channel. Subsequently, operations such as global average pooling, fully connected layers, and sigmoid functions are applied to calculate the importance weights of each channel. These weights signify the contribution of each channel to the final prediction. These channel weights are then applied to the corresponding feature maps, merging the features of different channels through weighted fusion to generate the final feature representation. Finally, the fused features are input into subsequent layers of the network.Fig. 8Channel attention module structure.Building upon SPConv, this study developed a lightweight and fast CNN for multi-view feature extraction. Table 3 displays the network architecture of the multi-view feature extraction network. The multi-view feature extraction network in this study is partitioned into four stages, with each stage comprising down-sampling layers and feature extraction layers. The down-sampling layer consists of 2D convolutional layers and batch normalization operations. The feature extraction layer employs a stacked Block approach, where each Block consists of PConv, a 1\(\times\)1 convolution for dimensionality expansion, and another 1\(\times\)1 convolution for dimensionality reduction. Furthermore, batch normalization and GRELU activation operations are applied following the first 1\(\times\)1 convolution layer. Finally, skip connections are utilized to facilitate residual learning. In Stage 1 and Stage 2, the feature extraction layers consist of two parallel sets of Blocks for dual-channel processing, with each set dedicated to extracting features from two different views. In Stage 3 and Stage 4, the feature extraction layers are single-channel, responsible for integrating features from both views. Additionally, a channel attention module is introduced after the down-sampling layer in Stage 3.
Table 3 Backbone network architecture table.During the drying process of Auricularia cornea, significant deformations occur, the distribution of phenotypic information in different perspectives when observed from various angles. In this study, images from two divergent perspectives are simultaneously inputted into the corresponding branches of a multi-view feature extraction network. After the feature extraction process, feature maps of size 7\(\times\)7 with 640 dimensions are produced. By sharing the top-level parameters of the multi-view feature extraction network, the model tends to extract relevant to comprehensive information of multiple phenotypic features of Auricularia cornea. This approach reduces the risk of overfitting and ensures comprehensive training across all tasks. Moreover, parameter sharing significantly reduces the number of network parameters compared to the total parameters of N individual task-specific networks. This reduction implies higher efficiency of Multi-Task Learning models in real-time multi-task prediction scenarios18.Class-specific attention classifier for multi-task learningThere are strong correlations among the phenotypes of Auricularia cornea fruiting bodies, such as size, number, and shape. The proposed model achieves Multi-Task Learning of these phenotypes by sharing partial parameters. However, a challenge arises: how to enable different classifiers to identify features relevant to their respective classification tasks more effectively and efficiently. To address this challenge, this research introduces a class-specific residual attention mechanism to capture distinct feature regions attended to by different tasks. A class-specific attention classifier for multi-task classification was thus constructed.As shown in Fig. 9, the class-specific attention multi-task classifier developed in this study operates in two stages.In Stage One, eight parallel CSRA modules are utilized to extract class-specific attention for various classification tasks. The Residual Attention module in Fig. 9 depicts the structure of CSRA. Each CSRA consists of parallel branches for average pooling and spatial pooling. The average pooling branch calculates the average feature over the entire input, resulting in class-agnostic average pool features, as shown in Eq. (1). The spatial pooling branch generates spatial pool features, yielding class-specific spatial pool features for each category, as illustrated in Eq. (3). By weighting and adding these two features, the importance of class-specific features in the global feature is enhanced, resulting in class-specific feature attention scores (Residual Attention), as depicted in Eq. (4).$$\begin{aligned} \textrm{g}=\frac{1}{49}\sum _{k=1}^{49}\textrm{x}_{k} \end{aligned}$$
(1)
$$\begin{aligned} s_{j}^{i}=\frac{\exp \left( T\textrm{x}_{j}^{T}\textrm{m}_{i}\right) }{\Sigma _{k=1}^{49}\exp \left( T\textrm{x}_{k}^{T}\textrm{m}_{i}\right) } \end{aligned}$$
(2)
$$\begin{aligned} \textrm{a}^{i}=\sum _{k=1}^{49}s_{k}^{i} \textrm{x}_{k} \end{aligned}$$
(3)
$$\begin{aligned} \textrm{f}^i=\textrm{g}+\lambda \textrm{a}^i \end{aligned}$$
(4)
In Stage Two, the spatial features attended to by the spatial pooling branches of different CSRA modules differ. The class-specific attention features obtained from the eight CSRA modules are aggregated, and six parallel fully connected layers serve as classifiers for each task. Finally, classification predictions are made based on the feature vectors extracted through class-specific attention processing, enabling different task classifiers to identify the features they are focused on.Fig. 9Class-specific attention classifier structure.Uncertainty weight-based optimization contains multiple objective loss functionsOn one hand, some phenotypic classification tasks exhibit an imbalance between positive and negative samples. When the number of samples in a certain category is significantly large, the value of the loss function is influenced by the category with a larger sample size, leading the model to bias towards the larger category during classification. In this study, different weight values are assigned to each phenotypic classification task individually to address this issue. The weights of smaller categories are increased, thereby regulating the loss value based on the sample quantity for different categories. This weighted approach based on categories helps alleviate the training challenges caused by sample imbalance.On the other hand, the multi-task learning of Auricularia cornea phenotypes involves optimizing the model across six tasks, a common challenge in multi-task learning. A straightforward method for combining multi-task losses is to linearly weigh and sum the loss for each task. However, this approach poses several issues, given that the model’s performance is highly sensitive to the choice of weights. An effective and convenient method involves utilizing same-variance uncertainty learning for relative weight learning. Same-variance uncertainty is a form of uncertainty that is independent of the incidental uncertainties in the input data19. It is not an output of the model but a value that remains constant for all input data and varies across different tasks. Thus, it can be described as task-specific uncertainty.The optimal weight for each task depends on the scale of measurement and, ultimately, on the magnitude of task noise. As shown in Eq. (5), where W represents the loss values for the six sub-tasks, \(\sigma\)1 and \(\sigma\)2 denote the magnitudes of noise.$$\begin{aligned} \mathscr {L}(W,\sigma _{1},\sigma _{2})=\frac{1}{2\sigma _{1}^{2}}\mathscr {L}_{1}(W)+\frac{1}{2\sigma _{2}^{2}}\mathscr {L}_{2}(W)+\textrm{log}\sigma _{1}\sigma _{2} \end{aligned}$$
(5)
Experimental setupExperimental platformAll model training and testing were conducted on the same computer and server with the following specifications: Linux version 5.15.0-60-generic, server with NVIDIA A800 80G, 24 vCPU Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50 GHz, computer graphics card driver MX150, programming platform Anaconda 3.5, CUDA 10.2, development environment PyTorch, and programming with Python 3.8.10.Evaluation indicesTo evaluate the performance of our multi-view and multi-task fast model, this paper applies evaluation metrics including precision, recall, multi-label F1 score(macro-F1), exact match ratio, and average inference time. Considering the multi-tasking nature, the average of the six grading tasks of Auricularia cornea is taken as the macro-F1, and the prediction of all grading tasks correctly is counted effectively as the Exact Match Ratio. The calculation of each evaluation metric is as follows:$$\begin{aligned} & P=\frac{TP}{TP+FP} \end{aligned}$$
(6)
$$\begin{aligned} & R=\frac{TP}{TP+FN} \end{aligned}$$
(7)
$$\begin{aligned} & \text {Accuracy}=\frac{\text {TP+TN}}{\text {TP+FN+FP+TN}} \end{aligned}$$
(8)
$$\begin{aligned} & \text {F1-score}=\frac{2\times \text {Precision}\times \text {Recall}}{\text {Precision+Recall}} \end{aligned}$$
(9)
$$\begin{aligned} & \mathrm {macro-F1=\frac{F1-score_{1}+F1-score_{2}+\cdots +F1-score_{n}}{3}} \end{aligned}$$
(10)

Hot Topics

Related Articles