Enhancing agriculture through real-time grape leaf disease classification via an edge device with a lightweight CNN architecture and Grad-CAM

Overall architecture of the proposed systemFor this research, a preplanned approach was developed for efficient and rapid development of the model and its testing. The complete process from data collection to preprocessing, from customizing the proposed model and training it to evaluating the test data, is explained in detail and shown according to the block diagram in Fig. 1.Figure 1Block diagram of the proposed research framework.Image acquisitionThe dataset for this research was obtained from the “Grapevine Disease Dataset (Original)”26, which contains four classes with a total of 7222 images for the training part and 1805 images for the test part (Fig. 2). Each class contains approximately 1600–1900 training images with 400–450 test images (Table 1). The images were originally unbalanced RGB images of 256 × 256 each in size.Figure 2Four classes of grape leaves from the dataset. (a) Black rot, (b) ESCA, (c) healthy, (d) leaf blight.Table 1 Original downloaded dataset with distribution of grape leaf classes for training and testing.The dataset26 for grape disease obtained from Kaggle consisted of a small amount of data, which is not suitable for proper model building. When a model is equipped with a large number of parameters but is given only a limited amount of data, its ability to effectively learn the underlying patterns is compromised, leading to vulnerability to overfitting27. In addition to the presence of imbalanced data within each class, it is important to acknowledge that such data can lead to the development of a biased model.Preprocessing of image dataOne of the most important steps in obtaining data ready for CNN model training is data preprocessing. Upon inspecting the dataset, there were duplicate/redundant images in every class, which resulted in increased training time and memory consumption. Additionally, few images were misclassified as the wrong class by the provider, which, upon initial training, resulted in poor accuracy and incorrect predictions. Therefore, duplicate and misclassified images were manually removed.To remove duplicate images, an image hash algorithm (average hash) was used. This procedure normalized every image from the dataset to a consistent, compact dimension, usually an 8 × 8 pixel grid. A binary hash was created after each image that is transformed to grayscale and resized to a fixed size. The mean pixel intensity was calculated, and the image was then resized to its original size. Every bit in the hash encodes information about a pixel’s intensity greater or less than the mean value28. Figure 3 presents an original image and its corresponding hashed image. When two or more hashes were found to be the same, the duplicate image was found and removed permanently.Figure 3Image hash technique steps (average hash) from left to right.In addition to working with edge devices and lightweight models, the usual image sizes needed to be reduced. Most of the lightweight models support 224 × 224 pixels. The original dataset downloaded from Kaggle had an image dimension of 256 × 256, which was converted to 224 × 224 using the Pillow Library, The images were subjected to a conversion process that transformed them into RGB color mode, thereby guaranteeing a comprehensive representation of colors. Resizing an image was accomplished by utilizing the default antialiasing resampling filter. This filter effectively preserved the quality of the image while simultaneously reducing the occurrence of aliasing artifacts. A comparison between a normal image and a resized image is shown in Fig. 4.Figure 4Resizing images from 256 × 256 to 224 × 224 pixels.An unbalanced dataset can lead to a number of problems during the process of training machine learning models, including biased model learning, poor generalization, imbalanced loss functions, and misleading evaluation matrices. Dataset balancing was implemented using image augmentation techniques in which 5 new augmented images were created from each image using width shift, height shift, rotation, flip, and zoom augmentation techniques. This approach contributed to improve the model’s performance in three primary ways: increasing the size of the dataset without requiring the collection of additional data, improving model generalization by exposing the dataset to a diverse range of data variations and enhancing model robustness to real-world variations such as lighting and orientation. Finally, for each class of the training dataset, 6000 images were obtained after removing duplicate images. Similar method was also applied to increase the size of the test dataset. It was ensured that under no circumstances, the training and test data had similar images. Figure 5 shows examples of image augmentation applied to create additional images, and Table 2 presents the final distribution of images in the dataset after augmentation.Figure 5Data augmentation to generate additional images from a single image.Table 2 Final dataset after augmentation.Customizing pretrained CNN modelThe CNNs have considerably revolutionized the fields of agriculture and image processing with their remarkable capability to accurately process through the visual input. Today, thanks to deep learning frameworks and vast labeled image datasets, it has become one of the go-to choices for a variety of image-related tasks. In the field of image classification, many networks like VGG16, ResNet, Inception have shown exceptional performance as state-of-the-art (SOTA) approaches by using transfer learning. However, portability constraints encourage the development of lightweight CNN models of multiple layers that can capture rich image features and patterns.Scalability and the resource constraint in different situations such as for a mobile device, edge computing platform, and any real-time application are the crucial concerns. Furthermore, energy is a major concern and we need to conserve battery life for longer reasons in portable devices and it is also essential to reduce the energy consumption especially in edge computing scenarios. Lightweight models largely help address these concerns by minimizing computational burden. We choose MobileNetV3Large as the model of interest for transfer learning, and this is the third version of the MobileNet class of models. To keep its design lightweight, MobileNetV3Large employs multiple techniques, including inverted residuals, expanded activation functions, network architecture search, squeeze-and-excitation blocks, and a light classification head29. This characteristic renders it a very suitable option for integration into embedded systems, including Internet of Things (IoT) devices and mobile devices, which often possess constrained computational capabilities. Figure 6 shows the model architecture for Mobilenetv3 with its individual blocks.Figure 6MobileNetv3Large architecture.The bottleneck component of MobileNetV3Large which enhanced input feature space with the help of 1 × 1 convolution. Then, depthwise convolutions were applied with different kernel sizes. There is also an optional integration of a squeeze-and-excitation (SE) layer. Finally, the feature space was reduced back to the size of the original image with a pointwise convolution. Alternatively, the ResNet architecture includes a skip connection as a ResNet style skip connection. This skip connection is implemented when the shape of the input tensor is equal to the shape of the output tensor so as to increase the architectural robustness. This operation, called a convolutions, served to increase the depth of the feature map, thus increasing the capacity of the network to describe data. The expansion ratio, commonly represented as the ‘expand ratio’, is a hyperparameter that controls the magnitude of the expansion. The 1 × 1 convolution layer provides the ability to exert flexible control over the quantity of output channels that are generated. The process of expansion plays a crucial role in the architecture of MobileNetV3Large, as it prepares the feature map for later activities.Using SE layer (Fig. 7) required the dependency between the channels to be used and information from the feature map, which was dynamically calibrated. The feature maps from the previous convolutional layer were input into the SE block. The image was transformed in 3D tensor with dimensions such as 1 × 16 × 56 × 56. Here, 16 stands for the number of channels and 56 × 56 is the spatial dimensions of feature maps. The first operation in the SE block, the adaptive average pooling took spatial dimensions (56 × 56) of each channel and compresses into a single number. As a result, a tensor of size 1 × 16 × 1 × 1 was obtained, which condenses the overall spatial information of each channel into a single value. This stage aggregated the feature responses throughout the spatial domain, providing a comprehensive representation of the input characteristics on a global scale. The basic effect of reshape was to keep the content of the data the same as the input data and change the size so that it can be adapted to subsequent fully connected FC layers. After this, the overall information in the tensor was still there, except it was arranged in a 1 dimensional vector with 16 elements that had a fully linked layer whose dimensionality had been reduced to 8 fed into the transformed tensor. This was a bottleneck layer, capable of picking up the most prevailing aspects of the channels. Then a Rectified Linear Unit (ReLU) was used as activation function, which introduced nonlinearity and allowed for the learning of complex functions. Another fully linked layer was used to restore the dimensionality of the features to their original number of channels, for instance, from 8 to 16. Then an activation function ReLU6 was use which was similar to ReLU but it limits the top value to 6 to prevent over-activation.2. We reconverted the output of the fully connected (FC) layers into a three-dimensional tensor with only one spatial dimension (1 × 16 × 1 × 1). The values in this tensor were the learned weight that will be used to recalibrate the original feature maps. These adjusted weights were then multiplied with the original feature maps matching the calculation channel dimension-wise. Therefore, each single channel of the original input was multiplied by the scalar relating to the same channel of the recalibrated tensor. This process rescaled the feature maps, in order to enhance significant elements and suppress less important features. The result of the scaling operation was the output SE output, a tensor with the same dimensions as the SE input (1 × 16 × 56 × 56). As result of giving weights based on their importance decided by the SE block, the recalibrated feature maps now were a more refined set of features to extract. These processes were specifically designed to allow the network to learn which features to boost and which to suppress, allowing the optimization of network performance over a variety of tasks that rely on feature discrimination.Figure 7Detailed block diagram of the squeeze and excitation block of MobileNetV3Large.MobileNetV3Large has been specifically designed for CPUs by utilizing a combination of hardware-aware network architecture search (NAS) to enhance both the structure and selection of nonlinear functions. Neural architecture search automated the process of determining optimal layer configurations and connection topologies by evaluating their performance on a validation dataset30. Additionally, the proposed model replaced the ReLU nonlinearity used in earlier versions with the Hard Swish activation function, which approximated the Swish function using piecewise linear segments.In this study, six models were trained with different types of lightweight CNN architectures and deployed on an edge device to identify the best performing one based on changes in accuracy, visualization, inference speed (CPU), power consumption and confidence. The final selected models were NASNetMobile, MobileNetV3Large, MobileNetV3Small, DenseNet121, EfficientNetV2B1 and EfficientNetV2B2 in their customized forms. Figure 8 shows the relevant diagram of the customized model built upon the pretrained weights of MobileNetV3Large. After loading the image data, all the images were shuffled to reduce overfitting during training. They were then preprocessed using the corresponding model’s built-in preprocessing functions. Finally, the pretrained models were loaded with max pooling, and the top layers were removed because we used our own fully connected layers in this part. Subsequent weight parameters were initialized using pretrained values that were primed on the ImageNet dataset. On average, each synset was represented by approximately 1000 images sourced from ImageNet, which provides a substantial collection of meticulously labeled and organized images (10 million) for the majority of concepts within the WordNet hierarchy30. All the pretrained layers of each model were frozen because they were fully utilized, allowing the models to run more efficiently during training. Once the input layer of the pretrained model was replaced, the custom layers underwent resizing and normalization. The fine-tuning process thus consisted of stacking a number of layers on top of the output of the pretrained model. First layer, a fully connected, dense layer with 128 neurons used by nonlinearity by ReLU activation. To combat the overfitting issue, a dropout layer with a dropout rate of 0.45 was added. During training, we did this by turning off some of the input units at random, which we zeroed out. Appropriately, the process was repeated with another dense layer of 256 neurons, followed by, once more, a dropping layer that uses the same dropping rate. The last layer in the series consisted of a dense layer with 4 neurons, with a softmax activation function used to represent the four classes. Using the softmax activation, the output values were turned into a probability distribution (ensuring normalization), when we have four classes. The loss function, “categorical_crossentropy”, was optimized using Adam with a learning rate equal to 0.0001. Overall, Adam is already well known to be effective and robust for optimizing complex, high dimensional models31. The execution of the model during training was quantified by this, since it was assigned, hence serving as the objective to be optimized by the optimizer for it to be minimized. The model will thus make high-confidence predictions about inputs of the correct class by penalizing the incorrect predictions more heavily32.Figure 8Loading of pre-trained model (MobileNetV3Large) and addition of layers to construct the proposed network.The end result is a personalized model that integrates the feature extraction skills of the pretrained model with supplementary layers designed for a particular classification job, showcasing a prevalent technique in transfer learning. Consequently, this process facilitates the development of deep learning models that are both efficient and tailored to specific tasks33.The following hyperparameters (Table 3) were kept the same while training all six models. The learning rate was kept low to avoid overfitting and poor generalization of the models. Since obtaining a higher accuracy of the model was the first goal, monitoring metrics were set to observe the training accuracy only and save it when the latest epoch reached a higher accuracy than the last epoch.Table 3 Hyperparameter values for training the model.Model training, particularly for deep learning techniques, involves computationally intensive tasks such as matrix multiplication and gradient calculations. For quicker training, the models were trained on the Kaggle platform, where an Nvidia Tesla P100 GPU with 16 GB of memory was the key hardware. The system achieved a peak performance of 9.3 teraflops in double-precision calculations, which is a vital characteristic for executing model training. A comparison of the training times for different models is shown in Fig. 9.Figure 9Training time for each of the models with the same parameter.Grad-CAM visualizationThe Gradient-weighted Class Activation Mapping (Grad-CAM) technique shows salient insight about why the DL models made a prediction visualizing the more relevant regions of the input image influencing the decision process. The process commences with the state-of-the-art deep neural network operation (forward pass) to produce raw prediction scores for every class as it was based on the incoming image34. The initial scores are the biggest part, setting the stage for the other steps. In the next stage, backpropagation was started after forwarding pass was done. During the process of backpropagation, the gradients of the predicted class scores were calculated with respect to the feature maps originating from the final convolutional layer. The depicted gradients served as indicators of the model’s responsiveness to alterations in the feature maps. The significance-weighted gradients are frequently subjected to spatial averaging, leading to weighted aggregation that encompasses the essential regions accountable for the classification of the network35.For this research, Grad-CAM was implemented on bulk images from the test dataset as well as on single real-time leaf images. The goal was to differentiate diseased leaves from healthy leaves by highlighting them. The general Grad-CAM visualization technique (Fig. 10) was slightly modified in this work so that only diseased classes are marked while avoiding healthy classes, as they are not important for real-time visualization.Figure 10Grad-CAM architecture with transparency thresholding.The procedure started with the importation of essential libraries for numerical computations, picture manipulation, and deep learning capabilities.It then imports the customized MobileNetV3Large model from a specified file directory. Subsequently, essential parameters such as picture dimensions and the last convolutional layer of the model were selected, which was necessary for the Grad-CAM approach. Finally, a picture was imported, scaled, and preprocessed to conform to the input specifications of the model.The process prioritized the feature maps obtained from the final convolutional layer by using the gradients of a selected target class from the network’s output to allocate significance weights to each feature map. This process entailed generating a specific model that produces the output of the last convolutional layer (Conv_1 for MobileNetv3_Large) and the final predictions. The code used TensorFlow’s gradient computation to identify the regions in the picture that have the most impact on the model’s prediction. These feature maps were merged using the weights, and a rectified linear unit (ReLU) function was used to retain only the features that had a favorable impact on the classification prediction. The gradients were computed and subsequently averaged to construct a heatmap.The last step involved overlaying this heatmap over the original picture. The heatmap underwent colorization and resizing to align with the proportions of the source picture. The original picture was superimposed with a designated degree of transparency, yielding a composite image that graphically depicts the focal regions for the model’s decision-making procedure. However, the transparency level of the heatmap before superimposing was increased to a certain level to focus only on diseased pixels and neglect healthy pixels. Several tests were performed to determine the perfect transparency of the proposed model. The overlay picture offered a clear and easy-to-understand visual representation of the model’s predictions of diseased areas only. This implementation serves as a tangible demonstration of improving the comprehensibility of deep learning models, facilitating a better understanding and reliance on their judgments.Computational specification and evaluation metrics for model performanceThe model was trained by NVIDIA Tesla P100 GPUs, along with an approximate GPU memory capacity of 16 GB, under the complimentary GPU service extended by Kaggle. The Tesla P100 is outfitted with a collective sum of 3584 CUDA cores, which have been purposefully engineered to execute parallel computations. The customary arrangement of this product generally consists of 16 GB of High Bandwidth Memory (HBM2), which offers a significant memory bandwidth crucial for efficiently handling large datasets and complex neural networks.To determine the best performance of the trained models, the test dataset was used to determine the evaluation metrics. The accuracy, precision, recall, F1 score, and area under the curve (AUC) were calculated and furthermore, the model’s inference time for both .h5 and .tflite, Giga Floating Point Operations Per Second (GFLOPS) and the power consumption of each model were measured to identify the optimal model for classification in edge devices.Accuracy calculation involved determining the proportion of accurately recognized leaves, including both infected and healthy leaves, out of the total number of leaves that were examined. Precision quantified the model’s ability to accurately forecast the presence of disease in leaves among all the leaves classified as diseased. The diseased leaf prediction accuracy was determined by the ratio of accurately predicted diseased leaves (true positives) to the total number of leaves identified as diseased (the sum of true positives and false positives). A high level of accuracy implies that when the model predicts that a leaf is unhealthy, there is a high probability that the leaf is truly infected. The recall quantified the model’s capacity to accurately detect all existing sick leave. The diseased leaf prediction accuracy was determined by the ratio of accurately predicted diseased leaves (true positives) to the total number of leaves that are truly diseased (the sum of true positives and false negatives). A high recall score suggests that the model is proficient at accurately detecting the majority of sick leaves. The power consumed by the CPU of the edge device during prediction was measured using the Jtop library. As the goal was to choose a suitable model for lightweight application, the computational complexity of each of the trained models was measured through GFLOPS measurements.

Hot Topics

Related Articles