Real-time visual intelligence for defect detection in pharmaceutical packaging

The proposed CB-YOLOv8 detection model integrates a coordinate attention mechanism, BiFPN and SimSPPF. Incorporating the coordinate attention mechanism within the YOLOv8 backbone enhances object localization by prioritizing relevant spatial coordinates, thereby yielding refined feature maps. The inclusion of BiFPN in the YOLOv8 neck facilitates efficient feature aggregation across multiple scales, leading to improved multi-scale prediction capabilities. SimSPPF reduces the computational cost. Because the usage of the CA mechanism will increase the computational complexity by adding it to the backbone of YOLOv8.Coordinate attention mechanismThe idea of attention draws inspiration from the visual attention system found in humans, which extracts the most relevant information from huge amounts of data. The effective features are extracted by highlighting the most important information and reducing the less relevant data. Incorporating the attention in the appropriate location of the backbone can reduce the impact of distracting background data, increase the accuracy of targeted feature retrieval, and ultimately raise the precision of the algorithm’s detection capabilities. Various attention mechanisms are available nowadays, such as SE attention25, ECA attention26, and CBAM attention27. SE (squeeze and excitation) attention, which calculates the channel attention using 2D global pooling, was resulting in significant performance improvements with minimal computational requirements. The SE attention mechanism focuses only on encoding inter-channel data. Still, it lacks focus on positional information, which leads to critical in capturing the structure of objects in computer vision tasks. ECA, or Efficient Channel Attention, enhances SE attention through the utilization of one-dimensional convolutional layers for gathering cross-channel data, resulting in more precise attention details. However, ECA neglects the positional data of image characteristics, which limits its effectiveness. The CBAM (Convolutional Block Attention Module), employing a convolutional block attention mechanism, integrates both channel and spatial aspects, strengthening the correlation between channel features and spatial dimensions. Nevertheless, it struggles to capture the contextual information surrounding the target. Additionally, the compact attention models focusing solely on the channel domain mentioned earlier solely address individual channel data, disregarding the positional information within the image. Although CBAM incorporates both channel and positional details, it cannot capture long-range relationships28.Coordinate attention (CA) is a fast, simple, lightweight mechanism that is easily adaptable to integrate with the core structure of any algorithm29. It balances both long-range positional relationships and channel information and engages with broader contexts without significant computational overhead. This enhancement leads to improved target detection and recognition, surpassing other attention mechanisms like SE, ECA, and CBAM. Insert the Coordinate Attention module after the convolutional layers within the C2f block, this enables the network model to concentrate on appropriate features at various stages of feature extraction. The process separates the attention mechanism into two unique one-dimensional feature encoding operations, each dedicated to collecting features across different spatial dimensions. This approach successfully creates an attention map that is aware of coordinates by identifying operational coordinates, which helps in capturing spatial dependencies over long distances. The resulting feature maps are then converted into attention maps that are conscious of both the direction and the location, thus enhancing the depiction of target objects without adding to the computational load. Through dividing channel attention into two parallel one-dimensional features, this coordinated attention method highlights the reduction of positional information that occurs due to global pooling. The whole structure of CA mechanism is shown in the Fig. 3. The CA mechanism includes two steps: coordinate information embedding and coordinate attention generation.Figure 3Structure of coordinate attention mechanism.Coordinate information embeddingIn channel attention mechanisms, global pooling is a common technique utilized to integrate spatial information spanning the entirety of an image. However, it compresses this comprehensive spatial data to a channel descriptor, making it challenging to maintain precise positional information crucial for capturing spatial data in the visualization process. To promote the attention blocks’ ability to grasp long-range spatial relations while preserving accurate positional details, the CA module factorizes the global pooling represented in SE attention into two separate one-dimensional feature encoding processes. Specifically, the X is the provided input, which is fed into two spatial pooling kernels (H, 1) and (1, W) are used for the encoding process of vertical and horizontal coordinates, respectively. Then, the cth channel output for H and W can be formulated as$${z}_{c}^{h}\left(h\right)= \frac{1}{W}\sum_{0\le i<w}{x}_{c}\left(h,j\right),$$
(2)
$${z}_{c}^{w}\left(w\right)= \frac{1}{H}\sum_{0\le j<H}{x}_{c}(j,w).$$
(3)
The above two Eqs. (1) and (2) aggregate features in two directions individually, yielding a set of direction aware feature maps. The above two equations make this attention to capture positional information in one spatial direction and long-range information in another spatial direction, which helps the model to accurately locate the interested objects in an image or video.Coordinate attention generationBy using the aggregated features from the Eqs. (1) and (2), first perform the concatenate process and send that feature maps to individual 1 × 1 convolution function F1, gives$${\varvec{f}}= \delta \left(F1\left(\left[{{\varvec{z}}}^{h}, {{\varvec{z}}}^{w}\right]\right)\right), {\varvec{f}}\epsilon {\mathbb{R}}^{C/r\times (H+W)},$$
(4)
where \(\delta (.)\) is a nonlinear activation function, z represents the concatenation process, and r is the reduction rate used to control the block size. Then split the f into two individual tensors such as horizontal and vertical coordinates \({{\varvec{f}}}^{{\varvec{h}}}\epsilon {\mathbb{R}}^{C/r\times H}\) and \({{\varvec{f}}}^{{\varvec{w}}}\epsilon {\mathbb{R}}^{C/r\times w}\) respectively. Again, additional separate two 1 × 1 convolution operations Fh and Fw are used to transmute separately \({{\varvec{f}}}^{{\varvec{h}}}\) and \({{\varvec{f}}}^{{\varvec{w}}}\) to a similar channel number to the X input, gives$${g}^{h}= \sigma \left({F}_{h}\left({{\varvec{f}}}^{{\varvec{h}}}\right)\right),$$$${g}^{w}= \sigma \left({F}_{w}\left({{\varvec{f}}}^{{\varvec{w}}}\right)\right),$$
(5)
where σ is the sigmoid function. To reduce the model complexity by decreasing the rate of reduction r (eg., r = 32). The output of the equations \({g}^{h}\) and \({g}^{w}\) are extended and employed as an attention weight. At last, result of CA module Y is calculated by using \({g}^{h}\) and \({g}^{w}\) and it is written as$${y}_{c}\left(i,j\right)= {x}_{c}\left(i,j\right)\times {g}_{c}^{h}\left(i\right)\times {g}_{c}^{w}\left(j\right).$$
(6)
Bi-directional feature pyramid networkFeature pyramid network (FPN) is mainly utilized for different scale feature fusion. NAS- FPN and PANet are also developed for feature fusion used to obtain a cross scale feature fusion. Previous research has typically combined various input features by simply adding them together without distinct characteristics. However, given that these features are captured at different resolutions, the notable issue is that they often contribute unequally to the final combined feature. To tackle this challenge, a straightforward yet remarkably efficient solution is called the weighted bi-directional feature pyramid network. This approach incorporates trainable weights to understand the significance of individual input features, while iteratively merging features across multiple scales in both bottom-up and top-down directions. BiFPN is a feature fusion method that is introduced in EfficientDet-D7 which is an object detector that achieves 55.1 AP on the COCO dataset30. Its structure is shown in the Fig. 4. The BiFPN includes two steps: efficient bidirectional cross scale connection and weighted feature fusion.Figure 4Efficient bidirectional cross scale connectionFPN uses only a top-down approach for feature fusion. Lower-level features have higher and better resolution but less semantic data (good for localization). Higher-level features have robust semantic information (good for recognition) but lesser resolution. To solve this problem, PANet includes a bottom-up approach after the top-down approach that is mentioned in FPN. Cross-scale connections give better feature fusion. So, NAS-FPN designs a new neural architecture for achieving cross-scale feature fusion. Still, its architecture is complex and difficult to modify and needs a greater number of GPU hours for searching. To the core, PANet gives better results than NAS-FPN and FPN, but the lacking of PAN requires more computations and parameters. To solve this issue, BiFPN is developed which includes three major changes. First, if a node has a single input edge without feature fusion, then that node is removed. Second, if the node (both actual input and output) is at the same level, then an additional edge is added between these nodes, which enhance the feature fusion. Third, the BiFPN layer is repeated multiple times to achieve the high-end feature fusion. The number of times needed to repeat the BiFPN layer is based on resource constraints by using a compound scaling technique.Feature fusion with weighted valueThe diverse input features with different resolutions lead to inequality in output features. So, the BiFPN adds the weighted value to each input feature and strengthens the model by understanding the importance of every input feature. In BiFPN, three weighted values are used for fusion:

Unbounded fusion is represented as \(O= \sum_{i}{w}_{i} . {I}_{i}\), where wi can be a vector, multi-dimensional tensor, scalar that represents a learnable weight for per channel, pixel, feature. A constraint arises when weights are represented as scalar values, potentially causing instability during training. However, employing weight normalization offers a viable strategy to limit the weight value ranges and overcome this issue.

Softmax based fusion is represented as \(O= \sum_{i}\frac{{e}^{{w}_{i}}}{\sum_{j}{e}^{{w}_{j}}} . {I}_{i}\), Softmax function is applied to all weights, so it normalizes the weights from 0 to 1.

Fast normalized fusion is represented as \(= \sum_{i}\frac{{w}_{i}}{\in + \sum_{j}{w}_{j}} . {I}_{i}\), the ReLU is applied after each weight wi which ensures wi ≥ 0 and ϵ is set to 0.0001 is small value for avoiding the numerical instability.

Therefore, the final BiFPN combines the above two steps. For example, the feature fusion of layer 6 for BiFPN is represented as$${P}_{6}^{td}=Conv\left(\frac{{w}_{1} . {P}_{6}^{in}+ {w}_{2} . Resize({P}_{7}^{in}}{{w}_{1}+ {w}_{2}+ \in }\right),$$
(7)
$${P}_{6}^{out}=Conv\left(\frac{{w{\prime}}_{1} . {P}_{6}^{in}+ {w{\prime}}_{2} . {P}_{6}^{td}+ {w{\prime}}_{3} . Resize\left({P}_{5}^{out}\right)}{{w{\prime}}_{1}+ {w{\prime}}_{2}+ {w{\prime}}_{3}+ \in }\right).$$
(8)
\({P}_{6}^{td}\) is a layer 6 intermediate feature in top down approach and \({P}_{6}^{out}\) is a layer 6 output feature in bottom up approach.SimSPPF loss functionTo ensure the real-time performance of the tablet defect detection, the SPPF in the YOLOv8 backbone is replaced with SimSPPF. In YOLOv6, the SimSPPF is first introduced is shown in the Fig. 5, which is an enhanced version of SPPF, it reduces the computational complexity and inference time31. This will be achieved by aggregating three 5 × 5 maximum pooling layers to process each input, gives a finite-sized feature map. These feature maps enhance the feature representation and the receptive field of the model. The conv module in YOLOv8’s SPPF includes a convolution layer, batch normalization, and Sigmoid Linear Unit (SiLU) activation function (CBS module). But the SimSPPF has a conv module which includes a convolution layer, batch normalization and Rectified Linear Unit (ReLU) activation function (CBR module). The equation for SiLU and ReLU activation functions are,Figure 5$$SiLU, f\left(x\right)= \frac{{x}^{2}}{1+{e}^{-x}},$$
(9)
$$ReLU, f\left(x\right)= \left\{\begin{array}{c}x, if x>0\\ 0, if x\le 0\end{array}=\text{max}(0,x)\right..$$
(10)
The computation of exponential terms in SiLU leads to high computational complexity. To mitigate this issue, replace the SiLU with the ReLU function in the conv module in order to increase the Frames Per Second (FPS).CBS-YOLOv8Overall, the enhancements and refinements made earlier have been integrated into the architecture of the CBS-YOLOv8 network, resulting in notable enhancements in its performance and accuracy of the detection model. Incorporation of CA mechanism in shallow and before bottleneck layer of backbone enables more extraction of features. The BiFPN module expands the receptive field of the model by using high-resolution features. This will increase the detection of tiny cracks in tablets within the blister packages. The SimSPPF reduce the computation complexity which ensures the real-time object detection in the pharmaceutical industry, leading to a more accurate and robust detection system.Finally, these changes made in YOLOv8 give a more effective and well efficient defect detection model that will work in wide real-time scenarios. The final model gives the tensors of 20 × 20 × 27, 40 × 40 × 27 and 80 × 80 × 27is illustrated in the Fig. 6.Figure 6Architecture of CBS-YOLOv8.Process in blister package defect detection in real-timeThe overall process of CBS-YOLOv8 defect detection in real-time environment is shown in Fig. 7. Initially, the video data is acquired through the hardware component and subsequently forwarded to the next stage for preprocessing. Secondly, the video undergoes a conversion into frames, subsequent to which frame-level annotation is executed, followed by an augmentation process, ending in the transformation into the CBS-YOLOv8 detection model. Finally, the proposed model detects the defects both in uploaded video and in real-time video. Then, the results are presented by indicating the number of defects observed within the video.Figure 7Process involved in CBS-YOLOv8 defect detection model.

Hot Topics

Related Articles