Combination of Cross Stage Partial Network and GhostNet with Spatial Pyramid Pooling on YOLOv4 for Detection of Acute Lymphoblastic Leukemia Subtypes in Multi-Cell Blood Microscopic Image

. Purpose: Acute Lymphoblastic Leukemia (ALL) Detection with microscopic blood images can use a deep learning-based object detection model to localize and classify ALL cell subtypes. Previous studies only performed single cell-based detection objects or binary classification with leukemia and normal classes. Detection of ALL subtypes is crucial to support early diagnosis and treatment. Therefore, an object detection model is needed to detect ALL subtypes in multi-cell blood microscopic images. Methods: This study focuses on detecting the ALL subtype using YOLOv4 with a modified neck using Cross Stage Partial Network (CSPNet) and GhostNet. CSPNet is combined with Spatial Pyramid Pooling (SPP) to become SPPCSP to get various features map before the YOLOv4 final layer. Ghostnet was used to reduce the computation time of the modified YOLOv4 neck. Result: Experimental results show that YOLOv4 SPPCSP outperformed the recall value of 14.6%, the value of mAP@.5 0.8%, and reduced the computation time by 4.7 ms compared to the original YOLOv4. Novelty: The combination of CSPNet and GhostNet for YOLOV4 neck modification can increase the variety of features map and reduce computing time compared to the Original YOLOv4.


INTRODUCTION
Automatic analysis of microscopic images for disease detection and diagnosis is becoming very important and challenging in the medical field of computer vision. Microscopic images of blood contain essential information for early diagnosis and treatment of diseases, one of which is leukemia. Leukemia is a hematological disease of the bone marrow characterized by abnormal white blood cell production [1]. Leukemia is divided into two infected cell types: myeloid and lymphocyte. Based on the speed of spread of infection and aggressiveness, leukemia is divided into 2, namely acute and chronic. Acute Lymphoblastic Leukemia (ALL) is characterized by abnormal lymphocyte cell development and overproduction [2]. ALL can cause a decrease in the immune system and disease complications in crucial organs of the body.
The method for obtaining the location of the ALL subtype in multi-cell microscopic blood images can use the object detection method. Object detection methods from the previous literature have been used for cancer analysis and can also classify and detect cancer cell locations [10], [11]. Detection of leukemia cells using the original YOLOv4 by [12] only performed localization and classification on multi-cell images with normal and leukemia classes. However, detecting ALL subtypes is crucial to support early diagnosis and treatment. Therefore, this study uses object detection based on You Only Look Once (YOLO) to classify and localize ALL subtypes on multicellular blood microscopic images.
The development of the YOLO object detection model has reached YOLOv5. Previous literature reported that the YOLOv4 model could detect apple images with higher mAP values than YOLOv5 [13]. Another study that detected and localized melanoma lesions using YOLOv4 achieved a more robust and optimal bounding box than YOLOv5 [14]. Although YOLOv4 outperforms YOLOv5, YOLOv4 is still ineffective at detecting small and touching objects [15]. YOLOv4 is also prone to object class misclassification if the object class characteristics are similar [16].
Following the development of the object detection model, CSPNet proposed by Wang and his colleagues in 2020 improves feature map analysis by integrating the initial features map from the base layer through a cross-stage hierarchy to the resulting CNN feature map layer [17]. CSPNet has been used on the YOLOv4 backbone but not on the YOLOv4 neck. Even though CSPNet makes the YOLOv4 model more robust, it still has a higher computation time and a heavier memory [18].
Several popular light detection models for extracting feature maps in images have been proposed, one of which is GhostNet. GhostNet has the advantage of extracting feature maps by reducing computation time while maintaining the quality of the analysis results [19]. Previous studies used GhostNet for colon disease classification, where GhostNet outperformed Residual Network (ResNet) and Mobile Neural Architecture Search (MnasNet) [20].
In this study, we propose a neck-modified YOLOv4 model using CSPNet and GhostNet for ALL subtype detection. CSPNet is combined with Spatial Pyramid Pooling (SPP) into SPPCSP to get various features mapped before the YOLOv4 final layer. Ghostnet was used to reduce the computation time of the modified YOLOv4 neck.
To summarize, this paper has two main contributions: 1. Enhancing the variety of feature maps with CSPNet 2. Reducing computation time of the YOLOv4 SPPCSP model using GhostNet This paper is arranged in the following order: Section 2 describes the methods and models used, section 3 describes and compares the results obtained, and section 4 concludes this study.

YOLOv4
YOLOv4 improves the object detection model to improve accuracy and efficiency [21]. Effective and efficient Weighted Residual Connection combines residual neural networks [22]. Cross Stage Partial Connections (CSPNet) integrates the feature map from the initial to the final stage of the network. CSPNet implementation reduces computing by up to twenty percent and outperforms other backbone architectures [22]. Cross Mini-Batch Normalization solves a problem in statistics where the result of defining normalization cannot be estimated accurately [23]. Self Adversarial Training is a technique for adding new data to a network that operates in two phases, forward and backward. Mish activation is a function of selfregulating non-monotone activation [24]. Mosaic Data Augmentation is a technique for adding data by combining several input image attributes into one image data set. CNN uses DropBlock regularization as a new regularization method [24]. Lastly is CIoU loss, a loss function that achieves faster convergence and greater accuracy for bounding box regression problems [25].
The YOLOv4 architecture leverages the CSPDarknet53 backbone network, SPP (Spatial Pyramid Pooling), PANet for Neck, and 3 YOLO heads, each generating a feature map of S2 size (3 ( 5 + C)). S represents the grid size, 3 represents the number of bounding boxes generated per feature map, (5+C) represents the bounding box attributes, which include the center, width, and height coordinates and their objectivity scores, and C represents the confidence score of the bounding box for each class. Details of the YOLOv4 architecture are shown in Figure 1. Architectural components such as the backbone, neck, and head in YOLOv4 perform different tasks. The CSPdarknet53 backbone functions to extract features map from the image. On the neck, an SPP performs max pooling with four different pool sizes (5 x 5, 9 x 9, 13 x 13, 1 x 1), and the output of these four processes will be combined and processed PANet. PANet works to improve the quality of feature maps by upsampling and downsampling using low-level and high-level feature maps. Object detection is carried out in the head section using three different sizes. The head section of the first YOLOv4 detects small objects, the second YOLOv4 head section detects medium-sized objects, and the third YOLOv4 head section detects large objects.

Cross Stage Partial Network (CSPNet)
Cross Stage Partial Network (CSPNet) attempts to solve the vanishing gradient problem and extensive computational processes of a CNN-based architecture to be used on mobile CPUs and GPUs with limited computing power [17]. Deep learning architecture acquires knowledge through the backpropagation of gradients in the network. CSPNet aims to achieve more diverse gradient combinations while minimizing computations. CSPNet accomplishes this by splitting the gradient stream into separate network paths.
The base layer feature map is divided into two parts using the channel x0 = [x0',x0"]. Merging the different feature maps x0' and x0" will produce the original feature map x0. The former is directly connected to the last layer, while the latter will be analyzed for its feature map with the CNN architectural block, and in this study, the SPP block is used. Details of the CSPNet architecture used in this study are shown in Figure 2.

GhostNet
GhostNet aims to provide a better model performance alternative in computing for convolution layers used in convolutional neural networks with reduced computational time [26]. GhostNet generates a portion of the total output feature map using standard convolution, while the rest uses a more computationally efficient linear operation, depthwise convolution.
The output tensor of each convolution layer is analyzed by serializing two operations. First, it analyzes a portion of the total channel for the output tensor using a sequential stack of three layers: standard convolution, batch normalization, and non-linear activation function, which by default is Rectified Linear Unit (ReLU). Output is sent to a secondary block, a sequential stack of three layers: depthwise convolution, batch normalization, and ReLU. The output tensor is provided by stacking the tensor of the first sequential block with the output of the second sequential block. The detailed architecture of GhostNet is shown in Figure 3. The experiment results from the model carried out an evaluation stage to determine the performance in detecting the location and classification of ALL subtypes on multi-cell blood microscopic images. The evaluation metrics used in this study are precision, recall, F1, Mean Average Precision (mAP), and computation time. The precision measurement process aims to compare the True Positive prediction to the overall results of the positive predicted model [27]. Recall provides information regarding comparing True Positive predictions to the comprehensive data predicted True Positive [28]. F1 can measure the proportion of the average comparison of recall and precision values weighted [29] by Equation (1) to produce precision values, Equation (2) to create recall, and Equation (3) to create F1.
Average precision is a measurement result that combines recall and precision [30]. In other words, it compares the prediction accuracy of all documents or output data with the actual data. True Positive indicates that an object with the correct classification has been detected, and the IoU value is more significant than specified, for example, IoU >= 0.5 or 50%. False Positive refers to detecting an object that does not conform to the classification and has an IoU that is less than the specified threshold. False Negative represents the number of things that are not detected based on ground truth data.
Equation (4) shows the Average precision formula. The calculation of precision and recall will calculate the Average Precision (AP) score, which calculates the average precision value of the total recall where r indicates the recall value and p is the precision value. The mAP is calculated by finding the AP for each class and then the average across several classes, as shown in Equation (5). The thresholds used in this study were 0.5 and 0.95.
The evaluation of computing time is the process of searching for one bounding box from 4 criteria: preprocessing, inference, Non-Maximum Suppression (NMS), and total. Preprocessing and inference computation time calculates the time from the start of the image to the preprocessing stage and detection of the ALL subtype. Inference computation time is longer than other variables because the detection process tries many possible object combinations. The NMS selects each bounding box with the highest filter confidence and an overlapping bounding box.

Proposed Method
The analysis process aims to extract feature map variations to detect touching objects and small objects in multi-cell ALL images. At this stage, the original YOLOv4 neck was modified using CSPNet as an additional block for SPP. The work process of the SPPCSP Block is divided into 2, the first going through SPP processing and the second going straight to the final block through a standard convolution process. CSPNet Blocks are composed of 3 Ghost Convolution Blocks before SPP and 2 Convolution Blocks after SPP. The results of the two paths are carried out in the concatenation stage to unify the resulting feature map.
Ghost Convolution is used in this study to make the performance of the SPPCSP neck modification model faster in computing time while maintaining a high accuracy value. Ghost convolution comprises 2 convolution processes, namely standard convolution and Depthwise convolution. Deptwise convolution performs the convolution process of each channel on each filter channel so that the computational process is more efficient. The kernel sizes used in the SPP block model are 5 x 5, 9 x 9, and 13 x 13, with the padding size value obtained from the variation in kernel size divided by 2. The design of the YOLOv4 neck modification is shown in Figure 4.

RESULT AND DISCUSSION
The dataset used is a microscopic image of blood cells subtype ALL from the paper by Fatonah et al., which has a multi-cell image [6]. The dataset consists of 3 classes, namely L1, L2, and L3. The literature states that the dataset source was obtained from a hospital with detailed images taken with a 1000x scale and the same staining process. The total image data is 301 multi-cell subtype ALL consisting of 128 L1 images, 63 L2 images and 110 L3 images.
Preprocessing the data in this study conducted an initial analysis of the dataset by manual labelling using a roboflow annotator, which can be accessed online, and resizing the image size from 591 x 443 pixels to 416 x 416 pixels. The process of selecting the resize size to 416 x 416 pixels is based on the standard size often used as image input in the YOLOv4 detection object model. The labelling process provides a bounding box for each white blood cell seen in the image according to ALL subtypes, such as L1, L2 and L3. An example of the labelling process is shown in Figure 5. The process of sharing the data resulting from the preprocessing is divided into 3 types: training, validation, and testing data. Each proportion is 75% on training data, 15% on validation data, and 10% on testing data. Details of the results of the division of 200 images on the training data, 39 images on the validation data, and 27 images on the testing data. The division process is carried out evenly from each class so that the model can find class representations at the training, validation, and testing stages. The testing process is divided into 2 experiment scenarios: the training process with the YOLOv4 model without modification and the training process with the modification of YOLOv4 on the neck with SPPCSP. Details of the implementation and experiment stages are shown in Figure 6.
The first scenario experiment extracted the location and classification of ALL subtypes on multi-cell blood microscopic images using the YOLOv4 model without modification. The testing process aims to determine the strength shown from the results of the overall analysis and each subtype of L1, L2, and L3. The experiment in the second scenario adds the implementation of the CSPNet model to the SPP so that the model gets a wider variety of features map. CSPNet extracts the features map from the base layer features map combined with the features map resulting from the SPP convolution process. The merging process produces a more varied features map to improve the quality of the analysis of subtypes of small and touching ALL cells in microscopic multi-cell blood images. The experiment results from 2 scenarios were evaluated using mAP values, recall, precision and computation time. The threshold used for the mAP metric is 0.5 ( mAP@.5 ) and 0.95 ( mAP@.95 ) to evaluate the model's existence from the confidence bounding box resulting from the analysis. The evaluation process is carried out as a whole from the image in the dataset and each subtype ALL. The evaluation was carried out by testing data on as many as 27 images to determine the model's strength in generalizing to new data that was not included in the training process.
Details of the L1 experiment results are shown in Table 1. The modified YOLOv4 SPPCSP model can outperform all aspects of the evaluation compared to the original YOLOv4. The slightest difference is found in the value of mAP@.5 by 0.1%. The most significant difference was found in the recall metric, with 4.7%. A high recall value indicates that the YOLOv4 SPPCSP model can detect True Positive data accurately. The L1 object detection results are compared in Figure 7, where the YOLOv4 SPPCSP model can detect more L1 than the original YOLOv4 model.  Table 2. The original YOLOv4 model excels in evaluation precision by 16.6% but is still lower than other metrics. The YOLOv4 SPPCSP model has the same high precision evaluation value as the result of L1 object detection, with a difference of 33.3%. Figure 8 shows that the original YOLOv4 model cannot detect L2 according to ground truth, in contrast to the YOLOv4 SPPCSP model, which can detect L2 according to ground truth and has a high IoU confidence value of 0.9. Apart from evaluating each subtype of ALL, an overall evaluation was also carried out to determine the model's performance against the generalization of all subtypes of ALL. From Table 4, the original YOLOv4 model has the advantage of a precision value of 6% and the same evaluation value at mAP@.95. The YOLOv4 SPPCSP model still has the advantage of recall values such as from the evaluation of each ALL subtype with a difference of 14.6%. The result shows that the YOLOv4 SPPCSP model can detect true positive data more accurately than the original YOLOv4 model. The evaluation of the computational time is shown in Table 5. The evaluation process was carried out with 4 variables: preprocessing, inference, Non-Maximum Suppression (NMS), and the total of all computational time. The computational time of the preprocess and inference calculates the time from the beginning of the image through the preprocessing stage and detection of ALL subtypes. The inference computation time is longer than other variables because the detection process tries many combinations of possible objects found. The NMS process selects each bounding box with the highest filter confidence and overlaps bounding boxes. From Table 5, the YOLOv4 SPPCSP model can outperform all evaluation variables for computing time with a total difference of 4.7 ms. Our experiment in Table 5 indices YOLOv4 SPPCSP best in computing time, than YOLOv4 original [21], especially in this dataset [6].

CONCLUSION
In this study, location extraction and classification of ALL subtypes were carried out using microscopic multi-cell blood images and modified YOLOv4 neck using CSPNet and GhostNet. CSPNet is combined with Spatial Pyramid Pooling (SPP) into SPPCSP to get various features mapped before the YOLOv4 final layer. Ghostnet was used to reduce the computation time of the modified YOLOV4 neck. The Experimental results show that YOLOv4 SPPCSP outperformed the recall value of 14.6%, the value of mAP@.5 0.8%, and reduced the computation time by 4.7 ms compared to the original YOLOv4. The results showed that the YOLOv4 SPPCSP model could perform location extraction and classification on multi-cell blood microscopic images better than the original YOLOv4 model.