Benchmarking YOLOv3 and SSD: A Performance Comparison for Multi-Object Detection

Authors

  • Septian Eko Prasetyo Universitas Negeri Semarang Author
  • Chandra Atmaja Universitas Gadjah Mada Author
  • Muhammad Ardian Universitas Gadjah Mada Author
  • Alfian Ardhiansyah Universitas Negeri Semarang Author
  • Ajeng Rahma Sudarni Universitas Negeri Semarang Author
  • Mulil Khaira Universitas Negeri Semarang Author

DOI:

https://doi.org/10.15294/edukom.v11i2.28005

Keywords:

Object Annotation, Object Detection, PASCAL VOC, SSD, YOLO

Abstract

Multiple object detection remains a significant challenge in the field of computer vision. One of the key factors affecting detection performance is the feature extraction process, especially when objects are relatively small or positioned closely together. This study aims to compare the effectiveness of two popular object detection models, YOLO (You Only Look Once) and Single Shot MultiBox Detector (SSD), in detecting multiple objects within images. These models were selected due to their reported high accuracy and real-time processing capabilities, outperforming traditional methods such as the Hough Transform, Deformable Part-based Models (DPM), and conventional CNN architectures. The models were evaluated using a subset of the PASCAL VOC dataset, which includes object categories such as aircraft, faces, cars, and others, with a total of 1,447 annotated images used in training and testing. The evaluation metric used was mean Average Precision (mAP) to assess detection accuracy. Experimental results indicate that YOLO achieves a mAP of 82.01%, while SSD achieves 70.47%. These findings demonstrate that YOLO provides better performance in detecting multiple objects under the same conditions. Overall, this study confirms the advantages of YOLO in scenarios requiring fast and accurate multi-object detection, highlighting its potential for deployment in real-time applications such as autonomous vehicles, surveillance systems, and robotics. The main contribution of this study lies in providing a comparative performance benchmark between YOLO and SSD on a standard multi-object dataset to guide practical model selection in real-time computer vision tasks.

References

Alom, M. Z., Hasan, M., Yakopcic, C., Taha, T. M., & Asari, V. K. (2020). Improved inception-residual convolutional neural network for object recognition. Neural Computing and Applications, 32(1), 279–293. https://doi.org/10.1007/s00521-018-3627-6

Barinova, O., Lempitsky, V., & Kohli, P. (2012). On detection of multiple object instances using Hough transforms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9), 1773–1784. https://doi.org/10.1109/TPAMI.2012.79

Brahimi, S., Ben Aoun, N., & Ben Amar, C. (2019). Boosted convolutional neural network for object recognition at a large scale. Neurocomputing, 330, 337–354. https://doi.org/10.1016/j.neucom.2018.11.031

Cao, C., Wang, B., Zhang, W., Zeng, X., Yan, X., Feng, Z., Liu, Y., & Wu, Z. (2019). An improved Faster R-CNN for small object detection. IEEE Access, 7, 106838–106846. https://doi.org/10.1109/ACCESS.2019.2932731

Everingham, M., van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The Pascal Visual Object Classes (VOC) Challenge. International Journal of Computer Vision, 88(2), 303–338. https://doi.org/10.1007/s11263-009-0275-4

Forsyth, D. (2014). Object detection with discriminatively trained part-based models. Computer, 47(2), 6–7. https://doi.org/10.1109/MC.2014.42

Girshick, R. (2015, December). Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) (pp. 1440–1448). IEEE. https://doi.org/10.1109/ICCV.2015.169

He, K., Zhang, X., Ren, S., & Sun, J. (2014). Spatial pyramid pooling in deep convolutional networks for visual recognition. In D. Fleet, T. Pajdla, B. Schiele, & T. Tuytelaars (Eds.), Computer Vision – ECCV 2014. Lecture Notes in Computer Science, volume 8691 (pp. 346–361). Springer, Cham. https://doi.org/10.1007/978-3-319-10578-9_23

Hendry, & Chen, R.-C. (2019). Automatic license plate recognition via sliding-window darknet-YOLO deep learning. Image and Vision Computing, 87, 47–56. https://doi.org/10.1016/j.imavis.2019.04.007

Jia, S., Diao, C., Zhang, G., Dun, A., Sun, Y., Li, X., & Zhang, X. (2019). Object detection based on the improved Single Shot MultiBox Detector. Journal of Physics: Conference Series, 1187(4), 042041. https://doi.org/10.1088/1742-6596/1187/4/042041

Redmon, J., & Farhadi, A. (2018, April 8). YOLOv3: An incremental improvement (Tech. Rep.). arXiv. https://doi.org/10.48550/arXiv.1804.02767

Kheradpisheh, S. R., Ganjtabesh, M., Thorpe, S. J., & Masquelier, T. (2018). STDP-based spiking deep convolutional neural networks for object recognition. Neural Networks, 99, 56–67. https://doi.org/10.1016/j.neunet.2017.12.005

Leibe, B., Leonardis, A., & Schiele, B. (2008). Robust object detection with interleaved categorization and segmentation. International Journal of Computer Vision, 77(1–3), 259–289. https://doi.org/10.1007/s11263-007-0095-3

Li, J., Wong, H.-C., Lo, S.-L., & Xin, Y. (2018). Multiple object detection by a deformable part-based model and an R-CNN. IEEE Signal Processing Letters, 25(2), 288–292. https://doi.org/10.1109/LSP.2017.2789325

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Berg, A. C. (2016). SSD: Single Shot MultiBox Detector. In B. Leibe, J. Matas, N. Sebe, & M. Welling (Eds.), Computer Vision – ECCV 2016 (Lecture Notes in Computer Science, Vol. 9905, pp. 21–37). Cham, Switzerland: Springer. https://doi.org/10.1007/978-3-319-46448-0_2

Lorencin, I., Anđelić, N., Mrzljak, V., & Car, Z. (2019). Marine objects recognition using convolutional neural networks. NAŠE MORE, 66(3), 112–119. https://doi.org/10.17818/NM/2019/3.3

Ni, Z., Chen, J., Sang, N., Gao, C., & Liu, L. (2018, October). Light YOLO for high-speed gesture recognition. In 2018 25th IEEE International Conference on Image Processing (ICIP) (pp. 3099–3103). IEEE. https://doi.org/10.1109/ICIP.2018.8451766

Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2014). Overfeat: Integrated recognition, localization and detection using convolutional networks. Paper presented at the 2nd International Conference on Learning Representations (ICLR), Banff, Canada.

Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 779–788). IEEE. https://doi.org/10.1109/CVPR.2016.91

Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031

Sadykova, D., Pernebayeva, D., Bagheri, M., & James, A. (2020). IN-YOLO: Real-Time detection of outdoor high voltage insulators using UAV imaging. IEEE Transactions on Power Delivery, 35(3), 1599–1601. https://doi.org/10.1109/TPWRD.2019.2944741

Shi, W., Bao, S., & Tan, D. (2019). FFESSD: An accurate and efficient single-shot detector for target detection. Applied Sciences, 9(20), Article 4276. https://doi.org/10.3390/app9204276

Wang, X., Ma, H., & Chen, X. (2016). Salient object detection via fast R-CNN and low-level cues. In 2016 IEEE International Conference on Image Processing (ICIP) (pp. 1042–1046). IEEE.

Downloads

Published

2024-12-31

Article ID

28005

How to Cite

Prasetyo, S. E., Atmaja, C., Ardian, M., Ardhiansyah, A., Sudarni, A. R., & Khaira, M. (2024). Benchmarking YOLOv3 and SSD: A Performance Comparison for Multi-Object Detection. Edu Komputika Journal, 11(2), 136-146. https://doi.org/10.15294/edukom.v11i2.28005