IJEM Vol. 16, No. 3, 8 Jun. 2026
Cover page and Table of Contents: PDF (size: 2565KB)
PDF (2565KB), PP.345-364
Views: 0 Downloads: 0
User object of interest, Video clip extraction, Pretrained yolov7, temporal-window voting, Object detection in video, video shortening
Video clip extraction is the process of generating shorter, focused video segments by identifying and retaining frames that contain a user-specified object of interest (UOoI). Such targeted extraction allows users to access relevant portions of a video without watching the entire recording, with practical use in surveillance review, content management, and educational settings. In this work, we present an object-conditioned video clip extraction framework that uses the pretrained YOLOv7 detector to perform frame-level analysis of an input video. For each frame, the detector produces a set of class labels, which are matched against the user-selected UOoI to produce a binary per-frame detection signal. A one-dimensional temporal-window voting filter is then applied to this signal to suppress isolated false positives and recover isolated false negatives, addressing the single-frame detection noise that produces visible discontinuities in naive frame-by-frame approaches. The voted-positive frame indices are mapped back to source timestamps, and the corresponding audio segments are extracted directly from the source video using ffmpeg, preserving the original audio track in the output clip. The framework uses a dictionary of 80 object categories drawn from the MS COCO label set, and a graphical user interface allows users to select an input video, choose a target object, preview the input, and obtain the extracted clip with audio. We evaluate the framework on the SumMe benchmark, which we re-annotated at the frame level for object presence, and on a newly annotated set of 39 videos collected from public sources. Both datasets were independently labelled by two annotators, with Cohen’s kappa of 0.85 and 0.83, respectively, and disagreements resolved by a third adjudicator. At the default voting configuration of W=5, K=3, the framework attains an F1-score of 70.88% with 90.12% accuracy on SumMe and an F1-score of 69.89% with 85.13% accuracy on the custom dataset. An ablation over voting parameters shows monotonic gains on SumMe across the full sweep, and a smaller, dataset-dependent gain on the custom set. We discuss the remaining limitations of the pipeline, including single-UOoI conditioning, dependence on the MS COCO label vocabulary, and abrupt transitions between non-adjacent extracted segments, and outline directions for addressing them. abstract.
Mahmudul Hasan, Titas Ahmmed, Farhan Sadique, "User Object of Interest Based Video Clip Extraction using Pretrained YOLOv7", International Journal of Engineering and Manufacturing (IJEM), Vol.16, No.3, pp.345-364, 2026. DOI:10.5815/ijem.2026.03.21
[1]C.-Y. Wang, A. Bochkovskiy and H.-Y. M. Liao, “YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 2023. https://doi.org/10.1109/CVPR52729.2023.00721
[2]D. Potapov, M. Douze, Z. Harchaoui and C. Schmid, “Category-specific Video Summarization,” in ECCV- Euro pean Conference On Computer Vision, Zurich, Switzerland, 2014. https://doi.org/10.1007/978-3-319-10599-4_35
[3]P. Gunawardena, H. Sudarshana, O. Amila, R. Nawarante, D. Alahakoon, A. S.Perera and C. Chitraranjan, “Interest Oriented Video Summarization with Keyframe Extraction,” in 2019 19th International Conference on Advances in ICT for Emerging Regions (ICTer), Colombo, Sri Lanka, 2019. https://doi.org/10.1109/ICTer48817.2019.9023769
[4]R. J. R, P. Nimmagadda, K. Sudhakar, B. C. J, P. Rajasekar and S. M. A, “Perceptual Video Summarization Using Keyframe Extraction Technique,” in 3rd Internation Conference on Innovative Practices in Technology and Manage ment (ICIPTM), Uttar Pradesh, India, 2023. https://doi.org/10.1109/ICIPTM57143.2023.10118236
[5]M.Shim, T. Kim, J. Kim and D.Wee, “MaskedAutoencoder for Unsupervised Video Summarization,” ArXiv, 2023. https://arXiv:2306.01395v1
[6]B. P. A. Prabhu, T. Sharma, R. Dani and M. S. G. Prasad, “A Novel Approach to Video Summarization Using AI-GPT and Speech Recognition,” in Data Science and Applications, Singapore, 2024. https://doi.org/10.1007/978-981-99-7817-5_16
[7]H. Khan, T. Hussain, S. U. Khan, Z. A. Khan and S. W. Baik, “Deep multi-scale pyramidal features network for supervised video summarization,” Expert System with Applications, vol. 237, 2024. https://doi.org/10.1016/j.eswa.2023.121288
[8]A. Singh and M. Kumar, “Bayesian fuzzy clustering and deep CNN-based automatic video summarization,” Multi media Tools and Applicationss, vol. 83(1), pp. 963-1000, 2024. https://doi.org/10.1007/s11042-023-15431-9
[9]A. SahuandA.S.Chowdhury, “Egocentric video co-summarization using transfer learning and refined random walk on a constrained graph,” in Pattern Recongnition, 2023. https://doi.org/10.1016/j.patcog.2022.109128
[10]J. Qin, H. Yu, W. Liang and D. Ding, “Video Summarization Using Knowledge Distillation-Based Attentive Net work,” in Cognitive Computation, 2024. https://doi.org/10.1007/s12559-023-10243-3
[11]P. Narwal, N. Duhan and K. K. Bhatia, “A novel multi-modal neural network approach for dynamic and generic sports video summarization,” in Engineering Applications of Artificial Intelligence, 2023. https://doi.org/10.1016/j.engappai.2023.106964
[12]G. Wang, X. Wu and J. Yan, “Progressive reinforcement learning for video summarization,” Information Sciences, 2024. https://doi.org/10.1016/j.ins.2023.119888
[13]Liu, Ling & Özsu, M. Tamer. (2009). Encyclopedia of Database Systems. https://doi.org/10.1007/978-0-387-39940-9
[14]H. B. U. Haq, W. Suwansantisuk and K. Chammongthai, “An Optimized Deep Learning Method for Video Summa rization Based on the User Object of Interest,” International Journal of Advanced Computer Science and Applica tions(IJACSA), vol. 14, no. 10, 2023. https://doi.org/10.14569/IJACSA.2023.0141027
[15]H. B. U. Haq, M. Asif, M. Ahmed, R. Ashraf and T. Mahmood, “An Effective Video Summarization Framework Based on the Object of Interest Using Deep Learning,” Mathematical Problems in Engineering, pp. 1-25, 2022. https://doi.org/10.1155/2022/7453744
[16]J. Redmon and AliFarhadi, “YOLOv3: An Incremental Improvement,” 2018. https://arXiv:1804.02767v1
[17]T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick and P. Doll´ ar, “Microsoft COCO: Common Objects in Context,” 2015. https://arXiv/abs/1405.0312
[18]M. Gygli, H. Grabner, H. Riemenschneider and L. VanGool, “Creating Summaries from User Videos,” in Computer Vision– ECCV 2014, 2014. https://doi.org/10.1007/978-3-319-10584-0_33
[19]Y. Song, J. Vallmitjana, A. Stent and A. Jaimes, “TVSum: Summarizing web videos using titles,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015. https://doi.org/10.1109/CVPR.2015.7299154
[20]M. Afham, S. N. Shukla, O. Poursaeed, P. Zhang, A. Shah and S. Lim, “Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023. https://arxiv.org/abs/2309.11569
[21]J. Cohen, “A coefficient of agreement for nominal scales,” Educ. Psychol. Meas., vol. 20, no. 1, pp. 37–46, Apr. 1960. https://doi.org/10.1177/001316446002000104
[22]J. R. Landis and G. G. Koch, “The measurement of observer agreement for categorical data,” Biometrics, vol. 33, no. 1, pp. 159-174, Mar. 1977. https://doi.org/10.2307/2529310