Farhan Sadique

Work place: Computer Science and Engineering Discipline, Khulna University, Khulna, 9208, Bangladesh

E-mail: farhan@cse.ku.ac.bd

Website: https://orcid.org/0000-0002-7454-0638

Research Interests:

Biography

Md. Farhan Sadique is a Lecturer in the Computer Science and Engineering Discipline at Khulna University, Khulna, Bangladesh. He earned both his Bachelor and Master of Science degrees in Computer Science and Engineering from Khulna University.

Author Articles
User Object of Interest Based Video Clip Extraction using Pretrained YOLOv7

By Mahmudul Hasan Titas Ahmmed Farhan Sadique

DOI: https://doi.org/10.5815/ijem.2026.03.21, Pub. Date: 8 Jun. 2026

Video clip extraction is the process of generating shorter, focused video segments by identifying and retaining frames that contain a user-specified object of interest (UOoI). Such targeted extraction allows users to access relevant portions of a video without watching the entire recording, with practical use in surveillance review, content management, and educational settings. In this work, we present an object-conditioned video clip extraction framework that uses the pretrained YOLOv7 detector to perform frame-level analysis of an input video. For each frame, the detector produces a set of class labels, which are matched against the user-selected UOoI to produce a binary per-frame detection signal. A one-dimensional temporal-window voting filter is then applied to this signal to suppress isolated false positives and recover isolated false negatives, addressing the single-frame detection noise that produces visible discontinuities in naive frame-by-frame approaches. The voted-positive frame indices are mapped back to source timestamps, and the corresponding audio segments are extracted directly from the source video using ffmpeg, preserving the original audio track in the output clip. The framework uses a dictionary of 80 object categories drawn from the MS COCO label set, and a graphical user interface allows users to select an input video, choose a target object, preview the input, and obtain the extracted clip with audio. We evaluate the framework on the SumMe benchmark, which we re-annotated at the frame level for object presence, and on a newly annotated set of 39 videos collected from public sources. Both datasets were independently labelled by two annotators, with Cohen’s kappa of 0.85 and 0.83, respectively, and disagreements resolved by a third adjudicator. At the default voting configuration of W=5, K=3, the framework attains an F1-score of 70.88% with 90.12% accuracy on SumMe and an F1-score of 69.89% with 85.13% accuracy on the custom dataset. An ablation over voting parameters shows monotonic gains on SumMe across the full sweep, and a smaller, dataset-dependent gain on the custom set. We discuss the remaining limitations of the pipeline, including single-UOoI conditioning, dependence on the MS COCO label vocabulary, and abrupt transitions between non-adjacent extracted segments, and outline directions for addressing them. abstract.

[...] Read more.
Other Articles