An Efficient String Matching Technique for Desktop Search to Detect Duplicate Files

Full Text (PDF, 978KB), PP.69-76

Views: 0 Downloads: 0


S. Vijayarani 1,* M.Muthulakshmi 1

1. Department of Computer Science, Bharathiar University, Coimbatore, Tamilnadu, India

* Corresponding author.


Received: 20 Sep. 2016 / Revised: 7 Jan. 2017 / Accepted: 15 Mar. 2017 / Published: 8 Jul. 2017

Index Terms

Content Analysis, File similarity, String matching, Boyer Moore Horspool, Knuth Morris Pratt, W2COM


Information retrieval is used to identify the relevant documents in a document collection, which is matching a user's query. It also refers to the automatic retrieval of documents from the large document corpus. The most important application of information retrieval system is search engine like Google, which identify those documents on the World Wide Web that are relevant to user queries. In most situations, users may download the files that are already downloaded and stored in their computer. Then, there is a chance of multiple copies of the files that are already stored in different drives and folders on the system, which in turn reduces the performance of the system and these files occupy a lot of memory space. Analyzing the contents of the file and finding their similarity is one of the major problems in text mining and information retrieval. The main objective of this research work is to analyze the file contents and deletes the duplicate files in the system. In order to perform this task, this research work proposes a new tool named Duplicate File Detector Tool i.e. DFDT. DFDT helps the user to search and delete duplicate files in the system at a minimum time. It also helps to delete the duplicate files not only with the same file category, but also with different file categories. Boyer Moore Horspool and Knuth Morris Pratt string searching algorithms are existing algorithms and these algorithms are used to compare the file contents for finding their similarity. This work also proposes a new string matching algorithm named as W2COM (Word to Word COMparison). From the experimental results it is observed that the newly proposed W2COM string matching algorithm performance is better than Boyer Moore Horspool and Knuth Morris Pratt algorithms.

Cite This Paper

S. Vijayarani, Ms. M.Muthulakshmi, "An Efficient String Matching Technique for Desktop Search to Detect Duplicate Files", International Journal of Information Technology and Computer Science(IJITCS), Vol.9, No.7, pp.69-76, 2017. DOI:10.5815/ijitcs.2017.07.08


[1]Ababneh Mohammad, OqeiliSaleh and Rawan A Abdeen, Occurrences Algorithm for String Searching Based on Brute-Force Algorithm, Journal of Computer Science, 2(1): 82-85, 2006.

[2]Ankur Singh Bist, Pattern Matching Algorithms for Computer Virus Detection, International Journal of Engineering Sciences & Research Technology, Singh 2(1), P.No.28-29, 2013.

[3]Abdulwahab Ali Al-Mazroi and Nur’aini Abdul Rashid, A Fast Hybrid Algorithm for the Exact String Matching Problem, American Journal of Engineering and Applied Sciences 4 (1): 102-107, 2011.

[4]Anthony Scime, NilaySaiya, Gregg R. Murray and Steven J. Jurek, “Classification Trees as Proxies”, International Journal of Business Analytics (IJBAN), volume 2, issue 2.

[5]Bin Wang, Zhiwei Li, Mingjing Li and Wei-Ying Ma, Large-Scale Duplicate Detection for Web Image Search, Multimedia and Expo, IEEE International Conference, 353-356, 2006

[6]Bo Hong and DemynPlantenberg, Duplicate Data Elimination in a SAN File System, In Proceedings of the 21st IEEE / 12th NASA Goddard Conference on Mass Storage Systems and Technologies, 2004.

[7]Christian Charras, Thierry Lecroq and Joseph Daniel, A Very fast string searching algorithm for small alphabets and long patterns, Combinational Pattern Matching, 9th Annual Symposium, CPM 98 Piscataway, New Jersey, USA, 2005.

[8]Craig A. N. Soules, Gregory R. Ganger, Connections: Using Context to Enhance File Search, ACM SIGOPS Operating Systems Review - SOSP '05,Volume 39, Issue 5, 2005.

[9]Gregory Ramsey  and Sanjay BapnaText mining to identify customers likely to respond to cross-selling campaigns: reading notes from your customers, International Journal of Business Analytics (IJBAN), volume 3, issue 2

[10]George Forman, KaveEshghi andJaapSuermondt, Efficient Detection of Large-Scale Redundancy in Enterprise File Systems, ACM SIGOPS Operating Systems, Volume 43 Issue 1, 84-91 2009.

[11]Harish B S, S Manjunath and D S Guru, Text Document Classification: An Approach Based on Indexing,International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.2, No.1, 43-62, 2012.

[12]HemlataSahu, ShaliniShrma, SeemaGondhalakar, A Brief Overview on Data Mining Survey, International Journal of Computer Technology and Electronics Engineering, Volume 1, Issue 3, 114-121, 2000.

[13]Ian H. Witten and Eibe Frank, Data Mining Tools and Techniques practical Machine Learning, 2011 (Book).

[14]JormaTarhio and EskoUkkonen, Approximate Boyer-Moore String Matching, SIAM Journal on Computing, Volume 22 Issue 2, 243 – 260, 1993.

[15]MilošRadovanović, andMirjanaIvanović, Text Mining: Approaches and Applications, Volume. 38, No. 3, 227-234, 2008.

[16]Olivier Danvy, Henning Korsholm Rohde, On Obtaining the Boyer-Moore String-Matching Algorithm by Partial Evaluation, Journal of Information Processing Letters, Volume 99 Issue 4, 158-162, 2005.

[17]Robert S. Boyer and J. Strother Moore, A fast string Searching Algorithm, Communication of the ACM, Volume 20 Issue 10, 762-772, 1977.

[18]Simon Wahlström, Evaluation of String Searching Algorithms, 2004.

[19]SriharshaOddiraju, BOYER-MOORE, December 16, 2011.

[20]StephaneDucasse, Matthias Rieger& Serge Demeyer, A Language Independent Approach for Detecting Duplicated Code, Proceeding IEEE International Conference on Software Maintenance, 109 – 118, 1999.

[21]Thierry Lecroq, A variation on the Boyer-Moore algorithm, Journal of Theoretical Computer Science, Volume 92 Issue 1, 119-144, 1992.

[22]Vijayarani S, and Muthulakshmi M, Comparative Study on Classification Meta Algorithms, International Journal of Innovative Research in Computer and Communication Engineering, Vol. 1, Issue 8,1768-1772, 2013.

[23]Vishal Jain, Mayank Singh, Ontology Based Information Retrieval in Semantic Web: A Survey, International Journal of Information Technology and Computer Science(IJITCS), vol.5, no.10, pp.62-69, 2013. DOI: 10.5815/ijitcs.2013.10.06

[24]Divya K.S., Dr. R. Subha, Dr. S. Palaniswami, Similar Words Identification Using Naive and TF-IDF Method,International Journal of Information Technology and Computer Science(IJITCS), 2014, 11, 42-47, DOI: 10.5815/ijitcs.2014.11.06