MOTIFSM: Cloudera Motif DNA Finding Algorithm

Full Text (PDF, 259KB), PP.10-18

Views: 0 Downloads: 0


Tahani M. Allam 1,*

1. Computer and Control Department, Faculty of Engineering, Tanta University, Tanta, Egypt

* Corresponding author.


Received: 11 Mar. 2023 / Revised: 3 May 2023 / Accepted: 17 Jun. 2023 / Published: 8 Aug. 2023

Index Terms

Cloud Computing, Motif, DNA, Impala, Cloudera


Many studying systems of gene function work depend on the DNA motif. DNA motifs finding generate a lot of trails which make it complex. Regulation of gene expression is identified according to Transcription Factor Binding Sites (TFBSs). There are different algorithms explained, over the past decades, to get an accurate motif tool. The major problems for these algorithms are on the execution time and the memory size which depend on the probabilistic approaches. Our previous algorithm, called EIMF, is recently proposed to overcome these problems by rearranging data. Because cloud computing involves many resources, the challenge of mapping jobs to infinite computing resources is an NP-hard optimization problem. In this paper, we proposed an Impala framework for solving a motif finding algorithms in single and multi-user based on cloud computing. Also, the comparison between Cloud motif and previous EIMF algorithms is performed in three different motif group. The results obtained the Cloudera motif was a considerable finding algorithms in the experimental group that decreased the execution time and the Memory size, when compared with the previous EIMF algorithms. The proposed MOTIFSM algorithm based on the cloud computing decrease the execution time by 70% approximately in MOTIFSM than EIMF framework. Memory size also is decreased in MOTIFSM about 75% than EIMF.

Cite This Paper

Tahani M. Allam, "MOTIFSM: Cloudera Motif DNA Finding Algorithm", International Journal of Information Technology and Computer Science(IJITCS), Vol.15, No.4, pp.10-18, 2023. DOI:10.5815/ijitcs.2023.04.02


[1]Koppad, S., Annappa, B., Gkoutos, GV.: Cloud Computing Enabled Big Multi-Omics Data Analytics, BIOINFORMATICS AND BIOLOGY INSIGHTS, SAGE PUBLICATIONS LTD, September, 2021.
[2]Dai, L., Gao, X., Guo, Y. et al.: Bioinformatics clouds for big data manipulation. Biology Direct 7 (1): 43, 2012.
[3]De Oliveira Veras, Adonney A.a, De Sá, Pablo H.C.G.a. et al: Computational techniques in data integration and big data handling in omics, Omics Technologies and Bio-engineering: Towards Improving Quality of LifeVolume 1, Pages 209 – 222, 2018.
[4]Zhang, L., Gu, S., Liu, Y. et al.: Gene set analysis in the cloud. Bioinformatics 28 (2): 294–295, 2011.
[5]Oh, M., Park, S., Kim, S. et al.: Machine learning-based analysis of multi-omics data on the cloud for investigating gene regulations, Briefings in Bioinformatics, OXFORD UNIV PRESS, May, 11, 2021.
[6]Li, H., Weng, S., Tong, J. et al.: Composition of Resource-Service Chain Based on Evolutionary Algorithm in Distributed Cloud Manufacturing Systems, IEEE Access, Open Access, Volume 8, Pages 19911 – 19920, 2020.
[7]Abdelmenem S. Elgabry, Tahani M. Allam, and Mahmoud M. Fahmy.: An Identical Motif Finding Algorithm through Dynamic Programming, International Journal of Online and Biomedical Engineering, 2021.
[8]Google Cloud, Accessed March , 2023.
[9]Cloudera Bringing Impala to AWS Cloud,, March 2, 2023.
[10]Kornacker, M., Behm, A., Bittorf, V., Bobrovytsky, T., et al.: Impala: A Modern, Open-Source SQL Engine for Hadoop. In Cidr 2015 Jan 4, Vol. 1, p. 9.
[11]Das, Modan K., Dai, H., A survey of DNA motif finding algorithms, BMC Bioinformatics Open Access Volume 8, Issue SUPPL. 71 November 2007.
[12]Karczewski, K.J., Fernald, G.H., Martin, A.R. et al. STORMSeq: an open-source, User-friendly pipeline for processing personal genomics data in the cloud. PLoS One 9 (1): 2014.
[13]Garbelini, JMC., Sanches, DS., Pozo, ATR.: Expectation Maximization based algorithm applied to DNA sequence motif finder, IEEE Congress on Evolutionary Computation, Proceedings Paper, Accessed in October 15, 2022.
[14]Petit, Robert A., Read, Timothy D.: Bactopia: A flexible pipeline for complete analysis of bacterial genomes, my Systems, Open Access, Volume 5, Issue 4 August 2020.
[15]Managing Jobs and Pipelines in Cloudera Machine Learning. Date published, CLOUDERA,:, 2020-07-16.
[16]Sparks, ER., Venkataraman, S., Kaftan, T., Franklin, MJ., Recht, B.: KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics, INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2017), DOI 10.1109/ICDE.2017.109, IEEE 2017.
[17]Goudarzi, M., Palaniswami, Marimuthu S., Buyya, R., A Distributed Deep Reinforcement Learning Technique for Application Placement in Edge and Fog Computing Environments, IEEE Transactions on Mobile Computing, Open Access, 2021.
[18]Microsoft Azure, Accessed March 2023.
[19]Hussein, E., Sadiki, R., Jafta, Y., Sungay, MM., Ajayi, O., Bagula, A.: Big Data Processing Using Hadoop and Spark: The Case of Meteorology Data, E-INFRASTRUCTURE AND E-SERVICES FOR DEVELOPING COUNTRIES (AFRICOMM 2019), DOI 10.1007/978-3-030-41593-8_13, Published 2020.