Kernel Techniques in Support Vector Machines for Classification of Biological Data

Full Text (PDF, 471KB), PP.1-8

Views: 0 Downloads: 0


Hao Jiang 1,* Wai-Ki Ching 1 Zeyu Zheng 2

1. Advanced Modeling and Applied Computing Laboratory Department of Mathematics The University of Hong Kong, Hong Kong, China

2. School of Mathematical Sciences, Peking University, China

* Corresponding author.


Received: 23 Jul. 2010 / Revised: 15 Oct. 2010 / Accepted: 12 Jan. 2011 / Published: 8 Mar. 2011

Index Terms

AAindex2, Eigen-matrix Translation Techniques, Motif, Protein Classification, Support Vector Machine, Spectrum Kernel Method


In this paper, we consider the problem of protein classification, which is a important and hot topic in bioinformatics. We propose a novel kernel based on the KSpectrum Kernel by incorporating physico-chemical and biological properties of amino acids as well as the motif information for the captured protein classification problem. Similarity matrix is constructed based on an AAindex2 substitution matrix which measures the amino acid pair distance. Together with the motif content posing importance on the protein sequences, a new kernel is then constructed. We adopt the Eigen-matrix translation techniques for improving the classification accuracy. Experimental results indicate that the string-based kernel in conjunction with SVM classifier performs significantly better than the traditional spectrum kernel method. Furthermore, numerical examples also confirm the use of the Eigenmatrix translation techniques as general strategy.

Cite This Paper

Hao Jiang, Wai-Ki Ching, Zeyu Zheng, "Kernel Techniques in Support Vector Machines for Classification of Biological Data", International Journal of Information Technology and Computer Science(IJITCS), vol.3, no.2, pp.1-8, 2011. DOI: 10.5815/ijitcs.2011.02.01


[1] A.M. Lesk, Introduction to bioinformatics, 3rd ed., New York, USA: Oxford, 2002.

[2] K.M. Borgwardt and H.P. Kriegel, Kernel methods for protein function prediction, AFP-SIG. Detroit, USA: Oxford, 2005.

[3] A. Krogh, M. Brown, I. Mian, K. Sjolander, and D. Haussler, “Hidden markov models in computational biology: Applications to protein modeling,” J. Mol. Biol. 235, 1501-1531: 1994.

[4] G. Bejerano and G. Yona, “Modeling protein families using probabilistic suffix trees,” In Proc. Third Anual Inter. Conf. on Computational Molecular Biology (RECOMB), 1999.

[5] E. Eskin, W. Noble, and G.Y. Singer, “Protein family classification using sparse Markov transducers,” Proc. Eighth. Inter. Conf. on Intelligent Systems for Molecular Biology, 131-135, 2000.

[6] R.A. Horn and C.R. Johnson Matrix analysis, Cambridge University Press, 1985.

[7] T. Jaakkola, M. Diekhans, and D. Haussler, “A discriminative framework for detecting remote protein homologies,” Journal of Computational Biology, 2000.

[8] T. Jaakkola, M. diekhans, and D. Haussler, “Using the fisher kernel method to detectt remote protein homologies,” In Proc. Seventh. Inter. Conf. on Intelligent Systems for Molecular Biology, 149-158, 1999.

[9] J. Shawe-Taylor, N. Cristianini, Kernel methods for pattern analysis, Cambridge University Press, 2004.

[10] H. Jiang and W. Ching, “Physico-Chemically Weighted Kernelfor SVM Protein Classification,” Proceedings of the 2nd International Conference on Biomedical Engineering and Computer Science (ICBECS 2011), 23-24 April, 2011, Wuhan, China.

[11] C. Leslie, E. Eskin and W.S. Noble, “The spectrum kernel: A string kernel for SVM protein classification,” Proceedings of the Pacific Biocomputing Symposium, 2002.

[12] C. Leslie, E. Eskin, J. Weston and W.S. Noble, “Mismatch string kernel for discriminative protein classification,” Bioinformatics. 20(4):2003.

[13] Y.S. Yuan, L. Lin, Q.W. Dong, X.L. Wang and M.H. Li, “A protein classification method based on latent semantic analysis,” Proceedings of the 2005 IEEE Engineering in Mdeicine and Biology 27th Annl. Conf. 20(4):2005.

[14] B. Scholkopf, Kernel methods in computational biology, MIT Press New York, 2004.

[15] K. Tommi and M. Kanehisa, “Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins,” Protein Engineering 9(1), 27-36:1996.

[16] B.H. Asa and D. Brutlay, Remote homology detection:a motif based approach, Bioinformatics 19(1),26-33:2003.

[17] T. Miyata, S. Miyazawa and T. Yasunaga MIYT790101, J. Mol. Evol. 12:219-236,1979.

[18] Functional Glycomics Gateway,

[19] B.J.M. Webb-Robertson, K.G. Ratuiste, C.S. Oehmen, “Physicochemical property distributions for accurate and rapid pairwise protein homology detection,” BMC Bioinformatics 11, 145:2010.

[20] G. Ratsch, S. Sonnenburg, B. Scolkopf, “RASE: recognition of alternatively spliced exons in c.elegans.,” Bioinformatics 21(suppl I):i369-i377, 2005.

[21] Y. Yang, L. Lin, Q. Dong, X. Wang, M. Li, “Remote proteinhomology detection using recurrence quantification analysisand amino acid physicochemical properties,” J. Theor. Biol. 252(1):145-154, 2008.

[22] Kuboyama T, Hirata K, Aoki-Kinoshita KF, Kashima H,Yasuda H, “A gram distribution kernel applied to glycan classification and motif extraction,” Genome Informatics 17:25–34,2006.

[23] Hashimoto K, Goto S, Kawano S, Aoki-Kinoshita KF, Ueda N, Hamajima M, Kawasaki T, Kanehisa M: “KEGG as a glycome informatics resource,” Glycobiology 16:263R–70R,2006.

[24] Doubet S, Albersheim P. CarBank. Glycobiology 2:505-507, 1992.