TY - JOUR
T1 - Gene ontology-based protein function prediction by using sequence composition information.
AU - Dong, Qiwen
AU - Zhou, Shuigeng
AU - Deng, Lei
AU - Guan, Jihong
PY - 2010
Y1 - 2010
N2 - The prediction of protein function is a difficult and important problem in computational biology. In this study, an efficient method is presented to predict protein function with sequence composition information. Four kinds of basic building blocks of protein sequences are investigated, including N-grams, binary profiles, PFAM domains and InterPro domains. The protein sequences are mapped into high-dimensional vectors by using the occurrence frequencies of each kind of building blocks. The resulting vectors are then taken as input to support vector machine to predict their function based on gene ontology. Experiments are conducted over the subset of GOA database. The experimental results show that the protein function can be predicted from primary sequence information. The method based on InterPro domains outperforms the other building blocks, and gets an overall accuracy of 0.87 and ROC score is 0.93. We also demonstrate that the use of feature extraction algorithms such as latent semantic analysis and nonnegative matrix factorization, can efficiently remove noise and improve the prediction efficiency without significantly degrading the performance. The results obtained here are helpful for the prediction of protein function by using only sequence information.
AB - The prediction of protein function is a difficult and important problem in computational biology. In this study, an efficient method is presented to predict protein function with sequence composition information. Four kinds of basic building blocks of protein sequences are investigated, including N-grams, binary profiles, PFAM domains and InterPro domains. The protein sequences are mapped into high-dimensional vectors by using the occurrence frequencies of each kind of building blocks. The resulting vectors are then taken as input to support vector machine to predict their function based on gene ontology. Experiments are conducted over the subset of GOA database. The experimental results show that the protein function can be predicted from primary sequence information. The method based on InterPro domains outperforms the other building blocks, and gets an overall accuracy of 0.87 and ROC score is 0.93. We also demonstrate that the use of feature extraction algorithms such as latent semantic analysis and nonnegative matrix factorization, can efficiently remove noise and improve the prediction efficiency without significantly degrading the performance. The results obtained here are helpful for the prediction of protein function by using only sequence information.
UR - https://www.scopus.com/pages/publications/77955964483
U2 - 10.2174/092986610791190336
DO - 10.2174/092986610791190336
M3 - 文章
C2 - 19995340
AN - SCOPUS:77955964483
SN - 0929-8665
VL - 17
SP - 789
EP - 795
JO - Protein and Peptide Letters
JF - Protein and Peptide Letters
IS - 6
ER -