The integration of multiple feature representations for protein protein interaction classification task

Man Lan, Chew Lim Tan

Research output: Contribution to journalConference articlepeer-review

Abstract

Background: In order to extract and retrieve protein protein interaction (PPI) information from text, automatic detecting protein interaction relevant articles for database curation is a crucial step. The vast majority of this research used the "bag-of-words" representation, where each feature corresponds to a single word. For the sake of capturing more information left out from this simple bag-of-word representation, we examined alternative ways to represent text based on advanced natural language techniques, i.e. protein named entities, and biological domain knowledge, i.e. trigger keywords. Results: These feature representations are evaluated using SVM classifier on the BioCreAtIvE II benchmark corpus. On their own the new representations are not found to produce a significant performance improvement based on the statistical significance tests. On the other hand, the performance achieved by the integration of 70 trigger keywords and 4 protein named entities features is comparable with that achieved by using bag-of-words alone. In addition, the only 4 protein named entities features (4PNE) obtained the best recall performance (98.13%). Conclusions: In general, our work supports that more sophisticated natural language processing (NLP) techniques and more advanced usage of these techniques need to be developed before better text representations can be produced. The feature representations with simple NLP techniques would benefit the real-life detecting system implemented with great efficiency and speed without losing the classification performance and exhaustive curation system.

Original languageEnglish
Pages (from-to)3.1-3.17
JournalCEUR Workshop Proceedings
Volume319
StatePublished - 2007
Event2nd International Symposium on Languages in Biology and Medicine, LBM 2007 - Singapore, Singapore
Duration: 6 Dec 20077 Dec 2007

Fingerprint

Dive into the research topics of 'The integration of multiple feature representations for protein protein interaction classification task'. Together they form a unique fingerprint.

Cite this