Abstract
Background: In order to extract and retrieve protein protein interaction (PPI) information from text, automatic detecting protein interaction relevant articles for database curation is a crucial step. The vast majority of this research used the "bag-of-words" representation, where each feature corresponds to a single word. For the sake of capturing more information left out from this simple bag-of-word representation, we examined alternative ways to represent text based on advanced natural language techniques, i.e. protein named entities, and biological domain knowledge, i.e. trigger keywords. Results: These feature representations are evaluated using SVM classifier on the BioCreAtIvE II benchmark corpus. On their own the new representations are not found to produce a significant performance improvement based on the statistical significance tests. On the other hand, the performance achieved by the integration of 70 trigger keywords and 4 protein named entities features is comparable with that achieved by using bag-of-words alone. In addition, the only 4 protein named entities features (4PNE) obtained the best recall performance (98.13%). Conclusions: In general, our work supports that more sophisticated natural language processing (NLP) techniques and more advanced usage of these techniques need to be developed before better text representations can be produced. The feature representations with simple NLP techniques would benefit the real-life detecting system implemented with great efficiency and speed without losing the classification performance and exhaustive curation system.
| Original language | English |
|---|---|
| Pages (from-to) | 3.1-3.17 |
| Journal | CEUR Workshop Proceedings |
| Volume | 319 |
| State | Published - 2007 |
| Event | 2nd International Symposium on Languages in Biology and Medicine, LBM 2007 - Singapore, Singapore Duration: 6 Dec 2007 → 7 Dec 2007 |