Loading [MathJax]/extensions/MathMenu.js
A Framework for Semisupervised Feature Generation and Its Applications in Biomedical Literature Mining | IEEE Journals & Magazine | IEEE Xplore

A Framework for Semisupervised Feature Generation and Its Applications in Biomedical Literature Mining


Abstract:

Feature representation is essential to machine learning and text mining. In this paper, we present a feature coupling generalization (FCG) framework for generating new fe...Show More

Abstract:

Feature representation is essential to machine learning and text mining. In this paper, we present a feature coupling generalization (FCG) framework for generating new features from unlabeled data. It selects two special types of features, i.e., example-distinguishing features (EDFs) and class-distinguishing features (CDFs) from original feature set, and then generalizes EDFs into higher-level features based on their coupling degrees with CDFs in unlabeled data. The advantage is: EDFs with extreme sparsity in labeled data can be enriched by their co-occurrences with CDFs in unlabeled data so that the performance of these low-frequency features can be greatly boosted and new information from unlabeled can be incorporated. We apply this approach to three tasks in biomedical literature mining: gene named entity recognition (NER), protein-protein interaction extraction (PPIE), and text classification (TC) for gene ontology (GO) annotation. New features are generated from over 20 GB unlabeled PubMed abstracts. The experimental results on BioCreative 2, AIMED corpus, and TREC 2005 Genomics Track show that 1) FCG can utilize well the sparse features ignored by supervised learning. 2) It improves the performance of supervised baselines by 7.8 percent, 5.0 percent, and 5.8 percent, respectively, in the tree tasks. 3) Our methods achieve 89.1, 64.5 F-score, and 60.1 normalized utility on the three benchmark data sets.
Page(s): 294 - 307
Date of Publication: 30 September 2010

ISSN Information:

PubMed ID: 20876938
Author image of Yanpeng Li
College of Information Science and Technology, Drexel University, Philadelphia, PA, USA
College of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, China
Yanpeng Li received the MS degree from the Dalian University of Technology, China, in 2007. He is a PhD candidate in the College of Computer Science and Technology at the Dalian University of Technology, and currently a visiting student in the College of Information Science and Technology at Drexel University, Philadelphia. His research interests include machine learning, feature generation, semisupervised learning, biome...Show More
Yanpeng Li received the MS degree from the Dalian University of Technology, China, in 2007. He is a PhD candidate in the College of Computer Science and Technology at the Dalian University of Technology, and currently a visiting student in the College of Information Science and Technology at Drexel University, Philadelphia. His research interests include machine learning, feature generation, semisupervised learning, biome...View more
Author image of Xiaohua Hu
College of Information Science and Technology, Drexel University, Philadelphia, PA, USA
Xiaohua Hu received the BSc degree in software from the Wuhan University, in 1985, the MEng degree in computer engineering from the Institute of Computing Technology, Chinese Academy of Science, in 1988, the MSc degree in computer science from the Simon Fraser University, Canada, in 1992, and the PhD degree in computer science from the University of Regina, Canada, in 1995. He is currently an associate professor and the f...Show More
Xiaohua Hu received the BSc degree in software from the Wuhan University, in 1985, the MEng degree in computer engineering from the Institute of Computing Technology, Chinese Academy of Science, in 1988, the MSc degree in computer science from the Simon Fraser University, Canada, in 1992, and the PhD degree in computer science from the University of Regina, Canada, in 1995. He is currently an associate professor and the f...View more
Author image of Hongfei Lin
College of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, China
Zhihao Yang received the PhD degree in computer science from the Dalian University of Technology, Dalian, China, in 2008. He is currently an associate professor in the Computer Science and Technology College at the Dalian University of Technology. He has published more than 20 research papers on topics in biomedical literature data mining. His current research interests include biomedical literature data mining and inform...Show More
Zhihao Yang received the PhD degree in computer science from the Dalian University of Technology, Dalian, China, in 2008. He is currently an associate professor in the Computer Science and Technology College at the Dalian University of Technology. He has published more than 20 research papers on topics in biomedical literature data mining. His current research interests include biomedical literature data mining and inform...View more
Author image of Zhiahi Yang
College of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, China
Yanpeng Li received the MS degree from the Dalian University of Technology, China, in 2007. He is a PhD candidate in the College of Computer Science and Technology at the Dalian University of Technology, and currently a visiting student in the College of Information Science and Technology at Drexel University, Philadelphia. His research interests include machine learning, feature generation, semisupervised learning, biome...Show More
Yanpeng Li received the MS degree from the Dalian University of Technology, China, in 2007. He is a PhD candidate in the College of Computer Science and Technology at the Dalian University of Technology, and currently a visiting student in the College of Information Science and Technology at Drexel University, Philadelphia. His research interests include machine learning, feature generation, semisupervised learning, biome...View more

1 Introduction

With the exponential explosion of biomedical literature, such as MEDLINE, developing automatic text mining tools has become essential for people to seek information more accurately and efficiently. Biomedical text miming (BioTM) [12] becomes a hot area in data mining. The fundamental tasks, such as named entity recognition (NER), protein-protein interaction extraction (PPIE) and text classification (TC) have attracted a lot of research interests in various domains including Bioinformatics, natural language processing (NLP), and machine learning (ML). Although these tasks focus on extracting information of different formats, e.g., entities, relations or documents, classical methods usually treat them as the classification of text snippets and the methodologies have a lot in common. In traditional methods, each example is represented by a feature vector where each element is generated by a Boolean function indicating whether a word, n-gram or lexical pattern appears in the current example, and then these features are integrated in a supervised learning framework.

Author image of Yanpeng Li
College of Information Science and Technology, Drexel University, Philadelphia, PA, USA
College of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, China
Yanpeng Li received the MS degree from the Dalian University of Technology, China, in 2007. He is a PhD candidate in the College of Computer Science and Technology at the Dalian University of Technology, and currently a visiting student in the College of Information Science and Technology at Drexel University, Philadelphia. His research interests include machine learning, feature generation, semisupervised learning, biomedical text mining, and information retrieval.
Yanpeng Li received the MS degree from the Dalian University of Technology, China, in 2007. He is a PhD candidate in the College of Computer Science and Technology at the Dalian University of Technology, and currently a visiting student in the College of Information Science and Technology at Drexel University, Philadelphia. His research interests include machine learning, feature generation, semisupervised learning, biomedical text mining, and information retrieval.View more
Author image of Xiaohua Hu
College of Information Science and Technology, Drexel University, Philadelphia, PA, USA
Xiaohua Hu received the BSc degree in software from the Wuhan University, in 1985, the MEng degree in computer engineering from the Institute of Computing Technology, Chinese Academy of Science, in 1988, the MSc degree in computer science from the Simon Fraser University, Canada, in 1992, and the PhD degree in computer science from the University of Regina, Canada, in 1995. He is currently an associate professor and the founding director of the Data Mining and Bioinformatics Lab at the College of Information Science and Technology, Drexel University. He is also serving as the IEEE Computer Society Bioinformatics and Biomedicine Steering Committee chair, and was the IEEE Computational Intelligence Society Granular Computing Technical Committee chair from 2007 to 2009. He is a scientist, educator, and entrepreneur. He joined Drexel University in 2002, founded the International Journal of Data Mining and Bioinformatics, in 2006, and International Journal of Granular Computing, and Rough Sets and Intelligent Systems, in 2008. Earlier, he worked as a research scientist in the world-leading R&D centers such as Nortel Research Center, GTE Labs, and HP Labs. In 2001, he founded the DMW Software in Silicon Valley, California. His research ideas have been integrated into many commercial products and applications. His current research interests are in biomedical literature data mining, bioinformatics, text mining, semantic web mining and reasoning, rough set theory and application, information extraction, and information retrieval. He has published more than 180 peer-reviewed research papers in various journals, conferences, and books, such as various IEEE/ACM Transactions (IEEE/ACM TCBB, IEEE TFS, IEEE TDKE, IEEE TITB, IEEE Computer), JIS, KAIS, CI, DKE, IJBRA, SIG KDD, IEEE ICDM, IEEE ICDE, SIGIR, ACM CIKM, IEEE BIBM, IEEE CICBC, etc., co-edited 14 books/proceedings. He has received a few prestigious awards including the 2005 National Science Foundation (NSF) Career award, (the most prestigious award from NSF to young faculty in the USA), the best paper award at the 2007 International Conference on Artificial Intelligence, the best paper award at the 2004 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, the 2007 IEEE Bioinformatics and Bioengineering Outstanding Contribution Award, the 2006 IEEE Granular Computing Outstanding Service Award, and the 2001 IEEE Data Mining Outstanding Service Award. His research projects are funded by the National Science Foundation (NSF), US Department of Education, and the PA Department of Health. He has obtained more than US $4.8 millions research grants in the past five years as PI or Co-PI. He is a senior member of the IEEE and the IEEE Computer Society.
Xiaohua Hu received the BSc degree in software from the Wuhan University, in 1985, the MEng degree in computer engineering from the Institute of Computing Technology, Chinese Academy of Science, in 1988, the MSc degree in computer science from the Simon Fraser University, Canada, in 1992, and the PhD degree in computer science from the University of Regina, Canada, in 1995. He is currently an associate professor and the founding director of the Data Mining and Bioinformatics Lab at the College of Information Science and Technology, Drexel University. He is also serving as the IEEE Computer Society Bioinformatics and Biomedicine Steering Committee chair, and was the IEEE Computational Intelligence Society Granular Computing Technical Committee chair from 2007 to 2009. He is a scientist, educator, and entrepreneur. He joined Drexel University in 2002, founded the International Journal of Data Mining and Bioinformatics, in 2006, and International Journal of Granular Computing, and Rough Sets and Intelligent Systems, in 2008. Earlier, he worked as a research scientist in the world-leading R&D centers such as Nortel Research Center, GTE Labs, and HP Labs. In 2001, he founded the DMW Software in Silicon Valley, California. His research ideas have been integrated into many commercial products and applications. His current research interests are in biomedical literature data mining, bioinformatics, text mining, semantic web mining and reasoning, rough set theory and application, information extraction, and information retrieval. He has published more than 180 peer-reviewed research papers in various journals, conferences, and books, such as various IEEE/ACM Transactions (IEEE/ACM TCBB, IEEE TFS, IEEE TDKE, IEEE TITB, IEEE Computer), JIS, KAIS, CI, DKE, IJBRA, SIG KDD, IEEE ICDM, IEEE ICDE, SIGIR, ACM CIKM, IEEE BIBM, IEEE CICBC, etc., co-edited 14 books/proceedings. He has received a few prestigious awards including the 2005 National Science Foundation (NSF) Career award, (the most prestigious award from NSF to young faculty in the USA), the best paper award at the 2007 International Conference on Artificial Intelligence, the best paper award at the 2004 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, the 2007 IEEE Bioinformatics and Bioengineering Outstanding Contribution Award, the 2006 IEEE Granular Computing Outstanding Service Award, and the 2001 IEEE Data Mining Outstanding Service Award. His research projects are funded by the National Science Foundation (NSF), US Department of Education, and the PA Department of Health. He has obtained more than US $4.8 millions research grants in the past five years as PI or Co-PI. He is a senior member of the IEEE and the IEEE Computer Society.View more
Author image of Hongfei Lin
College of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, China
Zhihao Yang received the PhD degree in computer science from the Dalian University of Technology, Dalian, China, in 2008. He is currently an associate professor in the Computer Science and Technology College at the Dalian University of Technology. He has published more than 20 research papers on topics in biomedical literature data mining. His current research interests include biomedical literature data mining and information retrieval.
Zhihao Yang received the PhD degree in computer science from the Dalian University of Technology, Dalian, China, in 2008. He is currently an associate professor in the Computer Science and Technology College at the Dalian University of Technology. He has published more than 20 research papers on topics in biomedical literature data mining. His current research interests include biomedical literature data mining and information retrieval.View more
Author image of Zhiahi Yang
College of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, China
Yanpeng Li received the MS degree from the Dalian University of Technology, China, in 2007. He is a PhD candidate in the College of Computer Science and Technology at the Dalian University of Technology, and currently a visiting student in the College of Information Science and Technology at Drexel University, Philadelphia. His research interests include machine learning, feature generation, semisupervised learning, biomedical text mining, and information retrieval.
Yanpeng Li received the MS degree from the Dalian University of Technology, China, in 2007. He is a PhD candidate in the College of Computer Science and Technology at the Dalian University of Technology, and currently a visiting student in the College of Information Science and Technology at Drexel University, Philadelphia. His research interests include machine learning, feature generation, semisupervised learning, biomedical text mining, and information retrieval.View more

References

References is not available for this document.