On feature selection for genomic signal processing and data mining

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

An effective data mining system lies in the representation of pattern vectors. The most vital information to be represented is the characteristics embedded in the raw data most essential for the intended applications. In order to extract a useful high-level representation, it is desirable that a representation can provide concise, invariant, and/or intelligible information on input patterns. The curse of dimensionality has traditionally been a serious concern in many genomic applications. For example, the feature dimension of gene expression data is often in the order of thousands. This motivates exploration into feature selection and representation, both aiming at reducing the feature dimensionality to facilitate the training and prediction of genomic data. The challenge lies in how to reduce feature dimension while conceding minimum sacrifice on accuracy. For feature selection, both individual and group information are important, and each has its own pros and cons in measuring the truly relevant information. The individual quantification is simple as each of the M features can be represented by one single value. However, it cannot deal with the inter-feature redundancy, abounding specially in genomic data. In contrast, the group information can fully address the mutual redundancy, but it is often too complicated to process. (Note that there are 2M possible groups.) Between the two extremes, fortunately, there is a convenient compromise: the pairwise kernel - which has a low complexity (M2 pairs) and yet reveals the critical information regarding the m inter-feature redundancy. Indeed, it has been already found very useful for many genomic applications. Especially, we shall describe how pairwise-based feature selection may be successful applied to genomic subcellular localization. A special method (VIA-SVM) designed exclusively for pairwise scoring kernels is introduced. This is the first method that fully utilizes the reflexive property of the so-called self-supervised training data, arising uniquely available in multiple sequence alignment. Based on several subcellular localization experiments, the VIA-SVM when combined with some filter-type metrics appears to deliver a substantial dimension reduction (one-order of magnitude) with only little degradation on accuracy.

Original languageEnglish (US)
Title of host publicationMachine Learning for Signal Processing 17 - Proceedings of the 2007 IEEE Signal Processing Society Workshop, MLSP
Pages1-20
Number of pages20
DOIs
StatePublished - 2007
Event17th IEEE International Workshop on Machine Learning for Signal Processing, MLSP-2007 - Thessaloniki, Greece
Duration: Aug 27 2007Aug 29 2007

Publication series

NameMachine Learning for Signal Processing 17 - Proceedings of the 2007 IEEE Signal Processing Society Workshop, MLSP

Other

Other17th IEEE International Workshop on Machine Learning for Signal Processing, MLSP-2007
Country/TerritoryGreece
CityThessaloniki
Period8/27/078/29/07

All Science Journal Classification (ASJC) codes

  • General Computer Science
  • Signal Processing

Fingerprint

Dive into the research topics of 'On feature selection for genomic signal processing and data mining'. Together they form a unique fingerprint.

Cite this