TY - GEN
T1 - On feature selection for genomic signal processing and data mining
AU - Kung, S. Y.
PY - 2007
Y1 - 2007
N2 - An effective data mining system lies in the representation of pattern vectors. The most vital information to be represented is the characteristics embedded in the raw data most essential for the intended applications. In order to extract a useful high-level representation, it is desirable that a representation can provide concise, invariant, and/or intelligible information on input patterns. The curse of dimensionality has traditionally been a serious concern in many genomic applications. For example, the feature dimension of gene expression data is often in the order of thousands. This motivates exploration into feature selection and representation, both aiming at reducing the feature dimensionality to facilitate the training and prediction of genomic data. The challenge lies in how to reduce feature dimension while conceding minimum sacrifice on accuracy. For feature selection, both individual and group information are important, and each has its own pros and cons in measuring the truly relevant information. The individual quantification is simple as each of the M features can be represented by one single value. However, it cannot deal with the inter-feature redundancy, abounding specially in genomic data. In contrast, the group information can fully address the mutual redundancy, but it is often too complicated to process. (Note that there are 2M possible groups.) Between the two extremes, fortunately, there is a convenient compromise: the pairwise kernel - which has a low complexity (M2 pairs) and yet reveals the critical information regarding the m inter-feature redundancy. Indeed, it has been already found very useful for many genomic applications. Especially, we shall describe how pairwise-based feature selection may be successful applied to genomic subcellular localization. A special method (VIA-SVM) designed exclusively for pairwise scoring kernels is introduced. This is the first method that fully utilizes the reflexive property of the so-called self-supervised training data, arising uniquely available in multiple sequence alignment. Based on several subcellular localization experiments, the VIA-SVM when combined with some filter-type metrics appears to deliver a substantial dimension reduction (one-order of magnitude) with only little degradation on accuracy.
AB - An effective data mining system lies in the representation of pattern vectors. The most vital information to be represented is the characteristics embedded in the raw data most essential for the intended applications. In order to extract a useful high-level representation, it is desirable that a representation can provide concise, invariant, and/or intelligible information on input patterns. The curse of dimensionality has traditionally been a serious concern in many genomic applications. For example, the feature dimension of gene expression data is often in the order of thousands. This motivates exploration into feature selection and representation, both aiming at reducing the feature dimensionality to facilitate the training and prediction of genomic data. The challenge lies in how to reduce feature dimension while conceding minimum sacrifice on accuracy. For feature selection, both individual and group information are important, and each has its own pros and cons in measuring the truly relevant information. The individual quantification is simple as each of the M features can be represented by one single value. However, it cannot deal with the inter-feature redundancy, abounding specially in genomic data. In contrast, the group information can fully address the mutual redundancy, but it is often too complicated to process. (Note that there are 2M possible groups.) Between the two extremes, fortunately, there is a convenient compromise: the pairwise kernel - which has a low complexity (M2 pairs) and yet reveals the critical information regarding the m inter-feature redundancy. Indeed, it has been already found very useful for many genomic applications. Especially, we shall describe how pairwise-based feature selection may be successful applied to genomic subcellular localization. A special method (VIA-SVM) designed exclusively for pairwise scoring kernels is introduced. This is the first method that fully utilizes the reflexive property of the so-called self-supervised training data, arising uniquely available in multiple sequence alignment. Based on several subcellular localization experiments, the VIA-SVM when combined with some filter-type metrics appears to deliver a substantial dimension reduction (one-order of magnitude) with only little degradation on accuracy.
UR - http://www.scopus.com/inward/record.url?scp=48149092317&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=48149092317&partnerID=8YFLogxK
U2 - 10.1109/MLSP.2007.4414275
DO - 10.1109/MLSP.2007.4414275
M3 - Conference contribution
AN - SCOPUS:48149092317
SN - 1424415667
SN - 9781424415663
T3 - Machine Learning for Signal Processing 17 - Proceedings of the 2007 IEEE Signal Processing Society Workshop, MLSP
SP - 1
EP - 20
BT - Machine Learning for Signal Processing 17 - Proceedings of the 2007 IEEE Signal Processing Society Workshop, MLSP
T2 - 17th IEEE International Workshop on Machine Learning for Signal Processing, MLSP-2007
Y2 - 27 August 2007 through 29 August 2007
ER -