Feature selection has become a de facto tool for analyzing high-dimensional data, especially in bioinformatics. It is effective in improving learning algorithms' scalability and facilitating feature generalization or interpretability by removing noise and redundancy. Our focus is placed on the paradigm of supervised feature selection, which aims to find an optimal feature subset to best predict the target. We propose a nonlinear approach for finding a feature subset that achieves the highest inter-class separability in terms of the kernel Discriminant Information (KDI) measure. Theoretically, we prove the existence of good prediction hypotheses for feature subsets with high KDI value. We also establish the equivalency between maximizing the KDI statistic and minimizing a functional dependency measure of label variable on data. Moreover, we asymptotically prove the concentration property of the optimal feature subset found by maximizing the KDI measure. Practically, we provide an efficient gradient optimization algorithm for solving the KDI feature selection problem. We evaluate the proposed method based on 19 benchmark datasets in various domains, and demonstrates a noticeable improvement against state-of-the-art baselines on the majority of classification and regression tasks. Notably, our method is robust to the choice of hyper-parameters, works well with various downstream classifiers, has competitive computational complexity among the kernel based methods considered, and scales well the large-scale object recognition dataset, with generalization enhancement on CIFAR.