Discriminant component analysis for privacy protection and visualization of big data

Research output: Contribution to journalArticlepeer-review

21 Scopus citations

Abstract

Big data has many divergent types of sources, from physical (sensor/IoT) to social and cyber (web) types, rendering it messy and, imprecise, and incomplete. Due to its quantitative (volume and velocity) and qualitative (variety) challenges, big data to the users resembles something like “the elephant to the blind men”. It is imperative to enact a major paradigm shift in data mining and learning tools so that information from diversified sources must be integrated together to unravel information hidden in the massive and messy big data, so that, metaphorically speaking, it would let the blind men “see” the elephant. This talk will address yet another vital “V”-paradigm: “Visualization”. Visualization tools are meant to supplement (instead of replace) the domain expertise (e.g. a cardiologist) and provide a big picture to help users formulate critical questions and subsequently postulate heuristic and insightful answers. For big data, the curse of high feature dimensionality is causing grave concerns on computational complexity and over-training. In this talk, we shall explore various projection methods for dimension reduction - a prelude to visualization of vectorial and non-vectorial data. A popular visualization tool for unsupervised learning is Principal Component Analysis (PCA). PCA aims at the best recoverability of the original data in the Euclidean Vector Space (EVS). However, PCA is not effective for supervised and collaborative learning environment. Discriminant Component Analysis (DCA), basically a supervised PCA, can be derived via a notion of Canonical Vector Space (CVS). The signal subspace components of DCA are associated with the discriminant distance/power (related to the classification effectiveness) while the noise subspace components of DCA are tightly coupled with the recoverability and/or privacy protection. DCA enjoys two major merits: First, because the rank of the signal subspace is limited by the number of classes, DCA can effectively support classification using a relatively small dimensionality (i.e. high compression). Second, in DCA, the eigenvalues of the noise-space are ordered according to their corresponding reconstruction errors and can thus be used to control recoverability or anti-recoverability by applying respectively an negative or positive ridge. Via DCA, individual data can be highly compressed before being uploaded to the cloud, and thus better enabling privacy protection. In many practical scenarios, additional privacy protection can be incorporated by allowing individual participants to selectively hide some personal features. The classification of masked data calls for a kernel approach to Incomplete Data Analysis (KAIDA). More specifically, we extend PCA/DCA to their kernel variants. The success of kernel machines hinges upon the kernel function adopted to characterize the similarity of pairs of partially-specified vectors. Simulations on the HAR dataset confirm that DCA far outperforms PCA, both in their conventional or kernelized variants. For the latter, the visualization/classification results suggest favorable performance by the proposed partial correlation kernels over the imputed RBF kernel. In addition, the visualization results further points to a potentially promising approach via multiple kernels such as combining an imputed Gaussian RBF kernel and a non-imputed partial correlation kernel.

Original languageEnglish (US)
Pages (from-to)3999-4034
Number of pages36
JournalMultimedia Tools and Applications
Volume76
Issue number3
DOIs
StatePublished - Feb 1 2017

All Science Journal Classification (ASJC) codes

  • Software
  • Media Technology
  • Hardware and Architecture
  • Computer Networks and Communications

Keywords

  • Anti-recoverability
  • Big data
  • CVS (canonical vector space)
  • Collaborative learning
  • Component analysis
  • DCA (discriminant component analysis)
  • Data matrix
  • Dimension reduction
  • Discriminant distance (DD)
  • Discriminant power (DP)
  • EVS (Euclidean vector space)
  • KDCA
  • Kernel approach ro Incomplete Data Analysis (KAIDA)
  • Kernel machine
  • Learning subspace property (LSP)
  • MDA (multiple discriminant analysis)
  • Net entropy
  • Noise subspace
  • PCA (principal component analysis)
  • Privacy protection
  • Projection matrix
  • Recoverability
  • Signal subspace
  • Subspace analysis
  • Supervised learning
  • Unsupervised learning
  • Vectorial and Non-vectorial data analysis
  • Visualization

Fingerprint

Dive into the research topics of 'Discriminant component analysis for privacy protection and visualization of big data'. Together they form a unique fingerprint.

Cite this