TY - GEN
T1 - Efficiently matching sets of features with random histograms
AU - Dong, Wei
AU - Wang, Zhe
AU - Charikar, Moses
AU - Li, Kai
PY - 2008
Y1 - 2008
N2 - As the commonly used representation of a feature-rich data object has evolved from a single feature vector to a set of feature vectors, a key challenge in building a content-based search engine for feature-rich data is to match feature-sets efficiently. Although substantial progress has been made during the past few years, existing approaches are still inef- ficient and inflexible for building a search engine for massive datasets. This paper presents a randomized algorithm to embed a set of features into a single high-dimensional vec- tor to simplify the feature-set matching problem. The main idea is to project feature vectors into an auxiliary space using locality sensitive hashing and to represent a set of features as a histogram in the auxiliary space. A histogram is simply a high dimensional vector, and efficient similarity measures like L1 and L2 distances can be employed to approximate feature-set distance measures. We evaluated the proposed approach under three differ- ent task settings, i.e. content-based image search, image object recognition and near-duplicate video clip detection. The experimental results show that the proposed approach is indeed effective and flexible. It can achieve accuracy com- parable to the feature-set matching methods, while requir- ing significantly less space and time. For object recognition with Caltech 101 dataset, our method runs 25 times faster to achieve the same precision as Pyramid Matching Kernel, the state-of-the-art feature-set matching method.
AB - As the commonly used representation of a feature-rich data object has evolved from a single feature vector to a set of feature vectors, a key challenge in building a content-based search engine for feature-rich data is to match feature-sets efficiently. Although substantial progress has been made during the past few years, existing approaches are still inef- ficient and inflexible for building a search engine for massive datasets. This paper presents a randomized algorithm to embed a set of features into a single high-dimensional vec- tor to simplify the feature-set matching problem. The main idea is to project feature vectors into an auxiliary space using locality sensitive hashing and to represent a set of features as a histogram in the auxiliary space. A histogram is simply a high dimensional vector, and efficient similarity measures like L1 and L2 distances can be employed to approximate feature-set distance measures. We evaluated the proposed approach under three differ- ent task settings, i.e. content-based image search, image object recognition and near-duplicate video clip detection. The experimental results show that the proposed approach is indeed effective and flexible. It can achieve accuracy com- parable to the feature-set matching methods, while requir- ing significantly less space and time. For object recognition with Caltech 101 dataset, our method runs 25 times faster to achieve the same precision as Pyramid Matching Kernel, the state-of-the-art feature-set matching method.
KW - Locality sensitive hashing
KW - Random histogram
KW - Set of features
UR - http://www.scopus.com/inward/record.url?scp=70350637634&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=70350637634&partnerID=8YFLogxK
U2 - 10.1145/1459359.1459384
DO - 10.1145/1459359.1459384
M3 - Conference contribution
AN - SCOPUS:70350637634
SN - 9781605583037
T3 - MM'08 - Proceedings of the 2008 ACM International Conference on Multimedia, with co-located Symposium and Workshops
SP - 179
EP - 188
BT - MM'08 - Proceedings of the 2008 ACM International Conference on Multimedia, with co-located Symposium and Workshops
T2 - 16th ACM International Conference on Multimedia, MM '08
Y2 - 26 October 2008 through 31 October 2008
ER -