Thumbnail

Index Structures for Similarity Search in Multimedia Databases

B. Bustos

2006
Dissertation

An important research issue in the field of multimedia databases is the retrieval of similar objects. For most applications in multimedia databases, an exact search is not meaningful. Thus, much effort has been devoted to developing efficient and effective similarity search techniques. A widely used approach for implementing similarity search engines is the feature-based approach. In this approach, all multimedia objects stored in a database are transformed into high-dimensional feature vectors, which are then inserted into an index structure to efficiently perform similarity queries. The contribution of this thesis is to explore and propose novel solutions to improve the efficiency of similarity queries in multimedia databases. The thesis begins with a study on how to improve the effectiveness (i.e., the quality of the answer) of a similarity retrieval engine. We first show that by using combinations of feature vectors the effectiveness of the similarity search may be significantly enhanced. Then, we describe methods for computing query-dependent weights to perform linear combinations of feature vectors, which can further improve the effectiveness of the similarity search. As almost all index structures for similarity search developed so far can only deal with single feature vectors, the design and analysis of new index structures is necessary to efficiently perform similarity queries that use combinations of feature vectors. This gives extra motivation for the techniques studied in the rest of the thesis. In the next part of the thesis, we propose several algorithms and index structures to improve the efficiency of similarity queries. Firstly, we study pivot selection techniques for pivot-based indices. We provide an efficiency criterion based on distance histograms for selecting a good set of pivots and present empirical evidence showing that the technique is effective. Secondly, we describe an improved k nearest neighbor (k-NN) algorithm, which is based on the best-first traversal algorithm proposed by Hjaltason and Samet. Although the original algorithm is already optimal in the number of distance computations, its space requirements are significant. The improved algorithm aims to lower the space requirements by using distance estimators. Thirdly, we present a metric access method for dynamic combinations of feature vectors. The index is pivot-based, and it can take advantage of the previously studied pivot selection techniques. Finally, we introduce an approach that aims to minimize the expected search cost of a similarity query. The idea is to index only the most frequently used combinations of feature vectors. If there are restrictions on the available space for constructing indices, then the resulting optimization problem can be modeled as a binary linear program. As binary linear programs are NP-hard in the general case, we also propose algorithms that quickly find good sets of indices. The last part of the thesis explores the use of graphics processor units (GPUs) for accelerating database operations. We present GPU implementations of a high-dimensional nearest neighbor search and a clustering algorithm. An experimental evaluation shows that the proposed GPU algorithms are an order of magnitude faster than their CPU versions.

Materials
Related Publication
thumbnail
Proceedings of the First International Workshop on Digital Libraries Foundations (DLF1), Vancouver, British Columbia, Canada, June 23, 2007. DELOS Network of Excellence on Digital Libraries, 2007
thumbnail
Proceedings of the International Conference on Computational Science (ICCS'06) Part IV, LNCS 3994, 2006
thumbnail
Proc. 20th Annual ACM Symposium on Applied Computing, Multimedia and Visualization Track (SAC-MV'05), 2005
Title