Project type: SCITECH Project

Benefit and Bias of Approximate Nearest Neighbor Search for Machine Learning and Data Mining

The search for nearest neighbors is a crucial ingredient in many applications such as density estimation, clustering, classification, and outlier detection. Often, neighborhood search is also the bottleneck in terms of efficiency in these applications. In the age of big data, companies and organizations can usually store billions of individual data points and embed these data points into a high-dimensional vector space. For example, the Danish company Pufin ID uses nearest neighbor search to link chemical labels placed on physical objects to a digital hash code. They require answers in milliseconds for such neighbor searches among nearly a billion high-dimensional vectors. Due to the curse of dimensionality, using traditional, exact nearest neighbor search algorithms are the bottleneck of such applications, which can take minutes or hours to answer a single query.

To solve such scalability challenges, more and more approximate nearest neighbor (ANN) search methods are employed. Depending on the data structure, the word “approximate” can both mean a strong theoretical guarantee or more loosely that results are expected to be inexact. Many applications of ANN based methods have profound societal influence on algorithmic decision-making processes. If a user sees a stream of personalized, recommended articles or a “curated” version of the timeline, the need for efficient processing makes it often necessary that these results are based on the selection of approximate nearest neighbors in an intermediate step. Thus, the bias, benefits, or dangers of such a selection process must be studied.

According to standard benchmarks, approximate methods can process queries several orders of magnitude faster than exact approaches, if results do not need to be close to exact. A downstream application of nearest neighbor search must take
the inexact nature of the results into account. Different paradigms might come with different biases, and some paradigms might be more suitable for a certain use case. For example, recent work suggests that some ANN methods exhibit an “all or nothing” behavior, which causes the found neighbors to be completely unrelated. This can evaporate the trust of a user in the application. On the other hand, there exists work that suggests that ANN can improve the results of a downstream application, for example in the context of ensemble learning for outlier detection.

This project aims to use approximate nearest neighbor search to design highly scalable and robust algorithms for diverse tasks such as clustering, classification, and outlier detection.

Hypothesis
Many applications in machine learning and data mining can be sped up using approximate nearest neighbor search with no or only a negligible loss in result quality for the application, compared to an exact search. Different methods for ANN search come with different biases that can be positive or negative with varying degrees for the downstream application. In this project, the bias of different ANN methods and its impact on different applications will be studied from a fundamental and from an empirical level. We strive to address the following problems:

Theme 1
ANN for Discrimination Discovery and Diversity Maximization. Discrimination discovery and diversity maximization are central elements in the area of algorithmic fairness. Traditional classifiers include k-NN classifiers that scale poorly to high-dimensional data. On the other hand, diversity maximization usually involves a diversification of nearest neighbor search results.

Goals: Study the effect of ANN results on the quality of the k-NN classifier for discrimination discovery. Theoretically develop LSH-based diversity maximization methods that build the diversification into the LSH, and empirically evaluate it against other known approaches.

Theme 2
ANN for Outlier Detection. How do different ANN paradigms influence the quality of an outlier detection (OD) algorithm? Can outlier classification be “built into” an ANN algorithm to further scale up the performance?

Goals: Develop a theoretically sound LSH based outlier detection algorithm with provable guarantees; empirically compare the performance of different ANN-based OD classifiers; design and evaluate the performance of using different classifiers in an ensemble.

Theme 3
ANN for Clustering. Density-based and traditional clustering approaches rely on a nearest neighbor search or a range search to cluster data points. What is the effect of finding approximate neighbors? How well can we adopt different ANN paradigms to support range search operations? Related work uses LSH as a black-box: Can we use the LSH bucket structure to directly implement DBSCAN?

Goals: Extend graph-based ANN algorithms to support range-search primitives. Implement DBSCAN-based variants and evaluate their performance and quality.

Theme 4
ANN to Speed-Up Machine Learning Training. Many training tasks in machine learning are costly. However, steps such as backpropagation boil down to a maximum inner product search (MIPS), for which we know that ANN provide efficient approximate solutions. In this task, we will study whether we can achieve comparable or better performance using ANN in the backpropagation step. Will the bias hurt the classification results or improve robustness?

Goals: Develop and evaluate neural network training using different ANN-based approaches to MIPS.

Risk Management
The research themes mentioned above can mostly be carried out independently from each other. The actual downstream application is relatively flexible, which lowers the risk of the project failing. The hiring process will make sure that the prospective PhD student has both a theoretical understanding of algorithms and data mining and practical experience with programming. As a fallback if theoretical results turn out to not be in reach, the empirical results will improve the state-of-the-art and imply strong results for venues with an empirical focus, yield demonstrations at such venues, and result in open-source software to make the methods available to a broad audience.

Scientific value
The scientific value of the project is a fundamental understanding of the influence of approximate nearest neighbor search on applications in machine learning and data mining, such as outlier detection, clustering and algorithmic decision making. Through this project, we will propose new algorithmic methods and provide efficient implementations to solve important machine learning and data mining tasks. In the spirit of open science and to maximize impact of the scientific results, all software resulting from the project will be made available open source. As a long-term goal, our results will show that when handled with care in the design and rigor in the analysis, approximate methods allow the design of scalable algorithms that do not necessarily lose in quality. Of course, this might not only be true areas covered in this project, but many others where exact solutions are computationally out of reach.

Capacity building
In terms of capacity building the value of the project is to educate a PhD student. Such a student will be able to work both on a theoretical and an applied level. She will also be trained in critical thinking on algorithmic decision making, which is a highly valuable skill for society. In addition, the project will offer several affiliated student projects on a Bachelor’s and Master’s level, and the availability of the research results will make it easy for others to build upon the work. The long-term goal of this project is to attract the interest of companies to use these methods and develop them further, aiming for follow-up projects with industry partners on a larger scale.

Societal value
The rise of vector embedding methods for text, images, and video had a deep impact on society. Many of its applications such as personalized recommendations or curated news feeds are taken for granted, but are only made possible through efficient search methods. Thus, ANN-based methods allowed us to design algorithmic decision-making processes with profound influence on our everyday life. If a user sees a stream of personalized, recommended articles or a “curated” version of their social media feed, it is very likely that these results are based on the selection of approximate nearest neighbors in an intermediate step. The bias, benefits, and dangers of such a selection process must be studied carefully. Moreover, a successful application of approximate techniques has the potential for liberating the use of methods such as deep learning by lowering the entry cost in terms of hardware. This is for example show-cased by the recently founded start-up ThirdAI.

August 2022 – December 31, 2025 – 3,5 years.

Total budget DKK 3,5 / DIREC investment DKK 1,77

Participants

Project Manager

Martin Aumüller

Associate Professor

IT University of Copenhagen
Department of Computer Science

E: maau@itu.dk

Project Manager

Arthur Zimek

Professor

University of Southern Denmark
Department of Mathematics and Computer Science

E: zimek@imada.sdu.dk

Partners