Search
Close this search box.
Categories
SciTech project

Online Algorithms with Predictions

Project type: SCITECH Project

Online Algorithms with Predictions

All industrial sectors face optimization problems, and usually many of them, i.e., situations where one must optimize wrt. some resource. This could be minimizing material usage or it could be optimizing towards time or space consumption. Examples include cutting shapes from expensive material, packing containers best possibly to minimize transportation costs, or scheduling routes or dependent tasks to finish earliest possible.

In some cases, all information is available when the processing of tasks commences, but in many situations, tasks arrive during the process, and decisions regarding their treatment must be made shortly after their arrival before further tasks appear. Such problems are referred to as “online”. Obviously, online problems lead to poorer solutions than one can obtain with their offline counterparts, unless fairly precise, additional information about the future tasks is available. In designing and analyzing algorithms, in general, the goal is to determine the quality of an algorithmic solution, preferably with guarantees on performance for all inputs, so that it is possible to promise delivery times or bounds on expenses, etc. Such an analysis also allows the designer to determine if it would be beneficial to search for other algorithmic solutions. Assessing the quality of the algorithms experimentally suffers from the difficulty of determining which inputs to test on and providing trustworthy worst-case bounds.

The area of online algorithms has existed for many years and provides analyses giving worst-case guarantees. However, since these guarantees hold for all inputs, even the most extreme and, sometimes, unrealistic, these guarantees are very pessimistic and often not suited for choosing good algorithms for the typical cases. Thus, in practice, companies often use techniques based on heuristic methods, machine learning, etc. Machine learning, especially, has proven very successful in many applications in providing solutions that are good in practice, when presented with typical inputs. However, on input not captured by training data, the algorithm may fail dramatically.

We need combinations of the desirable properties of guarantees from the online algorithms world and of the experienced good behavior on typical input from, for instance, the machine learning world. That is, we need algorithms that follow predictions given from a machine learning component, for instance, since that often gives good results, but it should not do so blindly or the worst-case behavior will generally be even worse than the guarantees provided by standard online algorithms and their analyses. Thus, a controlling algorithmic unit should monitor the predictions that are given so that safety decisions can overrule the predictions when things are progressing in a worrisome direction.

We also need ways of quantifying the guaranteed quality of our solutions as a function of how closely an input resembles the predicted (by a machine learning component, for instance) input. This is a crucial part of risk management. We want reassurance that we do not “fall off the cliff” just because predictions are slightly off. This includes limiting the ”damage” possible from machine learning adversarial attacks. As an integral part of a successful approach to this problem, we need measures developed to quantify an input’s distance from the prediction (the prediction error) that are defined in such a manner that quality can be expressed as a function of the prediction error. For online algorithm applications, this often needs to be different from standard loss functions for machine learning.

Our main aim is to further the development of generally-applicable techniques for utilizing usually good, but untrusted predictions, while at the same time providing worst-case guarantees, in the realm of online optimization problems. We want to further establish this research topic at Danish universities and subsequently disseminate knowledge of this to industry via joint collaboration. Developments of this nature are of course considered internationally. Progress is to a large extent made by considering carefully chosen concrete problems, their modeling and properties, and extract general techniques from those studies, and further test their applicability on new problems.

We are planning to initiate work on online call control and scheduling with precedence constraints. The rationale is that these problems are important in their own right and at the same type represent different types of challenges. Call control focuses on admitting as many requests as possible with limited bandwidth, whereas scheduling focuses on time, handling all requests as effectively as possible.

Call control can be seen as point-to-point requests in a network with limited capacity. The goal is to accept as profitable a collection of requests as possible. Scheduling deals with jobs of different duration that must be executed on some “machine” (not necessarily a computer), respecting some contraints that some jobs cannot be executed before certain other jobs are completed. In this problem, all jobs must be scheduled on some machine, and the target is to complete all jobs as fast as possible. To fully define these problems more details are required about the structure of the resources and the precise optimization goals.

Some generic insight we would like to gain and which is sorely lacking in the community currently is formalizable conditions for good predictions. We want performance of algorithms to degrade gracefully with prediction errors. This is important for the explainability and trustworthiness of algorithms. Related to this, whereas some predictions may be easy to work with theoretically, it is important to focus on classes of predictions that are learnable in practice. To be useful, this also requires robustness, in the sense that minor, inconsequential changes in the input sequence compared with the prediction should not affect the result dramtically.

We are also interested in giving minor consideration to impossibility results, i.e., proving limits on how good solutions can be obtained. Whereas this is not directly constructive, it can tell us if we are done or how close we are to an optimal algorithm, so we do not waste time trying to improve algorithms that cannot be improved or only improved marginally.

The project leads to value creation in a number of different directions.

Research-wise, with the developments in machine learning and related data science disciplines over the last years, the integration and utilization of these techniques into other areas of computer science is of great interest, and Danish research should be at the forefront of these endeavors. We facilitate this by bringing people with expertise in different topics together and consolidating knowledge of the primary techniques across institutions. Educating students in these topics is usually a nice side-effect of running such a project. The primary focus, of course, is to educate the PhD student and train the research assistants, but this is accompanied by having MS students working on their theses during the project period solve related, well-defined subproblems.

We are advocating the combined techniques that strive towards excellent typical-case performance while providing worst-case guarantees, and believe that they should be adopted by industry to a larger extent. The project will lead to results on concrete problems, but our experience tells us that companies generally need variations of these or new solutions to somewhat different problems. Thus, the most important aspect in this regards is capacity building, so that we can assist with concrete developments for particular companyspecific problems. Other than the fact that problems appear in many variations in different companies, a main reason why problem adaption would often be necessary is that the added value of the combined algorithmic approaches is based on predictions. And it varies greatly what type of data is obtainable and which subset of the data can give useful predictions.

We have prior experience with industry consulting, the industrial PhD program, and co-advised MS students, and maintain close relationships with local industry. After, and in principle also during, this project, we are open to subsequent joint projects with industry that take their challenges as the starting point, whereafter we utilize the know-how and experience gained from the current project. Work such as that could be on a consultancy basis, through joint student project, or, at a larger scale, with, for instance, the Innovation Foundation as a partner.

Finally, we see it as an advantage in our project that we include researchers that are relatively new to Denmark such that they get to interact with more people at different institutions and expand their Danish network.

September 1, 2022 – August 31, 2025 – 3 years.

Total budget DKK 3,5 / DIREC investment DKK 1,5

Participants

Project Manager

Kim Skak Larsen

Professor

University of Southern Denmark
Department of Mathematics and Computer Science

E: kslarsen@imada.sdu.dk

Nutan Limaye

Associate Professor

IT University of Copenhagen
Department of Computer Science

Joan Boyar

Professor

University of Southern Denmark
Department of Mathematics and Computer Science

Melih Kandemir

Associate Professor

University of Southern Denmark
Department of Mathematics and Computer Science

Lene Monrad Favholdt

Associate Professor

University of Southern Denmark
Department of Mathematics and Computer Science

Magnus Berg Pedersen

PhD Student

University of Southern Denmark
Department of Mathematics and Computer Science

Tim Poulsen

Student Programmer

IT University of Copenhagen

Partners

Categories
SciTech project

Benefit and Bias of Approximate Nearest Neighbor Search for Machine Learning and Data Mining

Project type: SCITECH Project

Benefit and Bias of Approximate Nearest Neighbor Search for Machine Learning and Data Mining

The search for nearest neighbors is a crucial ingredient in many applications such as density estimation, clustering, classification, and outlier detection. Often, neighborhood search is also the bottleneck in terms of efficiency in these applications. In the age of big data, companies and organizations can usually store billions of individual data points and embed these data points into a high-dimensional vector space. For example, the Danish company Pufin ID uses nearest neighbor search to link chemical labels placed on physical objects to a digital hash code. They require answers in milliseconds for such neighbor searches among nearly a billion high-dimensional vectors. Due to the curse of dimensionality, using traditional, exact nearest neighbor search algorithms are the bottleneck of such applications, which can take minutes or hours to answer a single query.

To solve such scalability challenges, more and more approximate nearest neighbor (ANN) search methods are employed. Depending on the data structure, the word “approximate” can both mean a strong theoretical guarantee or more loosely that results are expected to be inexact. Many applications of ANN based methods have profound societal influence on algorithmic decision-making processes. If a user sees a stream of personalized, recommended articles or a “curated” version of the timeline, the need for efficient processing makes it often necessary that these results are based on the selection of approximate nearest neighbors in an intermediate step. Thus, the bias, benefits, or dangers of such a selection process must be studied.

According to standard benchmarks, approximate methods can process queries several orders of magnitude faster than exact approaches, if results do not need to be close to exact. A downstream application of nearest neighbor search must take
the inexact nature of the results into account. Different paradigms might come with different biases, and some paradigms might be more suitable for a certain use case. For example, recent work suggests that some ANN methods exhibit an “all or nothing” behavior, which causes the found neighbors to be completely unrelated. This can evaporate the trust of a user in the application. On the other hand, there exists work that suggests that ANN can improve the results of a downstream application, for example in the context of ensemble learning for outlier detection.

This project aims to use approximate nearest neighbor search to design highly scalable and robust algorithms for diverse tasks such as clustering, classification, and outlier detection.

Hypothesis
Many applications in machine learning and data mining can be sped up using approximate nearest neighbor search with no or only a negligible loss in result quality for the application, compared to an exact search. Different methods for ANN search come with different biases that can be positive or negative with varying degrees for the downstream application. In this project, the bias of different ANN methods and its impact on different applications will be studied from a fundamental and from an empirical level. We strive to address the following problems:

Theme 1
ANN for Discrimination Discovery and Diversity Maximization. Discrimination discovery and diversity maximization are central elements in the area of algorithmic fairness. Traditional classifiers include k-NN classifiers that scale poorly to high-dimensional data. On the other hand, diversity maximization usually involves a diversification of nearest neighbor search results.

Goals: Study the effect of ANN results on the quality of the k-NN classifier for discrimination discovery. Theoretically develop LSH-based diversity maximization methods that build the diversification into the LSH, and empirically evaluate it against other known approaches.

Theme 2
ANN for Outlier Detection. How do different ANN paradigms influence the quality of an outlier detection (OD) algorithm? Can outlier classification be “built into” an ANN algorithm to further scale up the performance?

Goals: Develop a theoretically sound LSH based outlier detection algorithm with provable guarantees; empirically compare the performance of different ANN-based OD classifiers; design and evaluate the performance of using different classifiers in an ensemble.

Theme 3
ANN for Clustering. Density-based and traditional clustering approaches rely on a nearest neighbor search or a range search to cluster data points. What is the effect of finding approximate neighbors? How well can we adopt different ANN paradigms to support range search operations? Related work uses LSH as a black-box: Can we use the LSH bucket structure to directly implement DBSCAN?

Goals: Extend graph-based ANN algorithms to support range-search primitives. Implement DBSCAN-based variants and evaluate their performance and quality.

Theme 4
ANN to Speed-Up Machine Learning Training. Many training tasks in machine learning are costly. However, steps such as backpropagation boil down to a maximum inner product search (MIPS), for which we know that ANN provide efficient approximate solutions. In this task, we will study whether we can achieve comparable or better performance using ANN in the backpropagation step. Will the bias hurt the classification results or improve robustness?

Goals: Develop and evaluate neural network training using different ANN-based approaches to MIPS.

Risk Management
The research themes mentioned above can mostly be carried out independently from each other. The actual downstream application is relatively flexible, which lowers the risk of the project failing. The hiring process will make sure that the prospective PhD student has both a theoretical understanding of algorithms and data mining and practical experience with programming. As a fallback if theoretical results turn out to not be in reach, the empirical results will improve the state-of-the-art and imply strong results for venues with an empirical focus, yield demonstrations at such venues, and result in open-source software to make the methods available to a broad audience.

Scientific value
The scientific value of the project is a fundamental understanding of the influence of approximate nearest neighbor search on applications in machine learning and data mining, such as outlier detection, clustering and algorithmic decision making. Through this project, we will propose new algorithmic methods and provide efficient implementations to solve important machine learning and data mining tasks. In the spirit of open science and to maximize impact of the scientific results, all software resulting from the project will be made available open source. As a long-term goal, our results will show that when handled with care in the design and rigor in the analysis, approximate methods allow the design of scalable algorithms that do not necessarily lose in quality. Of course, this might not only be true areas covered in this project, but many others where exact solutions are computationally out of reach.

Capacity building
In terms of capacity building the value of the project is to educate a PhD student. Such a student will be able to work both on a theoretical and an applied level. She will also be trained in critical thinking on algorithmic decision making, which is a highly valuable skill for society. In addition, the project will offer several affiliated student projects on a Bachelor’s and Master’s level, and the availability of the research results will make it easy for others to build upon the work. The long-term goal of this project is to attract the interest of companies to use these methods and develop them further, aiming for follow-up projects with industry partners on a larger scale.

Societal value
The rise of vector embedding methods for text, images, and video had a deep impact on society. Many of its applications such as personalized recommendations or curated news feeds are taken for granted, but are only made possible through efficient search methods. Thus, ANN-based methods allowed us to design algorithmic decision-making processes with profound influence on our everyday life. If a user sees a stream of personalized, recommended articles or a “curated” version of their social media feed, it is very likely that these results are based on the selection of approximate nearest neighbors in an intermediate step. The bias, benefits, and dangers of such a selection process must be studied carefully. Moreover, a successful application of approximate techniques has the potential for liberating the use of methods such as deep learning by lowering the entry cost in terms of hardware. This is for example show-cased by the recently founded start-up ThirdAI.

August 2022 – December 31, 2025 – 3,5 years.

Total budget DKK 3,5 / DIREC investment DKK 1,77

Participants

Project Manager

Martin Aumüller

Associate Professor

IT University of Copenhagen
Department of Computer Science

E: maau@itu.dk

Project Manager

Arthur Zimek

Professor

University of Southern Denmark
Department of Mathematics and Computer Science

E: zimek@imada.sdu.dk

Camilla Okkels

PhD Student

IT University of Copenhagen
Department of Computer Science

Victor Bello Thomsen

Research Assistant

IT University of Copenhagen
Department of Computer Science

Partners

Categories
SciTech project

Machine Learning Algorithms Generalisation

Project type: SCITECH Project

Machine Learning Algorithms Generalisation

AI is radically changing society and the main driver behind new AI methods and systems is machine learning. Machine learning focuses on finding solutions for, or patterns in, new data by learning from relevant existing data. Thus, machine learning algorithms are often applied to large datasets and then they more or less autonomously find good solutions by finding relevant information or patterns hidden in the data. However, it is often not well understood why machine learning algorithms work so well in practice on completely new data – often their performance surpass what current theory would suggest by a wide margin.

Being able to understand and predict when, why and how well machine learning algorithms work on a given problem is critical for knowing when they may be applied and trusted, in particular in more critical systems. Understanding why the algorithms work is also important in order to be able drive the machine learning field forward in the right direction, improving upon existing algorithms and designing new ones.

The goal of this project is to research and develop a better understanding of the generalisation capability of the most used machine learning algorithms, including boosting algorithms, support vector machines and deep learning algorithms. The result will be new generalisation bounds, both showing positive what can be achieved and negative what cannot.

This will allow us to more fully understand the current possibilities and limits, and thus drive the development of new and better methods. Ultimately, this will provide better guarantees for the quality of the output of machine learning algorithms in a variety of domains.

Researching the theoretical foundation for machine learning (and thus essentially all AI based systems) will benefit society at large, since a solid theory will allow us to formally argue and understand when and under which conditions machine learning algorithms can deliver the required quality.

As an added value, the project will bring together leading experts in Denmark in the theory of algorithms to (further) develop the fundamental theoretical basis of machine learning. Thus, it may serve as a starting point for additional national and international collaboration and projects, and it will build up competences highly relevant for Danish industry.

October 1, 2020 – September 31, 2024 – 3,5 years.

Total budget DKK 2,41 / DIREC investment DKK 1,55

Participants

Project Manager

Kasper Green Larsen

Professor

Aarhus University
Department of Computer Science

E: larsen@cs.au.dk

Allan Grønlund

Postdoc

Aarhus University
Department of Computer Science

Mikkel Thorup

Professor

University of Copenhagen
Department of Computer Science

Martin Ritzert

Postdoc

Aarhus University
Department of Computer Science

Chenglin Fan

Postdoc

Aarhus University
Department of Computer Science

Partners