Search
Close this search box.
Categories
SciTech project

Online Algorithms with Predictions

Project type: SCITECH Project

Online Algorithms with Predictions

All industrial sectors face optimization problems, and usually many of them, i.e., situations where one must optimize wrt. some resource. This could be minimizing material usage or it could be optimizing towards time or space consumption. Examples include cutting shapes from expensive material, packing containers best possibly to minimize transportation costs, or scheduling routes or dependent tasks to finish earliest possible.

In some cases, all information is available when the processing of tasks commences, but in many situations, tasks arrive during the process, and decisions regarding their treatment must be made shortly after their arrival before further tasks appear. Such problems are referred to as “online”. Obviously, online problems lead to poorer solutions than one can obtain with their offline counterparts, unless fairly precise, additional information about the future tasks is available. In designing and analyzing algorithms, in general, the goal is to determine the quality of an algorithmic solution, preferably with guarantees on performance for all inputs, so that it is possible to promise delivery times or bounds on expenses, etc. Such an analysis also allows the designer to determine if it would be beneficial to search for other algorithmic solutions. Assessing the quality of the algorithms experimentally suffers from the difficulty of determining which inputs to test on and providing trustworthy worst-case bounds.

The area of online algorithms has existed for many years and provides analyses giving worst-case guarantees. However, since these guarantees hold for all inputs, even the most extreme and, sometimes, unrealistic, these guarantees are very pessimistic and often not suited for choosing good algorithms for the typical cases. Thus, in practice, companies often use techniques based on heuristic methods, machine learning, etc. Machine learning, especially, has proven very successful in many applications in providing solutions that are good in practice, when presented with typical inputs. However, on input not captured by training data, the algorithm may fail dramatically.

We need combinations of the desirable properties of guarantees from the online algorithms world and of the experienced good behavior on typical input from, for instance, the machine learning world. That is, we need algorithms that follow predictions given from a machine learning component, for instance, since that often gives good results, but it should not do so blindly or the worst-case behavior will generally be even worse than the guarantees provided by standard online algorithms and their analyses. Thus, a controlling algorithmic unit should monitor the predictions that are given so that safety decisions can overrule the predictions when things are progressing in a worrisome direction.

We also need ways of quantifying the guaranteed quality of our solutions as a function of how closely an input resembles the predicted (by a machine learning component, for instance) input. This is a crucial part of risk management. We want reassurance that we do not “fall off the cliff” just because predictions are slightly off. This includes limiting the ”damage” possible from machine learning adversarial attacks. As an integral part of a successful approach to this problem, we need measures developed to quantify an input’s distance from the prediction (the prediction error) that are defined in such a manner that quality can be expressed as a function of the prediction error. For online algorithm applications, this often needs to be different from standard loss functions for machine learning.

Our main aim is to further the development of generally-applicable techniques for utilizing usually good, but untrusted predictions, while at the same time providing worst-case guarantees, in the realm of online optimization problems. We want to further establish this research topic at Danish universities and subsequently disseminate knowledge of this to industry via joint collaboration. Developments of this nature are of course considered internationally. Progress is to a large extent made by considering carefully chosen concrete problems, their modeling and properties, and extract general techniques from those studies, and further test their applicability on new problems.

We are planning to initiate work on online call control and scheduling with precedence constraints. The rationale is that these problems are important in their own right and at the same type represent different types of challenges. Call control focuses on admitting as many requests as possible with limited bandwidth, whereas scheduling focuses on time, handling all requests as effectively as possible.

Call control can be seen as point-to-point requests in a network with limited capacity. The goal is to accept as profitable a collection of requests as possible. Scheduling deals with jobs of different duration that must be executed on some “machine” (not necessarily a computer), respecting some contraints that some jobs cannot be executed before certain other jobs are completed. In this problem, all jobs must be scheduled on some machine, and the target is to complete all jobs as fast as possible. To fully define these problems more details are required about the structure of the resources and the precise optimization goals.

Some generic insight we would like to gain and which is sorely lacking in the community currently is formalizable conditions for good predictions. We want performance of algorithms to degrade gracefully with prediction errors. This is important for the explainability and trustworthiness of algorithms. Related to this, whereas some predictions may be easy to work with theoretically, it is important to focus on classes of predictions that are learnable in practice. To be useful, this also requires robustness, in the sense that minor, inconsequential changes in the input sequence compared with the prediction should not affect the result dramtically.

We are also interested in giving minor consideration to impossibility results, i.e., proving limits on how good solutions can be obtained. Whereas this is not directly constructive, it can tell us if we are done or how close we are to an optimal algorithm, so we do not waste time trying to improve algorithms that cannot be improved or only improved marginally.

The project leads to value creation in a number of different directions.

Research-wise, with the developments in machine learning and related data science disciplines over the last years, the integration and utilization of these techniques into other areas of computer science is of great interest, and Danish research should be at the forefront of these endeavors. We facilitate this by bringing people with expertise in different topics together and consolidating knowledge of the primary techniques across institutions. Educating students in these topics is usually a nice side-effect of running such a project. The primary focus, of course, is to educate the PhD student and train the research assistants, but this is accompanied by having MS students working on their theses during the project period solve related, well-defined subproblems.

We are advocating the combined techniques that strive towards excellent typical-case performance while providing worst-case guarantees, and believe that they should be adopted by industry to a larger extent. The project will lead to results on concrete problems, but our experience tells us that companies generally need variations of these or new solutions to somewhat different problems. Thus, the most important aspect in this regards is capacity building, so that we can assist with concrete developments for particular companyspecific problems. Other than the fact that problems appear in many variations in different companies, a main reason why problem adaption would often be necessary is that the added value of the combined algorithmic approaches is based on predictions. And it varies greatly what type of data is obtainable and which subset of the data can give useful predictions.

We have prior experience with industry consulting, the industrial PhD program, and co-advised MS students, and maintain close relationships with local industry. After, and in principle also during, this project, we are open to subsequent joint projects with industry that take their challenges as the starting point, whereafter we utilize the know-how and experience gained from the current project. Work such as that could be on a consultancy basis, through joint student project, or, at a larger scale, with, for instance, the Innovation Foundation as a partner.

Finally, we see it as an advantage in our project that we include researchers that are relatively new to Denmark such that they get to interact with more people at different institutions and expand their Danish network.

September 1, 2022 – August 31, 2025 – 3 years.

Total budget DKK 3,5 / DIREC investment DKK 1,5

Participants

Project Manager

Kim Skak Larsen

Professor

University of Southern Denmark
Department of Mathematics and Computer Science

E: kslarsen@imada.sdu.dk

Nutan Limaye

Associate Professor

IT University of Copenhagen
Department of Computer Science

Joan Boyar

Professor

University of Southern Denmark
Department of Mathematics and Computer Science

Melih Kandemir

Associate Professor

University of Southern Denmark
Department of Mathematics and Computer Science

Lene Monrad Favholdt

Associate Professor

University of Southern Denmark
Department of Mathematics and Computer Science

Magnus Berg Pedersen

PhD Student

University of Southern Denmark
Department of Mathematics and Computer Science

Tim Poulsen

Student Programmer

IT University of Copenhagen

Partners

Categories
SciTech project

Benefit and Bias of Approximate Nearest Neighbor Search for Machine Learning and Data Mining

Project type: SCITECH Project

Benefit and Bias of Approximate Nearest Neighbor Search for Machine Learning and Data Mining

The search for nearest neighbors is a crucial ingredient in many applications such as density estimation, clustering, classification, and outlier detection. Often, neighborhood search is also the bottleneck in terms of efficiency in these applications. In the age of big data, companies and organizations can usually store billions of individual data points and embed these data points into a high-dimensional vector space. For example, the Danish company Pufin ID uses nearest neighbor search to link chemical labels placed on physical objects to a digital hash code. They require answers in milliseconds for such neighbor searches among nearly a billion high-dimensional vectors. Due to the curse of dimensionality, using traditional, exact nearest neighbor search algorithms are the bottleneck of such applications, which can take minutes or hours to answer a single query.

To solve such scalability challenges, more and more approximate nearest neighbor (ANN) search methods are employed. Depending on the data structure, the word “approximate” can both mean a strong theoretical guarantee or more loosely that results are expected to be inexact. Many applications of ANN based methods have profound societal influence on algorithmic decision-making processes. If a user sees a stream of personalized, recommended articles or a “curated” version of the timeline, the need for efficient processing makes it often necessary that these results are based on the selection of approximate nearest neighbors in an intermediate step. Thus, the bias, benefits, or dangers of such a selection process must be studied.

According to standard benchmarks, approximate methods can process queries several orders of magnitude faster than exact approaches, if results do not need to be close to exact. A downstream application of nearest neighbor search must take
the inexact nature of the results into account. Different paradigms might come with different biases, and some paradigms might be more suitable for a certain use case. For example, recent work suggests that some ANN methods exhibit an “all or nothing” behavior, which causes the found neighbors to be completely unrelated. This can evaporate the trust of a user in the application. On the other hand, there exists work that suggests that ANN can improve the results of a downstream application, for example in the context of ensemble learning for outlier detection.

This project aims to use approximate nearest neighbor search to design highly scalable and robust algorithms for diverse tasks such as clustering, classification, and outlier detection.

Hypothesis
Many applications in machine learning and data mining can be sped up using approximate nearest neighbor search with no or only a negligible loss in result quality for the application, compared to an exact search. Different methods for ANN search come with different biases that can be positive or negative with varying degrees for the downstream application. In this project, the bias of different ANN methods and its impact on different applications will be studied from a fundamental and from an empirical level. We strive to address the following problems:

Theme 1
ANN for Discrimination Discovery and Diversity Maximization. Discrimination discovery and diversity maximization are central elements in the area of algorithmic fairness. Traditional classifiers include k-NN classifiers that scale poorly to high-dimensional data. On the other hand, diversity maximization usually involves a diversification of nearest neighbor search results.

Goals: Study the effect of ANN results on the quality of the k-NN classifier for discrimination discovery. Theoretically develop LSH-based diversity maximization methods that build the diversification into the LSH, and empirically evaluate it against other known approaches.

Theme 2
ANN for Outlier Detection. How do different ANN paradigms influence the quality of an outlier detection (OD) algorithm? Can outlier classification be “built into” an ANN algorithm to further scale up the performance?

Goals: Develop a theoretically sound LSH based outlier detection algorithm with provable guarantees; empirically compare the performance of different ANN-based OD classifiers; design and evaluate the performance of using different classifiers in an ensemble.

Theme 3
ANN for Clustering. Density-based and traditional clustering approaches rely on a nearest neighbor search or a range search to cluster data points. What is the effect of finding approximate neighbors? How well can we adopt different ANN paradigms to support range search operations? Related work uses LSH as a black-box: Can we use the LSH bucket structure to directly implement DBSCAN?

Goals: Extend graph-based ANN algorithms to support range-search primitives. Implement DBSCAN-based variants and evaluate their performance and quality.

Theme 4
ANN to Speed-Up Machine Learning Training. Many training tasks in machine learning are costly. However, steps such as backpropagation boil down to a maximum inner product search (MIPS), for which we know that ANN provide efficient approximate solutions. In this task, we will study whether we can achieve comparable or better performance using ANN in the backpropagation step. Will the bias hurt the classification results or improve robustness?

Goals: Develop and evaluate neural network training using different ANN-based approaches to MIPS.

Risk Management
The research themes mentioned above can mostly be carried out independently from each other. The actual downstream application is relatively flexible, which lowers the risk of the project failing. The hiring process will make sure that the prospective PhD student has both a theoretical understanding of algorithms and data mining and practical experience with programming. As a fallback if theoretical results turn out to not be in reach, the empirical results will improve the state-of-the-art and imply strong results for venues with an empirical focus, yield demonstrations at such venues, and result in open-source software to make the methods available to a broad audience.

Scientific value
The scientific value of the project is a fundamental understanding of the influence of approximate nearest neighbor search on applications in machine learning and data mining, such as outlier detection, clustering and algorithmic decision making. Through this project, we will propose new algorithmic methods and provide efficient implementations to solve important machine learning and data mining tasks. In the spirit of open science and to maximize impact of the scientific results, all software resulting from the project will be made available open source. As a long-term goal, our results will show that when handled with care in the design and rigor in the analysis, approximate methods allow the design of scalable algorithms that do not necessarily lose in quality. Of course, this might not only be true areas covered in this project, but many others where exact solutions are computationally out of reach.

Capacity building
In terms of capacity building the value of the project is to educate a PhD student. Such a student will be able to work both on a theoretical and an applied level. She will also be trained in critical thinking on algorithmic decision making, which is a highly valuable skill for society. In addition, the project will offer several affiliated student projects on a Bachelor’s and Master’s level, and the availability of the research results will make it easy for others to build upon the work. The long-term goal of this project is to attract the interest of companies to use these methods and develop them further, aiming for follow-up projects with industry partners on a larger scale.

Societal value
The rise of vector embedding methods for text, images, and video had a deep impact on society. Many of its applications such as personalized recommendations or curated news feeds are taken for granted, but are only made possible through efficient search methods. Thus, ANN-based methods allowed us to design algorithmic decision-making processes with profound influence on our everyday life. If a user sees a stream of personalized, recommended articles or a “curated” version of their social media feed, it is very likely that these results are based on the selection of approximate nearest neighbors in an intermediate step. The bias, benefits, and dangers of such a selection process must be studied carefully. Moreover, a successful application of approximate techniques has the potential for liberating the use of methods such as deep learning by lowering the entry cost in terms of hardware. This is for example show-cased by the recently founded start-up ThirdAI.

August 2022 – December 31, 2025 – 3,5 years.

Total budget DKK 3,5 / DIREC investment DKK 1,77

Participants

Project Manager

Martin Aumüller

Associate Professor

IT University of Copenhagen
Department of Computer Science

E: maau@itu.dk

Project Manager

Arthur Zimek

Professor

University of Southern Denmark
Department of Mathematics and Computer Science

E: zimek@imada.sdu.dk

Camilla Okkels

PhD Student

IT University of Copenhagen
Department of Computer Science

Victor Bello Thomsen

Research Assistant

IT University of Copenhagen
Department of Computer Science

Partners

Categories
Bridge project

Low-Code Programming of Spatial Contexts for Logistic Tasks in Mobile Robotics

DIREC project

Low-code programming of spatial contexts for logistic tasks in mobile robotics

Summary

Low-volume production represents a large share of the Danish manufacturing industry. An unmet need in this industry is flexibility and adaptability of manufacturing processes. Existing solutions for automating industrial logistics tasks include combinations of automated storage, conveyor belts, and mobile robots with special loading and unloading docks. 

However, these solutions require major investments and are not cost efficient for low-volume production, and today, low-volume production is often labor intensive.

Together with industrial partners, this project will investigate production scenarios where a machine can be operated by untrained personnel by using low-code development for adaptive and re-configurable robot programming of logistic tasks.

Project period: 2022-2025
Budget: DKK 7,15 million

An unmet need in industry is flexibility and adaptability of manufacturing processes in low-volume production. Low-volume production represents a large share of the Danish manufacturing industry. Existing solutions for automating industrial logistics tasks include combinations of automated storage, conveyor belts, and mobile robots with special loading and unloading docks. However, these solutions require major investments and are not cost efficient for low-volume production.

Therefore, low-volume production is today labor intensive, as automation technology and software are not yet cost effective for such production scenarios where a machine can be operated by untrained personnel. The need for flexibility, ease of programming, and fast adaptability of manufacturing processes is recognized in both Europe and USA. EuRobotics highlights the need for systems that can be easily re-programmed without the use of skilled system configuration personnel. Furthermore, the American roadmap for robotics  highlights adaptable and reconfigurable assembly and manipulation as an important capability for manufacturing.

The company Enabled Robotics (ER) aims to provide easy programming as an integral part of their products. Their mobile manipulator ER-FLEX consists of a robot arm and a mobile platform. The ER-FLEX mobile collaborative robot provides an opportunity to automate logistic tasks in low-volume production. This includes manipulation of objects in production in a less invasive and more cost-efficient way, reusing existing machinery and traditional storage racks. However, this setting also challenges the robots due to the variability in rack locations, shelf locations, box types, object types, and drop off points.

Today the ER-FLEX can be programmed by means of block-based features, which can be configured to high-level robot behaviors. While this approach offers an easier programming experience, the operator must still have a good knowledge of robotics and programming to define the desired behavior. In order to enable the product to be accessible to a wider audience of users in low-volume production companies, robot behavior programming has to be defined in a simpler and intuitive manner. In addition, a solution is needed that address the variability in a time-efficient and adaptive way to program the 3D spatial context.

Low-code software development is an emerging research topic in software engineering. Research in this area has investigated the development of software platforms that allow non-technical people to develop fully functional application software without having to make use of a general-purpose programming language. The scope of most low-code development platforms, however, has been limited to create software-only solutions for business processes automation of low-to-moderate complexity.

Programming of robot tasks still relies on dedicated personnel with special training. In recent years, the emergence of digital twins, block-based programming languages, and collaborative robots that can be programmed by demonstration, has made a breakthrough in this field. However, existing solutions still lack the ability to address variability for programming logistics and manipulation tasks in an everchanging environment.

Current low-code development platforms do not support robotic systems. The extensive use of hardware components and sensorial data in robotics makes it challenging to translate low-level manipulations into a high-level language that is understandable for non-programmers. In this project we will tackle this by constraining the problem focusing on the spatial dimension and by using machine learning for adaptability. Therefore, the first research question we want to investigate in this project is whether and how the low-code development paradigm can support robot programming of spatial logistic task in indoor environments. The second research question will address how to apply ML-based methods for remapping between high-level instructions and the physical world to derive and execute new task-specific robot manipulation and logistic actions.

Therefore, the overall aim of this project is to investigate the use of low-code development for adaptive and re-configurable robot programming of logistic tasks. Through a case study proposed by ER, the project builds on SDU’s previous work on domain-specific languages (DSLs) to propose a solution for high-level programming of the 3D spatial context in natural language and work on using machine learning for adaptable programming of robotic skills. RUC will participate in the project with interaction competences to optimize the usability of the approach.

Our research methodology to solve this problem is oriented towards design science, which provides a concrete framework for dynamic validation in an industrial setting. For the problem investigation, we are planning a systematic literature review around existing solutions to address the issues of 3D space mapping and variability of logistic tasks. For the design and implementation, we will first address the requirement of building a spatial representation of the task conditions and the environment using external sensors, which will give us a map for deploying the ER platform. Furthermore, to minimizing the input that the users need to provide to link the programming parameters to the physical world we will investigate and apply sensor-based user interface technologies and machine learning. The designed solutions will be combined into the low-code development platform that will allow for the high-level robot programming.

Finally, for validation the resultant low-code development platform will be tested for logistics-manipulation tasks with the industry partner Enabled Robotics, both at a mockup test setup which will be established in the SDU I4.0 lab and at a customer site with increasing difficulty in terms of variability.

Value creation

Making it easier to program robotic solutions enables both new users of the technology and new use cases. This contributes to the DIREC’s long-term goal of building up research capacity as this project focuses on building the competences necessary to address challenges within software engineering, cyber-physical systems (robotics), interaction design, and machine learning.

Scientific value
The project’s scientific value is to develop new methods and techniques for low-code programming of robotic systems with novel user interface technologies and machine learning approaches to address variability. This addresses the lack of approaches for low-code development of robotic skills for logistic tasks. We expect to publish at least four high-quality research articles and to demonstrate the potential of the developed technologies in concrete real-world applications.

Capacity building
The project will build and strengthen the research capacity in Denmark directly through the education of one PhD candidate, and through the collaboration between researchers, domain experts, and end-users that will lead to R&D growth in the industrial sector. In particular, research competences in the intersection of software engineering and robotics to support the digital foundation for this sector.

Societal and business value
The project will create societal and business value by providing new solutions for programming robotic systems. A 2020 market report predicts that the market for autonomous mobile robots will grow from 310M DKK in 2021 to 3,327M DKK in 2024 with inquiries from segments such as the semiconductor manufacturers, automotive, automotive suppliers, pharma, and manufacturing in general. ER wants to tap into these market opportunities by providing an efficient and flexible solution for internal logistics. ER would like to position its solution with benefits such as making logistics smoother and programmable by a wide customer base while alleviating problems with shortage of labor. This project enables ER to improve their product in regard to key parameters. The project will provide significant societal value and directly contribute to SDGs 9 (Build resilient infrastructure, promote inclusive and sustainable industrialization, and foster innovation).

Impact

The project will provide a strong contribution to the digital foundation for robotics based on software competences and support Denmark being a digital frontrunner in this area.

Participants

Project Manager

Thiago Rocha Silva

Associate Professor

University of Southern Denmark
Maersk Mc-Kinney Moller Institute

E: trsi@mmmi.sdu.dk

Aljaz Kramberger

Associate Professor

University of Southern Denmark
Maersk Mc-Kinney Moller Institute

Mikkel Baun Kjærgaard

Professor

University of Southern Denmark
Maersk Mc-Kinney Moller Institute

Mads Hobye

Associate Professor

Roskilde University
Department of People and Technology

Lars Peter Ellekilde

Chief Executive Officer

Enabled Robotics ApS

Anahide Silahli

PhD

University of Southern Denmark
Maersk Mc-Kinney Moller Institute

Partners

Categories
Bridge project

Trust through Software Independence and Program Verification

DIREC project

Trust through software independence and program verification

Summary

There is constant interest for Internet Voting by election commissions around the world. This is illustrated well by Greenland – their election law was changed in 2020 and now permits the use of Internet Voting. However, building an Internet Voting system is not easy: The design of new cryptographic protocols is error-prone and public trust in the elected body is easily threatened. 

A software-independent voting protocol is one where an undetected change or error in software cannot cause an undetectable change or error in an election outcome. Program verification techniques have come a long way and promise to improve the reliability and the cybersecurity of election technologies but it is, by no means, clear if formally-verified software-independent voting systems also increase public confidence in elections.

Together with the authorities in Greenland, this project will investigate the effects of program verification on public trust in election technologies. The project aims to contribute to making internet elections more credible, which can strengthen developing and post-conflict democracies around the world.

Project period: 2023-2026
Budget: DKK 4,6 million

Here are four considerations that explain the unmet needs of this proposed project.

  1. Voting protocols have become increasingly popular and will be more widely deployed in the future as a result of an ongoing digitalization effort of democratic processes.
  2. Elections are based on trust, which means that election systems ideally should be based on algorithms and data structures that are trusted.
  3. Program verification techniques are believed to strengthen this trust.
  4. Greenland laws were recently changed to allow for Internet Voting.

The integrity of an election result is best captured through software-independence in the sense of Rivest and Wack’s definition “A voting system is software-independent if an undetected change or error in its software cannot cause an undetectable change or error in an election outcome.” Software independence is widely considered a precondition for trust. The assumption that program verification increases trust arises from the fact that those doing the verification are becoming convinced that the system implements its specification. However, the question is if these arguments also convince others not involved in the verification process that the verified system can be trusted, and if not, under which additional assumptions will they trust?

Thus, the topic of this project is to study the effects of program verification on public trust in the context of election technologies. Therefore, this project is structured into two parts. First, can we formally verify software dependence using modern program verification techniques and second, is software-independence sufficient to generate trust.

The research project aims to shed more light on the overall research question, if formal verification of software-independence can strengthen public confidence. Affirming this research question in the positive would lead to a novel understanding of what it means for voting protocols to be trustworthy, it would lead to an understanding how to increase public confidence in Internet Voting, which may be useful for countries that lack trust in the security of paper records.

(RO1) Explore the requirement of software-independence in the context of formal verification of existing Internet voting protocols.

(RO2) Study the public confidence in Greenland with respect to software-independence and formally verified Internet Voting protocols and systems.

Software Independence

In order to achieve (RO1), we will consider two theories of what constitutes software-independence. There is the game-theoretic view, which, similar to proof by reduction and simulation in cryptography, reduces software-independence of one protocol to another. The statistical view gives precise bounds on the likelihood of the election technology to produce an incorrect result. We plan to understand how to capture formally the requirement of software-independence by selecting existing or newly developed voting protocols and generate formally verified implementations. For all voting protocols that we design within this project, we will use proof assistants to derive mechanized proofs of software independence.

User Studies

To achieve (RO2), we will, together with the Domestic Affairs Division, Govern-ment of Greenland study the effects of formal verification of software independence on public confidence. The core hypothesis of these studies is that strategic communication of concepts, such as software inde-pendence, can be applied in such a way that it strengthens public confidence. We will invite Greenland voters to participate in pilot demonstrations and user studies and will evaluate answers qualitatively and quantitatively.

Scientific value
Internet voting provides a unique collection of challenges, such as election integrity, vote privacy, receipt-freeness, coercion resistance, and dispute resolution. Here we aim to focus on election integrity, and show that if we were to verify formally the property of software-independence of a voting system that would increase the public confidence of the voters in the accuracy of the election result.

Capacity building
The proposed project pursues two kinds of capacity building. First, by training the PhD student and university students affiliated with the project, making Denmark a leading place for secure Internet voting. Second, if successful, the results of the project will contribute to the Greenland voting project and to international capacity building in the sense that they will strengthen democratic institutions.

Societal value
Some nations are rethinking their respective electoral processes and the ways they hold elections. Since the start of the Covid-19 pandemic, approximately a third of all nations scheduled to hold a national election, have postponed them. It is therefore not surprising that countries are exploring Internet Voting as an additional voting channel. The result of this project would contribute to making Internet election more credible, and therefore strengthen developing and post-conflict democracies around the world.

News / coverage

Participants

Project Manager

Carsten Schürmann

Professor

IT University of Copenhagen
Department of Computer Science

E: carsten@itu.dk

Klaus Georg Hansen

Founder

KGH Productions

Markus Krabbe Larsen

PhD Student

IT University of Copenhagen
Department of Computer Science

Bas Spitters

Associate Professor

Aarhus University
Department of Computer Science

Oksana Kulyk

Associate Professor

IT University of Copenhagen

Philip Stark

Professor

University of California, Berkeley

Peter Ryan

Professor, Dr.

University of Luxembourg

Partners

Categories
Bridge project

Multimodal Data Processing of Earth Observation Data

DIREC project

Multimodal data processing of Earth Observation Data

Summary

Based on Earth observations, a number of Danish public organizations build and maintain important data foundations that are used for decision-making, e.g., for executing environmental law or making planning decisions in both private and public organizations in Denmark.  

Together with some of these public organizations, this project aims to support the digital acceleration of the green transition by strengthening the data foundation for environmental data. There is a need for public organizations to utilize new data sources and create a scalable data warehouse for Earth observation data. This will involve building processing pipelines for multimodal data processing and designing user-oriented data hubs and analytics. 

 

Project period: 2022-2025
Budget: DKK 12,27 million

The Danish partnership for digitalization has concluded that there is a need to support the digital acceleration of the green transition. This includes strengthening efforts to establish a stronger data foundation for environmental data. Based on observations of the Earth a range of Danish public organizations build and maintain important data foundations. Such foundations are used for decision making, e.g., for executing environmental law or making planning decisions in both private and public organizations in Denmark.

The increasing possibilities of automated data collection and processing can decrease the cost of creating and maintaining such data foundations and provide service improvements to provide more accurate and rich information. To realize such benefits, public organizations need to be able to utilize the new data sources that become available, e.g., to automize manual data curation tasks and increase the accuracy and richness of data. However, the organizations are challenged by the available methods ability to efficiently combine the different sources of data for their use cases. This is particularly the case when user-facing tools must be constructed on top of the data foundation. The availability of better data for end-users will among others help the user decrease the cost of executing environmental law and making planning decisions. In addition, the ability of public data sources to provide more value to end-users, improves the societal return-on-investment for publishing these data, which is in the interest of the public data providers as well as their end-users and the society at large.

The Danish Environmental Protection Agency (EPA) has the option to receive data from many data sources but today does not utilize this because today’s lack of infrastructure makes it cost prohibitive to take advantage of the data. Therefore, they are expressing a need for methods to enable a data hub that provide data products combining satellite, orthophoto and IoT data. The Danish GeoData Agency (GDA) collects very large quantities of Automatic Identification System (AIS) data from ships sailing in Denmark. However, they are only to a very limited degree using this data today. The GDA has a need for methods to enable a data hub that combines multiple sources of ship-based data including AIS data, ocean observation data (sea level and sea temperature) and metrological data. There is a need for analytics on top that can provide services for estimating travel-time at sea or finding the most fuel-efficient routes. This includes estimating the potential of lowering CO2 emissions at sea by following efficient routes.

Geo supports professional users in performing analysis of subsurface conditions based on their own extensive data, gathered from tens of thousands of geotechnical and environmental drilling operations, and on public sources. They deliver a professional software tool that presents this multi modal data in novel ways and are actively working on creating an educational platform giving high school students access to the same data. Geo has an interest in and need for methods for adding live, multi modal data to their platform, to support both professional decision makers and students. Furthermore, they have a need for novel new ways of querying and representing such data, to make it accessible to professionals and students alike. Creating a testbed for combining Geo’s data with satellite feeds, combined with automated processing to interpret this data, will create new synergies and has the potential to greatly improve the visualizations of the subsurface by building detailed, regional and national 3D voxel models.

Therefore, the key challenges that this project will address are how to construct scalable data warehouses for Earth observation data, how to design systems for combining and enriching multimodal data at scale and how to design user-oriented data interfaces and analytics to support domain experts. Thereby, helping the organizations to produce better data for the benefit of the green transition of the Danish society.

The aim of the project is to do use-inspired basic research on methods for multimodal processing of Earth observation data. The research will cover the areas of advanced and efficient big data management, software engineering, Internet of Things and machine learning. The project will research in these areas in the context of three domain cases with GDA on sea data and EPA/GEO on environmental data.

Scalable data warehousing is the key challenge that work within advanced and efficient big data management will address. The primary research question is how to build a data warehouse with billions of rows of all relevant domain data. AIS data from GDA will be studied and in addition to storage also data cleaning will be addressed. On top of the data warehouse, machine learning algorithms must be enabled to compute the fastest and most fuel-efficient route between two arbitrary destinations.

Processing pipelines for multimodal data processing is the key topic for work within software engineering, Internet of Things and machine learning. The primary research question is how to engineer data processing pipelines that allows for enriching data through processes of transformation and combination. In the EPA case there is a need for enriching data by combining data sources, both from multiple sources (e.g., satellite and drone) and modality (e.g., the NDVI index for quantifying vegetation greenness is a function over a green and a near infrared band). Furthermore, we will research methods for easing the process of bringing disparate data into a form that can be inspected both by a human and an AI user. For example, data sources are automatically cropped to a polygon representing a given area of interest (such as a city, municipality or country), normalized for comparability and subjected to data augmentation, in order to improve machine learning performance. We will leverage existing knowledge on graph databases. We aim to facilitate the combination of satellite data with other sources like sensor recordings at specific geo locations. This allows for advanced data analysis of a wide variety of phenomena, like detection and quantification of objects and changes over time, which again allows for prediction of future occurrences.

User-oriented data hubs and analytics is a cross cutting topic with the aim to design interfaces and user-oriented analytics on top of data warehouses and processing pipelines. In the EPA case the focus is on developing a Danish data hub with Earth observation data. The solution must provide a uniform interface to working with the data providing a user-centric view to data representation. This will then enable decision support systems, which will be worked on in the GEO case, that may be augmented by artificial intelligence and understandable to the human users through explorative graph-based user interfaces and data visualizations. For the GPA case the focus is on a web-frontend for querying AIS data as a trajectory and heat maps and estimating the travel time between two points in Danish waters. As part of the validation the data warehouse and related services will be deployed at GDA and serve as the foundation for future GDA services.

Advancing means to process, store and use Earth observation data has many potential domain applications. To build the world class computer science research and innovation centres, as per the long-term goal of DIREC, this project focuses on building the competencies necessary to address challenges with Earth observation data building on advances in advanced and efficient big data management, software engineering, Internet of Things and machine learning.

Scientific value
The project’s scientific value is the development of new methods and techniques for scalable data warehousing, processing pipelines for multimodal data and user-oriented data hubs and analytics. We expect to publish at least seven rank A research articles and to demonstrate the potential of the developed technologies in concrete real-world applications.

Capacity building
The project will build and strengthen the research capacity in Denmark directly through the education of two PhDs, and through the collaboration between researchers, domain experts, and end-users that will lead to R&D growth in the public and industrial sectors. Research competences to address a stronger digital foundation for the green transformation is important for the Danish society and associated industrial sectors.

Societal and business value
The project will create societal and business value by providing the foundation for the Blue Denmark to reduce environmental and climate impact in Danish and Greenlandic waters to help support the green transformation. With ever-increasing human activity at sea, growing transportation of goods where 90% is being transported by shipping and a goal of a European economy based on carbon neutrality there is a need for activating marine data to support this transformation. For the environmental protection sector the project will provide the foundation for efforts to increase the biodiversity in Denmark by better protection of fauna types and data-supported execution of environmental law. The project will provide significant societal value and directly contribute to SDGs 13 (climate action), 14 (life under water) and 15 (life on land).

In conclusion, the project will provide a strong contribution to the digital foundation for the green transition and support Denmark being a digital frontrunner in this area.

Impact

The project will provide the foundation for the Blue Denmark to reduce environmental and climate impact in Danish and Greenlandic waters to help support the green transformation.  

News / coverage

Participants

Project Manager

Kristian Torp

Professor

Aalborg University
Department of Computer Science
E: torp@cs.aau.dk

Christian S. Jensen

Professor

Aalborg University
Department of Computer Science

Thiago Rocha Silva

Associate Professor

University of Southern Denmark
Maersk Mc-Kinney Moller Institute

Mads Darø Kristensen

Principal Application Architect

The Alexandra Institute

Jakob Winge

Senior Software Developer

The Alexandra Institute

David Anthony Parham

Visual Computing Engineer

The Alexandra Institute

Søren Krogh Sørensen

Software Developer

The Alexandra Institute

Oliver Hjermitslev

Visual Computing Specialist

The Alexandra Institute

Mads Robenhagen Mølgaard

Department Director

GEO
Geodata & Subsurface Models

Ove Andersen

Special Consultant

Danish Geodata Agency

Mikael Vind Mikkelsen

Research Assistant

Aalborg University
Department of Computer Science

Tianyi Li

Assistant Professor

Aalborg University
Department of Computer Science

Partners

Categories
Phd school Previous events

Summer School on Missing Data, Augmentation and Generative Models

phd summer school

Missing Data, Augmentation and Generative Models

This summer school will introduce the state-of-the-art for handling too little or missing data in image processing tasks. The topics include data augmentation, density estimation, and generative models.

Missing data is a common problem in image processing and in general AI based methods. The source can be, for example, occlusions in 3D computer vision problems, poorly dyed tissue in biological applications, missing data points in long-term observations, or perhaps there is just too little annotated data for a deep-learning model to properly converge.

On this PhD summer school, you will learn some of the modern approaches to handling the above-mentioned problems in a manner compatible with modern machine learning methodology.

This summer school will introduce the state-of-the-art for handling too little or missing data in image processing tasks. The topics include data augmentation, density estimation, and generative models. The course will include project work, where the participants make a small programming project relating their research to the summer school’s topics.

The summer school is the fifteenth summer school jointly organized by DIKU, DTU, and AAU. DIREC is co-sponsor of the PhD school.

Photo from the summer school in 2022

Categories
News

Digitalisation can definitely boost the green transition

13 JULY 2022

Digitalisation can definitely boost the green transition​

Artificial intelligence and algorithms can help calculate how we can best heat our homes, produce efficiently, transport with the least possible energy consumption, and make optimal use of the IT infrastructure as part of the green transition. But it requires that we dare to delegate more tasks to algorithms and invest more in research and development.