- Query Optimization
- Data Vocalization
- Fact Checking
- Quantum Computing
- Text Mining and Machine Learning
Declarative query languages such as SQL allow users to simply describe the data they need instead of specifying how to generate it. Such languages are based on query optimizer components that translate declarative queries into executable query plans. The goal of query optimization is generally to find an optimal query plan for a given query (e.g., a plan with minimal execution time). The query optimization problem has been introduced in the 70's and is perhaps the optimization problem in the database community that most work has been published on. However, the context of query optimization is continuously changing which leads to novel challenges. For instance, modern execution engines have many more tuning parameters than before and motivate us to compare alternative query plans according to multiple cost metrics instead of only one. Both makes optimization more challenging. On the other hand, novel optimization platforms are nowadays available that can be exploited for query optimization.
In my dissertation, I address several extended variants of query optimization for which no practical algorithms were previously available. Also, I show how to optimize queries in traditional query optimization that are by far too large for prior methods. I propose a variety of approaches for making query optimization more efficient that can be categorized into three high-level ideas, as illustrated in the overview figure above. First, if queries correspond to query templates that are known in advance, we can make optimization a pre-processing step. Even though optimization may take a long time, it does not add any run time overhead. Second, we can speed up optimization by relaxing optimality guarantees. We have published approximation schemes, incremental algorithms, and randomized algorithms for novel query optimization variants. Finally, we can decrease optimization time by leveraging novel software and hardware platforms for optimization. We have recently shown how to leverage integer programming solvers, massively parallel computation clusters, and quantum annealers for query optimization variants. In many scenarios, our approaches reduce optimization times from hours or even years to just minutes or even seconds.
In this project, we study the question of how to Vocalize data, i.e. how to translate data into optimal voice output. Voice output has to be extremely efficient. While users can themselves quickly identify relevant parts in a plot or written text (via skimming), they have to trust the computer to select only the most relevant information for voice output. Voice output transmits information rather slowly (compared to reading text), still it has to be short (in order to not to exceed the user's attention span), while at the same time of a simple structure (in order to restrict cognitive load on the listener). We treat voice output generation as a global optimization problem in which various metrics (e.g., speaking time, precision of transmitted information, cognitive load on the listener) come into play.
We are currently studying data vocalization in different scenarios, characterized by data type, size, and user context among other criteria. We are also studying questions of how to optimally support voice output via data processing, i.e. how to avoid generating results of a degree of detail that cannot be transmitted via speech. The results of this research project are integrated into CiceroDB, a prototypical database system that is designed from the ground up for efficient and effective voice output.
We are developing approaches for automated and semi-automated fact checking of data summaries to answer that question. A text document, together with an associated data set, form the input for fact checking. Our goal is to identify erroneous claims about the data in the input text. More precisely, we focus on text passages that can be translated into a pair of an SQL query and a claimed query result. A claim is erroneous if evaluating the query yields a result that cannot be rounded to the one claimed in text.
In our first project in this space, we have developed a "fact checker tool" that supports authors in producing accurate data summaries. The tool is similar in spirit to a spell checker: where a spell checker supports users in avoiding erroneous spelling and grammatical mistakes, the fact checker supports users in avoiding erroneous claims. We focus on a restricted class of claims that are at the same time common and error-prone. The fact checker translates text passages into equivalent SQL queries, evaluates them on a database, and marks up potentially erroneous claims. Users obtain a natural language explanation, summarizing the system's interpretation of specific text passages, and can easily take corrective actions if necessary. We have recently used this tool to identify erroneous claims in articles from several major newspapers, some of which had gone by unnoticed for years.
I have recently presented the first paper on quantum computing on the VLDB conference (see talk video below). In this paper, we focus on a classical optimization problem from the database domain, the problem of multiple query optimization. We show how to map instances of that problem to the restrictive input format supported by the quantum annealer. Doing so is challenging, mainly due to the limited number of qubits and sparse connections between qubits. We describe problem representations with low asymptotic qubit growth that map well to the sparse connection structure between qubits. Also, we introduce an algorithm that calculates such mappings efficiently at run time. Finally, we present experimental results obtained on the quantum annealer, thereby evaluating the practicality and performance of the proposed method.
We use natural language analysis to identify statements in Web text that express an opinion about whether or not a specific property applies to a specific entity. As we consider subjective properties, user opinions diverge and we need to resolve conflicting opinions in order to correctly infer the majority opinion. This is surprisingly difficult and simple resolution strategies (e.g., use the majority vote among conflicting opinions) lead to poor results (i.e., poor match with opinions collected from test users). The reason for that are various types of skew that influence the probability that users express a certain opinion on the Web. For instance, users who think that a certain city is big are more likely to write about it on the Web. Hence, they are always overrepresented in a sample of opinions collected from the Web. We overcome those challenges by learning property and entity type specific user behavior models by unsupervised machine learning using an expectation-maximization approach. We show by a large user study that we can use those models to quite reliably infer the opinions of the average user. The system is described in more detail in the corresponding SIGMOD 2015 talk: