# Research Projects

## Amortized Bayesian Inference for Multilevel Models

Probabilistic multilevel models (MLMs) are a central building block in Bayesian data analysis. Despite their widely acknowledged advantages, MLMs remain challenging to estimate and evaluate, especially when the involved likelihoods or priors are analytically intractable. Recent developments in generative deep learning and simulation-based inference have shown promising results in scaling up Bayesian inference through *amortization*. However, the utility of deep generative models for learning Bayesian MLMs remains largely unexplored.

In this project, we propose to develop a general and efficient neural inference framework for estimating and evaluating complex Bayesian MLMs. Our framework will substantially extend previous work on simulation-based Bayesian inference for single-level models. Moreover, it aims to encompass not only the inference phase of a Bayesian workflow but also the model evaluation steps, which usually comprise a computational bottleneck with standard (non-amortized) Bayesian methods. Thus, the proposed project has the potential to greatly enhance model-based inference and understanding of complex processes across the quantitative sciences.

Overarching Topics: Simulation-Based Inference, Uncertainty Quantification

Project Members: Daniel Habermann

Funders: German Research Foundation (DFG)

Funding Period: 2023 – 2026

Publications:

- Schmitt M., Habermann D., Bürkner P. C., Köthe U., & Radev S. T. (2023). Leveraging Self-Consistency for Data-Efficient Amortized Bayesian Inference.
*NeurIPS UniReps Workshop*.

## Bayesian Distributional Latent Variable Models

In psychology and related sciences, a lot of research is concerned with studying latent variables, that is, constructs which are not directly observable. Statistical methods for modeling latent variables based on manifest (observable) indicators are thus crucial to the scientific progress in those fields. Two major interconnected statistical areas dealing with latent variables exist, namely, Item Response Theory (IRT) and Structural Equation Modeling (SEM). Although the two fields are closely connected, the frontiers of IRT and SEM have developed in somewhat different directions.

A combination of these two major frontiers would enable researchers to tackle a lot of advanced psychological research questions at the intersection of psychometrics, personnel psychology, cognitive psychology, and applied psychology. In order for us to gain better insights into behavioral and cognitive processes, their mathematical approximations should match the processes’ complexity in both overall distributional form and its components that are expressed as complex functions of predicting variables.

This project aims to develop a framework for Bayesian distributional latent variable models that combines the principles of IRT and SEM with the flexibility of distributional regression powered by modern Bayesian estimation methods.

Overarching Topics: Uncertainty Quantification, Machine-Assisted Bayesian Workflow

Project Members: Luna Fazio

Funders: German Research Foundation (DFG), TU Dortmund University

Funding Period: 2022 – 2025

## Simulation-Based Prior Distributions for Bayesian Models

Data-driven statistical modeling plays a crucial role in almost all quantitative sciences. Despite continuous increases in the amount of available data, the addition of further information sources, such as expert knowledge, often remains an irreplaceable part of setting up high-fidelity models. Grounded in probability theory, Bayesian statistics provides a principled approach to including expert knowledge in the form of prior distributions, a process called prior elicitation. However, prior elicitation for high-dimensional Bayesian models is infeasible with existing methods due to practical and computational challenges. With the goal of solving these challenges, we propose to develop simulation-based priors for high-dimensional Bayesian models that allow to incorporate prior information elicited on any model-implied quantities. We expect the developed methods to have a major impact on all fields applying probabilistic modeling by making the use of expert knowledge practical, robust, and computationally feasible.

Overarching Topics: Prior Specification, Simulation-Based Inference

Project Members: Florence Bockting

Funders: TU Dortmund University, Cluster of Excellence SimTech

Funding Period: 2022 – 2025

Publications:

- Bockting F., Radev, S. T., & Bürkner P. C. (2023). Simulation-Based Prior Knowledge Elicitation for Parametric Bayesian Models.
*ArXiv preprint*.

## Probabilistic Models for Single-Cell RNA Sequencing Data

Trajectory and pseudo-time inference methods in single-cell RNA sequencing face challenges from the ambiguity of the static single-cell transcriptome snapshot data. In this project, we aim to tackle this challenge by means of advanced probabilistic methods. Concretely, we aim to reconstruct unobserved cell ordering as latent pseudo-time by analyzing RNA spliced counts and corresponding derivative RNA velocity. Further, we aim to obtain uncertainty estimates of the latent cell ordering using Bayesian inference. To achieve these goals, we will develop advanced latent Gaussian process models with the ability of utilizing derivative information to increase precision in estimating unobserved latent inputs. This model deploys derivative covariance kernel functions and modifications in the hyperparameter specifications, thus increasing capabilities for utilizing derivative information in a multi-output setting. Although the primary motivation lies in applications in single-cell biology, this model has the potential to solve similar research problems dealing with multi-output data and its derivatives from diverse fields of study.

Overarching Topics: Uncertainty Quantification

Project Members: Soham Mukherjee

Co-Supervisors: Manfred Claassen

Funders: TU Dortmund University, University of Tübingen

Funding Period: 2022 – 2025

## Data-Integrated Training of Surrogate Models for Uncertainty Quantification of Complex Biological Systems Models

Uncertainty estimation is crucial to assess the predictive power and limitations of biological systems models. However, in the case of high-dimensional parameter spaces and/or complex functional relationships, physics-based simulation models are often computationally too demanding for rigorous Bayesian uncertainty quantification. Surrogate models allow for such analyses with much lower effort. They are typically trained such that they fit the simulation reference best.

What is left unexplored is the possibility of surrogate models to actually fit observed data better than the reference model. This phenomenon occurs when structural misspecification of the physics-constrained reference model limits its performance, but at the same time, the more flexible data-driven surrogate model can better capture the relation of output and input data. Such situations offer huge potential for diagnostic evaluation of the modelling approach toward deeper system understanding and model improvement.

We aim at developing (1) a weighted data-integrated surrogate training approach for improved prediction skill, (2) a diagnostic approach for structural error detection in the reference model, and (3) an uncertainty propagation analysis that accounts for the approximation error introduced by this novel as well as any type of surrogate.

Overarching Topics: Uncertainty Quantification

Project Members: Philipp Reiser

Co-Supervisors: Anneli Guthke

Funders: Cluster of Excellence SimTech

Funding Period: 2022 – 2025

## Meta-Uncertainty in Bayesian Model Comparison

What we can learn from a single data set in experiments and observational studies is always limited, and we are inevitably left with some remaining uncertainty. It is of utmost importance to take this uncertainty into account when drawing conclusions if we want to make real scientific progress. Formalizing and quantifying uncertainty is thus at the heart of statistical methods aiming to obtain insights from data.

To compare scientific theories, scientists translate them into statistical models and then investigate how well the models’ predictions match the gathered real-world data. One widely applied approach to compare statistical models is Bayesian model comparison (BMC). Relying on BMC, researchers obtain the probability that each of the competing models is true (or is closest to the truth) given the data. These probabilities are measures of uncertainty and, yet, are also uncertain themselves. This is what we call meta-uncertainty (uncertainty over uncertainties). Meta-uncertainty affects the conclusions we can draw from model comparisons and, consequently, the conclusions we can draw about the underlying scientific theories.

This project contributes to this endeavor by developing and evaluating methods for quantifying meta-uncertainty in BMC. Building upon mathematical theory of meta-uncertainty, we will utilize extensive model simulations as an additional source of information, which enable us to quantify so-far implicit yet important assumptions of BMC. What is more, we will be able to differentiate between a closed world, where the true model is assumed to be within the set of considered models, and an open world, where the true model may not be within that set – a critical distinction in the context of model comparison procedures.

Overarching Topics: Model Comparison, Uncertainty Quantification, Simulation-Based Inference

Project Members: Marvin Schmitt

Funders: Cyber Valley Research Fund

Funding Period: 2021 – 2024

Publications:

- Schmitt, M., Radev, S. T., & Bürkner P. C. (2023). Meta-Uncertainty in Bayesian Model Comparison.
*Artificial Intelligence and Statistics (AISTATS) Conference Proceedings*. - Schmitt, M., Bürkner P. C., Köthe U., & Radev S. T. (2023). Detecting Model Misspecification in Amortized Bayesian Inference with Neural Networks.
*Proceedings of the German Conference on Pattern Recognition (GCPR)*. - Schmitt M., Habermann D., Bürkner P. C., Köthe U., & Radev S. T. (2023). Leveraging Self-Consistency for Data-Efficient Amortized Bayesian Inference.
*NeurIPS UniReps Workshop*. - Radev S. T., Schmitt M., Pratz V., Picchini U., Köthe U., & Bürkner P. C. (2023). JANA: Jointly Amortized Neural Approximation of Complex Bayesian Models.
*Uncertainty in Artificial Intelligence (UAI) Conference Proceedings*. - Radev S. T., Schmitt M., Schumacher L., Elsemüller L., Pratz V., Schälte Y., Köthe U., & Bürkner P. C. (2023). BayesFlow: Amortized Bayesian Workflows With Neural Networks.
*Journal of Open Source Software*. doi:10.21105/joss.05702

Software:

- ggsimplex: Simplex visualizations with ggplot2 [GitHub]

## Intuitive Joint Priors for Bayesian Multilevel Models

Regression models are ubiquitous in the quantitative sciences making up a big part of all statistical analysis performed on data. In the quantitative sciences, data often contains multilevel structure, for example, because of natural groupings of individuals or repeated measurement of the same individuals. Multilevel models (MLMs) are designed specifically to account for the nested structure in multilevel data and are a widely applied class of regression models. From a Bayesian perspective, the widespread success of MLMs can be explained by the fact that they impose joint priors over a set of parameters with shared hyper-parameters, rather than separate independent priors for each parameter. However, in almost all state-of-the-art approaches, different additive regression terms in MLMs, corresponding to different parameter sets, still receive mutually independent priors. As more and more terms are being added to the model while the number of observations remains constant, such models will overfit the data. This is highly problematic as it leads to unreliable or uninterpretable estimates, bad out-of-sample predictions, and inflated Type I error rates.

To solve these challenges, this project aims to develop, evaluate, implement, and apply intuitive joint priors for Bayesian MLMs. We hypothesize that our developed priors will enable the reliable and interpretable estimation of much more complex Bayesian MLMs than was previously possible.

Overarching Topics: Prior Specification

Project Members: Javier Aguilar

Funders: German Research Foundation (DFG), TU Dortmund University, University of Stuttgart

Funding Period: 2021 – 2025

Publications:

- Aguilar J. E. & Bürkner P. C. (2023). Intuitive Joint Priors for Bayesian Linear Multilevel Models: The R2D2M2 prior.
*Electronic Journal of Statistics*. Available for free also on ArXiv. doi:10.1214/23-EJS2136

## Machine Learning for Bayesian Model Building

The Bayesian approach to data analysis provides a consistent and flexible way to handle uncertainty in all observations, model parameters, and model structure using probability theory. However, building Bayesian models in a principled way remains a highly complex task requiring a lot of expertise and cognitive resources. In this project, we will develop a machine assisted workflow for building interpretable, robust, and well-predicting Bayesian models. Based on statistical theory, we will develop a framework for simulating realistic data with known modeling challenges. Subsequently, using neural network architectures tuned to the structure of the fitted Bayesian models, machines will be trained on the simulated data to provide automatic model evaluation and modeling recommendations that guide the user through the model building process using interactive visualizations. While leaving the modeling choices up to the user, the machine learns from the user’s decisions to improve its recommendations on the fly.

Overarching Topics: Machine-Assisted Bayesian Workflow, Model Comparison

Project Members: Maximilian Scholz

Funders: Cluster of Excellence SimTech

Funding Period: 2021 – 2024

Publications:

- Scholz M. & Bürkner P. C. (2023). Prediction can be safely used as a proxy for explanation in causally consistent Bayesian generalized linear models.
*ArXiv preprint*. - Scholz M. & Bürkner P. C. (2023). Posterior accuracy and calibration under misspecification in Bayesian generalized linear models.
*ArXiv preprint*. - Bürkner P. C., Scholz M., & Radev S. T. (2023). Some models are useful, but how do we know which ones? Towards a unified Bayesian model taxonomy.
*Statistics Surveys*.

Software: