# Todagesmøde forår 2024

Early summer, a filled up auditorium, and unstoppable talks welcomed more than 70 participants at Aarhus University

The first meeting of 2024 was held at the Department of Mathematics at Aarhus University on the 21th and 22nd of May, and was organized by the Stochastics group at the Department. The participants were met with an ideally proportioned auditorium where presentations and discussions had rich conditions for both sparking ideas and provoking thoughts.

With the conference dinner on the first day taking place on site at the Mathematics canteen, it was simply straight from the frying pan of statistical talks and into the fire of the social programme, before the next set of talks would yet again stir things up on the following day. The programme had many differently themed talks, starting with a view into the life of a statistician at Novo Nordisk, eventually moving on to a controversial and very interesting story about a recently developed prediction algorithm, and finally ending up in abstract and mind-boggling spaces structured with C*-algebras. The conference housed students, senior statisticians, and everything in between, setting an ideal stage for reconvening with colleagues from near and far (on a Danish distance scale) and for getting to know new faces.

Claus Dethlefsen from Novo Nordisk started off the programme with his talk “Tales from the everyday life of a pharmaceutical statistician”, dutifully remarking that the views presented in the talk should only be seen as his personal viewpoints and not necessarily those of Novo Nordisk. Claus discussed both the current and future prevalence of biostatistics within Novo Nordisk (while not so subtly advertising that there will likely be open positions in the near future), the statistical challenges associated with clinical trials, and the formation of an R-based programming structure that can be used for submission of drug candidates with relevant authorities such as the FDA.

Staying within a structural scope, Therese Graversen followed with her talk “Designing a data science education”, which revolved around her work with restructuring the data science bachelor’s degree at the IT University of Copenhagen. Among other things, Therese discussed how to deal with the student population being highly international, how to balance the amount of mathematical and statistical theory to fit the level of the students that enter the programme as well as to equip the students with the skills needed to apply for more than a single master’s programme, and how to leverage the amount of time spent of hot topics against that spent on foundational data science. Therese also briefly touched upon the issue of gender balance within data science education. An overarching theme of Therese’s talk was a discussion of what data science is and what it should be now and in the future.

In the next talk Martin Bøgsted discussed a possibility of adhering to the general data protection regulation law while still being able to share and show something that represents actual sensitive data. His talk “How to generate realistic, non-personal synthetic health data” discussed how one can obtain a non-sensitive data sample from a sensitive one by ways of adding noise to the actual data. For this to be useful and safe, the synthetic data needs to have high utility, meaning that they are almost as good for answering a specific scientific question as the original data, and they need to have high privacy, meaning that given the synthetic data, the synthesization model, and potentially some of the original data, one should not be able to identify individuals from the original sample with individuals from the synthesized sample with high probability. Martin gave examples where Bayesian computations could lead to measures of privacy, and he provided some general theoretical results on bounding this type of privacy while also pointing to some gaps in the current theoretical privacy framework. Interestingly, Martin and his group have included a GDPR law specialist to aid the translation between legal and statistical formulations, in order to actually make their endeavors bear fruit in practice.

The talk of the day was given by Simon Tilma Vistisen and was called “The Emperor’s New Machine Learning Clothes – the fairy tale of a marketed clinical prediction model”. As suggested by the title this was indeed as captivating as it was provocative. A flagship prediction algorithm for proactively treating low blood pressure doing surgeries, which would otherwise lead to an increased risk of poor outcomes such as organ damage or even death, is apparently worthless. Why? In the development of the algorithm a severe data leakage issue was introduced which led to an extreme overestimation of prediction performance. This algorithm, developed by one of the top medical technology companies internationally, is being sold to hospitals around the world and is essentially telling the surgeon no more than what the currently observed blood pressure values can tell. As of now, this is a scientific-industrial scandal which is only starting to play out. With the developer of the algorithm being highly influential and there being a lot of money at stake, the issue needs to be convincingly and clearly described in the scientific literature for the use of the algorithm to be revised.

Before moving on to the social part of the programme, the Danish Statistical Society, the Committee for dissemination and press, and Young Statisticians Denmark gave updates on their ongoing and upcoming activities.

Clara Brimnes Gardner opened the second day with her talk ‘Phase-type representations for exponential distributions’ focusing on the question of when a phase-type distribution is in fact an overparameterized exponential distribution. Describing her work with this question she laid out necessary and sufficient conditions for a phase-type distribution to simplify into an exponential distribution. In particular these conditions are connected to the algebraic degree of a phase-type distribution as well as PH-simplicity, which have to do with the transition matrix of the phase-type distribution.

The next talk, given by Merle Behr, considered the issue of interaction recovery in a random forest framework and was titled “Provable Boolean interaction recovery from tree ensemble obtained via random forests”. She motivated this issue by the particular case of discovering genetic interactions amongst a large set of genetic factors, which might in particular lead to advancements in fields such as precision medicine. Roughly speaking the method considers how many times a set of signed features appear together in a Random Forest ensemble, and Merle and her group have proven a theorem stating that with high probability such a procedure will consistently discover interactions. This provides a much needed theoretical foundation for tree-based discovery of Boolean feature interactions.

Morten Overgaard followed with the next talk about ‘Pseudo-observations in a multistate setting’. Here he considered the case of pseudo-observation regression for estimating state occupation probabilities. The pseudo-observation method involves calculating jack-knife pseudo-observations based on some estimator of the state occupation probabilities, which will then be used in a regression analysis. In the particular case that Morten considered, the estimator was the Aalen-Johansen-derived estimator of the state occupation probabilities. Morten delved into some theory and discussion of when and why the approach can be expected to work, and in particular he discussed if and when the Markov assumption and the assumptions of state independent censoring and covariate independent censoring are necessary. He validated the theoretical discussion with a simulation study.

In the tail end of the conference, Jacob Hjelmborg propelled us all into space with his talk “Reviewing data embeddings to structured algebras”. Here he discussed the advantages of embedding data into a higher dimensional space for solving problems such as separating features of a data set. As a specific case he considered the embedding of data into Hilbert C*-modules that have the reproducing kernel property. He relieved the abstract part of the talk with an analysis of data from the Nordic twin study of cancer, where it was concluded that the likeliness to get a tattoo appears to be associated only with social factors and not genetic factors. Jacob ended his talk by teasing the potential use of embedding the twin study data into higher dimensional space for obtaining a way of quantifying salient features, such as the covariance between twin-specific trajectories, in a natural way. We will have to wait for the results of this labor.

The conference ended with sandwiches in the hall, where the participants could bid each other farewell and good luck until next time. Looking forward to seeing new and familiar faces at the next two-day meeting at The Cancer Society in Copenhagen!

This blogpost was written by Christoffer Sejling, member of DSTS UFP (Udvalg for Formidling og Presse / Committee for Dissemination and Press).