Philipp Benner

Researcher at BAM. [Github], [Twitter], [ORCID], [Email:s/ä/a/]

I’m a researcher in statistics and machine learning for materials science and bioinformatics. See our group website for more information!

Selected Research Projects

Fragmentation site prediction for non-targeted metabolomics using graph neural networks (Y. Nowatzky, T. Muth)

The potential of non-targeted metabolomics to uncover new biological insights, identify biomarkers or monitor clinical disease progression cannot be emphasized enough. However, spectral reference data is incomplete, and most compound mass spectra in non-targeted metabolomics experiments cannot be annotated with spectral search alone. At the same time, the identification and classification of unknown compounds are far from trivial. One reason is the current lack of understanding about how new molecules will fragment when subjected to tandem mass spectrometry (MS/MS). Existing in silico fragmentation methods, such as MetFrag [1] and CFM-ID [2], imitate the fragmentation process but their accuracy is limited due to the way they integrate and engineer molecular features. We investigate the ability of graph neural networks (GNNs) to learn and recognize relevant structural groups associated with bond cleavage during MS/MS.

References
  1. Ruttkies, Christoph, et al. “MetFrag relaunched: incorporating strategies beyond in silico fragmentation.” Journal of cheminformatics 8.1 (2016): 1-16.
  2. Wang, Fei, et al. “CFM-ID 4.0: more accurate ESIMS/MS spectral prediction and compound identification.” Analytical chemistry 93.34 (2021): 11692- 11700.
  3. SE Stein. ‘Mass Spectral Database’. In: National Institute of Standards and Technology (NIST) (2017)
  4. Mingxun Wang, et al. “Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking.” Nature biotechnology 34, no. 8 (2016): 828.

Crystal Synthesizability (S. Amariamir, J. George)

High-throughput material simulations are an integral part of modern materials science. However, there is no straightforward way to recognize synthesizable materials before feeding them to simulation pipelines. The common heuristics for distinguishing stable crystals, such as the Pauling Rules, have been shown to be outdated [1]. Besides stability, reaction kinetics and technological limitations also affect synthesizability. In this work, we built a machine learning model that predicts the synthesizability of a given crystal. This can be formulated as a classification problem with positive (experimental) data, and unlabeled (theoretical) data. We take an iterative Positive and Unlabeled (PU) learning approach to build and train our model. Two deep learning classifiers are used, the SchNetPack [2] and ALIGNN [3]. We combine their power via co-training [4] to increase prediction reliability. Our work has multiple important applications, for example, filtering structural predictions of high-throughput simulations for synthesizability.

References
  1. J. George, D. Waroquiers, D. Di Stefano, G. Petretto, G. Rignanese, and G. Hautier, “The Limited Predictive Power of the Pauling Rules,” Angew. Chem., vol. 132, no. 19, pp. 7639–7645,May 2020.
  2. K. T. Schütt et al. “SchNetPack: A Deep Learning Toolbox For Atomistic Systems,” J. Chem. Theory Comput., vol. 15, no. 1, pp. 448–455, Jan. 2019.
  3. Choudhary, K., DeCost, B. Atomistic Line Graph Neural Network for improved materials property predictions. npj Comput Mater 7, (2021).
  4. Katz, G.; Caragea, C.; Shabtai, A. Vertical Ensemble Co-Training for Text Classification. ACM Trans. Intell. Syst. Technol. 2018, 9, 21:1–21:23”

Invertible Neural Networks for Small Angle Scattering (S. Laskina, B. Pauw)

Continuing progress in the field of X-ray scattering empowers scientists with new possibilities to capture the 3D electron density of materials and molecules. Although first methods appeared almost a century ago, recovering the density structure of a sample is still challenging. Scattering techniques measure the intensity of scattered waves, for which a very accurate mathematical description exists based on the Fourier transform of the electron density. However, the resulting measurements capture not all information about the sample. Firstly, instead of a 3D measurement, only a 2D image is measured. In addition, also the phase information of the scattered waves is lost. The latter is known as the “phase problem” and poses a serious obstacle when trying to recover the 3D electron density. We tackle this problem using invertible neural networks trained on theoretical electron densities and simulated Small Angle X-ray Scattering (SAXS) measurements. We created a library of theoretical electron densities and their corresponding SAXS measurements. The library contains electron densities composed of different shapes, including spheres and cylinders. We used this library to develop an invertible neural network that is able to recover certain parameters of the electron densities. Despite the large loss of information when applying the forward model, our method is able to reliably identify important material parameters. Ideally, we hope that this work is a first step towards a fully automated measurement and analysis workflow.

References
  1. Preprint
  2. Github

Tissue-Specific Regulatory Information within Enhancer DNA Sequences (M. Vingron)

Recent efforts to measure epigenetic marks across a wide variety of different cell types and tissues provide insights into the cell type-specific regulatory landscape. We use this data to study if there exists a correlate of epigenetic signals in the DNA sequence of enhancers and explore with computational methods to what degree such sequence patterns can be used to predict cell type-specific regulatory activity. By constructing classifiers that predict in which tissues enhancers are active, we are able to identify sequence features that might be recognized by the cell in order to regulate gene expression. While classification performances vary greatly between tissues, we show examples where our classifiers correctly predict tissue specific regulation from sequence alone. We also show that many of the informative patterns indeed harbor transcription factor footprints.

References
  1. P. Benner, and M. Vingron. Quantifying the Tissue-Specific Regulatory Information within Enhancer and Promoter DNA Sequences. NAR Genomics and Bioinformatics 3.4 (2021)
  2. P. Benner, and M. Vingron. ModHMM: A modular supra-Bayesian genome segmentation method. Journal of Computational Biology 27.4 (2020): 442-457.

Algorithms for Computing Regularization Paths

High-dimensional statistics deals with statistical inference when the number of parameters or features $p$ exceeds the number of observations $n$ (i.e. $p \gg n$). In this case, the parameter space must be constrained either by regularization or by selecting a small subset of $m \le n$ features. Feature selection through $\ell_1$-regularization combines the benefits of both approaches and has proven to yield good results in practice. However, the functional relation between the regularization strength $\lambda$ and the number of selected features $m$ is difficult to determine. Hence, parameters are typically estimated for all possible regularization strengths $\lambda$. These so-called regularization paths can be expensive to compute and most solutions may not even be of interest to the problem at hand. As an alternative, an algorithm is proposed that determines the $\ell_1$-regularization strength $\lambda$ iteratively for a fixed $m$. The algorithm can be used to compute leapfrog regularization paths by subsequently increasing $m$.

References
  1. P. Benner. Computing leapfrog regularization paths with applications to large-scale k-mer logistic regression. Journal of Computational Biology 28.6 (2021): 560-569.

Lectures

Machine Learning in Bioinformatics - FU Berlin

Lecture at the Mathematics and Computer Science Department of Free University of Berlin together with H. Richard

Machine Learning Models:

Feature and Model Selection, Bias-Variance Tradeoff, Regularization, Model Complexity, Double Descent:

Model Evaluation and Explainability:

Statistics background:

Machine Learning in Materials Science - FU Berlin

Seminar at FU Berlin together with A. Kister

Software projects

Numerics, Statistics, ML

Bioinformatics

Other

Publications

Statistics Resources