Johannes Schmidt-Hieber

Professor of statistics
at the Department of Applied Mathematics,
University of Twente.

Visiting address:
Zilverling Building
Room ZI 4058
Drienerlolaan 5
7522 NB Enschede


Open positions:
We soon have more job openings for PhD and postdoc positions in our group. If you have a background in statistics/probability/mathematics feel free to send an email to .

  1. The Kolmogorov-Arnold representation theorem revisited preprint
  2. On lower bounds for the bias-variance trade-off preprint
    With Alexis Derumigny
  3. On frequentist coverage of Bayesian credible sets for estimation of the mean under constraints preprint
    With Kevin Duisters
  4. Deep ReLU network approximation of functions on a manifold. preprint
  5. Posterior consistency for n in the binomial (n,p) problem with both parameters unknown - with applications to quantitative nanoscopy. preprint
    With Laura Schneider, Thomas Staudt, Andrea Krajina, Timo Aspelmeier and Axel Munk.
  6. Posterior contraction rates for support boundary recovery preprint
    With Markus Reiss.
    To appear in Stochastic Processes and their Applications
  7. Nonparametric regression using deep neural networks with ReLU activation function article pdf preprint
    Annals of Statistics, Volume 48, Number 4, 1875-1897, 2020.
    This article has been discussed by
  8. Rejoinder: "Nonparametric regression using deep neural networks with ReLU activation function" article pdf
    Annals of Statistics, Volume 48, Number 4, 1916-1921, 2020.
  9. Nonparametric Bayesian analysis of the compound Poisson prior for support boundary recovery article pdf preprint
    Annals of Statistics, Volume 48, Number 3, 1432-1451, 2020. With Markus Reiss.
  10. Bayesian variance estimation in the Gaussian sequence model with partial information on the means. article
    Electronic Journal of Statistics, Volume 14, Number 1, 239-271, 2020. With Gianluca Finocchio.
  11. Asymptotic nonequivalence of density estimation and Gaussian white noise for small densities preprint article
    Annales de l'Institut Henri Poincaré B, Volume 55, Number 4, 2195-2208, 2019. With Kolyan Ray.
  12. Tests for qualitative features in the random coefficients model pdf
    Electronic Journal of Statistics, Volume 3, 2257-2306, 2019. With Fabian Dunker, Konstantin Eckle, and Katharina Proksch.
  13. A comparison of deep networks with ReLU activation function and linear spline-type methods pdf
    Neural Networks, Volume 110, 232-242, 2019. With Konstantin Eckle.
  14. The Le Cam distance between density estimation, Poisson processes and Gaussian white noise article preprint
    Mathematical Statistics and Learning. Volume 1, Issue 2, 101-170, 2018. With Kolyan Ray.
  15. A regularity class for the roots of non-negative functions. pdf arXiv
    Annali di Matematica Pura ed Applicata. Volume 196, Number 6, 2091-2103, 2017. With Kolyan Ray.
  16. Minimax theory for a class of non-linear statistical inverse problems. article revised preprint
    Inverse Problems. Volume 32, Number 6, 065003, 2016. With Kolyan Ray.
  17. Conditions for posterior contraction in the sparse normal means problem. pdf
    Electronic Journal of Statistics. Volume 10, Number 1, 976-1000, 2016. With Stéphanie van der Pas and JB Salomond.
  18. Bayesian linear regression with sparse priors. pdf arXiv
    Annals of Statistics. Volume 43, Number 5, 1986-2018, 2015. With Ismael Castillo and Aad van der Vaart.
  19. On adaptive posterior concentration rates. pdf
    Annals of Statistics. Volume 43, Number 5, 2259-2295, 2015. With Marc Hoffmann and Judith Rousseau.
  20. Spot volatility estimation for high-frequency data: adaptive estimation in practice. pdf arXiv
    Springer Lecture Notes in Statistics: Modeling and Stochastic Learning for Forecasting in High Dimension. 213-241, 2015. With Till Sabel and Axel Munk.
  21. Asymptotic equivalence for regression under fractional noise. pdf arXiv
    Annals of Statistics, Volume 42, Number 6, 2557-2585, 2014.
  22. Asymptotically efficient estimation of a scale parameter in Gaussian time series and closed-form expressions for the Fisher information. pdf arXiv supplement
    Bernoulli, Volume 20, Number 2, 747-774, 2014. With Till Sabel.
  23. On an estimator achieving the adaptive rate in nonparametric regression under L p -loss for all 1 p . preprint
    This is an update of the working paper pdf. In the first version, we only consider simultaneous adaptation with respect to L 2 - and L -loss. This article might be easier to read and includes a small numerical study.
  24. Multiscale methods for shape constraints in deconvolution: Confidence statements for qualitative features.pdf supplement
    Annals of Statistics, Volume 41, Number 3, 1299-1328, 2013. With Axel Munk and Lutz Dümbgen.
    A first draft of this paper appeared under the title: "Multiscale methods for shape constraints in deconvolution" in 2011. pdf. It contains essentially the same results, but under a very strong assumption on the decay of the Fourier transform of the error density. The first version is much easier to read and does not require the theory of pseudo-differential operators.
  25. Adaptive wavelet estimation of the diffusion coefficient under additive error measurements. pdf software
    Annales de l'Institut Henri Poincaré, 48, 1186-1216. With Marc Hoffmann and Axel Munk.
    An earlier version of this paper was published as a working paper under the title "Nonparametric estimation of the volatility under microstructure noise: wavelet adaptation." pdf.
  26. Nonparametric methods in spot volatility estimation. pdf
    Dissertation. Universität Göttingen und Universtät Bern, 2010.
  27. Lower bounds for volatility estimation in microstructure noise models. pdf
    Borrowing Strength: Theory Powering Applications - A Festschrift for Lawrence D. Brown, IMS Collections, 6, 43-55, 2010. With Axel Munk.
  28. Nonparametric estimation of the volatility function in a high-frequency model corrupted by noise. pdf
    Electronic Journal of Statistics, 4, 781-821, 2010. With Axel Munk.
  29. Sharp minimax estimation of the variance of Brownian motion corrupted with Gaussian noise. pdf (including supplementary material).
    Statistica Sinica, 20, 1011-1024, 2010 . With T. Tony Cai and Axel Munk.
  1. Statistical theory for deep neural networks with ReLU activation function. pdf
    Oberwolfach Reports, 2018.
  2. Nonparametric Bayes for an irregular model. pdf
    Oberwolfach Reports, 2017.
  3. Asymptotic equivalence for regression under dependent noise. pdf
    Oberwolfach Reports, 2015.
  4. Reconstruction of risk measures from financial data. pdf
    Nieuw Archief voor Wiskunde, 2014.
  5. Simultaneously adaptive estimation for L 2 - and L -loss. pdf
    Oberwolfach Reports,2014.
  6. Detection of qualitative features in statistical inverse problems. pdf
    Oberwolfach Reports, 2012.
  7. Obtaining qualitative statements in deconvolution models. pdf
    Oberwolfach Reports, 2012.
  8. The Estimation of different scales in microstructure noise models from a nonparametric regression perspective. pdf
    Oberwolfach Reports, 2009. With Axel Munk.

Curriculum Vitae:
Born 1984 in Freiburg im Breisgau, Germany. Studies in mathematics at Universität Freiburg (2003-2004) and Universität Göttingen (2004-2007). PhD studies at Universität Göttingen and Universität Bern 2007- 2011 (supervisors: Axel Munk and Lutz Dümbgen). Postdoc at Vrije Universiteit Amsterdam and ENSAE, Paris. Assistant professor at Leiden University (2014-2018). Since 2018, full professor at University of Twente.

Abitur 2004; Diploma in mathematics with minor theoretical physics, Universität Göttingen, 2007. Dissertation, Universität Göttingen and Universität Bern 2011 (double degree program, summa cum laude).

Research Experience:
Visiting Scholar at University of California, Davis (September 2006-March 2007); Research stays at Wharton Business school, Philadelphia (February 2008), RICAM, Linz (October-November 2008), ENSAE, Paris (August and December 2009, June 2012-May 2013, January 2019), Vrije Universiteit Amsterdam (June 2011-May 2012), Universität Heidelberg (December 2010), Humboldt University (August 2014, August 2016, January- March 2018), Paris Dauphine (February 2014), SAMSI (June 2015), Göttingen (October-December 2015), Bochum (June-July 2016), Fudan university (June, August 2017), Isaac Newton institute (January-June 2018), Simons Institute Berkeley (July 2019). Guest of Collaborative Research Center 649 at Humboldt University Berlin (2010 - 2011).

Associate editor:
Awards and grants: Conference organization: Other activities:

Upcoming Talks:

Research topics

  1. Statistical theory for deep neural networks
  2. Nonparametric Bayes
  3. Confidence statements for qualitative constraints
  4. Asymptotic equivalence
  5. Spot volatility estimation

Statistical theory for deep neural networks

Mathematically speaking, a neural network is a function mapping an input vector $\mathbf{x}\in\mathbb{R}^d$ to an output variable $y.$ Network functions are build by alternating matrix-vector multiplications with the action of a non-linear activation function $\sigma.$ Fitting a multilayer neural network means to find network parameters such that the network explains the input-output relation on training data as good as possible.

A neural network can be represented as directed graph, cf. Figure 1. The nodes in the graph (also called units ) are arranged in layers. The input layer is the first layer and the output layer the last layer. The layers that lie in between are called hidden layers . Each node/unit in the graph stands for a scalar product of the incoming signal with a weight vector which is then shifted and applied to the activation function. The number of hidden layers is called the depth of the network. A multilayer network (also called deep network) is a network with more than one hidden layer.
Fig.1 - Representation as a direct graph of a network with two hidden layers.
Large databases and increasing computational power have recently resulted in astonishing performances of multilayer neural networks or deep nets for a broad range of learning tasks, including image and text classification, speech recognition and game playing. Figure 2 shows the performance of a deep network for object recognition. For that a so called convolutional neural network (CNN) is used. The input of the CNN are the pixels of the image in (A). Some of the learned features in the first layer are displayed in (B). The CNN outputs the probabilities for the classes (C). It correctly classifies the image in this case.
Fig.2 - Object recognition with deep CNN.
Although deep networks are a central topic in machine learning they received little attention from mathematicians yet. While the optimal estimation rates in high dimensions are slow due to the unavoidable curse of dimensionality, multilayer neural networks still perform well in high dimensions. It is thus natural to conjecture that multilayer neural networks form a flexible class of estimators which can avoid the curse of dimensionality by adapting to various low-dimensional structural constraints on the regression function and the design.

Recently, I finished a first preprint studying large multilayer neural networks. It is shown that estimators based on sparsely connected multilayer neural networks with ReLU activation function and properly chosen network architecture achieve the minimax estimation rates (up to log n-factors) under a general composition assumption on the regression function. The framework includes many well-studied structural constraints such as (generalized) additive models. The mathematical analysis extends the recent progress in approximation theory and combines it with explicit control on the network architecture and the network parameters. Interestingly, the number of layers of the neural network architectures plays an important role and the theory suggests to scale the network depth with the logarithm of the sample size.

Nonparametric Bayes

Statistics has long been divided into frequentist and Bayesian statistics and this division lasts until today. Bayesian statistics is the historically older principle, dating back to the work by Thomas Bayes in the 18th century. Following the Bayesian paradigm, one specifies a prior distribution $\pi$ on the parameter space that models the prior belief of the underlying unknown parameter of interest, say $\theta.$ The posterior distribution $$ \pi(\theta | x) = \frac{p(x|\theta) \pi(\theta)}{\int p(x|\theta) \pi(\theta) d\theta}$$ can be used for point estimation and uncertainty quantification.
Fig.3 - No portrait of Thomas Bayes is known. This drawing has been wrongly associated to Bayes.
For parametric models, the Bernstein-von Mises theorem states that under weak assumptions on the prior and the model, the frequentist and Bayesian inference match in the sense that confidence sets and credible sets are close for large sample sizes. For most nonparametric and high-dimensional models the posterior distribution is much more difficult to analyze and depends even asymptotically on the prior. For a subjective Bayesian the prior is given and models the prior belief. For models with complex parameter structure precise knowledge of the full high-dimensional prior distribution can typically not be assumed. Instead, practitioners pick priors within a collection of standard priors. It is then natural to interpret Bayes as a frequentist method and to study properties of the posterior assuming that there exists a true parameter generating the data.
Fig.4 - Posterior distribution for coin flip experiment with $n=10$ (red) and $n=100$ (blue) trials and uniform prior. The true parameter value is 3/4.
In previous work, we wrote an article on posterior contraction in the high-dimensional linear regression model. It can be shown that for priors that put a lot of mass on sparse models, the posterior concentrates around the true regression model. The Laplace prior that leads to the famous LASSO estimator does, however, not lead to contraction of the full posterior around the truth. In another project we studied so called global-local shrinkage priors and derived sharp conditions on this class of priors under which the posterior adapts to the underlying sparsity.
Posterior contraction is typically derived with respect to an intrinsic norm that is induced by the underlying statistical model. We investigated the problem of deriving posterior concentration rates under different loss functions in nonparametric Bayes

Confidence statements for qualitative constraints

If we reconstruct a function from data, there is always the question whether shape features such as maxima are artifacts of the reconstruction method or whether they are also present in the true underlying function. To answer these questions one wants to assign confidence statements to qualitative constraints. As we do not know in advance where interesting features of the shape occurs, we have to search over the whole domain.

One possibility is to derive a so called multiscale statistic that combines local tests in a sophisticated way to account for the dependence among the individual tests. The mathematical challenge is then to prove convergence to a distribution-free limit from which quantiles can be obtained.

In previous work, we studied multiscale statistics for deconvolution. In more recent work, we investigated the random coefficients model.
Fig.5 - The multiscale statistics returns boxes (in black) from which information about the shape of the true function (purple) can be deduced.

Asymptotic equivalence

Nonparametric statistics deals with a large zoo of different statistical models. Although the way the data are recorded might be quite different accross the models, there is typically a lot of similarity once it comes to reconstruction/estimation of the hidden quantities. In many cases for instance we obtain the same convergence rates. The notion of asymptotic equivalence makes the similarity of statistical models more precise. Two statistical models are said to be asymptotic equivalent if they lead to the same asymptotic estimation theory (in a certain sense).
Asymptotic equivalence can be quite useful for statistical theory. In several cases, difficult statistical models can be proven to be asymptotically equivalent to a simpler model and this allows us to work directly in the simpler model, avoiding for instance nasty discretization effects.
To establish asymptotic equivalence is, however, quite hard and each result needs a new proving strategy. Therefore, only few results have been established so far. In previous work, we derived conditions for which nonparametric regression with dependent errors by a continuous model. In a second project, we worked on asymptotic equivalence between the Gaussian white noise model and nonparametric density estimation.

Spot volatility estimation

The spot volatility describes the local variability of a financial asset or a portefolio. It is an important quantity for risk management and analyzing historic data. Unfortunately, the spot volatility cannot be observed directly and has to be inferred from the price process. If the price is recorded on high frequencies such as milliseconds, there are various market frictions that pertub the price process. If this so called microstructure noise is ignored the reconstruction of the spot volatility will be far to big. Models with microstructure noise are hard to analyse as the microstructure noise dominates the signal on most frequencies.
Fig.6 - Asset price over one trading day and reconstruction of spot volatility.
I studied spot volatility estimation with additive microstructure noise in my dissertation. We proved that microstructure noise reduces the optimal convergence rates by a factor 1/2. We also constructed reconstruction methods that achieve these convergence rates and implemented them in a software package for Matlab.