Picking peaks from N-dimensional NMR spectra

The problem

We can model a spectrum, $S(\mathbf{x})$ , as the convolution of a discrete set of peaks, with a lineshape function $f(\mathbf{x})$ , plus a noise term $\alpha\eta(\mathbf{x})$ :

S(\mathbf{x}) = \sum_{\mathbf{p} \in X} f_{\boldsymbol{\gamma}}(\mathbf{p} - \mathbf{x}) + \eta_\alpha(\mathbf{x}),

where $\gamma$ is a vector of peak widths along each axis. $\eta_\alpha$ is distributed according to a Gaussian distribution with variance $\alpha$ (Keeler, 2005).

The goal of peak picking is to take a spectrum $S$ and determine the set of peaks, $X$ , which gave rise to it.

The following article provides an overview of the various algorithms which can be employed to solve this problem.

Local maxima

A naïve algorithm identifies peaks at the local maxima of $S$ :

X = \left\{\mathbf{x} \in \mathbb{R}^n \mid \nabla S({\mathbf{x}}) = \mathbf{0}\right\}

\nabla S(\mathbf{x}) = \sum \nabla f(\mathbf{x_i} - \mathbf{x}) + \nabla \eta(\mathbf{x})

however this results in the identification of peaks in the noise as well.

This simple algorithm can be improved with two simple modifications. Most of the noise peaks have a significantly lower intensity than the true peaks. By only selecting peaks above a threshold, many of these can be eliminated.

Furthermore applying a Gaussian blur to the spectrum, $S_\text{blurred} = S * G_{\sigma}$ , reduces detection of noise peaks by smoothing the high-frequency noise.

G_\sigma = \left(\sqrt{2\pi}\sigma\right)^{-n} \exp\left(\frac{ |\mathbf{x}| }{2\sigma^2}\right)

Laplacian of Gaussian

The Laplacian of a spectrum:

\nabla^2 S(\mathbf{x}) = \frac{\partial^2 S}{\partial x_1^2} + ... + \frac{\partial^2 S}{\partial x_n^2},

has peaks where the spectrum intensity changes rapidly, and is close to zero in uniform areas. Thus, the local maxima of the Laplacian of a spectrum correspond to areas of high curvature, which are likely to be spectral peaks. However, taking the second derivative amplifies high-frequency noise signals. As before, high-frequency noise can be attenuated using a Gaussian blur. To compensate for the loss in contrast introduced by the Gaussian blur, the Laplacian of Gaussian kernel is multiplied by a normalization factor of $\sigma^2$ .

Since $\sigma^2\nabla^2 (G_\sigma * S) = (\sigma^2\nabla^2 G_\sigma) * S,$ this can be represented as the convolution of the spectrum with a single Laplacian-of-Gaussian kernel,

\text{LoG}_\sigma = -\sigma^2\nabla^2 G_\sigma

The parameter $\sigma$ determines the size of the features which are detected: a larger $\sigma$ causes larger features to be blurred away, while a smaller $\sigma$ retains more high-frequency information. Since the peaks in an NMR spectrum are all roughly the same scale,

$\text{LoG}_\sigma$ can be well-approximated by the difference of two Gaussian blurs

\text{LoG}_\sigma \approx \frac{2}{k^2 - 1} \left(G_\sigma - G_{k\sigma}\right)

Since Gaussian blur can be calculated much faster than convolution by an arbitrary kernel, this method is preferred, especially for large or high-dimensional spectra.

Wavelet transform

Local quadratic fitting

Curve fitting / GSD

NMR lineshape
Lorentzian distribution
Drawbacks

qGSD

For quantitative applications when knowledge of the exact lineshape or peak integral is required
Not really necessary for most protein structural applications

Machine learning

DEEP picker

Validation

Peaks between different spectra of the same sample should be consistent.
For example, an N-HSQC peak should correspond to a set of peaks in a HNCACB spectrum with the same H/N shift.

References

Keeler J (2005) Understanding NMR Spectroscopy 2nd ed. Wiley