A fast region of interest algorithm for efficient data compression and improved peak detection in high-resolution mass spectrometry

Optimisation scheme for determining the m/z deviation (δ m/z)

The key parameters requiring optimisation in the two ROI algorithms are ρmin, Ithresh and δm/z. Ithresh must be set below the baseline of the least abundant peak of interest to ensure its inclusion (Fig. 1, step 3). The optimal δm/z was determined by testing values between 0.01 and 0.10 Da on seven injections of the wastewater effluent extracts onto the LC-HRMS. An increasing number of m/z traces were detected in the samples by the OMG algorithm as a function δm/z until 0.02 Da, after which the number of m/z traces with a unique m/z decreased (Fig. 2a, ρgap allowed equal to zero scans).

Fig. 2figure 2

The number of m/z traces extracted from the raw data plotted against increasing m/z deviation (δm/z) in Da and Ithresh using a the OMG and b the TGJ algorithm, respectively. In panels c and d, a ρgap allowed value of 1 and 2 was used for the OMG algorithm, respectively. The colours blue to brown represent increasing number of m/z traces. A value of 8 scans was used for ρmin. A value of ρgap allowed equal to zero was chosen for the OMG algorithm in panel a. See Fig. 1 step 3 for a visualisation of the ρmin and ρgap allowed

The mass spectral resolution was compromised at values of δm/z > 0.03 Da for this dataset (Fig. S1a-Fig. S2a). At these values, the mean and median ROI lengths exceeded the expected peak width, since adjacent ROIs in the m/z dimension, potentially originating from different chemical events, were merged into the same m/z trace. This led to higher variation within each m/z trace or ROI and increased m/z deviation due to the more heterogeneous collection of m/z measurements included in each trace (Fig. S4).

For δm/z values < 0.02 Da, potentially relevant ROIs were removed from the final data matrix. This occurred because splitting the m/z traces into multiple adjacent ROIs resulted in fewer consecutive scans than ρmin, causing their removal by the chromatographic filter. This effect was reflected in shorter mean and median ROI lengths and a reduced number of detected m/z traces (Fig. S12, Fig. 2). To address this issue, the parameter ρgap allowed was implemented. This parameter enables the retention of two ROIs separated by a user-defined number of scans, provided their combined length is at least ρmin. Generally, a ρgap allowed value of 1 or 2 yields higher number of detected compounds and longer ROIs especially for higher values of Ithresh and lower values of δm/z (Fig. S13). We also observed that more m/z traces were retained in the data matrix (Fig. 2a, c, d).

In the TGJ algorithm, gaps are addressed by performing a linear interpolation plus a random noise contribution between the two points on each side of the gap. All scan points for all m/z traces are consequently given an intensity value, which contrasts the sparse nature of HRMS data. No optimum for δm/z was observed for the TGJ algorithm, based on the number of m/z traces, nor the length of the ROIs, since the TGJ chromatographic filter did not require the mass spectral measurements to be in consecutive scans, as was the case for the OMG algorithm. Rather a minimum number of occurrences within an m/z trace was required. Therefore, the TGJ chromatographic filter was found to be less efficient in excluding ROIs that did not correspond to a real chromatographic peak (Fig. 2, Fig. S1 and Fig. S2b). As a result, more m/z traces were retained in the TGJ algorithm (Fig. 2), leading to a lower data compression rate.

In Fig. S2, the median ROI length was shown to range from 12 to 23 scans for the OMG algorithm, compared to a single scan for the TGJ algorithm, regardless of Ithresh and δm/z. This indicates that the OMG algorithm effectively excluded ROIs shorter than a chromatographic peak. When ROIs with lengths below eight scans were excluded, the median ROI length in the TGJ algorithm increased to 11–14 scans, comparable to the OMG algorithm. The more efficient chromatographic filtering in the OMG algorithm suggests it could reduce false positive rate in feature detection and compound identification by filtering out small spikes (step 3, Fig. 1). Reducing false positives in NTS has been the focus of several previous publications, since it increases the reliability of NTS results along with reducing the time needed for manual exclusion of false positive ROIs [15, 17].

The maximum number of detected compounds for the OMG algorithm was 53, observed at δm/z > 0.06 Da and Ithresh ≤ 750 counts. The number of detected compounds decreased with decreasing δm/z and increasing Ithresh for both algorithms. However, the TGJ algorithm was less affected by these parameters with respect to the number of detected compounds. This suggests that the OMG algorithm is more sensitive to Ithresh and δm/z compared to the TGJ algorithm, which was explained by the fact that the risk that a ROI was excluded by the chromatographic filter increased as a consequence of having data points with too low intensity or a too heterogeneous collection of m/z’s compared to δm/z.

To mitigate this, the parameter ρgap allowed can be adjusted. Unlike the TGJ algorithm, the OMG algorithm’s chromatographic filter aligns more closely with chromatographic peak width, making it easier for data analysts to select suitable values. The m/z deviation decreased with lower δm/z and higher Ithresh values (Fig. S4), since fewer data points were included in each m/z trace, reducing the m/z variance. The m/z deviations were similar between the two algorithms, with ρgap allowed having minimal impact (Fig. S4).

The proposed optimisation scheme is also believed to be applicable to the centWave algorithm, where consecutive mass spectral measurements are required [6]. Myers et al. [18] investigated key differences in feature detection between MZmine2 and XCMS but did not address the optimisation of δm/z. As an alternative to ROI detection, Reuschenbach et al. [19, 20], have put forth a collection of algorithms named qAlgorithms, which uses probabilistic approach that is applicable to continuum HRMS data reducing subjectivity. If continuum data is not available, a user-defined m/z threshold is required to be optimised. In this case, it loses its advantages of being user-parameter-free.

In this study, an optimal δm/z value of 0.02 Da was identified. At this value, along with an Ithresh of 200, a ten-fold reduction in data size was achieved when storing the pre-processed files on disk. The same optimisation was applied to the OMG algorithm, specifying δm/z in ppm. We achieved a mean RMS-mDev of 9 ppm using a δm/z of 36 ppm (~0.02 Da at 500 m/z) and Ithresh of 200 compared to a mean RMS-mDev of 20 ppm when using a δm/z of 0.02 Da. The improvement with ppm specification was expected, since TOF mass spectrometry inherently maintains a constant ppm deviation across the m/z range [21]. This trend was evident in the data, where m/z deviations were higher for compounds with lower m/z values when δm/z was specified in Da.

Due to the chromatographic filter implemented in the TGJ algorithm, the presented δm/z optimisation could not be used; instead δm/z must be selected manually. Previous studies suggest setting δm/z as a multiple of the instruments mass resolution, emphasising the need for analysts to evaluate suitability for each dataset [8, 9]. However, this manual approach introduces variability in results, based on the data analyst’s choices with respect to the observed ROI length (Fig. S12), m/z deviation (Fig. S4), and the number of detected ROIs (Fig. 2).

Scalability of the ROI algorithms and its implication to compound detection

The maximum processing time for one sample was 15 s vs. 2.0×103 s for the OMG and the TGJ algorithms, respectively (Fig. 3a and b). The improvement in processing time ranged from a factor 3 to 166 across the tested values of δm/z and Ithresh. Figure 3a and b show that processing time for both algorithms increased as Ithresh decreased, with the TGJ algorithm being more sensitive to δm/z. Specifically, for the TGJ algorithm, reducing Ithresh from 2000 to 200 caused processing time to increase by factors of 68 and 240 for δm/z values of 0.1 and 0.01, respectively. In comparison, the OMG algorithm showed smaller increases, with processing time rising by factors of 14 and 11 for the same δm/z values.

Fig. 3figure 3

Mean processing time (in seconds) for seven replicate injections of the pooled wastewater effluent extract, plotted as a function of the m/z deviation (δm/z) and noise threshold (Ithresh, secondary x-axis) for the OMG (a) and TGJ (b) algorithms. The tested δm/z values and noise thresholds are shown in the figure. The primary x-axis indicates the mean number of data points remaining in the data file after excluding those with intensity < Ithresh (Fig. 1, steps 2 and 3). A value of ρgap allowed = 0 was used for the OMG algorithm

Since Ithresh was used as a proxy for file size, Fig. 3 shows that the TGJ algorithm’s processing time increased more than linearly with the number of data points, rising from 3.0×105 to 6.3×106, equivalent to a 21-fold increase in data size. In contrast, the OMG algorithm showed a sub-linear increase, with the slope of the relationship between processing time and the number of data points ranging from ~ 0.5 to 0.7 (< 1). For the TGJ algorithm, the slopes were significantly steeper, ranging from 3 to 11 for δm/z values of 0.1 and 0.01. Consequently, the difference in processing time between the OMG and TGJ algorithm increased with decreasing δm/z and Ithresh.

The improved speed of the OMG algorithm can be attributed to several factors: primarily due to a more efficient m/z trace detection procedure (Fig. 1, step 2), and, to a lesser extent, to memory pre-allocation and footprint, lower computationally complexity and vectorised operations. These improvements allowed data analysts to optimise δm/z and Ithresh in multiple steps without major concerns about processing time. It is noteworthy that all m/z’s and intensities are required to be loaded into the memory by the sorting step by the OMG algorithm, whereas the TGJ could be modified to work scan-wise thereby reducing the memory footprint at the price of a higher computation time. However, this is not done in the tested implementation of the TGJ algorithm.

This flexibility is critical, as previous studies have recommended setting Ithresh at 0.25% of the most intense data point, which in this study corresponds to an Ithresh of ~7500 [9]. Applying this threshold would have resulted in only 36 of the 57 investigated compounds being detected due to insufficient data points per peak. The most intense signal in a chromatogram may not always be a peak relevant to the problem at hand; it could arise from blank contaminations, compounds in the washing phase or irrelevant sample components. Such signals can be orders of magnitude more intense than peaks of interest. Whilst a high Ithresh can reduce data compression time [22], its impact on the OMG algorithm is minimal: Even with the most conservative choices of δm/z (0.01) and Ithresh (200), the maximum processing time was approx. 15 s (Fig. 3). In contrast, other studies such as Dalmau, Bedia and Tauler [9] have used higher Ithresh potentially excluding trace-level compounds below the applied Ithresh. Similarly, Schöneich et al. [23] demonstrated that noise threshold selection critically affects compound detection rates. Their study showed that at a spike level of 50 ppb, only 6–7 out of 18 spiked pesticides were detected using a noise threshold of 0.1%, whilst none was detected at 1 ppb. Our observations align with these findings: reducing Ithresh for the OMG algorithm increased number of detected compounds (Fig. S3). Lower Ithresh values preserved more data points at the edges of chromatographic peaks, ensuring the number of data points per peak met or exceeded ρmin. This improvement allowed the detection of more compounds, especially those at trace levels.

Augmenting data matrices across samples

The matrices for the 21 LC-HRMS chromatograms from the wastewater effluent sample were augmented (stacked row-wise) using δm/z = 0.02 for both algorithms. The Ithresh values were set at 200 counts for OMG and 3500 counts for TGJ. This resulted in a mean RMS-mDev of 17 ppm for OMG and 11 ppm for TGJ across the 21 injections. The lower RMS-mDev for TGJ was due to its higher noise threshold (Fig. S4). Both algorithms detected 53 compounds under these conditions (Fig. 4a, b).

Fig. 4figure 4

a, b Compounds detected in the same m/z trace in one sample as in the majority of the pooled wastewater extracts, i.e. quality control (QC) injection, are shown in blue for both OMG and TGJ algorithms. If a compound was detected in a different m/z trace compared to the majority, it is displayed in white. Compounds that were not detected are shown in black. c The elution profiles of theobromine (compound 4) across the 21 injections, highlighting a misclassification in the OMG algorithm where the compound was grouped into two separate m/z traces. These m/z traces are represented by cyan and black lines, with the Δm/z values indicated on the plot. d A similar scenario for amisulpride (compound 9) using the TGJ algorithm, showing misclassification into two m/z traces. For the OMG algorithm, a ρgap allowed value of 0 was applied

In Fig. 4a, b, the OMG algorithm demonstrated a higher success rate in augmenting compounds, as a larger fraction of the 53 compounds had identical m/z values in the augmented data matrix. Using the OMG algorithm, only one compound in six samples (out of 53 compounds across 21 chromatograms) was wrongly augmented into different m/z traces. For TGJ, this occurred for 12 compounds across 19 samples.

The elution profiles of a misclassified compound are shown in Fig. 4c, d, clearly indicating that the profiles originate from the same compound. Ideally, all compounds should be grouped into the same m/z trace in the augmented matrix, resulting in identical m/z deviations across all samples. Deviations from this uniformity in RMS-mDev indicate misclassification, where one or more compounds were incorrectly grouped into separate m/z traces across replicates.

Misclassified compounds can be identified automatically by comparing the m/z values of each compound in each sample to the most prevalent m/z value for that compound across all 21 replicates (Eq. 3). Compounds with m/z values that deviate from this most prevalent value indicate unsuccessful grouping. This approach provides a systematic method for identifying and resolving misclassified m/z traces in augmented data matrices.

The high success rate of data augmentation in the OMG algorithm is pivotal for its effectiveness as a pre-processing step for curve-resolution methods with tri-linearity constraints. Misclassified m/z traces for a single compound would introduce non-rank-one contributions, undermining the trilinear structure. Similarly, in feature detection workflows, incorrect grouping of m/z traces would falsely increase the perceived complexity of the sample, leading to inaccurate results. The high success rate of grouping m/z traces serves as a good foundation for further grouping of adducts, fragments, and in-source fragments.

Comments (0)

No login
gif