PFAS are a group of synthetic chemicals that have gained significant attention due to their widespread presence and their potential adverse effects on human and environmental health. PFAS are considered harmful chemicals due to their persistence, bioaccumulation, toxicity, and widespread contamination. Identifying PFAS and understanding their presence and behaviour in the environment is a crucial step in protecting the environment, and human health. However, PFAS are complex and diverse, posing a challenge in developing a standardised method for their detection and identification. Additionally, new PFAS that are not well characterised or understood are being synthesised and introduced in the environment and when released they have the potential to transform which makes it even more challenging to identify them.
To address this challenge, machine learning models are being developed to predict if a compound and its fragments are PFAS related. The data used for this comes from the MassBank database which is a record that contains some of the known PFAS. There are a few challenges with this data: the data is limited in the amount of PFAS spectra it has, the diversity of PFAS compounds is limited, and although the data comes from actual measurements, the compound mass is not actually measured. To address these issues, a synthetic measured mass is derived, and from the compound measured mass, several Kendrick mass defects (KMD) are calculated.
MystMatch - An Algorithm for Matching Unidentified Compounds from spectral data
Non-targeted analysis (NTA) is essential for identifying chemicals in complex matrices, yet existing tools often focus on known compounds, limiting their ability to address novel or unexpected substances. To bridge this gap, we present MystMatch, a novel algorithm for matching unidentified compounds in HRMS data across chromatograms from diverse instruments and methodologies. MystMatch calculates match scores based on precursor ion mass-to-charge ratios, retention times, and fragmentation patterns, grouping spectra likely originating from the same chemical species. Designed for scalability, the algorithm efficiently processes millions of spectra and adapts to varied datasets and laboratory workflows. This approach enables post-experimental feature alignment without requiring reference standards, facilitating the identification of recurring patterns and emerging contaminants. By leveraging previously untapped data, MystMatch enhances NTA workflows, supporting environmental monitoring and prioritising unidentified compounds of concern.
Research Outputs