PFAS are a group of synthetic chemicals that have gained significant attention due to their widespread presence and their potential adverse effects on human and environmental health. PFAS are considered harmful chemicals due to their persistence, bioaccumulation, toxicity, and widespread contamination. Identifying PFAS and understanding their presence and behaviour in the environment is a crucial step in protecting the environment, and human health. However, PFAS are complex and diverse, posing a challenge in developing a standardized method for their detection and identification. Additionally, new PFAS that are not well characterised or understood are being synthesized and introduced in the environment and when released they have the potential to transform which makes it even more challenging to identify them.
To address this challenge, machine learning models are being developed to predict if a compound and its fragments are PFAS related. The data used for this comes from the MassBank database which is a record that contains some of the known PFAS. There are a few challenges with this data: the data is limited in the amount of PFAS spectra it has, the diversity of PFAS compounds is limited, and although the data comes from actual measurements, the compound mass is not actually measured. To address these issues, a synthetic measured mass is derived, and from the compound measured mass, several Kendrick mass defects (KMD) are calculated. Furthermore, instead of using the fragments as features, the loss values are used. To make the data more balanced and better distributed for the model several steps are taken, such as adding a weight term to individual data, stratifying the data into specific cross-validation strata, and bootstrapping techniques on distinct compounds with low representation are performed so that the model can maximize on the limited data.
Two distinct models have been developed: model 1 aims to classify compounds into specific classes, while model 2 focuses only on identifying compounds of interest.