Chemometrics and machine learning

We develop and apply chemometric and machine learning methods to face real problems in chemistry, toxicology, pharmacology and environmental sciences. Specific research interests on multivariate modelling are neural networks, variable selection, data fusion, ranking methods, supervised classification, correlation and information measures, multicriteria decision making.
In this framework, new methods for supervised pattern recognition and classification (CAIMAN, N3, BNN), neural netwroks (K-CM), variable reduction and selection (W-VSP, Reshaped Sequential Replacement), unsupervised data analysis (MADS, MOLMAP approach for 3D analytical data), similarity and distance measures (new similarity coefficients for binary data, Locally-centred Mahalanobis distance, higher-order similarity measures), correlation measures (Canonical Measure of Correlation) and multicriteria decision making (Weighted Power-Weakness Ratio) have been proposed in the scientific literature.
On the other side, application of machine learning and chemometrics vary from analytical profils and signals (mainly related to envoironmental and food matrices) to molecular modelling (QSAR and virtual screening).
Visit this list for an overview of scientific publications related to new theoretical proposals and applications in chemometrics and machine learning.

Molecular Descriptors

Molecular descriptors capture diverse parts of the structural information of molecules and they are the support of many contemporary computer-assisted toxicological and chemical applications. . Since the beginning, Milano Chemometrics has studied and developed new theoretically-based molecular descriptors, such as WHIM (Weighted Holistic Invariant Molecular descriptors), G-WHIM (Grid-Weighted Holistic Invariant Molecular descriptors), GETAWAY (GEometry, Topology and Atoms-Weighted AssemblY) descriptors and evaluated their ability in modelling different physico-chemical, biological and environmental responses. Originally, the DRAGON software was developed to calculate molecular descriptors.
Moreover, the second edition of the Handbook of Molecular Descriptors (Molecular Descriptors for Chemoinformatics by Roberto Todeschini and Viviana Consonni) has been published by Wiley-VCH. It is an encyclopedic collection of the molecular descriptors from the beginning. About 3300 definitions, presented in alphabetic order, allow not only a rapid consulting, but also an organized learning of algorithms, meanings and tables of the molecular descriptors, QSAR strategies, and other related topics.
In this framework, the MOLE db – Molecular Descriptors Data Base has been released. This is a free on-line database constituted of 1124 molecular descriptors calculated on 234773 molecules of the NCI database.

QSAR, QSPR and chemical modelling

QSAR models are currently regarded as a scientifically reliable tool for predicting and classifying properties of untested chemicals. QSARs are based on the assumption that the structure of a molecule (for example, its geometric, steric and electronic properties) must contain the features responsible for its physical, chemical, and biological properties and on the ability to capture these features into one or more numerical descriptors. Milano Chemometrics has been involved in several projects related to the proposal and use use of QSAR for the REACH registration of chemicals, such as the study of the relationships between molecular structures of dyes and their toxicological properties and the use of in-silico models to develop a new, safe, multifunctional accelerator curative molecule which can replace thiourea-based accelerators in the vulcanisation process.
Milano Chemometrics has been involved in the development of new QSAR models adressed to the prediction of several properties (such as bioaccumulation, biodegradation and acute toxicity), which  have been proposed in literature. Reserach is also devoted to evaluate new and existing strategies to define the Applicability Domain of QSAR models, that is, the chemical domain where QSAR predictions can be assumed to be reliable. Finally, consensus modelling and data fusion of QSAR predictions are one of the considered research topic.


We develop and distribute (for free) softwares and toolboxes to calculate multivariate models (such as the PCA toolbox, the classification toolbox, the Kohonen and CPANN toolbox), to assess the Applicability Domain of QSAR models, for virtual screening, as well as KNIME workflows. Moreover, banchmark QSAR datasets are available for download. See the download page for further details.

Development, optimization and validation of new analytical methods

Development, optimization and validation of new analytical methods (in particular HPLC-MS/MS, UHPLC-MS/MS, online SPE HPLC-MS/MS) for the identification and the determination of target and non-target species in the environment (chloroanilines, aromatic sulfonates, pesticides, perfluorocompounds, etc) , in food (dyes, biogenic amines, PAHs, aldehydes, etc), and biological samples (drugs of abuse, benzodiazepines, neurotransmitters, etc).

Food and environmental application of analytical methods

Degradation studies and identification of new emerging pollutants in environment: advanced oxidation processes are generally employed for the destruction of persistent pollutants in the environment. Nevertheless, these kind of processes do not always lead to a complete mineralization of the pollutant, but to a formation of new products of comparable toxicity. The studies deal with also the natural solar photodegradation of the pollutant in water, the evaluation of the kinetics, and the identification of the new species formed by HPLC-MS/MS or UHPLC-MS/MS using target and non-target approach.
Identification and determination of unknown compounds in food: identification and determination of unknown species formed in food and beverages for effect of sunlight or for unexpected interactions with other ingredients. These interactions are often unpredictable, but they can give rise to different kinds of contaminations with the formation of new species potentially harmful to the consumer health. For this purpose, HPLC and UHPLC coupled with low- and high-resolution tandem mass spectrometry are developed and validated. In the latter case, the use of specific software for the data contextualization and the application of multivariate chemometric techniques (PCA, PCA-DA,…) for the data interpretation are of paramount importance.
Chemical characterization of food: full fingerprint of food (cheese, wine, salami, tomato sauce, olive oil, etc) by HPLC-DAD, HPLC-MS/MS, UHPLC-MS/MS, IC, GC-MS, ICP OES, and ICP OES, in order to perform traceability and authentication studies.