Classification and interpretation in quantitative structure-activity relationships

A good QSAR model comprises several components. Predictive accuracy is paramount, but it is not the only important aspect. In addition, one should apply robust and appropriate statistical tests to the models to assess their significance or the significance of any apparent improvements. The real impa...

Full description

Bibliographic Details
Main Author: Bruce, Craig L.
Format: Thesis (University of Nottingham only)
Language:English
Published: 2010
Subjects:
Online Access:https://eprints.nottingham.ac.uk/11666/
_version_ 1848791329914486784
author Bruce, Craig L.
author_facet Bruce, Craig L.
author_sort Bruce, Craig L.
building Nottingham Research Data Repository
collection Online Access
description A good QSAR model comprises several components. Predictive accuracy is paramount, but it is not the only important aspect. In addition, one should apply robust and appropriate statistical tests to the models to assess their significance or the significance of any apparent improvements. The real impact of a QSAR, however, perhaps lies in its chemical insight and interpretation, an aspect which is often overlooked. This thesis covers three main topics: a comparison of contemporary classifiers, interpretability of random forests and usage of interpretable descriptors. The selection of data mining technique and descriptors entirely determine the available interpretation. Using interpretable approaches we have demonstrated their success on a variety of data sets. By using robust multiple comparison statistics with eight data sets we demonstrate that a random forest has comparable predictive accuracies to the de facto standard, support vector machine. A random forest is inherently more interpretable than support vector machine, due to the underlying tree construction. We can extract some chemical insight from the random forest. However, with additional tools further insight would be available. A decision tree is easier to interpret than a random forest. Therefore, to obtain useful interpretation from a random forest we have employed a selection of tools. This includes alternative representations of the trees using SMILES and SMARTS. Using existing methods we can compare and cluster the trees in this representation. Descriptor analysis and importance can be measured at the tree and forest level. Pathways in the trees can be compared and frequently occurring subgraphs identified. These tools have been built around the Weka machine learning workbench and are designed to allow further additions of new functionality. The interpretability of a model is dependent on the model and the descriptors. They must describe something meaningful. To this end we have used the TMACC descriptors in the Solubility Challenge and literature data sets. We report how our retrospective analysis confirms existing knowledge and how we identify novel C-domain inhibition of ACE. In order to test our hypotheses we extended and developed existing software forming two applications. The Nottingham Cheminformatics Workbench (NCW) will generate TMACC descriptors and allows the user to build and analyse models, including visualising the chemical interpretation. Forest Based Interpretation (FBI) provides various tools for interpretating a random forest model. Both applications are written in Java with full documentation and simple installations wizards are available for Windows, Linux and Mac.
first_indexed 2025-11-14T18:26:47Z
format Thesis (University of Nottingham only)
id nottingham-11666
institution University of Nottingham Malaysia Campus
institution_category Local University
language English
last_indexed 2025-11-14T18:26:47Z
publishDate 2010
recordtype eprints
repository_type Digital Repository
spelling nottingham-116662025-02-28T11:14:53Z https://eprints.nottingham.ac.uk/11666/ Classification and interpretation in quantitative structure-activity relationships Bruce, Craig L. A good QSAR model comprises several components. Predictive accuracy is paramount, but it is not the only important aspect. In addition, one should apply robust and appropriate statistical tests to the models to assess their significance or the significance of any apparent improvements. The real impact of a QSAR, however, perhaps lies in its chemical insight and interpretation, an aspect which is often overlooked. This thesis covers three main topics: a comparison of contemporary classifiers, interpretability of random forests and usage of interpretable descriptors. The selection of data mining technique and descriptors entirely determine the available interpretation. Using interpretable approaches we have demonstrated their success on a variety of data sets. By using robust multiple comparison statistics with eight data sets we demonstrate that a random forest has comparable predictive accuracies to the de facto standard, support vector machine. A random forest is inherently more interpretable than support vector machine, due to the underlying tree construction. We can extract some chemical insight from the random forest. However, with additional tools further insight would be available. A decision tree is easier to interpret than a random forest. Therefore, to obtain useful interpretation from a random forest we have employed a selection of tools. This includes alternative representations of the trees using SMILES and SMARTS. Using existing methods we can compare and cluster the trees in this representation. Descriptor analysis and importance can be measured at the tree and forest level. Pathways in the trees can be compared and frequently occurring subgraphs identified. These tools have been built around the Weka machine learning workbench and are designed to allow further additions of new functionality. The interpretability of a model is dependent on the model and the descriptors. They must describe something meaningful. To this end we have used the TMACC descriptors in the Solubility Challenge and literature data sets. We report how our retrospective analysis confirms existing knowledge and how we identify novel C-domain inhibition of ACE. In order to test our hypotheses we extended and developed existing software forming two applications. The Nottingham Cheminformatics Workbench (NCW) will generate TMACC descriptors and allows the user to build and analyse models, including visualising the chemical interpretation. Forest Based Interpretation (FBI) provides various tools for interpretating a random forest model. Both applications are written in Java with full documentation and simple installations wizards are available for Windows, Linux and Mac. 2010-12-09 Thesis (University of Nottingham only) NonPeerReviewed application/pdf en arr https://eprints.nottingham.ac.uk/11666/1/thesis-final.pdf Bruce, Craig L. (2010) Classification and interpretation in quantitative structure-activity relationships. PhD thesis, University of Nottingham. qsar data mining cheminformatics chemoinformatics random forest machine learning
spellingShingle qsar data mining cheminformatics chemoinformatics random forest machine learning
Bruce, Craig L.
Classification and interpretation in quantitative structure-activity relationships
title Classification and interpretation in quantitative structure-activity relationships
title_full Classification and interpretation in quantitative structure-activity relationships
title_fullStr Classification and interpretation in quantitative structure-activity relationships
title_full_unstemmed Classification and interpretation in quantitative structure-activity relationships
title_short Classification and interpretation in quantitative structure-activity relationships
title_sort classification and interpretation in quantitative structure-activity relationships
topic qsar data mining cheminformatics chemoinformatics random forest machine learning
url https://eprints.nottingham.ac.uk/11666/