Classification and interpretation in quantitative structure-activity relationships

A good QSAR model comprises several components. Predictive accuracy is paramount, but it is not the only important aspect. In addition, one should apply robust and appropriate statistical tests to the models to assess their significance or the significance of any apparent improvements. The real impa...

Full description

Bibliographic Details
Main Author:	Bruce, Craig L.
Format:	Thesis (University of Nottingham only)
Language:	English
Published:	2010
Subjects:	qsar data mining cheminformatics chemoinformatics random forest machine learning
Online Access:	https://eprints.nottingham.ac.uk/11666/

_version_	1848791329914486784
author	Bruce, Craig L.
author_facet	Bruce, Craig L.
author_sort	Bruce, Craig L.
building	Nottingham Research Data Repository
collection	Online Access
description	A good QSAR model comprises several components. Predictive accuracy is paramount, but it is not the only important aspect. In addition, one should apply robust and appropriate statistical tests to the models to assess their significance or the significance of any apparent improvements. The real impact of a QSAR, however, perhaps lies in its chemical insight and interpretation, an aspect which is often overlooked. This thesis covers three main topics: a comparison of contemporary classifiers, interpretability of random forests and usage of interpretable descriptors. The selection of data mining technique and descriptors entirely determine the available interpretation. Using interpretable approaches we have demonstrated their success on a variety of data sets. By using robust multiple comparison statistics with eight data sets we demonstrate that a random forest has comparable predictive accuracies to the de facto standard, support vector machine. A random forest is inherently more interpretable than support vector machine, due to the underlying tree construction. We can extract some chemical insight from the random forest. However, with additional tools further insight would be available. A decision tree is easier to interpret than a random forest. Therefore, to obtain useful interpretation from a random forest we have employed a selection of tools. This includes alternative representations of the trees using SMILES and SMARTS. Using existing methods we can compare and cluster the trees in this representation. Descriptor analysis and importance can be measured at the tree and forest level. Pathways in the trees can be compared and frequently occurring subgraphs identified. These tools have been built around the Weka machine learning workbench and are designed to allow further additions of new functionality. The interpretability of a model is dependent on the model and the descriptors. They must describe something meaningful. To this end we have used the TMACC descriptors in the Solubility Challenge and literature data sets. We report how our retrospective analysis confirms existing knowledge and how we identify novel C-domain inhibition of ACE. In order to test our hypotheses we extended and developed existing software forming two applications. The Nottingham Cheminformatics Workbench (NCW) will generate TMACC descriptors and allows the user to build and analyse models, including visualising the chemical interpretation. Forest Based Interpretation (FBI) provides various tools for interpretating a random forest model. Both applications are written in Java with full documentation and simple installations wizards are available for Windows, Linux and Mac.
first_indexed	2025-11-14T18:26:47Z
format	Thesis (University of Nottingham only)
id	nottingham-11666
institution	University of Nottingham Malaysia Campus
institution_category	Local University
language	English
last_indexed	2025-11-14T18:26:47Z
publishDate	2010
recordtype	eprints
repository_type	Digital Repository
spelling	nottingham-116662025-02-28T11:14:53Z https://eprints.nottingham.ac.uk/11666/ Classification and interpretation in quantitative structure-activity relationships Bruce, Craig L. A good QSAR model comprises several components. Predictive accuracy is paramount, but it is not the only important aspect. In addition, one should apply robust and appropriate statistical tests to the models to assess their significance or the significance of any apparent improvements. The real impact of a QSAR, however, perhaps lies in its chemical insight and interpretation, an aspect which is often overlooked. This thesis covers three main topics: a comparison of contemporary classifiers, interpretability of random forests and usage of interpretable descriptors. The selection of data mining technique and descriptors entirely determine the available interpretation. Using interpretable approaches we have demonstrated their success on a variety of data sets. By using robust multiple comparison statistics with eight data sets we demonstrate that a random forest has comparable predictive accuracies to the de facto standard, support vector machine. A random forest is inherently more interpretable than support vector machine, due to the underlying tree construction. We can extract some chemical insight from the random forest. However, with additional tools further insight would be available. A decision tree is easier to interpret than a random forest. Therefore, to obtain useful interpretation from a random forest we have employed a selection of tools. This includes alternative representations of the trees using SMILES and SMARTS. Using existing methods we can compare and cluster the trees in this representation. Descriptor analysis and importance can be measured at the tree and forest level. Pathways in the trees can be compared and frequently occurring subgraphs identified. These tools have been built around the Weka machine learning workbench and are designed to allow further additions of new functionality. The interpretability of a model is dependent on the model and the descriptors. They must describe something meaningful. To this end we have used the TMACC descriptors in the Solubility Challenge and literature data sets. We report how our retrospective analysis confirms existing knowledge and how we identify novel C-domain inhibition of ACE. In order to test our hypotheses we extended and developed existing software forming two applications. The Nottingham Cheminformatics Workbench (NCW) will generate TMACC descriptors and allows the user to build and analyse models, including visualising the chemical interpretation. Forest Based Interpretation (FBI) provides various tools for interpretating a random forest model. Both applications are written in Java with full documentation and simple installations wizards are available for Windows, Linux and Mac. 2010-12-09 Thesis (University of Nottingham only) NonPeerReviewed application/pdf en arr https://eprints.nottingham.ac.uk/11666/1/thesis-final.pdf Bruce, Craig L. (2010) Classification and interpretation in quantitative structure-activity relationships. PhD thesis, University of Nottingham. qsar data mining cheminformatics chemoinformatics random forest machine learning
spellingShingle	qsar data mining cheminformatics chemoinformatics random forest machine learning Bruce, Craig L. Classification and interpretation in quantitative structure-activity relationships
title	Classification and interpretation in quantitative structure-activity relationships
title_full	Classification and interpretation in quantitative structure-activity relationships
title_fullStr	Classification and interpretation in quantitative structure-activity relationships
title_full_unstemmed	Classification and interpretation in quantitative structure-activity relationships
title_short	Classification and interpretation in quantitative structure-activity relationships
title_sort	classification and interpretation in quantitative structure-activity relationships
topic	qsar data mining cheminformatics chemoinformatics random forest machine learning
url	https://eprints.nottingham.ac.uk/11666/

Classification and interpretation in quantitative structure-activity relationships

Similar Items