Explaining anomalies in coal proximity and coal processing data with Shapley and tree-based models

Modelling the characteristics and composition of coal is important, as proximity data and other measurements to do so are typically expensive or hard to acquire in real-time. Understanding anomalies in these relatively small data sets are important, as removal may result in an unnecessary loss of da...

Full description

Bibliographic Details
Main Authors:	Liu, Xiu, Aldrich, Chris
Format:	Journal Article
Language:	English
Published:	Elsevier 2022
Subjects:	Science & Technology Technology Energy & Fuels Engineering, Chemical Engineering Anomaly detection Isolation forest Shapley value regression Coal Variable importance measures Random forests PRINCIPAL COMPONENT ANALYSIS BITUMINOUS COAL COMBUSTION SYSTEM PREDICTION REGRESSION BOILER FOREST CARBON ASH
Online Access:	http://hdl.handle.net/20.500.11937/97646

_version_	1848766297573163008
author	Liu, Xiu Aldrich, Chris
author_facet	Liu, Xiu Aldrich, Chris
author_sort	Liu, Xiu
building	Curtin Institutional Repository
collection	Online Access
description	Modelling the characteristics and composition of coal is important, as proximity data and other measurements to do so are typically expensive or hard to acquire in real-time. Understanding anomalies in these relatively small data sets are important, as removal may result in an unnecessary loss of data or bias in the data used in the model. Although anomaly detection has been considered in-depth in the literature, very little work has been devoted to the explanation of anomalies. In this paper, a general anomaly detection and identification methodology is considered, based on three models, viz an isolation forest, a random forest and a tree SHAP explanatory model. Three case studies related to the composition of coal and coal processing are considered. In these case studies, the IF-RF-SHAP approach identified outliers of data anomalies not identifiable with principal component analysis. The model is a new variant of some of the integrated approaches that have recently been considered. Further contribution of the study lies in the empirical comparison of IF anomaly scores with distance-based and reconstruction-based anomaly scores generated with principal component models. In the case studies considered, the IF anomaly scores were better able to identify anomalies in the data than the scores derived from the principal component models. As a result, the methodology can complement distance-based approaches, such as principal component analysis, to explain anomalies or outliers detected in data. Apart from the proposed IF-RF-SHAP approach, four approaches to compare the contributions of variables in random forest models are considered as well. These were simple correlation of individual predictors with anomaly scores of samples, random forest prediction based on an impurity criterion, random forest prediction based on a permutation criterion, as well as the tree SHAP approach. If the latter is considered as a benchmark, then the impurity criterion gave the most reliable results, while simple predictor correlations gave the least reliable results.
first_indexed	2025-11-14T11:48:54Z
format	Journal Article
id	curtin-20.500.11937-97646
institution	Curtin University Malaysia
institution_category	Local University
language	English
last_indexed	2025-11-14T11:48:54Z
publishDate	2022
publisher	Elsevier
recordtype	eprints
repository_type	Digital Repository
spelling	curtin-20.500.11937-976462025-04-30T00:41:46Z Explaining anomalies in coal proximity and coal processing data with Shapley and tree-based models Liu, Xiu Aldrich, Chris Science & Technology Technology Energy & Fuels Engineering, Chemical Engineering Anomaly detection Isolation forest Shapley value regression Coal Variable importance measures Random forests PRINCIPAL COMPONENT ANALYSIS BITUMINOUS COAL COMBUSTION SYSTEM PREDICTION REGRESSION BOILER FOREST CARBON ASH Modelling the characteristics and composition of coal is important, as proximity data and other measurements to do so are typically expensive or hard to acquire in real-time. Understanding anomalies in these relatively small data sets are important, as removal may result in an unnecessary loss of data or bias in the data used in the model. Although anomaly detection has been considered in-depth in the literature, very little work has been devoted to the explanation of anomalies. In this paper, a general anomaly detection and identification methodology is considered, based on three models, viz an isolation forest, a random forest and a tree SHAP explanatory model. Three case studies related to the composition of coal and coal processing are considered. In these case studies, the IF-RF-SHAP approach identified outliers of data anomalies not identifiable with principal component analysis. The model is a new variant of some of the integrated approaches that have recently been considered. Further contribution of the study lies in the empirical comparison of IF anomaly scores with distance-based and reconstruction-based anomaly scores generated with principal component models. In the case studies considered, the IF anomaly scores were better able to identify anomalies in the data than the scores derived from the principal component models. As a result, the methodology can complement distance-based approaches, such as principal component analysis, to explain anomalies or outliers detected in data. Apart from the proposed IF-RF-SHAP approach, four approaches to compare the contributions of variables in random forest models are considered as well. These were simple correlation of individual predictors with anomaly scores of samples, random forest prediction based on an impurity criterion, random forest prediction based on a permutation criterion, as well as the tree SHAP approach. If the latter is considered as a benchmark, then the impurity criterion gave the most reliable results, while simple predictor correlations gave the least reliable results. 2022 Journal Article http://hdl.handle.net/20.500.11937/97646 10.1016/j.fuel.2022.126891 English Elsevier fulltext
spellingShingle	Science & Technology Technology Energy & Fuels Engineering, Chemical Engineering Anomaly detection Isolation forest Shapley value regression Coal Variable importance measures Random forests PRINCIPAL COMPONENT ANALYSIS BITUMINOUS COAL COMBUSTION SYSTEM PREDICTION REGRESSION BOILER FOREST CARBON ASH Liu, Xiu Aldrich, Chris Explaining anomalies in coal proximity and coal processing data with Shapley and tree-based models
title	Explaining anomalies in coal proximity and coal processing data with Shapley and tree-based models
title_full	Explaining anomalies in coal proximity and coal processing data with Shapley and tree-based models
title_fullStr	Explaining anomalies in coal proximity and coal processing data with Shapley and tree-based models
title_full_unstemmed	Explaining anomalies in coal proximity and coal processing data with Shapley and tree-based models
title_short	Explaining anomalies in coal proximity and coal processing data with Shapley and tree-based models
title_sort	explaining anomalies in coal proximity and coal processing data with shapley and tree-based models
topic	Science & Technology Technology Energy & Fuels Engineering, Chemical Engineering Anomaly detection Isolation forest Shapley value regression Coal Variable importance measures Random forests PRINCIPAL COMPONENT ANALYSIS BITUMINOUS COAL COMBUSTION SYSTEM PREDICTION REGRESSION BOILER FOREST CARBON ASH
url	http://hdl.handle.net/20.500.11937/97646

Explaining anomalies in coal proximity and coal processing data with Shapley and tree-based models

Similar Items