Influence of Missing Values Substitutes on Multivariate Analysis of Metabolomics Data

Missing values are known to be problematic for the analysis of gas chromatography-mass spectrometry (GC-MS) metabolomics data. Typically these values cover about 10%–20% of all data and can originate from various backgrounds, including analytical, computational, as well as biological. Currently, the...

Full description

Bibliographic Details
Main Authors: Gromski, Piotr S., Xu, Yun, Kotze, Helen L., Correa, Elon, Ellis, David I., Armitage, Emily Grace, Turner, Michael L., Goodacre, Royston
Format: Online
Language:English
Published: MDPI 2014
Online Access:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4101515/
id pubmed-4101515
recordtype oai_dc
spelling pubmed-41015152014-07-17 Influence of Missing Values Substitutes on Multivariate Analysis of Metabolomics Data Gromski, Piotr S. Xu, Yun Kotze, Helen L. Correa, Elon Ellis, David I. Armitage, Emily Grace Turner, Michael L. Goodacre, Royston Article Missing values are known to be problematic for the analysis of gas chromatography-mass spectrometry (GC-MS) metabolomics data. Typically these values cover about 10%–20% of all data and can originate from various backgrounds, including analytical, computational, as well as biological. Currently, the most well known substitute for missing values is a mean imputation. In fact, some researchers consider this aspect of data analysis in their metabolomics pipeline as so routine that they do not even mention using this replacement approach. However, this may have a significant influence on the data analysis output(s) and might be highly sensitive to the distribution of samples between different classes. Therefore, in this study we have analysed different substitutes of missing values namely: zero, mean, median, k-nearest neighbours (kNN) and random forest (RF) imputation, in terms of their influence on unsupervised and supervised learning and, thus, their impact on the final output(s) in terms of biological interpretation. These comparisons have been demonstrated both visually and computationally (classification rate) to support our findings. The results show that the selection of the replacement methods to impute missing values may have a considerable effect on the classification accuracy, if performed incorrectly this may negatively influence the biomarkers selected for an early disease diagnosis or identification of cancer related metabolites. In the case of GC-MS metabolomics data studied here our findings recommend that RF should be favored as an imputation of missing value over the other tested methods. This approach displayed excellent results in terms of classification rate for both supervised methods namely: principal components-linear discriminant analysis (PC-LDA) (98.02%) and partial least squares-discriminant analysis (PLS-DA) (97.96%) outperforming other imputation methods. MDPI 2014-06-16 /pmc/articles/PMC4101515/ /pubmed/24957035 http://dx.doi.org/10.3390/metabo4020433 Text en © 2014 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).
repository_type Open Access Journal
institution_category Foreign Institution
institution US National Center for Biotechnology Information
building NCBI PubMed
collection Online Access
language English
format Online
author Gromski, Piotr S.
Xu, Yun
Kotze, Helen L.
Correa, Elon
Ellis, David I.
Armitage, Emily Grace
Turner, Michael L.
Goodacre, Royston
spellingShingle Gromski, Piotr S.
Xu, Yun
Kotze, Helen L.
Correa, Elon
Ellis, David I.
Armitage, Emily Grace
Turner, Michael L.
Goodacre, Royston
Influence of Missing Values Substitutes on Multivariate Analysis of Metabolomics Data
author_facet Gromski, Piotr S.
Xu, Yun
Kotze, Helen L.
Correa, Elon
Ellis, David I.
Armitage, Emily Grace
Turner, Michael L.
Goodacre, Royston
author_sort Gromski, Piotr S.
title Influence of Missing Values Substitutes on Multivariate Analysis of Metabolomics Data
title_short Influence of Missing Values Substitutes on Multivariate Analysis of Metabolomics Data
title_full Influence of Missing Values Substitutes on Multivariate Analysis of Metabolomics Data
title_fullStr Influence of Missing Values Substitutes on Multivariate Analysis of Metabolomics Data
title_full_unstemmed Influence of Missing Values Substitutes on Multivariate Analysis of Metabolomics Data
title_sort influence of missing values substitutes on multivariate analysis of metabolomics data
description Missing values are known to be problematic for the analysis of gas chromatography-mass spectrometry (GC-MS) metabolomics data. Typically these values cover about 10%–20% of all data and can originate from various backgrounds, including analytical, computational, as well as biological. Currently, the most well known substitute for missing values is a mean imputation. In fact, some researchers consider this aspect of data analysis in their metabolomics pipeline as so routine that they do not even mention using this replacement approach. However, this may have a significant influence on the data analysis output(s) and might be highly sensitive to the distribution of samples between different classes. Therefore, in this study we have analysed different substitutes of missing values namely: zero, mean, median, k-nearest neighbours (kNN) and random forest (RF) imputation, in terms of their influence on unsupervised and supervised learning and, thus, their impact on the final output(s) in terms of biological interpretation. These comparisons have been demonstrated both visually and computationally (classification rate) to support our findings. The results show that the selection of the replacement methods to impute missing values may have a considerable effect on the classification accuracy, if performed incorrectly this may negatively influence the biomarkers selected for an early disease diagnosis or identification of cancer related metabolites. In the case of GC-MS metabolomics data studied here our findings recommend that RF should be favored as an imputation of missing value over the other tested methods. This approach displayed excellent results in terms of classification rate for both supervised methods namely: principal components-linear discriminant analysis (PC-LDA) (98.02%) and partial least squares-discriminant analysis (PLS-DA) (97.96%) outperforming other imputation methods.
publisher MDPI
publishDate 2014
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4101515/
_version_ 1613115019984109568