Bayesian hidden Markov models in DNA sequence segmentation using R: the case of Simian Vacuolating virus (SV40)

Segmentation models aim to partition compositionally heterogeneous domains into homogeneous segments which may be reflective of biological function. Due to the latent nature of the segments a natural approach to segmentation that has gained favour recently uses Bayesian hidden Markov models (HMMs)....

Full description

Bibliographic Details
Main Authors: Totterdell, J.A., Nur, Darfiana, Mengersen, K.L.
Format: Journal Article
Language:English
Published: TAYLOR & FRANCIS LTD 2017
Subjects:
Online Access:http://hdl.handle.net/20.500.11937/79609
_version_ 1848764081059659776
author Totterdell, J.A.
Nur, Darfiana
Mengersen, K.L.
author_facet Totterdell, J.A.
Nur, Darfiana
Mengersen, K.L.
author_sort Totterdell, J.A.
building Curtin Institutional Repository
collection Online Access
description Segmentation models aim to partition compositionally heterogeneous domains into homogeneous segments which may be reflective of biological function. Due to the latent nature of the segments a natural approach to segmentation that has gained favour recently uses Bayesian hidden Markov models (HMMs). Concomitantly in the last few decades, the free R programming language has become a dominant tool for computational statistics, visualization and data science. Therefore, this paper aims to fully exploit R to fit a Bayesian HMM for DNA segmentation. The joint posterior distribution of parameters in the model to be considered is derived followed by the algorithms that can be used for estimation. Functions following these algorithms (Gibbs Sampling, Data Augmentation and Label Switching) are then fully implemented in R. The methodology is assessed through extensive simulation studies and then being applied to analyse Simian Vacuolating virus (SV40). It is concluded that: (1) the algorithms and functions in R can correctly estimate sequence segmentation if the HMM structure is assumed; (2) the performance of the model improves with sequence length; (3) R is reasonably fast for short to medium sequence lengths and number of segments and (4) the segmentation of SV40 appears to correspond with the two major transcripts, early and late, that regulate the expression of SV40 genes.
first_indexed 2025-11-14T11:13:41Z
format Journal Article
id curtin-20.500.11937-79609
institution Curtin University Malaysia
institution_category Local University
language English
last_indexed 2025-11-14T11:13:41Z
publishDate 2017
publisher TAYLOR & FRANCIS LTD
recordtype eprints
repository_type Digital Repository
spelling curtin-20.500.11937-796092020-06-15T00:20:58Z Bayesian hidden Markov models in DNA sequence segmentation using R: the case of Simian Vacuolating virus (SV40) Totterdell, J.A. Nur, Darfiana Mengersen, K.L. Science & Technology Technology Physical Sciences Computer Science, Interdisciplinary Applications Statistics & Probability Computer Science Mathematics Bayesian modelling DNA sequence data augmentation Gibbs sampler algorithm hidden Markov models label switching algorithm R statistical software segmentation modelling Simian Vacuolating virus (SV40) PROBABILISTIC FUNCTIONS STATISTICAL-ANALYSIS GENOME CHAINS Segmentation models aim to partition compositionally heterogeneous domains into homogeneous segments which may be reflective of biological function. Due to the latent nature of the segments a natural approach to segmentation that has gained favour recently uses Bayesian hidden Markov models (HMMs). Concomitantly in the last few decades, the free R programming language has become a dominant tool for computational statistics, visualization and data science. Therefore, this paper aims to fully exploit R to fit a Bayesian HMM for DNA segmentation. The joint posterior distribution of parameters in the model to be considered is derived followed by the algorithms that can be used for estimation. Functions following these algorithms (Gibbs Sampling, Data Augmentation and Label Switching) are then fully implemented in R. The methodology is assessed through extensive simulation studies and then being applied to analyse Simian Vacuolating virus (SV40). It is concluded that: (1) the algorithms and functions in R can correctly estimate sequence segmentation if the HMM structure is assumed; (2) the performance of the model improves with sequence length; (3) R is reasonably fast for short to medium sequence lengths and number of segments and (4) the segmentation of SV40 appears to correspond with the two major transcripts, early and late, that regulate the expression of SV40 genes. 2017 Journal Article http://hdl.handle.net/20.500.11937/79609 10.1080/00949655.2017.1344666 English TAYLOR & FRANCIS LTD restricted
spellingShingle Science & Technology
Technology
Physical Sciences
Computer Science, Interdisciplinary Applications
Statistics & Probability
Computer Science
Mathematics
Bayesian modelling
DNA sequence
data augmentation
Gibbs sampler algorithm
hidden Markov models
label switching algorithm
R statistical software
segmentation modelling
Simian Vacuolating virus (SV40)
PROBABILISTIC FUNCTIONS
STATISTICAL-ANALYSIS
GENOME
CHAINS
Totterdell, J.A.
Nur, Darfiana
Mengersen, K.L.
Bayesian hidden Markov models in DNA sequence segmentation using R: the case of Simian Vacuolating virus (SV40)
title Bayesian hidden Markov models in DNA sequence segmentation using R: the case of Simian Vacuolating virus (SV40)
title_full Bayesian hidden Markov models in DNA sequence segmentation using R: the case of Simian Vacuolating virus (SV40)
title_fullStr Bayesian hidden Markov models in DNA sequence segmentation using R: the case of Simian Vacuolating virus (SV40)
title_full_unstemmed Bayesian hidden Markov models in DNA sequence segmentation using R: the case of Simian Vacuolating virus (SV40)
title_short Bayesian hidden Markov models in DNA sequence segmentation using R: the case of Simian Vacuolating virus (SV40)
title_sort bayesian hidden markov models in dna sequence segmentation using r: the case of simian vacuolating virus (sv40)
topic Science & Technology
Technology
Physical Sciences
Computer Science, Interdisciplinary Applications
Statistics & Probability
Computer Science
Mathematics
Bayesian modelling
DNA sequence
data augmentation
Gibbs sampler algorithm
hidden Markov models
label switching algorithm
R statistical software
segmentation modelling
Simian Vacuolating virus (SV40)
PROBABILISTIC FUNCTIONS
STATISTICAL-ANALYSIS
GENOME
CHAINS
url http://hdl.handle.net/20.500.11937/79609