Bayesian hidden Markov models in DNA sequence segmentation using R: the case of Simian Vacuolating virus (SV40)
Segmentation models aim to partition compositionally heterogeneous domains into homogeneous segments which may be reflective of biological function. Due to the latent nature of the segments a natural approach to segmentation that has gained favour recently uses Bayesian hidden Markov models (HMMs)....
| Main Authors: | , , |
|---|---|
| Format: | Journal Article |
| Language: | English |
| Published: |
TAYLOR & FRANCIS LTD
2017
|
| Subjects: | |
| Online Access: | http://hdl.handle.net/20.500.11937/79609 |
| _version_ | 1848764081059659776 |
|---|---|
| author | Totterdell, J.A. Nur, Darfiana Mengersen, K.L. |
| author_facet | Totterdell, J.A. Nur, Darfiana Mengersen, K.L. |
| author_sort | Totterdell, J.A. |
| building | Curtin Institutional Repository |
| collection | Online Access |
| description | Segmentation models aim to partition compositionally heterogeneous domains into homogeneous segments which may be reflective of biological function. Due to the latent nature of the segments a natural approach to segmentation that has gained favour recently uses Bayesian hidden Markov models (HMMs). Concomitantly in the last few decades, the free R programming language has become a dominant tool for computational statistics, visualization and data science. Therefore, this paper aims to fully exploit R to fit a Bayesian HMM for DNA segmentation. The joint posterior distribution of parameters in the model to be considered is derived followed by the algorithms that can be used for estimation. Functions following these algorithms (Gibbs Sampling, Data Augmentation and Label Switching) are then fully implemented in R. The methodology is assessed through extensive simulation studies and then being applied to analyse Simian Vacuolating virus (SV40). It is concluded that: (1) the algorithms and functions in R can correctly estimate sequence segmentation if the HMM structure is assumed; (2) the performance of the model improves with sequence length; (3) R is reasonably fast for short to medium sequence lengths and number of segments and (4) the segmentation of SV40 appears to correspond with the two major transcripts, early and late, that regulate the expression of SV40 genes. |
| first_indexed | 2025-11-14T11:13:41Z |
| format | Journal Article |
| id | curtin-20.500.11937-79609 |
| institution | Curtin University Malaysia |
| institution_category | Local University |
| language | English |
| last_indexed | 2025-11-14T11:13:41Z |
| publishDate | 2017 |
| publisher | TAYLOR & FRANCIS LTD |
| recordtype | eprints |
| repository_type | Digital Repository |
| spelling | curtin-20.500.11937-796092020-06-15T00:20:58Z Bayesian hidden Markov models in DNA sequence segmentation using R: the case of Simian Vacuolating virus (SV40) Totterdell, J.A. Nur, Darfiana Mengersen, K.L. Science & Technology Technology Physical Sciences Computer Science, Interdisciplinary Applications Statistics & Probability Computer Science Mathematics Bayesian modelling DNA sequence data augmentation Gibbs sampler algorithm hidden Markov models label switching algorithm R statistical software segmentation modelling Simian Vacuolating virus (SV40) PROBABILISTIC FUNCTIONS STATISTICAL-ANALYSIS GENOME CHAINS Segmentation models aim to partition compositionally heterogeneous domains into homogeneous segments which may be reflective of biological function. Due to the latent nature of the segments a natural approach to segmentation that has gained favour recently uses Bayesian hidden Markov models (HMMs). Concomitantly in the last few decades, the free R programming language has become a dominant tool for computational statistics, visualization and data science. Therefore, this paper aims to fully exploit R to fit a Bayesian HMM for DNA segmentation. The joint posterior distribution of parameters in the model to be considered is derived followed by the algorithms that can be used for estimation. Functions following these algorithms (Gibbs Sampling, Data Augmentation and Label Switching) are then fully implemented in R. The methodology is assessed through extensive simulation studies and then being applied to analyse Simian Vacuolating virus (SV40). It is concluded that: (1) the algorithms and functions in R can correctly estimate sequence segmentation if the HMM structure is assumed; (2) the performance of the model improves with sequence length; (3) R is reasonably fast for short to medium sequence lengths and number of segments and (4) the segmentation of SV40 appears to correspond with the two major transcripts, early and late, that regulate the expression of SV40 genes. 2017 Journal Article http://hdl.handle.net/20.500.11937/79609 10.1080/00949655.2017.1344666 English TAYLOR & FRANCIS LTD restricted |
| spellingShingle | Science & Technology Technology Physical Sciences Computer Science, Interdisciplinary Applications Statistics & Probability Computer Science Mathematics Bayesian modelling DNA sequence data augmentation Gibbs sampler algorithm hidden Markov models label switching algorithm R statistical software segmentation modelling Simian Vacuolating virus (SV40) PROBABILISTIC FUNCTIONS STATISTICAL-ANALYSIS GENOME CHAINS Totterdell, J.A. Nur, Darfiana Mengersen, K.L. Bayesian hidden Markov models in DNA sequence segmentation using R: the case of Simian Vacuolating virus (SV40) |
| title | Bayesian hidden Markov models in DNA sequence segmentation using R: the case of Simian Vacuolating virus (SV40) |
| title_full | Bayesian hidden Markov models in DNA sequence segmentation using R: the case of Simian Vacuolating virus (SV40) |
| title_fullStr | Bayesian hidden Markov models in DNA sequence segmentation using R: the case of Simian Vacuolating virus (SV40) |
| title_full_unstemmed | Bayesian hidden Markov models in DNA sequence segmentation using R: the case of Simian Vacuolating virus (SV40) |
| title_short | Bayesian hidden Markov models in DNA sequence segmentation using R: the case of Simian Vacuolating virus (SV40) |
| title_sort | bayesian hidden markov models in dna sequence segmentation using r: the case of simian vacuolating virus (sv40) |
| topic | Science & Technology Technology Physical Sciences Computer Science, Interdisciplinary Applications Statistics & Probability Computer Science Mathematics Bayesian modelling DNA sequence data augmentation Gibbs sampler algorithm hidden Markov models label switching algorithm R statistical software segmentation modelling Simian Vacuolating virus (SV40) PROBABILISTIC FUNCTIONS STATISTICAL-ANALYSIS GENOME CHAINS |
| url | http://hdl.handle.net/20.500.11937/79609 |