Models and methods to integrate epidemiological and whole genome sequence data for effectively analysing infectious disease outbreak data

Advances in sequencing technology and the reduction in associated costs have enabled scientists to obtain highly detailed genomic data on disease-causing pathogens on a scale never seen before. Combining genomic data with traditional epidemiological data (e.g. incidence data) provides a unique oppor...

Full description

Bibliographic Details
Main Author: Marsh, J. S.
Format: Thesis (University of Nottingham only)
Language:English
Published: 2024
Subjects:
Online Access:https://eprints.nottingham.ac.uk/77270/
_version_ 1848800980441759744
author Marsh, J. S.
author_facet Marsh, J. S.
author_sort Marsh, J. S.
building Nottingham Research Data Repository
collection Online Access
description Advances in sequencing technology and the reduction in associated costs have enabled scientists to obtain highly detailed genomic data on disease-causing pathogens on a scale never seen before. Combining genomic data with traditional epidemiological data (e.g. incidence data) provides a unique opportunity to determine the actual transmission pathway of the pathogen through a population. Despite recent advances, existing approaches have their own limitations, such as simplifications to the underlying biological processes, arbitrary phenomenological models or approximations to the likelihood function, to name a few. We present a novel modelling framework for integrating epidemiological and whole genome sequence data to overcome the above limitations where (i) we use the matrix of pairwise horizontal distances between sequences as a summary statistic for the genetic data and (ii) explicitly derive joint probability distribution of pairwise genetic distances under the assumption of microevolution mutation models. We develop bespoke and computationally efficient data-augmentation MCMC algorithms to infer the transmission network, infection times and unobserved genetic distances from pathogen sequences at the time of transmission. The framework presented is general and applicable to a variety of outbreak scenarios. For example, we explicitly consider a discrete time transmission model for healthcare associated infections and demonstrate the performance of our framework on simulated data and also analyse an outbreak of \textit{S. aureus} in an intensive care unit in Brighton during 2011-2012. Our approach integrates healthcare worker data at an individual level and considers the possibility of multiple distinct genetic subtypes. Finally we also consider integrating genetic data with a continuous time SEIR model and analyse an outbreak of foot-and-mouth disease in Darlington, a town in the north west of the UK in 2001. We validated our inferred transmission network with previous modelling studies and demonstrate that pairwise genetic distance is an informative summary of the raw sequence data.
first_indexed 2025-11-14T21:00:11Z
format Thesis (University of Nottingham only)
id nottingham-77270
institution University of Nottingham Malaysia Campus
institution_category Local University
language English
last_indexed 2025-11-14T21:00:11Z
publishDate 2024
recordtype eprints
repository_type Digital Repository
spelling nottingham-772702024-07-24T04:40:51Z https://eprints.nottingham.ac.uk/77270/ Models and methods to integrate epidemiological and whole genome sequence data for effectively analysing infectious disease outbreak data Marsh, J. S. Advances in sequencing technology and the reduction in associated costs have enabled scientists to obtain highly detailed genomic data on disease-causing pathogens on a scale never seen before. Combining genomic data with traditional epidemiological data (e.g. incidence data) provides a unique opportunity to determine the actual transmission pathway of the pathogen through a population. Despite recent advances, existing approaches have their own limitations, such as simplifications to the underlying biological processes, arbitrary phenomenological models or approximations to the likelihood function, to name a few. We present a novel modelling framework for integrating epidemiological and whole genome sequence data to overcome the above limitations where (i) we use the matrix of pairwise horizontal distances between sequences as a summary statistic for the genetic data and (ii) explicitly derive joint probability distribution of pairwise genetic distances under the assumption of microevolution mutation models. We develop bespoke and computationally efficient data-augmentation MCMC algorithms to infer the transmission network, infection times and unobserved genetic distances from pathogen sequences at the time of transmission. The framework presented is general and applicable to a variety of outbreak scenarios. For example, we explicitly consider a discrete time transmission model for healthcare associated infections and demonstrate the performance of our framework on simulated data and also analyse an outbreak of \textit{S. aureus} in an intensive care unit in Brighton during 2011-2012. Our approach integrates healthcare worker data at an individual level and considers the possibility of multiple distinct genetic subtypes. Finally we also consider integrating genetic data with a continuous time SEIR model and analyse an outbreak of foot-and-mouth disease in Darlington, a town in the north west of the UK in 2001. We validated our inferred transmission network with previous modelling studies and demonstrate that pairwise genetic distance is an informative summary of the raw sequence data. 2024-07-24 Thesis (University of Nottingham only) NonPeerReviewed application/pdf en cc_by https://eprints.nottingham.ac.uk/77270/1/JoeMarshPhDThesis.pdf Marsh, J. S. (2024) Models and methods to integrate epidemiological and whole genome sequence data for effectively analysing infectious disease outbreak data. PhD thesis, University of Nottingham. stochastic epidemic models; bayesian inference; genetic data; transmission trees; mrsa; foot and mouth disease; pairwise genetic distance data; who infected whom
spellingShingle stochastic epidemic models; bayesian inference; genetic data; transmission trees; mrsa; foot and mouth disease; pairwise genetic distance data; who infected whom
Marsh, J. S.
Models and methods to integrate epidemiological and whole genome sequence data for effectively analysing infectious disease outbreak data
title Models and methods to integrate epidemiological and whole genome sequence data for effectively analysing infectious disease outbreak data
title_full Models and methods to integrate epidemiological and whole genome sequence data for effectively analysing infectious disease outbreak data
title_fullStr Models and methods to integrate epidemiological and whole genome sequence data for effectively analysing infectious disease outbreak data
title_full_unstemmed Models and methods to integrate epidemiological and whole genome sequence data for effectively analysing infectious disease outbreak data
title_short Models and methods to integrate epidemiological and whole genome sequence data for effectively analysing infectious disease outbreak data
title_sort models and methods to integrate epidemiological and whole genome sequence data for effectively analysing infectious disease outbreak data
topic stochastic epidemic models; bayesian inference; genetic data; transmission trees; mrsa; foot and mouth disease; pairwise genetic distance data; who infected whom
url https://eprints.nottingham.ac.uk/77270/