Statistical analysis of genomic data : a new model for class prediction and inference

Genomics is a major scientific revolution in this century. High-throughput genomic data provides an opportunity for identifying genes and SNPs (singlenucleotide polymorphism) that are related to various clinical phenotypes. To deal with the sheer volume of genetic data being produced, it requires ad...

Full description

Bibliographic Details
Main Author: Jiang, Zhenyu
Format: Thesis
Language:English
Published: Curtin University 2011
Subjects:
Online Access:http://hdl.handle.net/20.500.11937/1017
_version_ 1848743545234522112
author Jiang, Zhenyu
author_facet Jiang, Zhenyu
author_sort Jiang, Zhenyu
building Curtin Institutional Repository
collection Online Access
description Genomics is a major scientific revolution in this century. High-throughput genomic data provides an opportunity for identifying genes and SNPs (singlenucleotide polymorphism) that are related to various clinical phenotypes. To deal with the sheer volume of genetic data being produced, it requires advanced methodological development in biostatistics that is lagging behind the technical capability to generate genomic data. SNPs have great importance in biomedical research for comparing regions of the genome between cohorts (such as case-control studies). Within a population, SNPs can be assigned a minor allele frequency, the lowest allele frequency at a locus that is observed in a particular population, and be recoded to binary datasets. Therefore, it is important to develop suitable statistical methods for SNPs analysis of genome alteration with the goal of contributing to the understanding of complex human diseases or traits such as mental health.In this thesis, we develop new statistical methodologies for the analysis of schizophrenia genomic data from the WA Genetic Epidemiology Resource (WAGER). The motivation is driven by the schizophrenia class prediction, (i.e. the prediction of individuals’ disease status through their genotype and quantitative traits). In general, individual’s disease status is a nominal variable, while genotypes can be converted into ordinal variables but are of high dimension. Note that the usual nonparametric regression that is developed for continuous variables cannot be applied here. There are some methodologies, such as the tree-based logistic Non-parametric Pathway-based Regression model (NPR) proposed by Wei and Li (2007)available in the literature. However, it is found that this model does not well adapt to the data set that we are analyzing. It is even worse than the (generalized) linear logistic regression model. Using logistic discrimination rule, together with adding quantitative traits, some important results have been obtained. However, some shortcomings remain. Firstly, the generalized linear logistic model has a high type I error rate for schizophrenia classification. Secondly, quantitative traits required for schizophrenia class prediction are performance assessments which demand several hours on-site participation by both assessor and assessee. These traits are generally quite difficult to reach even for a medium size sample. Meanwhile, though the laboratory analyzing cost is high, a person’s genotype can be obtained by merely collecting a drop of blood.Thus, two kinds of nonlinear models are proposed to capture the nonlinear effects in SNP datasets, which are categorical. The main contributions of this thesis are summarized as follows: • Two kinds of nonlinear threshold index logistic regression models are proposed to capture the nonlinear effects by applying the idea of threshold models (Tong (1983, 1990)) which are parametric and therefore applicable to the categorical data. One of the proposed models, which is called the partially linear threshold index logistic regression (PL-TILoR) model, is given by log ( P(Yi = 1|Xi) 1 − P(Yi = 1|Xi) ) = ®TXi + g(¯TXi), (0.1) where Yi is the disease status of the ith person under case-control study, taking on values of 1 (case) or 0 (control), Xi is the vector of genotype variables, which is p-dimensional, and the superscript T stands for transpose of a vector or matrix. Here, ® and ¯ are p-dimensional unknown parameters with ¯ being an index vector used for the reduction of dimension, satisfying k¯k = 1 and ®T¯ = 0 for model identifiability, and g is, therefore, a one-dimensional nonlinear function, which is modelled as stepwise linear function through threshold effect (Tong, 1990), given below. g(z) = (b1z + b2)I{z•c} + (b3z + b4)I{z>c}, (0.2) where bi’s and c are unknown parameters to be estimated and IA is an indicator function of the set A. In practice, the first component in model (0.1) could also be nonlinear. In this case, model (0.1) becomes log ( P(Yi = 1|Xi) 1 − P(Yi = 1|Xi) ) = g1(®TXi) + g2(¯TXi), (0.3) where k®k = 1, k¯k = 1 and ®T¯ = 0 for model identifiability, and g1 and g2 are two one-dimensional nonlinear functions which are modelled by stepwise linear functions through threshold effects as follows: gk(z) = (bk1z + bk2)I{z•ck} + (bk3z + bk4)I{z>ck}, k = 1, 2, (0.4) where bki’s and ck’s are unknown parameters to be estimated. Thus, (0.3) and (0.4) form an additive threshold index logistic regression (ATILoR) model. • A maximum likelihood methodology is developed to estimate the unknown parameters in the PL-TILoR and A-TILoR models. Simulation studies have found that the proposed methodology works well for finite size samples. • Empirical studies of the proposed models applied to the analysis of schizophrenia genomic data from the WA Genetic Epidemiology Resource (WAGER) have shown that A-TILoR model is very successful in reducing the type I error rate in schizophrenia classification without even using quantitative traits. It outperforms the generalized linear logistic model that is widely used in the literature.
first_indexed 2025-11-14T05:47:16Z
format Thesis
id curtin-20.500.11937-1017
institution Curtin University Malaysia
institution_category Local University
language English
last_indexed 2025-11-14T05:47:16Z
publishDate 2011
publisher Curtin University
recordtype eprints
repository_type Digital Repository
spelling curtin-20.500.11937-10172017-02-20T06:42:03Z Statistical analysis of genomic data : a new model for class prediction and inference Jiang, Zhenyu genomic data class prediction class inference Statistical analysis Genomics is a major scientific revolution in this century. High-throughput genomic data provides an opportunity for identifying genes and SNPs (singlenucleotide polymorphism) that are related to various clinical phenotypes. To deal with the sheer volume of genetic data being produced, it requires advanced methodological development in biostatistics that is lagging behind the technical capability to generate genomic data. SNPs have great importance in biomedical research for comparing regions of the genome between cohorts (such as case-control studies). Within a population, SNPs can be assigned a minor allele frequency, the lowest allele frequency at a locus that is observed in a particular population, and be recoded to binary datasets. Therefore, it is important to develop suitable statistical methods for SNPs analysis of genome alteration with the goal of contributing to the understanding of complex human diseases or traits such as mental health.In this thesis, we develop new statistical methodologies for the analysis of schizophrenia genomic data from the WA Genetic Epidemiology Resource (WAGER). The motivation is driven by the schizophrenia class prediction, (i.e. the prediction of individuals’ disease status through their genotype and quantitative traits). In general, individual’s disease status is a nominal variable, while genotypes can be converted into ordinal variables but are of high dimension. Note that the usual nonparametric regression that is developed for continuous variables cannot be applied here. There are some methodologies, such as the tree-based logistic Non-parametric Pathway-based Regression model (NPR) proposed by Wei and Li (2007)available in the literature. However, it is found that this model does not well adapt to the data set that we are analyzing. It is even worse than the (generalized) linear logistic regression model. Using logistic discrimination rule, together with adding quantitative traits, some important results have been obtained. However, some shortcomings remain. Firstly, the generalized linear logistic model has a high type I error rate for schizophrenia classification. Secondly, quantitative traits required for schizophrenia class prediction are performance assessments which demand several hours on-site participation by both assessor and assessee. These traits are generally quite difficult to reach even for a medium size sample. Meanwhile, though the laboratory analyzing cost is high, a person’s genotype can be obtained by merely collecting a drop of blood.Thus, two kinds of nonlinear models are proposed to capture the nonlinear effects in SNP datasets, which are categorical. The main contributions of this thesis are summarized as follows: • Two kinds of nonlinear threshold index logistic regression models are proposed to capture the nonlinear effects by applying the idea of threshold models (Tong (1983, 1990)) which are parametric and therefore applicable to the categorical data. One of the proposed models, which is called the partially linear threshold index logistic regression (PL-TILoR) model, is given by log ( P(Yi = 1|Xi) 1 − P(Yi = 1|Xi) ) = ®TXi + g(¯TXi), (0.1) where Yi is the disease status of the ith person under case-control study, taking on values of 1 (case) or 0 (control), Xi is the vector of genotype variables, which is p-dimensional, and the superscript T stands for transpose of a vector or matrix. Here, ® and ¯ are p-dimensional unknown parameters with ¯ being an index vector used for the reduction of dimension, satisfying k¯k = 1 and ®T¯ = 0 for model identifiability, and g is, therefore, a one-dimensional nonlinear function, which is modelled as stepwise linear function through threshold effect (Tong, 1990), given below. g(z) = (b1z + b2)I{z•c} + (b3z + b4)I{z>c}, (0.2) where bi’s and c are unknown parameters to be estimated and IA is an indicator function of the set A. In practice, the first component in model (0.1) could also be nonlinear. In this case, model (0.1) becomes log ( P(Yi = 1|Xi) 1 − P(Yi = 1|Xi) ) = g1(®TXi) + g2(¯TXi), (0.3) where k®k = 1, k¯k = 1 and ®T¯ = 0 for model identifiability, and g1 and g2 are two one-dimensional nonlinear functions which are modelled by stepwise linear functions through threshold effects as follows: gk(z) = (bk1z + bk2)I{z•ck} + (bk3z + bk4)I{z>ck}, k = 1, 2, (0.4) where bki’s and ck’s are unknown parameters to be estimated. Thus, (0.3) and (0.4) form an additive threshold index logistic regression (ATILoR) model. • A maximum likelihood methodology is developed to estimate the unknown parameters in the PL-TILoR and A-TILoR models. Simulation studies have found that the proposed methodology works well for finite size samples. • Empirical studies of the proposed models applied to the analysis of schizophrenia genomic data from the WA Genetic Epidemiology Resource (WAGER) have shown that A-TILoR model is very successful in reducing the type I error rate in schizophrenia classification without even using quantitative traits. It outperforms the generalized linear logistic model that is widely used in the literature. 2011 Thesis http://hdl.handle.net/20.500.11937/1017 en Curtin University fulltext
spellingShingle genomic data
class prediction
class inference
Statistical analysis
Jiang, Zhenyu
Statistical analysis of genomic data : a new model for class prediction and inference
title Statistical analysis of genomic data : a new model for class prediction and inference
title_full Statistical analysis of genomic data : a new model for class prediction and inference
title_fullStr Statistical analysis of genomic data : a new model for class prediction and inference
title_full_unstemmed Statistical analysis of genomic data : a new model for class prediction and inference
title_short Statistical analysis of genomic data : a new model for class prediction and inference
title_sort statistical analysis of genomic data : a new model for class prediction and inference
topic genomic data
class prediction
class inference
Statistical analysis
url http://hdl.handle.net/20.500.11937/1017