Extending generalized linear models with random effects and components of dispersion = [Gegeneraliseerde lineaire modellen met extra stochastische termen en bijbehorende variantiecomponenten]

This dissertation was born out of a need for general and numerically feasible procedures for inference in variance components models for non-normal data. The methodology should be widely applicable within the institutes of the Agricultural Research Department (DLO) of the Dutch Ministry of Agriculture, Nature Management and Fisheries. Available methodology employing maximum likelihood estimation, due to numerical limitations, was too restricted with respect to the choice of random structures. Modification of the iterative re-weighted least squares (IRLS) algorithm, which is widely used for estimation in generalized linear models (GLMs), seemed a promising alternative to maximum likelihood.The class of generalized linear mixed models (GLMMs) studied in this dissertation, is a straightforward extension of GLMs. The proposed estimation procedure for GLMMs, obtained by replacing least squares by linear mixed model (LMM) methodology, is a straightforward extension of the IRLS procedure for GLMs. The new procedure, involves iterative use of restricted maximum likelihood (REML) and is referred to as iterative reweighted restricted maximum likelihood (IRREML). REML is an estimation procedure for ordinary normal data LMMs. Software for REML is widely available. In this thesis facilities for REML in the statistical programming language Genstat 5 are employed. In each iteration step of IRREML, REML is applied to an approximate LMM for an artificial dependent variate. This variate and corresponding residual weights, referred to as the "adjusted dependent variate" and the "iterative weights" (adhering to GLM terminology), are up- dated after each iteration. Numerical restrictions for IRREML are the same as for REML for ordinary normal data mixed models and pertain to the size of matrices to be inverted. These can be dealt with to a large extent by eliminating (absorbing) factors with a large number of levels. The estimation procedure, programmed in Genstat 5, is available through the Genstat Procedure Library of the Agricultural Mathematics Group (GLW-DLO). By now it has been widely used both within and outside the institutes of DLO.After the introduction in Chapter 1, inference for LMMs, with emphasis on REML, and for over- dispersed GLMs, illustrating maximum quasi-likelihood estimation, is discussed in Chapters 2 and 3.IRREML is introduced in Chapter 4. As can be seen from the discussion in that chapter, and from later chapters, a number of statisticians independently have approached the estimation problem from different starting points, ending up with the same estimating equations. A Bayesian approach for prediction of random (genetic) effects for binary, binomial and ordinal data, was presented as early as 1983 by Gionola and Foulley.In Chapter 5, a first attempt is made to assess the quality of IRREML by simulation. Simulated data was based on a practical problem involving carcass classification of cattle. For this problem, observations analysed were proportions of agreement between classifiers. Although the data set was large and highly unbalanced, a GLMM with four components of variance and an over-dispersion parameter could be fitted without problems. The simulation study included various procedures for the construction of confidence intervals and significance tests. These procedures, which were originally derived for LMMs under normality, were applied to the adjusted dependent variate in the last iteration step of IRREML. IRREML and the modified LMM procedures performed satisfactorily.In Chapter 6, the analysis of threshold models for binary and binomial data is considered. These threshold models are part of the class of GLMMs. A simulation study, mimicking an animal breeding experiment for binary data, indicated that IRREML may perform poorly when the number of observations per random effect is small. In terms of the animal breeding experiment: IRREML estimates of heritability may be considerably biased when the data set consists of a large number of small families. In contrast to other results in the literature, it was found that both under- and overestimation may occur, depending on therelative number of fixed effects in the model. In an animal breeding experiment, fixed effects usually represent a very large number of herds, years and seasons, which are all nuisance parameters, since interest centers on variance components and predicted random effects for animals (representing their genetic merit).In Chapter 7, IRREML is extended towards threshold models for ordinal data. Estimation includes additional shape parameters for a wide class of underlying distributions. For instance, heterogeneity of residual variances of an underlying normal distribution may be modelled in terms of factors and covariates employing a logarithmic link function.In Chapter 8, the simulation study for binary data from Chapter 6 is extended and two methods for bias correction of variance component estimators are studied. Minimal dimensions of the data set are identified, such that useful inference about components of variance is feasible.In Chapter 9, prediction of random effects in a model for normal data with heterogeneous variances is considered. In this model, both means and variances are expressed in terms of fixed and random effects, involving both additive and multiplicative effects. The estimation procedure was developed as a basis for a new national breeding evaluation method for Dutch dairy cattle. It was implemented by the Dutch Cattle Syndicate in Arnhem in 1995. Data sets in the dairy industry are extremely large, and therefore computational aspects were very important. A data set comprising 12,629,403 milk records was analysed. Ideas behind IRREML were used to motivate the estimation procedure. The performance of the procedure was assessed by simulation.In Chapter 10 the relationship between estimation by IRREML and maximum likelihood (ML) estimation, is discussed in some detail. Employing Laplace integration, IRREML may be shown to be an approximate ML procedure. The poor asymptotic properties of IRREML when the number of binary observations per random effect is limited and the number of random effects is large, are illustrated by a simple over-dispersion model for binomial data. Since ML was seen to perform well, the Gibbs sampler, as a powerful numerical integrator to derive approximate ML estimates, seems a promising technique for datasets of this kind.

Saved in:
Bibliographic Details
Main Author: Engel, B.
Other Authors: Rasch, D.A.M.K.
Format: Doctoral thesis biblioteca
Language:English
Published: Landbouwuniversiteit Wageningen
Subjects:cum laude, mathematical models, statistics, statistiek, wiskundige modellen,
Online Access:https://research.wur.nl/en/publications/extending-generalized-linear-models-with-random-effects-and-compo
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:This dissertation was born out of a need for general and numerically feasible procedures for inference in variance components models for non-normal data. The methodology should be widely applicable within the institutes of the Agricultural Research Department (DLO) of the Dutch Ministry of Agriculture, Nature Management and Fisheries. Available methodology employing maximum likelihood estimation, due to numerical limitations, was too restricted with respect to the choice of random structures. Modification of the iterative re-weighted least squares (IRLS) algorithm, which is widely used for estimation in generalized linear models (GLMs), seemed a promising alternative to maximum likelihood.The class of generalized linear mixed models (GLMMs) studied in this dissertation, is a straightforward extension of GLMs. The proposed estimation procedure for GLMMs, obtained by replacing least squares by linear mixed model (LMM) methodology, is a straightforward extension of the IRLS procedure for GLMs. The new procedure, involves iterative use of restricted maximum likelihood (REML) and is referred to as iterative reweighted restricted maximum likelihood (IRREML). REML is an estimation procedure for ordinary normal data LMMs. Software for REML is widely available. In this thesis facilities for REML in the statistical programming language Genstat 5 are employed. In each iteration step of IRREML, REML is applied to an approximate LMM for an artificial dependent variate. This variate and corresponding residual weights, referred to as the "adjusted dependent variate" and the "iterative weights" (adhering to GLM terminology), are up- dated after each iteration. Numerical restrictions for IRREML are the same as for REML for ordinary normal data mixed models and pertain to the size of matrices to be inverted. These can be dealt with to a large extent by eliminating (absorbing) factors with a large number of levels. The estimation procedure, programmed in Genstat 5, is available through the Genstat Procedure Library of the Agricultural Mathematics Group (GLW-DLO). By now it has been widely used both within and outside the institutes of DLO.After the introduction in Chapter 1, inference for LMMs, with emphasis on REML, and for over- dispersed GLMs, illustrating maximum quasi-likelihood estimation, is discussed in Chapters 2 and 3.IRREML is introduced in Chapter 4. As can be seen from the discussion in that chapter, and from later chapters, a number of statisticians independently have approached the estimation problem from different starting points, ending up with the same estimating equations. A Bayesian approach for prediction of random (genetic) effects for binary, binomial and ordinal data, was presented as early as 1983 by Gionola and Foulley.In Chapter 5, a first attempt is made to assess the quality of IRREML by simulation. Simulated data was based on a practical problem involving carcass classification of cattle. For this problem, observations analysed were proportions of agreement between classifiers. Although the data set was large and highly unbalanced, a GLMM with four components of variance and an over-dispersion parameter could be fitted without problems. The simulation study included various procedures for the construction of confidence intervals and significance tests. These procedures, which were originally derived for LMMs under normality, were applied to the adjusted dependent variate in the last iteration step of IRREML. IRREML and the modified LMM procedures performed satisfactorily.In Chapter 6, the analysis of threshold models for binary and binomial data is considered. These threshold models are part of the class of GLMMs. A simulation study, mimicking an animal breeding experiment for binary data, indicated that IRREML may perform poorly when the number of observations per random effect is small. In terms of the animal breeding experiment: IRREML estimates of heritability may be considerably biased when the data set consists of a large number of small families. In contrast to other results in the literature, it was found that both under- and overestimation may occur, depending on therelative number of fixed effects in the model. In an animal breeding experiment, fixed effects usually represent a very large number of herds, years and seasons, which are all nuisance parameters, since interest centers on variance components and predicted random effects for animals (representing their genetic merit).In Chapter 7, IRREML is extended towards threshold models for ordinal data. Estimation includes additional shape parameters for a wide class of underlying distributions. For instance, heterogeneity of residual variances of an underlying normal distribution may be modelled in terms of factors and covariates employing a logarithmic link function.In Chapter 8, the simulation study for binary data from Chapter 6 is extended and two methods for bias correction of variance component estimators are studied. Minimal dimensions of the data set are identified, such that useful inference about components of variance is feasible.In Chapter 9, prediction of random effects in a model for normal data with heterogeneous variances is considered. In this model, both means and variances are expressed in terms of fixed and random effects, involving both additive and multiplicative effects. The estimation procedure was developed as a basis for a new national breeding evaluation method for Dutch dairy cattle. It was implemented by the Dutch Cattle Syndicate in Arnhem in 1995. Data sets in the dairy industry are extremely large, and therefore computational aspects were very important. A data set comprising 12,629,403 milk records was analysed. Ideas behind IRREML were used to motivate the estimation procedure. The performance of the procedure was assessed by simulation.In Chapter 10 the relationship between estimation by IRREML and maximum likelihood (ML) estimation, is discussed in some detail. Employing Laplace integration, IRREML may be shown to be an approximate ML procedure. The poor asymptotic properties of IRREML when the number of binary observations per random effect is limited and the number of random effects is large, are illustrated by a simple over-dispersion model for binomial data. Since ML was seen to perform well, the Gibbs sampler, as a powerful numerical integrator to derive approximate ML estimates, seems a promising technique for datasets of this kind.