A bioinformatics approach to marker development
The thesis focuses on two bioinformatics research topics: the development of tools for an efficient and reliable identification of single nucleotides polymorphisms (SNPs) and polymorphic simple sequence repeats (SSRs) from expressed sequence tags (ESTs) (Chapter 2, 3 and 4), and the subsequent implementation of these tools in a pipeline for narrowing down QTL intervals to facilitate the identification of candidate genes for the QTL (Chapter 5). Chapter 1 provides an introduction to molecular markers, SNP and SSR markers, illustrated by a number of applications of molecular markers, as well as existing programs for their detection. After analysis of existing problems and programs for the detection SNPs, a new algorithm is described to reliably identify SNPs and indels in EST data from diploid and polyploid species (Chapter 2). The algorithm is implemented in a program called QualitySNP. This program uses three filters to identify reliable SNPs: filter one screens for potential SNPs by requiring at least two sequences per represented allele; filter two uses a haplotype-based strategy to filter out clusters with paralogs and false SNPs caused by sequence errors; finally, filter three calculates a confidence score for every putative SNP according to the number of occurrences of each allele in high and low quality regions. For the detection of non-synonymous SNPs (nsSNP), synonymous SNPs and SNPs in UTRs, a program was developed as well. Furthermore, these programs were implemented in a pipeline that includes the identification of SNPs and nsSNPs, as well as a storage and retrieval system. Using this pipeline, large numbers of SNPs could be identified in potato ESTs. QualitySNP is available for running on LINUX and UNIX systems; the program, user manual and examples are available at http://www.bioinformatics.nl/tools/snpweb/ . Chapter 3 describes a web-based implementation of the QualitySNP algorithm, called HaploSNPer, which is a tool for the detection of alleles and SNPs in user-specified input sequences from both diploid and polyploid species. HaploSNPer tries to find homologous sequences in user-specified sequence databases using a user-supplied seed sequence, or in a collection of input sequences. All alleles and associated SNPs are identified on clusters of these homologous sequences using QualitySNP. HaploSNPer provides a user-friendly interface for visualization of SNPs and alleles, which allows the selection of informative SNPs and allele specific makers. Currently HaploSNPer is available for nine animal and thirteen plant species, and is available from http://www.bioinformatics.nl/tools/haplosnper/ . Chapter 4 presents a new tool, called PolySSR, to identify polymorphic SSRs rather than just SSRs based on public EST sequence data derived from heterozygotes and/or different genotypes. Based on PolySSR a pipeline was developed to automatically develop primers for putatively polymorphic SSR markers, taking into account SNPs in the SSR flanking regions, thus improving the success rate of the potential markers. Furthermore, SSR positions in coding or UTR regions of genes are identified by the pipeline. The pipeline also includes a searchable database for these SSRs. The value of PolySSR was demonstrated by the fact that nearly all tested SSRs predicted to be polymorphic were indeed validated as polymorphic, and also most designed primers produced clear amplicons. Large numbers of polymorphic SSRs were identified from publicly available ESTs in potato, tomato, rice, Arabidopsis, brassica and chicken using the pipeline. They are stored into a database, which is available at http://www.bioinformatics.nl/tools/polyssr/ . PolySSR not only decreases the cost of designing and testing primers, it also brings a new approach to use the redundancy and heterozygosity of ESTs for developing SSRs that was ignored before. Analysis of the data obtained with polySSR showed that a larger percentage of short SSRs identified in the species used in the study were polymorphic than that of long SSRs. From this it is clear that in the past we have ‘forgotten’ a whole class of putatively informative markers. QualitySNP and PolySSR have been implemented into a pipeline called GeneTagger to find candidate genes underlying a QTL using the strategy of narrowing the QTL interval (Chapter 5). The pipeline first detects the syntenic region of a QTL interval in the species under study in a model species based on marker sequences linked to the QTL. Next, within the syntenic regions identified in the model species genes are identified that might have a function related to the QTL or genes within the regions that are not part of a large gene family. Based on their map position in the model species a number of genes are selected for marker development. To facilitate marker development, ESTs derived from the target species are analyzed using QualitySNP, PolySSR and other tools. Based on identified genetic variations in the selected genes, molecular markers can be developed for accurate fine-mapping of the QTL and ultimately identification of the gene underlying the QTL effect. The pipeline has been used to narrow the clubroot resistance QTL. The tool is available from the website http://www.bioinformatics.nl/tools/genetagger/ . Finally, merits and shortcomings of the tools that have been developed as well as related bioinformatics questions that arose during these studies are discussed in Chapter 6.
Main Author: | |
---|---|
Other Authors: | |
Format: | Doctoral thesis biblioteca |
Language: | English |
Subjects: | algorithms, bioinformatics, genetic markers, marker genes, microsatellites, molecular genetics, nucleotide sequences, quantitative trait loci, single nucleotide polymorphism, algoritmen, bio-informatica, genetische merkers, loci voor kwantitatief kenmerk, merkergenen, microsatellieten, moleculaire genetica, nucleotidenvolgordes, |
Online Access: | https://research.wur.nl/en/publications/a-bioinformatics-approach-to-marker-development |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | The thesis focuses on two bioinformatics research topics: the development of tools for an efficient and reliable identification of single nucleotides polymorphisms (SNPs) and polymorphic simple sequence repeats (SSRs) from expressed sequence tags (ESTs) (Chapter 2, 3 and 4), and the subsequent implementation of these tools in a pipeline for narrowing down QTL intervals to facilitate the identification of candidate genes for the QTL (Chapter 5). Chapter 1 provides an introduction to molecular markers, SNP and SSR markers, illustrated by a number of applications of molecular markers, as well as existing programs for their detection. After analysis of existing problems and programs for the detection SNPs, a new algorithm is described to reliably identify SNPs and indels in EST data from diploid and polyploid species (Chapter 2). The algorithm is implemented in a program called QualitySNP. This program uses three filters to identify reliable SNPs: filter one screens for potential SNPs by requiring at least two sequences per represented allele; filter two uses a haplotype-based strategy to filter out clusters with paralogs and false SNPs caused by sequence errors; finally, filter three calculates a confidence score for every putative SNP according to the number of occurrences of each allele in high and low quality regions. For the detection of non-synonymous SNPs (nsSNP), synonymous SNPs and SNPs in UTRs, a program was developed as well. Furthermore, these programs were implemented in a pipeline that includes the identification of SNPs and nsSNPs, as well as a storage and retrieval system. Using this pipeline, large numbers of SNPs could be identified in potato ESTs. QualitySNP is available for running on LINUX and UNIX systems; the program, user manual and examples are available at http://www.bioinformatics.nl/tools/snpweb/ . Chapter 3 describes a web-based implementation of the QualitySNP algorithm, called HaploSNPer, which is a tool for the detection of alleles and SNPs in user-specified input sequences from both diploid and polyploid species. HaploSNPer tries to find homologous sequences in user-specified sequence databases using a user-supplied seed sequence, or in a collection of input sequences. All alleles and associated SNPs are identified on clusters of these homologous sequences using QualitySNP. HaploSNPer provides a user-friendly interface for visualization of SNPs and alleles, which allows the selection of informative SNPs and allele specific makers. Currently HaploSNPer is available for nine animal and thirteen plant species, and is available from http://www.bioinformatics.nl/tools/haplosnper/ . Chapter 4 presents a new tool, called PolySSR, to identify polymorphic SSRs rather than just SSRs based on public EST sequence data derived from heterozygotes and/or different genotypes. Based on PolySSR a pipeline was developed to automatically develop primers for putatively polymorphic SSR markers, taking into account SNPs in the SSR flanking regions, thus improving the success rate of the potential markers. Furthermore, SSR positions in coding or UTR regions of genes are identified by the pipeline. The pipeline also includes a searchable database for these SSRs. The value of PolySSR was demonstrated by the fact that nearly all tested SSRs predicted to be polymorphic were indeed validated as polymorphic, and also most designed primers produced clear amplicons. Large numbers of polymorphic SSRs were identified from publicly available ESTs in potato, tomato, rice, Arabidopsis, brassica and chicken using the pipeline. They are stored into a database, which is available at http://www.bioinformatics.nl/tools/polyssr/ . PolySSR not only decreases the cost of designing and testing primers, it also brings a new approach to use the redundancy and heterozygosity of ESTs for developing SSRs that was ignored before. Analysis of the data obtained with polySSR showed that a larger percentage of short SSRs identified in the species used in the study were polymorphic than that of long SSRs. From this it is clear that in the past we have ‘forgotten’ a whole class of putatively informative markers. QualitySNP and PolySSR have been implemented into a pipeline called GeneTagger to find candidate genes underlying a QTL using the strategy of narrowing the QTL interval (Chapter 5). The pipeline first detects the syntenic region of a QTL interval in the species under study in a model species based on marker sequences linked to the QTL. Next, within the syntenic regions identified in the model species genes are identified that might have a function related to the QTL or genes within the regions that are not part of a large gene family. Based on their map position in the model species a number of genes are selected for marker development. To facilitate marker development, ESTs derived from the target species are analyzed using QualitySNP, PolySSR and other tools. Based on identified genetic variations in the selected genes, molecular markers can be developed for accurate fine-mapping of the QTL and ultimately identification of the gene underlying the QTL effect. The pipeline has been used to narrow the clubroot resistance QTL. The tool is available from the website http://www.bioinformatics.nl/tools/genetagger/ . Finally, merits and shortcomings of the tools that have been developed as well as related bioinformatics questions that arose during these studies are discussed in Chapter 6. |
---|