Data from polishCLR: Example input genome assemblies

<p>[ NOTE - Data files added 2022-11-01:</p> <ul> <li><strong>Test long reads</strong> - test.1.filtered.bam_.gz</li> <li><strong>Test short reads R1</strong> - testpolish_R1.fastq </li> <li><strong>Test short reads R2</strong> - testpolish_R2.fastq</li> <li><strong>Chromosome 30 of H. zea</strong> - GCF_022581195.2_ilHelZeax1.1_chr30.fasta ]</li> </ul> <p>In order to produce the best possible <em>de novo</em>, chromosome-scale genome assembly from error prone Pacific BioSciences continuous long reads (CLR) reads, we developed a publicly available, flexible and reproducible workflow that is containerized so it can be run on any conventional HPC, called polishCLR. This dataset provides example input primary contig assemblies to test and reproduce the demonstrated utility of our workflow.</p> <p>The polishCLR workflow can be easily initiated from three input cases: Case 1: An unresolved primary assembly with associated contigs, the output of FALCON 2-asm: p_ctg.fasta and a_ctg.fasta Case 2: A haplotype-resolved but unpolished set, the output of FALCON-Unzip 3-unzip: all_p_ctg.fasta and all_h_ctg.fasta Case 3: A haplotype-resolved, CLR long-read, Arrow-polished set of primary and alternate contigs, the output of FALCON-Unzip 4-polish: cns_p_ctg.fasta and cns_h_ctg.fasta.</p> <p>These example data are the input contigs assemblies for the pest <em>Helicoverpa zea</em>. These contigs are built from 49.89 Gb of raw Pacific Biosciences (PacBio) CLR data generated from a single <em>H. zea</em> HzStark_Cry1AcR strain male.</p> <p>Adult <em>H. zea</em> were collected near the USDA-ARS Genetics and Sustainability Agricultural Research Unit, Starkville, MS, USA in 2011, and transported to and maintained in a colony at the USDA Southern Insect Management Unit (SIMRU), Stoneville, MS, USA as described previously. Larvae were selected on a diagnostic dose of 2.0 μg ml-1 purified Cry1Ac, and survivors used to create the strain, HzStark_Cry1AcR. HzStark_Cry1AcR was back-crossed every 5 generations to a susceptible line maintained at USDA-ARS SIMRU. </p> <p>A single male pupa (homogametic, ZZ sex chromosome) from HzStark_Cry1AcR was dissected laterally into eight ~20 μg sections. High molecular weight DNA was extracted. PacBio libraries were generated from unsheared DNA using a SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, Menlo Park, CA, USA), and 20 hour run time movies generated on a single SMRT Cell 1M v3 using the Sequel I system (Pacific Biosciences).</p> <p>The raw continuous long read (CLR) subread bam files were converted to fastq format using bamtools v. 2.5.1 (Barnett et al. 2011), then used as input for the Falcon assembler (Chin et al. 2016) using the pb-assembly conda environment v. 0.0.8.1 (Pacific Biosciences; default parameters). Falcon-Unzip created primary and alternate contigs with one round of haplotype-aware polishing by Arrow (Pacific Biosciences). </p><div><br>Resources in this dataset:</div><br><ul><li><p>Resource Title: Associated assembly contigs output from FALCON/2-asm-falcon.</p> <p>File Name: a_ctg_all.fasta</p></li><br><li><p>Resource Title: Primary assembly contigs output from FALCON/2-asm-falcon.</p> <p>File Name: p_ctg.fasta</p></li><br><li><p>Resource Title: Alternate haplotype assembly contigs output from FALCON Unzip 3-unzip.</p> <p>File Name: all_h_ctg.fasta</p></li><br><li><p>Resource Title: Primary assembly contigs output from FALCON Unzip 3-unzip.</p> <p>File Name: all_p_ctg.fasta</p></li><br><li><p>Resource Title: Alternate assembly contigs output from FALCON Unzip 4-polish.</p> <p>File Name: cns_h_ctg.fasta</p></li><br><li><p>Resource Title: Primary assembly contigs output from FALCON Unzip 4-polish.</p> <p>File Name: cns_p<em>ctg.fasta</em></p></li><em><br></em><li><em><p>Resource Title: Test long reads.</p> </em><p><em>File Name: test.1.filtered.bam</em>.gz</p><p>Resource Description: For testing the pipeline, long reads that map to H. zea chromosome 30</p></li><br><li><p>Resource Title: Test short reads R1.</p> <p>File Name: testpolish_R1.fastq</p><p>Resource Description: Short reads aligned to Chromosome 30 of H. zea</p></li><br><li><p>Resource Title: Test short reads R2.</p> <p>File Name: testpolish_R2.fastq</p><p>Resource Description: Reverse pair (R2) short reads aligned to Chromosome 30 of H. zea</p></li><br><li><p>Resource Title: Chromosome 30 of H. zea.</p> <p>File Name: GCF_022581195.2_ilHelZeax1.1_chr30.fasta</p></li></ul><p></p>

Saved in:
Bibliographic Details
Main Authors: Amanda R. Stahlke (10809001), Brad S. Coates (11385256)
Format: Dataset biblioteca
Published: 2022
Subjects:Genomics and transcriptomics, Genetics, PacBio, genome assembly, NP304, NP305, data.gov, ARS,
Online Access:https://figshare.com/articles/dataset/Data_from_polishCLR_Example_input_genome_assemblies/24667776
Tags: Add Tag
No Tags, Be the first to tag this record!
id dat-usda-us-article24667776
record_format figshare
institution USDA US
collection Figshare
country Estados Unidos
countrycode US
component Datos de investigación
access En linea
databasecode dat-usda-us
tag biblioteca
region America del Norte
libraryname National Agricultural Library of USDA
topic Genomics and transcriptomics
Genetics
PacBio
genome assembly
NP304
NP305
data.gov
ARS
spellingShingle Genomics and transcriptomics
Genetics
PacBio
genome assembly
NP304
NP305
data.gov
ARS
Amanda R. Stahlke (10809001)
Brad S. Coates (11385256)
Data from polishCLR: Example input genome assemblies
description <p>[ NOTE - Data files added 2022-11-01:</p> <ul> <li><strong>Test long reads</strong> - test.1.filtered.bam_.gz</li> <li><strong>Test short reads R1</strong> - testpolish_R1.fastq </li> <li><strong>Test short reads R2</strong> - testpolish_R2.fastq</li> <li><strong>Chromosome 30 of H. zea</strong> - GCF_022581195.2_ilHelZeax1.1_chr30.fasta ]</li> </ul> <p>In order to produce the best possible <em>de novo</em>, chromosome-scale genome assembly from error prone Pacific BioSciences continuous long reads (CLR) reads, we developed a publicly available, flexible and reproducible workflow that is containerized so it can be run on any conventional HPC, called polishCLR. This dataset provides example input primary contig assemblies to test and reproduce the demonstrated utility of our workflow.</p> <p>The polishCLR workflow can be easily initiated from three input cases: Case 1: An unresolved primary assembly with associated contigs, the output of FALCON 2-asm: p_ctg.fasta and a_ctg.fasta Case 2: A haplotype-resolved but unpolished set, the output of FALCON-Unzip 3-unzip: all_p_ctg.fasta and all_h_ctg.fasta Case 3: A haplotype-resolved, CLR long-read, Arrow-polished set of primary and alternate contigs, the output of FALCON-Unzip 4-polish: cns_p_ctg.fasta and cns_h_ctg.fasta.</p> <p>These example data are the input contigs assemblies for the pest <em>Helicoverpa zea</em>. These contigs are built from 49.89 Gb of raw Pacific Biosciences (PacBio) CLR data generated from a single <em>H. zea</em> HzStark_Cry1AcR strain male.</p> <p>Adult <em>H. zea</em> were collected near the USDA-ARS Genetics and Sustainability Agricultural Research Unit, Starkville, MS, USA in 2011, and transported to and maintained in a colony at the USDA Southern Insect Management Unit (SIMRU), Stoneville, MS, USA as described previously. Larvae were selected on a diagnostic dose of 2.0 μg ml-1 purified Cry1Ac, and survivors used to create the strain, HzStark_Cry1AcR. HzStark_Cry1AcR was back-crossed every 5 generations to a susceptible line maintained at USDA-ARS SIMRU. </p> <p>A single male pupa (homogametic, ZZ sex chromosome) from HzStark_Cry1AcR was dissected laterally into eight ~20 μg sections. High molecular weight DNA was extracted. PacBio libraries were generated from unsheared DNA using a SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, Menlo Park, CA, USA), and 20 hour run time movies generated on a single SMRT Cell 1M v3 using the Sequel I system (Pacific Biosciences).</p> <p>The raw continuous long read (CLR) subread bam files were converted to fastq format using bamtools v. 2.5.1 (Barnett et al. 2011), then used as input for the Falcon assembler (Chin et al. 2016) using the pb-assembly conda environment v. 0.0.8.1 (Pacific Biosciences; default parameters). Falcon-Unzip created primary and alternate contigs with one round of haplotype-aware polishing by Arrow (Pacific Biosciences). </p><div><br>Resources in this dataset:</div><br><ul><li><p>Resource Title: Associated assembly contigs output from FALCON/2-asm-falcon.</p> <p>File Name: a_ctg_all.fasta</p></li><br><li><p>Resource Title: Primary assembly contigs output from FALCON/2-asm-falcon.</p> <p>File Name: p_ctg.fasta</p></li><br><li><p>Resource Title: Alternate haplotype assembly contigs output from FALCON Unzip 3-unzip.</p> <p>File Name: all_h_ctg.fasta</p></li><br><li><p>Resource Title: Primary assembly contigs output from FALCON Unzip 3-unzip.</p> <p>File Name: all_p_ctg.fasta</p></li><br><li><p>Resource Title: Alternate assembly contigs output from FALCON Unzip 4-polish.</p> <p>File Name: cns_h_ctg.fasta</p></li><br><li><p>Resource Title: Primary assembly contigs output from FALCON Unzip 4-polish.</p> <p>File Name: cns_p<em>ctg.fasta</em></p></li><em><br></em><li><em><p>Resource Title: Test long reads.</p> </em><p><em>File Name: test.1.filtered.bam</em>.gz</p><p>Resource Description: For testing the pipeline, long reads that map to H. zea chromosome 30</p></li><br><li><p>Resource Title: Test short reads R1.</p> <p>File Name: testpolish_R1.fastq</p><p>Resource Description: Short reads aligned to Chromosome 30 of H. zea</p></li><br><li><p>Resource Title: Test short reads R2.</p> <p>File Name: testpolish_R2.fastq</p><p>Resource Description: Reverse pair (R2) short reads aligned to Chromosome 30 of H. zea</p></li><br><li><p>Resource Title: Chromosome 30 of H. zea.</p> <p>File Name: GCF_022581195.2_ilHelZeax1.1_chr30.fasta</p></li></ul><p></p>
format Dataset
author Amanda R. Stahlke (10809001)
Brad S. Coates (11385256)
author_facet Amanda R. Stahlke (10809001)
Brad S. Coates (11385256)
author_sort Amanda R. Stahlke (10809001)
title Data from polishCLR: Example input genome assemblies
title_short Data from polishCLR: Example input genome assemblies
title_full Data from polishCLR: Example input genome assemblies
title_fullStr Data from polishCLR: Example input genome assemblies
title_full_unstemmed Data from polishCLR: Example input genome assemblies
title_sort data from polishclr: example input genome assemblies
publishDate 2022
url https://figshare.com/articles/dataset/Data_from_polishCLR_Example_input_genome_assemblies/24667776
work_keys_str_mv AT amandarstahlke10809001 datafrompolishclrexampleinputgenomeassemblies
AT bradscoates11385256 datafrompolishclrexampleinputgenomeassemblies
_version_ 1802722037235449856
spelling dat-usda-us-article246677762022-02-10T00:00:00Z Data from polishCLR: Example input genome assemblies Amanda R. Stahlke (10809001) Brad S. Coates (11385256) Genomics and transcriptomics Genetics PacBio genome assembly NP304 NP305 data.gov ARS <p>[ NOTE - Data files added 2022-11-01:</p> <ul> <li><strong>Test long reads</strong> - test.1.filtered.bam_.gz</li> <li><strong>Test short reads R1</strong> - testpolish_R1.fastq </li> <li><strong>Test short reads R2</strong> - testpolish_R2.fastq</li> <li><strong>Chromosome 30 of H. zea</strong> - GCF_022581195.2_ilHelZeax1.1_chr30.fasta ]</li> </ul> <p>In order to produce the best possible <em>de novo</em>, chromosome-scale genome assembly from error prone Pacific BioSciences continuous long reads (CLR) reads, we developed a publicly available, flexible and reproducible workflow that is containerized so it can be run on any conventional HPC, called polishCLR. This dataset provides example input primary contig assemblies to test and reproduce the demonstrated utility of our workflow.</p> <p>The polishCLR workflow can be easily initiated from three input cases: Case 1: An unresolved primary assembly with associated contigs, the output of FALCON 2-asm: p_ctg.fasta and a_ctg.fasta Case 2: A haplotype-resolved but unpolished set, the output of FALCON-Unzip 3-unzip: all_p_ctg.fasta and all_h_ctg.fasta Case 3: A haplotype-resolved, CLR long-read, Arrow-polished set of primary and alternate contigs, the output of FALCON-Unzip 4-polish: cns_p_ctg.fasta and cns_h_ctg.fasta.</p> <p>These example data are the input contigs assemblies for the pest <em>Helicoverpa zea</em>. These contigs are built from 49.89 Gb of raw Pacific Biosciences (PacBio) CLR data generated from a single <em>H. zea</em> HzStark_Cry1AcR strain male.</p> <p>Adult <em>H. zea</em> were collected near the USDA-ARS Genetics and Sustainability Agricultural Research Unit, Starkville, MS, USA in 2011, and transported to and maintained in a colony at the USDA Southern Insect Management Unit (SIMRU), Stoneville, MS, USA as described previously. Larvae were selected on a diagnostic dose of 2.0 μg ml-1 purified Cry1Ac, and survivors used to create the strain, HzStark_Cry1AcR. HzStark_Cry1AcR was back-crossed every 5 generations to a susceptible line maintained at USDA-ARS SIMRU. </p> <p>A single male pupa (homogametic, ZZ sex chromosome) from HzStark_Cry1AcR was dissected laterally into eight ~20 μg sections. High molecular weight DNA was extracted. PacBio libraries were generated from unsheared DNA using a SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, Menlo Park, CA, USA), and 20 hour run time movies generated on a single SMRT Cell 1M v3 using the Sequel I system (Pacific Biosciences).</p> <p>The raw continuous long read (CLR) subread bam files were converted to fastq format using bamtools v. 2.5.1 (Barnett et al. 2011), then used as input for the Falcon assembler (Chin et al. 2016) using the pb-assembly conda environment v. 0.0.8.1 (Pacific Biosciences; default parameters). Falcon-Unzip created primary and alternate contigs with one round of haplotype-aware polishing by Arrow (Pacific Biosciences). </p><div><br>Resources in this dataset:</div><br><ul><li><p>Resource Title: Associated assembly contigs output from FALCON/2-asm-falcon.</p> <p>File Name: a_ctg_all.fasta</p></li><br><li><p>Resource Title: Primary assembly contigs output from FALCON/2-asm-falcon.</p> <p>File Name: p_ctg.fasta</p></li><br><li><p>Resource Title: Alternate haplotype assembly contigs output from FALCON Unzip 3-unzip.</p> <p>File Name: all_h_ctg.fasta</p></li><br><li><p>Resource Title: Primary assembly contigs output from FALCON Unzip 3-unzip.</p> <p>File Name: all_p_ctg.fasta</p></li><br><li><p>Resource Title: Alternate assembly contigs output from FALCON Unzip 4-polish.</p> <p>File Name: cns_h_ctg.fasta</p></li><br><li><p>Resource Title: Primary assembly contigs output from FALCON Unzip 4-polish.</p> <p>File Name: cns_p<em>ctg.fasta</em></p></li><em><br></em><li><em><p>Resource Title: Test long reads.</p> </em><p><em>File Name: test.1.filtered.bam</em>.gz</p><p>Resource Description: For testing the pipeline, long reads that map to H. zea chromosome 30</p></li><br><li><p>Resource Title: Test short reads R1.</p> <p>File Name: testpolish_R1.fastq</p><p>Resource Description: Short reads aligned to Chromosome 30 of H. zea</p></li><br><li><p>Resource Title: Test short reads R2.</p> <p>File Name: testpolish_R2.fastq</p><p>Resource Description: Reverse pair (R2) short reads aligned to Chromosome 30 of H. zea</p></li><br><li><p>Resource Title: Chromosome 30 of H. zea.</p> <p>File Name: GCF_022581195.2_ilHelZeax1.1_chr30.fasta</p></li></ul><p></p> 2022-02-10T00:00:00Z Dataset Dataset 10.15482/usda.adc/1524676 https://figshare.com/articles/dataset/Data_from_polishCLR_Example_input_genome_assemblies/24667776 CC0