Experimental results of "Managing variant calling datasets the big data way"

Boufea, Katerina; Athanasiadis, Ioannis

Experimental results of "Managing variant calling datasets the big data way"

Tomatula was demonstrated for retrieving the allele frequencies for a given region in the data from Aflitos et al (2014). We developed scripts to retrieve allele frequencies, either from the VCF file storage or Apache Parquet. We executed a series of experiments, querying for a region of 2000 bases in the file of chromosome 6, that corresponds to the approximate length of a gene. We compared both storage formats (VCF files and Parquet), two input sizes (104 and 1144 individuals), different cluster sizes varying between 2 and 150 executor nodes, and HDFS replication factor was set to 3, 5, 7, and 9, in order to examine four main factors that can affect the performance of a Big Data cluster: (a) the storage format, (b) the size of the input files, (c) the number of computing nodes of the cluster, and (d) the replication factor of HDFS. The block size of the HDFS was kept at the default value of 128MB. All experiments were executed five times and the detailed results are provided here, along with a script that produces the corresponding figures.

Saved in:

Bibliographic Details
Main Authors:	Boufea, Katerina, Athanasiadis, Ioannis
Format:	Dataset biblioteca
Published:	Zenodo
Subjects:	Life Science,
Online Access:	https://research.wur.nl/en/datasets/experimental-results-of-managing-variant-calling-datasets-the-big
Tags:	Add Tag No Tags, Be the first to tag this record!

id	dig-wur-nl-wurpubs-532125
record_format	koha
spelling	dig-wur-nl-wurpubs-5321252024-08-06 Boufea, Katerina Athanasiadis, Ioannis Dataset Experimental results of "Managing variant calling datasets the big data way" 2017 Tomatula was demonstrated for retrieving the allele frequencies for a given region in the data from Aflitos et al (2014). We developed scripts to retrieve allele frequencies, either from the VCF file storage or Apache Parquet. We executed a series of experiments, querying for a region of 2000 bases in the file of chromosome 6, that corresponds to the approximate length of a gene. We compared both storage formats (VCF files and Parquet), two input sizes (104 and 1144 individuals), different cluster sizes varying between 2 and 150 executor nodes, and HDFS replication factor was set to 3, 5, 7, and 9, in order to examine four main factors that can affect the performance of a Big Data cluster: (a) the storage format, (b) the size of the input files, (c) the number of computing nodes of the cluster, and (d) the replication factor of HDFS. The block size of the HDFS was kept at the default value of 128MB. All experiments were executed five times and the detailed results are provided here, along with a script that produces the corresponding figures. Zenodo text/html https://research.wur.nl/en/datasets/experimental-results-of-managing-variant-calling-datasets-the-big 10.5281/zenodo.582145 https://edepot.wur.nl/432334 Life Science Wageningen University & Research
institution	WUR NL
collection	DSpace
country	Países bajos
countrycode	NL
component	Bibliográfico
access	En linea
databasecode	dig-wur-nl
tag	biblioteca
region	Europa del Oeste
libraryname	WUR Library Netherlands
topic	Life Science Life Science
spellingShingle	Life Science Life Science Boufea, Katerina Athanasiadis, Ioannis Experimental results of "Managing variant calling datasets the big data way"
description	Tomatula was demonstrated for retrieving the allele frequencies for a given region in the data from Aflitos et al (2014). We developed scripts to retrieve allele frequencies, either from the VCF file storage or Apache Parquet. We executed a series of experiments, querying for a region of 2000 bases in the file of chromosome 6, that corresponds to the approximate length of a gene. We compared both storage formats (VCF files and Parquet), two input sizes (104 and 1144 individuals), different cluster sizes varying between 2 and 150 executor nodes, and HDFS replication factor was set to 3, 5, 7, and 9, in order to examine four main factors that can affect the performance of a Big Data cluster: (a) the storage format, (b) the size of the input files, (c) the number of computing nodes of the cluster, and (d) the replication factor of HDFS. The block size of the HDFS was kept at the default value of 128MB. All experiments were executed five times and the detailed results are provided here, along with a script that produces the corresponding figures.
format	Dataset
topic_facet	Life Science
author	Boufea, Katerina Athanasiadis, Ioannis
author_facet	Boufea, Katerina Athanasiadis, Ioannis
author_sort	Boufea, Katerina
title	Experimental results of "Managing variant calling datasets the big data way"
title_short	Experimental results of "Managing variant calling datasets the big data way"
title_full	Experimental results of "Managing variant calling datasets the big data way"
title_fullStr	Experimental results of "Managing variant calling datasets the big data way"
title_full_unstemmed	Experimental results of "Managing variant calling datasets the big data way"
title_sort	experimental results of "managing variant calling datasets the big data way"
publisher	Zenodo
url	https://research.wur.nl/en/datasets/experimental-results-of-managing-variant-calling-datasets-the-big
work_keys_str_mv	AT boufeakaterina experimentalresultsofmanagingvariantcallingdatasetsthebigdataway AT athanasiadisioannis experimentalresultsofmanagingvariantcallingdatasetsthebigdataway
_version_	1813198795812372480

Experimental results of "Managing variant calling datasets the big data way"

Similar Items

Resource Map