Categorization of species based on their microRNAs employing sequence motifs, information-theoretic sequence feature extraction, and k-mers

Background: Diseases like cancer can manifest themselves through changes in protein abundance, and microRNAs (miRNAs) play a key role in the modulation of protein quantity. MicroRNAs are used throughout all kingdoms and have been shown to be exploited by viruses to modulate their host environment. Since the experimental detection of miRNAs is difficult, computational methods have been developed. Many such tools employ machine learning for pre-miRNA detection, and many features for miRNA parameterization have been proposed. To train machine learning models, negative data is of importance yet hard to come by; therefore, we recently started to employ pre-miRNAs from one species as positive data versus another species’ pre-miRNAs as negative examples based on sequence motifs and k-mers. Here, we introduce the additional usage of information-theoretic (IT) features. Results: Pre-miRNAs from one species were used as positive and another species’ pre-miRNAs as negative training data for machine learning. The categorization capability of IT and k-mer features was investigated. Both feature sets and their combinations yielded a very high accuracy, which is as good as the previously suggested sequence motif and k-mer based method. However, for obtaining a high performance, a sufficiently large phylogenetic distance between the species and sufficiently high number of pre-miRNAs in the training set is required. To examine the contribution of the IT and k-mer features, an information gain-based feature ranking was performed. Although the top 3 are IT features, 80% of the top 100 features are k-mers. The comparison of all three individual approaches (motifs, IT, and k-mers) shows that the distinction of species based on their pre-miRNAs k-mers are sufficient. Conclusions: IT sequence feature extraction enables the distinction among species and is less computationally expensive than motif calculations. However, since IT features need larger amounts of data to have enough statistics for producing highly accurate results, future categorization into species can be effectively done using k-mers only. The biological reasoning for this is the existence of a codon bias between species which can, at least, be observed in exonic miRNAs. Future work in this direction will be the ab initio detection of pre-miRNA. In addition, prediction of pre-miRNA from RNA-seq can be done.

Saved in:
Bibliographic Details
Main Authors: Yousef, Malik, Nigatu, Dawit, Levy, Dalit, Allmer, Jens, Henkel, Werner
Format: Article/Letter to editor biblioteca
Language:English
Subjects:Differentiate miRNAs among species, Information theory, Machine learning, MicroRNA, Pre-microRNA, Sequence motifs, k-mer, miRNA categorization,
Online Access:https://research.wur.nl/en/publications/categorization-of-species-based-on-their-micrornas-employing-sequ
Tags: Add Tag
No Tags, Be the first to tag this record!
id dig-wur-nl-wurpubs-529404
record_format koha
spelling dig-wur-nl-wurpubs-5294042024-08-14 Yousef, Malik Nigatu, Dawit Levy, Dalit Allmer, Jens Henkel, Werner Article/Letter to editor Eurasip Journal on Advances in Signal Processing 2017 (2017) ISSN: 1687-6172 Categorization of species based on their microRNAs employing sequence motifs, information-theoretic sequence feature extraction, and k-mers 2017 Background: Diseases like cancer can manifest themselves through changes in protein abundance, and microRNAs (miRNAs) play a key role in the modulation of protein quantity. MicroRNAs are used throughout all kingdoms and have been shown to be exploited by viruses to modulate their host environment. Since the experimental detection of miRNAs is difficult, computational methods have been developed. Many such tools employ machine learning for pre-miRNA detection, and many features for miRNA parameterization have been proposed. To train machine learning models, negative data is of importance yet hard to come by; therefore, we recently started to employ pre-miRNAs from one species as positive data versus another species’ pre-miRNAs as negative examples based on sequence motifs and k-mers. Here, we introduce the additional usage of information-theoretic (IT) features. Results: Pre-miRNAs from one species were used as positive and another species’ pre-miRNAs as negative training data for machine learning. The categorization capability of IT and k-mer features was investigated. Both feature sets and their combinations yielded a very high accuracy, which is as good as the previously suggested sequence motif and k-mer based method. However, for obtaining a high performance, a sufficiently large phylogenetic distance between the species and sufficiently high number of pre-miRNAs in the training set is required. To examine the contribution of the IT and k-mer features, an information gain-based feature ranking was performed. Although the top 3 are IT features, 80% of the top 100 features are k-mers. The comparison of all three individual approaches (motifs, IT, and k-mers) shows that the distinction of species based on their pre-miRNAs k-mers are sufficient. Conclusions: IT sequence feature extraction enables the distinction among species and is less computationally expensive than motif calculations. However, since IT features need larger amounts of data to have enough statistics for producing highly accurate results, future categorization into species can be effectively done using k-mers only. The biological reasoning for this is the existence of a codon bias between species which can, at least, be observed in exonic miRNAs. Future work in this direction will be the ab initio detection of pre-miRNA. In addition, prediction of pre-miRNA from RNA-seq can be done. en application/pdf https://research.wur.nl/en/publications/categorization-of-species-based-on-their-micrornas-employing-sequ 10.1186/s13634-017-0506-8 https://edepot.wur.nl/426989 Differentiate miRNAs among species Information theory Machine learning MicroRNA Pre-microRNA Sequence motifs k-mer miRNA categorization https://creativecommons.org/licenses/by/4.0/ Wageningen University & Research
institution WUR NL
collection DSpace
country Países bajos
countrycode NL
component Bibliográfico
access En linea
databasecode dig-wur-nl
tag biblioteca
region Europa del Oeste
libraryname WUR Library Netherlands
language English
topic Differentiate miRNAs among species
Information theory
Machine learning
MicroRNA
Pre-microRNA
Sequence motifs
k-mer
miRNA categorization
Differentiate miRNAs among species
Information theory
Machine learning
MicroRNA
Pre-microRNA
Sequence motifs
k-mer
miRNA categorization
spellingShingle Differentiate miRNAs among species
Information theory
Machine learning
MicroRNA
Pre-microRNA
Sequence motifs
k-mer
miRNA categorization
Differentiate miRNAs among species
Information theory
Machine learning
MicroRNA
Pre-microRNA
Sequence motifs
k-mer
miRNA categorization
Yousef, Malik
Nigatu, Dawit
Levy, Dalit
Allmer, Jens
Henkel, Werner
Categorization of species based on their microRNAs employing sequence motifs, information-theoretic sequence feature extraction, and k-mers
description Background: Diseases like cancer can manifest themselves through changes in protein abundance, and microRNAs (miRNAs) play a key role in the modulation of protein quantity. MicroRNAs are used throughout all kingdoms and have been shown to be exploited by viruses to modulate their host environment. Since the experimental detection of miRNAs is difficult, computational methods have been developed. Many such tools employ machine learning for pre-miRNA detection, and many features for miRNA parameterization have been proposed. To train machine learning models, negative data is of importance yet hard to come by; therefore, we recently started to employ pre-miRNAs from one species as positive data versus another species’ pre-miRNAs as negative examples based on sequence motifs and k-mers. Here, we introduce the additional usage of information-theoretic (IT) features. Results: Pre-miRNAs from one species were used as positive and another species’ pre-miRNAs as negative training data for machine learning. The categorization capability of IT and k-mer features was investigated. Both feature sets and their combinations yielded a very high accuracy, which is as good as the previously suggested sequence motif and k-mer based method. However, for obtaining a high performance, a sufficiently large phylogenetic distance between the species and sufficiently high number of pre-miRNAs in the training set is required. To examine the contribution of the IT and k-mer features, an information gain-based feature ranking was performed. Although the top 3 are IT features, 80% of the top 100 features are k-mers. The comparison of all three individual approaches (motifs, IT, and k-mers) shows that the distinction of species based on their pre-miRNAs k-mers are sufficient. Conclusions: IT sequence feature extraction enables the distinction among species and is less computationally expensive than motif calculations. However, since IT features need larger amounts of data to have enough statistics for producing highly accurate results, future categorization into species can be effectively done using k-mers only. The biological reasoning for this is the existence of a codon bias between species which can, at least, be observed in exonic miRNAs. Future work in this direction will be the ab initio detection of pre-miRNA. In addition, prediction of pre-miRNA from RNA-seq can be done.
format Article/Letter to editor
topic_facet Differentiate miRNAs among species
Information theory
Machine learning
MicroRNA
Pre-microRNA
Sequence motifs
k-mer
miRNA categorization
author Yousef, Malik
Nigatu, Dawit
Levy, Dalit
Allmer, Jens
Henkel, Werner
author_facet Yousef, Malik
Nigatu, Dawit
Levy, Dalit
Allmer, Jens
Henkel, Werner
author_sort Yousef, Malik
title Categorization of species based on their microRNAs employing sequence motifs, information-theoretic sequence feature extraction, and k-mers
title_short Categorization of species based on their microRNAs employing sequence motifs, information-theoretic sequence feature extraction, and k-mers
title_full Categorization of species based on their microRNAs employing sequence motifs, information-theoretic sequence feature extraction, and k-mers
title_fullStr Categorization of species based on their microRNAs employing sequence motifs, information-theoretic sequence feature extraction, and k-mers
title_full_unstemmed Categorization of species based on their microRNAs employing sequence motifs, information-theoretic sequence feature extraction, and k-mers
title_sort categorization of species based on their micrornas employing sequence motifs, information-theoretic sequence feature extraction, and k-mers
url https://research.wur.nl/en/publications/categorization-of-species-based-on-their-micrornas-employing-sequ
work_keys_str_mv AT yousefmalik categorizationofspeciesbasedontheirmicrornasemployingsequencemotifsinformationtheoreticsequencefeatureextractionandkmers
AT nigatudawit categorizationofspeciesbasedontheirmicrornasemployingsequencemotifsinformationtheoreticsequencefeatureextractionandkmers
AT levydalit categorizationofspeciesbasedontheirmicrornasemployingsequencemotifsinformationtheoreticsequencefeatureextractionandkmers
AT allmerjens categorizationofspeciesbasedontheirmicrornasemployingsequencemotifsinformationtheoreticsequencefeatureextractionandkmers
AT henkelwerner categorizationofspeciesbasedontheirmicrornasemployingsequencemotifsinformationtheoreticsequencefeatureextractionandkmers
_version_ 1813198919139590144