Dataset for Evaluating location strategies in Padiweb

This dataset has been built in the framework of the optimization of the MUlti-Source surveillance Tool for the detection of Avian Influenza outbreaks in mammalian species (MUST-AI). The MUST-AI tool collects health events reported from 3 sources: two official sources, WAHIS from the World Animal Health Organization and mails from Program for Monitoring Emerging Diseases (ProMED); and one unofficial source, PADI-web, which collects online media articles. PADI-web uses 5 various strategies to locate health events mentioned in the text articles. The aim of our study was to assess the various strategies.<br> The dataset consists 7 case studies (outbreak events from official sources WAHIS or the scientific literature) associated to 222 validated media articles collected by PADI-web through the 5 strategies. The matching criteria to associate a case study to a PADI-web article are based on the country of the outbreak and the time period. <br>.<br> The five evaluated strategies are:.<br> (A) SpaCy locations in Outbreak articles: extraction with SpaCy of locations in articles classified as an epidemiological outbreak. .<br> (B) SpaCy locations in Outbreak articles and Current event sentences: extraction with SpaCy of locations in articles classified as an epidemiological outbreak and in a sentence that has been classified as relating to a current event..<br> (C) SpaCy locations in beginning of Outbreak articles: extraction with SpaCy of locations found in the first 300 characters of the text of an article classified as an epidemiological outbreak. .<br> (D) PADI-web-specific locations: extraction of locations by the location extraction model trained on PADI-web data..<br> (E) SpaCy locations in beginning of articles: extraction with SpaCy of locations found in the first 300 characters of the text of an article..<br>.<br> Each case study is associated with an identification number. For each case study, the set of Padiweb articles is given with a unique identification number. The dataset contains the values as follows: .<br> - Source: source of the case study, as WAHIS or published article in the scientific literature.<br> - Id_gold_standard: case study identification number as the outbreak id reported in WAHIS or ranked literature case study number.<br> - Id_article: identification number of media articles as generated in PADI-web - URL: url to the source article.<br> - Strategy X: binary value that stipulates whether the article has been returned by the strategy X.<br>

Saved in:
Bibliographic Details
Main Authors: Trevennec, Carlene, Bououda, Samira, Roche, Mathieu
Format: Dataset biblioteca
Language:English
Published: CIRAD Dataverse 2024
Subjects:Agricultural Sciences, Computer and Information Science, avian influenza, highly pathogenic avian influenza, HPAIM, epidemics, natural language processing, NLP, artificial intelligence, MUST-AI, PADIWEB, event-based surveillance, disease surveillance,
Online Access:https://doi.org/10.18167/DVN1/Y1J9XK
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:This dataset has been built in the framework of the optimization of the MUlti-Source surveillance Tool for the detection of Avian Influenza outbreaks in mammalian species (MUST-AI). The MUST-AI tool collects health events reported from 3 sources: two official sources, WAHIS from the World Animal Health Organization and mails from Program for Monitoring Emerging Diseases (ProMED); and one unofficial source, PADI-web, which collects online media articles. PADI-web uses 5 various strategies to locate health events mentioned in the text articles. The aim of our study was to assess the various strategies.<br> The dataset consists 7 case studies (outbreak events from official sources WAHIS or the scientific literature) associated to 222 validated media articles collected by PADI-web through the 5 strategies. The matching criteria to associate a case study to a PADI-web article are based on the country of the outbreak and the time period. <br>.<br> The five evaluated strategies are:.<br> (A) SpaCy locations in Outbreak articles: extraction with SpaCy of locations in articles classified as an epidemiological outbreak. .<br> (B) SpaCy locations in Outbreak articles and Current event sentences: extraction with SpaCy of locations in articles classified as an epidemiological outbreak and in a sentence that has been classified as relating to a current event..<br> (C) SpaCy locations in beginning of Outbreak articles: extraction with SpaCy of locations found in the first 300 characters of the text of an article classified as an epidemiological outbreak. .<br> (D) PADI-web-specific locations: extraction of locations by the location extraction model trained on PADI-web data..<br> (E) SpaCy locations in beginning of articles: extraction with SpaCy of locations found in the first 300 characters of the text of an article..<br>.<br> Each case study is associated with an identification number. For each case study, the set of Padiweb articles is given with a unique identification number. The dataset contains the values as follows: .<br> - Source: source of the case study, as WAHIS or published article in the scientific literature.<br> - Id_gold_standard: case study identification number as the outbreak id reported in WAHIS or ranked literature case study number.<br> - Id_article: identification number of media articles as generated in PADI-web - URL: url to the source article.<br> - Strategy X: binary value that stipulates whether the article has been returned by the strategy X.<br>