Dataset for Evaluating location strategies in Padiweb

Trevennec, Carlene; Bououda, Samira; Roche, Mathieu

Dataset for Evaluating location strategies in Padiweb

This dataset has been built in the framework of the optimization of the MUlti-Source surveillance Tool for the detection of Avian Influenza outbreaks in mammalian species (MUST-AI). The MUST-AI tool collects health events reported from 3 sources: two official sources, WAHIS from the World Animal Health Organization and mails from Program for Monitoring Emerging Diseases (ProMED); and one unofficial source, PADI-web, which collects online media articles. PADI-web uses 5 various strategies to locate health events mentioned in the text articles. The aim of our study was to assess the various strategies. The dataset consists 7 case studies (outbreak events from official sources WAHIS or the scientific literature) associated to 222 validated media articles collected by PADI-web through the 5 strategies. The matching criteria to associate a case study to a PADI-web article are based on the country of the outbreak and the time period. . The five evaluated strategies are:. (A) SpaCy locations in Outbreak articles: extraction with SpaCy of locations in articles classified as an epidemiological outbreak. . (B) SpaCy locations in Outbreak articles and Current event sentences: extraction with SpaCy of locations in articles classified as an epidemiological outbreak and in a sentence that has been classified as relating to a current event.. (C) SpaCy locations in beginning of Outbreak articles: extraction with SpaCy of locations found in the first 300 characters of the text of an article classified as an epidemiological outbreak. . (D) PADI-web-specific locations: extraction of locations by the location extraction model trained on PADI-web data.. (E) SpaCy locations in beginning of articles: extraction with SpaCy of locations found in the first 300 characters of the text of an article.. . Each case study is associated with an identification number. For each case study, the set of Padiweb articles is given with a unique identification number. The dataset contains the values as follows: . - Source: source of the case study, as WAHIS or published article in the scientific literature. - Id_gold_standard: case study identification number as the outbreak id reported in WAHIS or ranked literature case study number. - Id_article: identification number of media articles as generated in PADI-web - URL: url to the source article. - Strategy X: binary value that stipulates whether the article has been returned by the strategy X.

Saved in:

Bibliographic Details
Main Authors:	Trevennec, Carlene, Bououda, Samira, Roche, Mathieu
Format:	Dataset biblioteca
Language:	English
Published:	CIRAD Dataverse 2024
Subjects:	Agricultural Sciences, Computer and Information Science, avian influenza, highly pathogenic avian influenza, HPAIM, epidemics, natural language processing, NLP, artificial intelligence, MUST-AI, PADIWEB, event-based surveillance, disease surveillance,
Online Access:	https://doi.org/10.18167/DVN1/Y1J9XK
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	This dataset has been built in the framework of the optimization of the MUlti-Source surveillance Tool for the detection of Avian Influenza outbreaks in mammalian species (MUST-AI). The MUST-AI tool collects health events reported from 3 sources: two official sources, WAHIS from the World Animal Health Organization and mails from Program for Monitoring Emerging Diseases (ProMED); and one unofficial source, PADI-web, which collects online media articles. PADI-web uses 5 various strategies to locate health events mentioned in the text articles. The aim of our study was to assess the various strategies.<br> The dataset consists 7 case studies (outbreak events from official sources WAHIS or the scientific literature) associated to 222 validated media articles collected by PADI-web through the 5 strategies. The matching criteria to associate a case study to a PADI-web article are based on the country of the outbreak and the time period. <br>.<br> The five evaluated strategies are:.<br> (A) SpaCy locations in Outbreak articles: extraction with SpaCy of locations in articles classified as an epidemiological outbreak. .<br> (B) SpaCy locations in Outbreak articles and Current event sentences: extraction with SpaCy of locations in articles classified as an epidemiological outbreak and in a sentence that has been classified as relating to a current event..<br> (C) SpaCy locations in beginning of Outbreak articles: extraction with SpaCy of locations found in the first 300 characters of the text of an article classified as an epidemiological outbreak. .<br> (D) PADI-web-specific locations: extraction of locations by the location extraction model trained on PADI-web data..<br> (E) SpaCy locations in beginning of articles: extraction with SpaCy of locations found in the first 300 characters of the text of an article..<br>.<br> Each case study is associated with an identification number. For each case study, the set of Padiweb articles is given with a unique identification number. The dataset contains the values as follows: .<br> - Source: source of the case study, as WAHIS or published article in the scientific literature.<br> - Id_gold_standard: case study identification number as the outbreak id reported in WAHIS or ranked literature case study number.<br> - Id_article: identification number of media articles as generated in PADI-web - URL: url to the source article.<br> - Strategy X: binary value that stipulates whether the article has been returned by the strategy X.<br>

Dataset for Evaluating location strategies in Padiweb

Similar Items

Resource Map