Dataset for Evaluating location strategies in Padiweb
This dataset has been built in the framework of the optimization of the MUlti-Source surveillance Tool for the detection of Avian Influenza outbreaks in mammalian species (MUST-AI). The MUST-AI tool collects health events reported from 3 sources: two official sources, WAHIS from the World Animal Health Organization and mails from Program for Monitoring Emerging Diseases (ProMED); and one unofficial source, PADI-web, which collects online media articles. PADI-web uses 5 various strategies to locate health events mentioned in the text articles. The aim of our study was to assess the various strategies.<br> The dataset consists 7 case studies (outbreak events from official sources WAHIS or the scientific literature) associated to 222 validated media articles collected by PADI-web through the 5 strategies. The matching criteria to associate a case study to a PADI-web article are based on the country of the outbreak and the time period. <br>.<br> The five evaluated strategies are:.<br> (A) SpaCy locations in Outbreak articles: extraction with SpaCy of locations in articles classified as an epidemiological outbreak. .<br> (B) SpaCy locations in Outbreak articles and Current event sentences: extraction with SpaCy of locations in articles classified as an epidemiological outbreak and in a sentence that has been classified as relating to a current event..<br> (C) SpaCy locations in beginning of Outbreak articles: extraction with SpaCy of locations found in the first 300 characters of the text of an article classified as an epidemiological outbreak. .<br> (D) PADI-web-specific locations: extraction of locations by the location extraction model trained on PADI-web data..<br> (E) SpaCy locations in beginning of articles: extraction with SpaCy of locations found in the first 300 characters of the text of an article..<br>.<br> Each case study is associated with an identification number. For each case study, the set of Padiweb articles is given with a unique identification number. The dataset contains the values as follows: .<br> - Source: source of the case study, as WAHIS or published article in the scientific literature.<br> - Id_gold_standard: case study identification number as the outbreak id reported in WAHIS or ranked literature case study number.<br> - Id_article: identification number of media articles as generated in PADI-web - URL: url to the source article.<br> - Strategy X: binary value that stipulates whether the article has been returned by the strategy X.<br>
Main Authors: | , , |
---|---|
Format: | Dataset biblioteca |
Language: | English |
Published: |
CIRAD Dataverse
2024
|
Subjects: | Agricultural Sciences, Computer and Information Science, avian influenza, highly pathogenic avian influenza, HPAIM, epidemics, natural language processing, NLP, artificial intelligence, MUST-AI, PADIWEB, event-based surveillance, disease surveillance, |
Online Access: | https://doi.org/10.18167/DVN1/Y1J9XK |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | This dataset has been built in the framework of the optimization of the MUlti-Source surveillance Tool for the detection of Avian Influenza outbreaks in mammalian species (MUST-AI). The MUST-AI tool collects health events reported from 3 sources: two official sources, WAHIS from the World Animal Health Organization and mails from Program for Monitoring Emerging Diseases (ProMED); and one unofficial source, PADI-web, which collects online media articles. PADI-web uses 5 various strategies to locate health events mentioned in the text articles. The aim of our study was to assess the various strategies.<br>
The dataset consists 7 case studies (outbreak events from official sources WAHIS or the scientific literature) associated to 222 validated media articles collected by PADI-web through the 5 strategies. The matching criteria to associate a case study to a PADI-web article are based on the country of the outbreak and the time period.
<br>.<br>
The five evaluated strategies are:.<br>
(A) SpaCy locations in Outbreak articles: extraction with SpaCy of locations in articles classified as an epidemiological outbreak. .<br>
(B) SpaCy locations in Outbreak articles and Current event sentences: extraction with SpaCy of locations in articles classified as an epidemiological outbreak and in a sentence that has been classified as relating to a current event..<br>
(C) SpaCy locations in beginning of Outbreak articles: extraction with SpaCy of locations found in the first 300 characters of the text of an article classified as an epidemiological outbreak. .<br>
(D) PADI-web-specific locations: extraction of locations by the location extraction model trained on PADI-web data..<br>
(E) SpaCy locations in beginning of articles: extraction with SpaCy of locations found in the first 300 characters of the text of an article..<br>.<br>
Each case study is associated with an identification number. For each case study, the set of Padiweb articles is given with a unique identification number. The dataset contains the values as follows: .<br>
- Source: source of the case study, as WAHIS or published article in the scientific literature.<br>
- Id_gold_standard: case study identification number as the outbreak id reported in WAHIS or ranked literature case study number.<br>
- Id_article: identification number of media articles as generated in PADI-web
- URL: url to the source article.<br>
- Strategy X: binary value that stipulates whether the article has been returned by the strategy X.<br> |
---|