Classification des items inconnus de 88milSMS : aide à l'identification automatique de la créativité scripturale

The sud4science LR project 1 aimed at studying a fairly recent form of written communication: SMS (Short Message Service). The first step of the project was to collect a large number of text messages from the general public. We initially gathered 93'085 SMS and our final corpus, entitled 88milSMS , contains over 88'000 SMS. 2 In this article, we propose a novel approach (which is also applicable to other textual data) for classifying unknown items in 88milSMS , based on two steps: 1) Classification of SMS in relation to 5 European languages (French, Spanish, English, German, Italian), 2) Classification of unknown items accordi ng to predefined classes (schedules, items containing special character(s), number(s), words without accents, or with repeated characters, etc.). We are then able to make a distinction between the truly "original" items which are widely used compared to those that are rarely used in the corpus. Based on examples mined in the different classes, we present a preliminary analysis of the obtained resource.

Saved in:
Bibliographic Details
Main Authors: Lopez, Cédric, Roche, Mathieu, Panckhurst, Rachel
Format: article biblioteca
Language:fre
Subjects:C30 - Documentation et information, 000 - Autres thèmes, communication, linguistique, analyse de données, logiciel, méthode statistique, classification (information), information, http://aims.fao.org/aos/agrovoc/c_37866, http://aims.fao.org/aos/agrovoc/c_1335455465014, http://aims.fao.org/aos/agrovoc/c_15962, http://aims.fao.org/aos/agrovoc/c_24008, http://aims.fao.org/aos/agrovoc/c_7377, http://aims.fao.org/aos/agrovoc/c_11767, http://aims.fao.org/aos/agrovoc/c_330966,
Online Access:http://agritrop.cirad.fr/579647/
http://agritrop.cirad.fr/579647/1/revue_tranel15.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The sud4science LR project 1 aimed at studying a fairly recent form of written communication: SMS (Short Message Service). The first step of the project was to collect a large number of text messages from the general public. We initially gathered 93'085 SMS and our final corpus, entitled 88milSMS , contains over 88'000 SMS. 2 In this article, we propose a novel approach (which is also applicable to other textual data) for classifying unknown items in 88milSMS , based on two steps: 1) Classification of SMS in relation to 5 European languages (French, Spanish, English, German, Italian), 2) Classification of unknown items accordi ng to predefined classes (schedules, items containing special character(s), number(s), words without accents, or with repeated characters, etc.). We are then able to make a distinction between the truly "original" items which are widely used compared to those that are rarely used in the corpus. Based on examples mined in the different classes, we present a preliminary analysis of the obtained resource.