Search CORE

174 research outputs found

Garbage collection auto-tuning for Java MapReduce on Multi-Cores

Author: Allen E.
Chu C.
Dean J.
DeWitt D.
Gavin Brown
George Kovoor
Jeremy Singer
Kovoor G.
M
Mikel Luján
Printezis T.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2011
Field of study

MapReduce has been widely accepted as a simple programming pattern that can form the basis for efficient, large-scale, distributed data processing. The success of the MapReduce pattern has led to a variety of implementations for different computational scenarios. In this paper we present MRJ, a MapReduce Java framework for multi-core architectures. We evaluate its scalability on a four-core, hyperthreaded Intel Core i7 processor, using a set of standard MapReduce benchmarks. We investigate the significant impact that Java runtime garbage collection has on the performance and scalability of MRJ. We propose the use of memory management auto-tuning techniques based on machine learning. With our auto-tuning approach, we are able to achieve MRJ performance within 10% of optimal on 75% of our benchmark tests

Crossref

Enlighten

EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020

Author
Publication venue: 'OpenEdition'
Publication date: 10/06/2022
Field of study

Welcome to EVALITA 2020! EVALITA is the evaluation campaign of Natural Language Processing and Speech Tools for Italian. EVALITA is an initiative of the Italian Association for Computational Linguistics (AILC, http://www.ai-lc.it) and it is endorsed by the Italian Association for Artificial Intelligence (AIxIA, http://www.aixia.it) and the Italian Association for Speech Sciences (AISV, http://www.aisv.it)

Directory of Open Access Books (DOAB)

Authorship attribution in portuguese using character N-grams

Author: Baptista Jorge
Markov Ilia
Pichardo-Lagunas Obdulia
Publication venue: 'Obuda University'
Publication date: 01/01/2017
Field of study

For the Authorship Attribution (AA) task, character n-grams are considered among the best predictive features. In the English language, it has also been shown that some types of character n-grams perform better than others. This paper tackles the AA task in Portuguese by examining the performance of different types of character n-grams, and various combinations of them. The paper also experiments with different feature representations and machine-learning algorithms. Moreover, the paper demonstrates that the performance of the character n-gram approach can be improved by fine-tuning the feature set and by appropriately selecting the length and type of character n-grams. This relatively simple and language-independent approach to the AA task outperforms both a bag-of-words baseline and other approaches, using the same corpus.Mexican Government (Conacyt) [240844, 20161958]; Mexican Government (SIP-IPN) [20171813, 20171344, 20172008]; Mexican Government (SNI); Mexican Government (COFAA-IPN)

Crossref

Sapientia

The text classification pipeline: Starting shallow, going deeper

Author: SIINO Marco
Publication venue: place:Palermo
Publication date: 20/11/2023
Field of study

An increasingly relevant and crucial subfield of Natural Language Processing (NLP), tackled in this PhD thesis from a computer science and engineering perspective, is the Text Classification (TC). Also in this field, the exceptional success of deep learning has sparked a boom over the past ten years. Text retrieval and categorization, information extraction and summarization all rely heavily on TC. The literature has presented numerous datasets, models, and evaluation criteria. Even if languages as Arabic, Chinese, Hindi and others are employed in several works, from a computer science perspective the most used and referred language in the literature concerning TC is English. This is also the language mainly referenced in the rest of this PhD thesis. Even if numerous machine learning techniques have shown outstanding results, the classifier effectiveness depends on the capability to comprehend intricate relations and non-linear correlations in texts. In order to achieve this level of understanding, it is necessary to pay attention not only to the architecture of a model but also to other stages of the TC pipeline. In an NLP framework, a range of text representation techniques and model designs have emerged, including the large language models. These models are capable of turning massive amounts of text into useful vector representations that effectively capture semantically significant information. The fact that this field has been investigated by numerous communities, including data mining, linguistics, and information retrieval, is an aspect of crucial interest. These communities frequently have some overlap, but are mostly separate and do their research on their own. Bringing researchers from other groups together to improve the multidisciplinary comprehension of this field is one of the objectives of this dissertation. Additionally, this dissertation makes an effort to examine text mining from both a traditional and modern perspective. This thesis covers the whole TC pipeline in detail. However, the main contribution is to investigate the impact of every element in the TC pipeline to evaluate the impact on the final performance of a TC model. It is discussed the TC pipeline, including the traditional and the most recent deep learning-based models. This pipeline consists of State-Of-The-Art (SOTA) datasets used in the literature as benchmark, text preprocessing, text representation, machine learning models for TC, evaluation metrics and current SOTA results. In each chapter of this dissertation, I go over each of these steps, covering both the technical advancements and my most significant and recent findings while performing experiments and introducing novel models. The advantages and disadvantages of various options are also listed, along with a thorough comparison of the various approaches. At the end of each chapter, there are my contributions with experimental evaluations and discussions on the results that I have obtained during my three years PhD course. The experiments and the analysis related to each chapter (i.e., each element of the TC pipeline) are the main contributions that I provide, extending the basic knowledge of a regular survey on the matter of TC.An increasingly relevant and crucial subfield of Natural Language Processing (NLP), tackled in this PhD thesis from a computer science and engineering perspective, is the Text Classification (TC). Also in this field, the exceptional success of deep learning has sparked a boom over the past ten years. Text retrieval and categorization, information extraction and summarization all rely heavily on TC. The literature has presented numerous datasets, models, and evaluation criteria. Even if languages as Arabic, Chinese, Hindi and others are employed in several works, from a computer science perspective the most used and referred language in the literature concerning TC is English. This is also the language mainly referenced in the rest of this PhD thesis. Even if numerous machine learning techniques have shown outstanding results, the classifier effectiveness depends on the capability to comprehend intricate relations and non-linear correlations in texts. In order to achieve this level of understanding, it is necessary to pay attention not only to the architecture of a model but also to other stages of the TC pipeline. In an NLP framework, a range of text representation techniques and model designs have emerged, including the large language models. These models are capable of turning massive amounts of text into useful vector representations that effectively capture semantically significant information. The fact that this field has been investigated by numerous communities, including data mining, linguistics, and information retrieval, is an aspect of crucial interest. These communities frequently have some overlap, but are mostly separate and do their research on their own. Bringing researchers from other groups together to improve the multidisciplinary comprehension of this field is one of the objectives of this dissertation. Additionally, this dissertation makes an effort to examine text mining from both a traditional and modern perspective. This thesis covers the whole TC pipeline in detail. However, the main contribution is to investigate the impact of every element in the TC pipeline to evaluate the impact on the final performance of a TC model. It is discussed the TC pipeline, including the traditional and the most recent deep learning-based models. This pipeline consists of State-Of-The-Art (SOTA) datasets used in the literature as benchmark, text preprocessing, text representation, machine learning models for TC, evaluation metrics and current SOTA results. In each chapter of this dissertation, I go over each of these steps, covering both the technical advancements and my most significant and recent findings while performing experiments and introducing novel models. The advantages and disadvantages of various options are also listed, along with a thorough comparison of the various approaches. At the end of each chapter, there are my contributions with experimental evaluations and discussions on the results that I have obtained during my three years PhD course. The experiments and the analysis related to each chapter (i.e., each element of the TC pipeline) are the main contributions that I provide, extending the basic knowledge of a regular survey on the matter of TC

Archivio istituzionale della ricerca - Università di Palermo

Arabic tweeps dialect prediction based on machine learning approach

Author: Alrifai Khaled
Ghneim Nada
Rebdawi Ghaida
Publication venue: 'Institute of Advanced Engineering and Science'
Publication date: 01/04/2021
Field of study

In this paper, we present our approach for profiling Arabic authors on twitter, based on their tweets. We consider here the dialect of an Arabic author as an important trait to be predicted. For this purpose, many indicators, feature vectors and machine learning-based classifiers were implemented. The results of these classifiers were compared to find out the best dialect prediction model. The best dialect prediction model was obtained using random forest classifier with full forms and their stems as feature vector

ZENODO

Institute of Advanced Engineering and Science

EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020

Author: Agerri Rodrigo
Aliprandi Carlo
Alkhalifa Rabab
Alzetta Chiara
Angel Jason
Anselmi Guido
Appiah Balaji Nitin Nikamanth
Aroyehun Segun Taofeek
Artigas Herold Maria Fernanda
Attanasio Giuseppe
Attardi Giuseppe
Badryzlova Yulia
Bai Yang
Baldissin Gioia
Ballarè Silvia
Barrón-Cedeño Alberto
Bartle Anna-Sophie
Basile Pierpaolo
Basile Valerio
Basili Roberto
Belotti Federico
Bennici Mauro
Bharathi B.
Bhuvana J.
Bianchi Federico
Bisconti Elia
Bolanos Luis
Bondielli Alessandro
Bosco Cristina
Breazzano Claudia
Brivio Matteo
Brunato Dominique
Cafagna Michele
Caputo Annalina
Caselli Tommaso
Cassotti Pierluigi
Castañeda Enrique
Castro Castro Daniel
Centeno Roberto
Cercel Dumitru-Clementin
Cerruti Massimo
Chandrabose Aravindan
Chesi Cristiano
Chiarello Filippo
Cignarella Alessandra Teresa
Cimino Andrea
Comandini Gloria
Croce Danilo
Dai Hongbing
Dascalu Mihai
Dell’Orletta Felice
Delmonte Rodolfo
Deng Tao
De Francesco Nazareno
De Martino Graziella
De Mattei Lorenzo
Di Buccio Emanuele
Di Maro Maria
di Nuovo Elisa
Di Rosa Emanuele
dos S.R. da Silva Adriano
Durante Alberto
El Abassi Samer
Espinosa María S.
Fabrizi Samuel
Fantoni Gualtiero
Ferilli Stefano
Ferraccioli Federico
Fersini Elisabetta
Finos Livio
Fiorucci Stefano
Fontana Michele
Frenda Simona
Gambino Giuseppe
Gatt Albert
Gelbukh Alexander
Giorgi Giulia
Giorgioni Simone
Girardi Paolo
Goria Eugenio
Gregori Lorenzo
Hoffmann Julia
Iacono Maria
Iovine Andrea
Izzi Giovanni Luca
Jimenez Sergio
Kaiser Jens
Kayalvizhi S.
Kivlichan Ian
Klaus Svea
Koceva Frosina
Kovács György
Kruschwitz Udo
Labadie Tamayo Roberto
Lai Mirko
Laicher Severin
Lapesa Gabriella
Lavergne Eric
Lebani Gianluca E.
Lebani Gianluca E.
Lees Alyssa
Lenci Alessandro
Leonardelli Elisa
Li Hongling
Liakata Maria
Lovetere Marco
Madonna Domenico
Massidda Riccardo
Mattei Lorenzo De
Mauri Caterina
Mele Francesco
Melucci Massimo
Menini Stefano
Miaschi Alessio
Miliani Martina
Moggio Alessio
Montagnani Matteo
Montefinese Maria
Montemagni Simonetta
Monti Johanna
Moraca Maurizio
Moretti Giovanni
Morra Simone
Murphy Killian
Muti Arianna
Nakov Preslav
Nisioi Sergiu
Nissim Malvina
Nozza Debora
Occhipinti Daniela
Ortega Bueno Reynier
Ou Xiaozhi
Palmonari Matteo
Parizzi Andrea
Pascucci Antonio
Passaro Lucia C.
Pastor Eliana
Patti Viviana
Pirrone Roberto
Polignano Marco
Politi Marcello
Pont Mattia Da
Pražák Ondřej
Proisl Thomas
Puccetti Giovanni
Přibáň Pavel
Radicioni Daniele P.
Rama Ilir
Rambelli Giulia
Ravelli Andrea Amelio
Rodrigo Alvaro
Rodriguez-Diaz Carlos A.
Rodriguez Cisnero Mariano Jason
Roman Norton T.
Roman Norton Trevisan
Rossmann Daniela
Rosso Paolo
Rotaru Armand Stefan
Rubino Edoardo
Russo Irene
Sabella Gianluca
Saini Rajkumar
Salman Samir
Sangati Federico
Sanguinetti Manuela
Sarti Gabriele
Schlechtweg Dominik
Schulte im Walde Sabine
Sciandra Andrea
Setpal Jinen
Siciliani Lucia
Solari Dario
Sorensen Jeffrey
Sorgente Antonio
Sprugnoli Rachele
Stranisci Marco
Tamburini Fabio
Taylor Stephen
Tesei Andrea
Thenmozhi D.
Tonelli Sara
Torre Ilaria
Tsakalidis Adam
Varvara Rossella
Venturi Giulia
Vettigli Giuseppe
Vlad George-Alexandru
Wang Benyou
Zaharia George-Eduard
Zamparelli Roberto
Zubiaga Arkaitz
Publication venue: 'OpenEdition'
Publication date: 11/05/2021
Field of study

OpenEdition

On the Detection of False Information: From Rumors to Fake News

Author: Ghanem Bilal Hisham Hasan
Publication venue: 'Universitat Politecnica de Valencia'
Publication date: 10/01/2021
Field of study

Tesis por compendio[ES] En tiempos recientes, el desarrollo de las redes sociales y de las agencias de noticias han traído nuevos retos y amenazas a la web. Estas amenazas han llamado la atención de la comunidad investigadora en Procesamiento del Lenguaje Natural (PLN) ya que están contaminando las plataformas de redes sociales. Un ejemplo de amenaza serían las noticias falsas, en las que los usuarios difunden y comparten información falsa, inexacta o engañosa. La información falsa no se limita a la información verificable, sino que también incluye información que se utiliza con fines nocivos. Además, uno de los desafíos a los que se enfrentan los investigadores es la gran cantidad de usuarios en las plataformas de redes sociales, donde detectar a los difusores de información falsa no es tarea fácil. Los trabajos previos que se han propuesto para limitar o estudiar el tema de la detección de información falsa se han centrado en comprender el lenguaje de la información falsa desde una perspectiva lingüística. En el caso de información verificable, estos enfoques se han propuesto en un entorno monolingüe. Además, apenas se ha investigado la detección de las fuentes o los difusores de información falsa en las redes sociales. En esta tesis estudiamos la información falsa desde varias perspectivas. En primer lugar, dado que los trabajos anteriores se centraron en el estudio de la información falsa en un entorno monolingüe, en esta tesis estudiamos la información falsa en un entorno multilingüe. Proponemos diferentes enfoques multilingües y los comparamos con un conjunto de baselines monolingües. Además, proporcionamos estudios sistemáticos para los resultados de la evaluación de nuestros enfoques para una mejor comprensión. En segundo lugar, hemos notado que el papel de la información afectiva no se ha investigado en profundidad. Por lo tanto, la segunda parte de nuestro trabajo de investigación estudia el papel de la información afectiva en la información falsa y muestra cómo los autores de contenido falso la emplean para manipular al lector. Aquí, investigamos varios tipos de información falsa para comprender la correlación entre la información afectiva y cada tipo (Propaganda, Trucos / Engaños, Clickbait y Sátira). Por último, aunque no menos importante, en un intento de limitar su propagación, también abordamos el problema de los difusores de información falsa en las redes sociales. En esta dirección de la investigación, nos enfocamos en explotar varias características basadas en texto extraídas de los mensajes de perfiles en línea de tales difusores. Estudiamos diferentes conjuntos de características que pueden tener el potencial de ayudar a discriminar entre difusores de información falsa y verificadores de hechos.[CA] En temps recents, el desenvolupament de les xarxes socials i de les agències de notícies han portat nous reptes i amenaces a la web. Aquestes amenaces han cridat l'atenció de la comunitat investigadora en Processament de Llenguatge Natural (PLN) ja que estan contaminant les plataformes de xarxes socials. Un exemple d'amenaça serien les notícies falses, en què els usuaris difonen i comparteixen informació falsa, inexacta o enganyosa. La informació falsa no es limita a la informació verificable, sinó que també inclou informació que s'utilitza amb fins nocius. A més, un dels desafiaments als quals s'enfronten els investigadors és la gran quantitat d'usuaris en les plataformes de xarxes socials, on detectar els difusors d'informació falsa no és tasca fàcil. Els treballs previs que s'han proposat per limitar o estudiar el tema de la detecció d'informació falsa s'han centrat en comprendre el llenguatge de la informació falsa des d'una perspectiva lingüística. En el cas d'informació verificable, aquests enfocaments s'han proposat en un entorn monolingüe. A més, gairebé no s'ha investigat la detecció de les fonts o els difusors d'informació falsa a les xarxes socials. En aquesta tesi estudiem la informació falsa des de diverses perspectives. En primer lloc, atès que els treballs anteriors es van centrar en l'estudi de la informació falsa en un entorn monolingüe, en aquesta tesi estudiem la informació falsa en un entorn multilingüe. Proposem diferents enfocaments multilingües i els comparem amb un conjunt de baselines monolingües. A més, proporcionem estudis sistemàtics per als resultats de l'avaluació dels nostres enfocaments per a una millor comprensió. En segon lloc, hem notat que el paper de la informació afectiva no s'ha investigat en profunditat. Per tant, la segona part del nostre treball de recerca estudia el paper de la informació afectiva en la informació falsa i mostra com els autors de contingut fals l'empren per manipular el lector. Aquí, investiguem diversos tipus d'informació falsa per comprendre la correlació entre la informació afectiva i cada tipus (Propaganda, Trucs / Enganys, Clickbait i Sàtira). Finalment, però no menys important, en un intent de limitar la seva propagació, també abordem el problema dels difusors d'informació falsa a les xarxes socials. En aquesta direcció de la investigació, ens enfoquem en explotar diverses característiques basades en text extretes dels missatges de perfils en línia de tals difusors. Estudiem diferents conjunts de característiques que poden tenir el potencial d'ajudar a discriminar entre difusors d'informació falsa i verificadors de fets.[EN] In the recent years, the development of social media and online news agencies has brought several challenges and threats to the Web. These threats have taken the attention of the Natural Language Processing (NLP) research community as they are polluting the online social media platforms. One of the examples of these threats is false information, in which false, inaccurate, or deceptive information is spread and shared by online users. False information is not limited to verifiable information, but it also involves information that is used for harmful purposes. Also, one of the challenges that researchers have to face is the massive number of users in social media platforms, where detecting false information spreaders is not an easy job. Previous work that has been proposed for limiting or studying the issue of detecting false information has focused on understanding the language of false information from a linguistic perspective. In the case of verifiable information, approaches have been proposed in a monolingual setting. Moreover, detecting the sources or the spreaders of false information in social media has not been investigated much. In this thesis we study false information from several aspects. First, since previous work focused on studying false information in a monolingual setting, in this thesis we study false information in a cross-lingual one. We propose different cross-lingual approaches and we compare them to a set of monolingual baselines. Also, we provide systematic studies for the evaluation results of our approaches for better understanding. Second, we noticed that the role of affective information was not investigated in depth. Therefore, the second part of our research work studies the role of the affective information in false information and shows how the authors of false content use it to manipulate the reader. Here, we investigate several types of false information to understand the correlation between affective information and each type (Propaganda, Hoax, Clickbait, Rumor, and Satire). Last but not least, in an attempt to limit its spread, we also address the problem of detecting false information spreaders in social media. In this research direction, we focus on exploiting several text-based features extracted from the online profile messages of those spreaders. We study different feature sets that can have the potential to help to identify false information spreaders from fact checkers.Ghanem, BHH. (2020). On the Detection of False Information: From Rumors to Fake News [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/158570TESISCompendi

RiuNet