Search CORE

11,487 research outputs found

Experiment on Methods for Clustering and Categorization of Polish Text

Author: Dabrowska-Boruch Agnieszka
Fraczek Rafał
Jamro Ernest
Pietroń Marcin
Russek Paweł
Wiatr Kazimierz
Wielgosz Maciej
Publication venue: Institute of Informatics, Slovak Academy of Sciences
Publication date: 09/05/2017
Field of study

The main goal of this work was to experimentally verify the methods for a challenging task of categorization and clustering Polish text. Supervised and unsupervised learning was employed respectively for the categorization and clustering. A profound examination of the employed methods was done for the custom-built corpus of Polish texts. The corpus was assembled by the authors from Internet resources. The corpus data was acquired from the news portal and, therefore, it was sorted by type by journalists according to their specialization. The presented algorithms employ Vector Space Model (VSM) and TF-IDF (Term Frequency-Inverse Document Frequency) weighing scheme. Series of experiments were conducted that revealed certain properties of algorithms and their accuracy. The accuracy of algorithms was elaborated regarding their ability to match human arrangement of the documents by the topic. For both the categorization and clustering, the authors used F-measure to assess the quality of allocation

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

Design of a Controlled Language for Critical Infrastructures Protection

Author: CANTARELLA SIMONA
FERIGATO Carlo
OWUSU EVANS BOATENG
Publication venue: European Language Resources Association
Publication date: 28/03/2012
Field of study

We describe a project for the construction of controlled language for critical infrastructures protection (CIP). This project originates from the need to coordinate and categorize the communications on CIP at the European level. These communications can be physically represented by official documents, reports on incidents, informal communications and plain e-mail. We explore the application of traditional library science tools for the construction of controlled languages in order to achieve our goal. Our starting point is an analogous work done during the sixties in the field of nuclear science known as the Euratom Thesaurus.JRC.G.6-Security technology assessmen

JRC Publications Repository

SPAM detection: Naïve bayesian classification and RPN expression-based LGP approaches compared

Author: A Guven
A Khorsi
AH Gandomi
AW Burks
C Sangeetha
Carlton Downey
CL Hamblin
E Stamatatos
GV Cormack
I Kononenko
J Pearl
L Hirsch
Lorrie Faith Cranor
M Basavaraju
M Brameier
M Matsumoto
M Zhang
PE Bennett
S Mukkamala
VA Yatsko
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 26/07/2016
Field of study

An investigation is performed of a machine learning algorithm and the Bayesian classifier in the spam-filtering context. The paper shows the advantage of the use of Reverse Polish Notation (RPN) expressions with feature extraction compared to the traditional Naïve Bayesian classifier used for spam detection assuming the same features. The performance of the two is investigated using a public corpus and a recent private spam collection, concluding that the system based on RPN LGP (Linear Genetic Programming) gave better results compared to two popularly used open source Bayesian spam filters. © Springer International Publishing Switzerland 2016

Crossref

Institutional repository of Tomas Bata University Library

Natural language processing and cognitive science : proceedings 2018

Author: Lubaszewski Wiesław
Sedes Florence
Sharp Bernadette
Publication venue: Jagiellonian Library
Publication date: 01/01/2018
Field of study

Jagiellonian Univeristy Repository

Geometrical Product Specification and Verification as toolbox to meet up-to-date technical requirements

Author: Bills Paul J.
Dragomir Mihai
Hausotte Tino
Humienny Zbigniew
Jakubiec Wladyslaw
Marxer Michael
Mathieu Luc
Plowucha Wojciech
Savio Enrico
Wisla Norbert
Publication venue
Publication date: 01/01/2014
Field of study

The ISO standards for the Geometrical Product Specification and Verification (GPS) define an internationally uniform description language, that allows expressing unambiguously and completely all requirements for the geometry of a product with the corresponding requirements for the inspection process in technical drawings, taking into account current possibilities of measurement and testing technology. The practice shows that the university curricula of the mechanical engineering faculties often include only limited classes on the GPS, mostly as part of curriculum of subjects like Metrology or Fundamentals of Machine Design. This does not allow students to gain enough knowledge on the subject. Currently there is no coherent EU-wide provision for vocational training (VET) in this area. Consortium, members of which are the authors of this paper, is preparing a proposal of an EU project aiming to develop appropriate course

University of Huddersfield Repository

Archivio istituzionale della ricerca - Università di Padova

Huddersfield Research Portal

Constructing Datasets for Multi-hop Reading Comprehension Across Documents

Author: Riedel Sebastian
Stenetorp Pontus
Welbl Johannes
Publication venue
Publication date: 28/05/2018
Field of study

Most Reading Comprehension methods limit themselves to queries which can be answered using a single sentence, paragraph, or document. Enabling models to combine disjoint pieces of textual evidence would extend the scope of machine comprehension methods, but currently there exist no resources to train and test this capability. We propose a novel task to encourage the development of models for text understanding across multiple documents and to investigate the limits of existing methods. In our task, a model learns to seek and combine evidence - effectively performing multi-hop (alias multi-step) inference. We devise a methodology to produce datasets for this task, given a collection of query-answer pairs and thematically linked documents. Two datasets from different domains are induced, and we identify potential pitfalls and devise circumvention strategies. We evaluate two previously proposed competitive models and find that one can integrate information across documents. However, both models struggle to select relevant information, as providing documents guaranteed to be relevant greatly improves their performance. While the models outperform several strong baselines, their best accuracy reaches 42.9% compared to human performance at 74.0% - leaving ample room for improvement.Comment: This paper directly corresponds to the TACL version (https://transacl.org/ojs/index.php/tacl/article/view/1325) apart from minor changes in wording, additional footnotes, and appendice

arXiv.org e-Print Archive

UCL Discovery