Search CORE

11 research outputs found

Cross-lingual Similarity Calculation for Plagiarism Detection and More - Tools and Resources

Author: STEINBERGER Ralf
Publication venue: 'American Cleft Palate Association'
Publication date: 28/08/2012
Field of study

A system that recognises cross-lingual plagiarism needs to establish – among other things – whether two pieces of text written in different languages are equivalent to each other. Potthast et al. (2010) give a thorough overview of this challenging task. While the Joint Research Centre (JRC) is not specifically concerned with plagiarism, it has been working for many years on developing other cross-lingual functionalities that may well be useful for the plagiarism detection task, i.e. (a) cross-lingual document similarity calculation, (b) subject domain profiling of documents in many different languages according to the same multilingual subject domain categorisation scheme, and (c) the recognition of name spelling variants for the same entity, both within the same language and across different languages and scripts. The speaker will explain the algorithms behind these software tools and he will present a number of freely available language resources that can be used to develop software with cross-lingual functionality.JRC.G.2-Global security and crisis managemen

JRC Publications Repository

Hybrid approach combining statistical and rule-based models for the automated indexing of bibliographic metadata in the area of planning and building construction

Author: Busch Dimitri
Publication venue
Publication date: 04/10/2017
Field of study

KITopen

Hybrid Approach Combining Statistical and Rule-Based Models for the Automated Indexing of Bibliographic Metadata in the Area of Planning and Building Construction

Author: Busch Dimitri
Publication venue
Publication date: 11/12/2019
Field of study

ICONDA

^{®}

Bibliographic (International Construction Database) is a bibliographic database, which contains English-language documents in the area of planning and building construction. The documents are indexed with descriptors from controlled vocabularies (FINDEX thesauri, an authority list). The manual assignment of the descriptors is time-consuming and expensive. To solve this problem, an automated indexing system was developed. The indexing system combines a statistical classifier that is based on the vector space model with a rule-based classifier. In the statistical classifier, descriptor profiles are automatically trained from already indexed documents. The results provided by the statistical classifier will be improved with the rule based classifier that filters incorrect and adds missing descriptors. The rules can be created manually or automatically from already indexed documents. The hybrid approach is particularly useful when a descriptor cannot be successfully trained by the statistical classifier. In this case, the system can be easily fine-tuned by adding specific rules for the descriptor

KITopen

ODINet - Online Data Integration Network

Author: Caterino Luca
Franchini Michela
Greco Alessandro
Molinaro Sabrina
Pieroni Stefania
Pitto Francesco
Toigo Moreno
Publication venue: IARIA
Publication date
Field of study

Along with the expansion of Open Data and according to the latest EU directives for open access, the attention of public administration, research bodies and business is on web publishing of data in open format. However, a specialized search engine on the datasets, with similar role to that of Google for web pages, is not yet widespread. This article presents the Online Data Integration Network (ODINet) project, which aims to define a new technological framework for access to and online dissemination of structured and heterogeneous data through innovative methods of cataloging, searching and display of data on the web. In this article, we focus on the semantic component of our platform, emphasizing how we built and used ontologies. We further describe the Social Network Analysis (SNA) techniques we exploited to analyze it and to retrieve the required information. The testing phase of the project, that is still in progress, has already demonstrated the validity of the ODINet approach

PUblication MAnagement

The MARCELL Legislative Corpus

Author: Koeva Svetla
Nitoń Bartłomiej
Sass Bálint
Tadić Marko
Váradi Tamás
Yalamov Martin
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2020
Field of study

Repository of the Academy's Library

ODINet un framework innovativo per l\u27accesso e la diffusione on-line di dati strutturati ed eterogenei

Author: Caterino Luca
Faraoni Massimiliano
Franchini Michela
Greco Alessandro
Mariani Fabio
Molinaro Sabrina
Pieroni Stefania
Pitto Francesco
Toigo Moreno
Publication venue
Publication date
Field of study

ODINet is a research and development project, approved as part of the Regional Operational Programme through the European Regional Development Fund 2007-2013. The project involves the construction of a semantic search engine prototype able to catalog the data in a ontological graph, to extract the most relevant information depending on user requests and return them in a highly usable way. The application domain concerns the social, economic and health, in order to cover most of the data held by public bodies in the national context. The focus of this report is in the first place the description of the semantic components of the platform, emphasizing how the ontologies have been used to build an index in the form of graph. We also present a description of our semantic searches and finally an analysis of the results obtained in the final stage of the ODINEt prototype testing

PUblication MAnagement

JRC EuroVoc Indexer JEX – A freely available multi-label categorisation tool

Author: Marco Turchi
Mohamed Ebrahim
Ralf Steinberger
Publication venue
Publication date: 01/01/2012
Field of study

EuroVoc (2012) is a highly multilingual thesaurus consisting of over 6,700 hierarchically organised subject domains used by European Institutions and many authorities in Member States of the European Union (EU) for the classification and retrieval of official documents. JEX is JRC-developed multi-label classification software that learns from manually labelled data to automatically assign EuroVoc descriptors to new documents in a profile-based category-ranking task. The JEX release consists of trained classifiers for 22 official EU languages, of parallel training data in the same languages, of an interface that allows viewing and amending the assignment results, and of a module that allows users to re-train the tool on their own document collections. JEX allows advanced users to change the document representation so as to possibly improve the categorisation result through linguistic pre-processing. JEX can be used as a tool for interactive EuroVoc descriptor assignment to increase speed and consistency of the human categorisation process, or it can be used fully automatically. The output of JEX is a language-independent EuroVoc feature vector lending itself also as input to various other Language Technology tasks, including cross-lingual clustering and classification, cross-lingual plagiarism detection, sentence selection and ranking, and more

arXiv.org e-Print Archive

CiteSeerX

Archivio della ricerca - Fondazione Bruno Kessler

JRC Eurovoc Indexer JEX – A freely available multi-label categorisation tool

Author: EBRAHIM MOHAMED
STEINBERGER Ralf
TURCHI MARCO
Publication venue: European Language Resources Agency (ELRA)
Publication date: 25/10/2011
Field of study

Eurovoc (2011) is a highly multilingual thesaurus consisting of over 6,700 hierarchically organised subject domains used by European Institutions and many authorities in Member States of the European Union (EU) for the classification and retrieval of official documents. JEX is JRC-developed multi-label classification software that learns from manually labelled data to automatically assign Eurovoc descriptors to new documents in a profile-based category-ranking task. The JEX release consists of trained classifiers for all 23 official EU languages, of parallel training data in the same languages, of an interface that allows viewing and amending the assignment results, and of a module that allows users to re-train the tool on their own document collections. JEX allows advanced users to change the document representation so as to possibly improve the categorisation result through linguistic pre-processing. JEX can be used as a tool for interactive Eurovoc descriptor assignment to improve speed and consistency of the human categorisation process, or it can be used fully automatically. The output of JEX is a language-independent Eurovoc feature vector lending itself to tasks such as cross-lingual clustering and classification.JRC.G.2-Global security and crisis managemen

JRC Publications Repository

A korpusznyelvészettől a neurális hálókig : Köszöntő kötet Váradi Tamás 70. születésnapjára

Author
Publication venue: Nyelvtudományi Kutatóközpont
Publication date: 01/01/2021
Field of study

Repository of the Academy's Library