Search CORE

15 research outputs found

Hungarian-Somali-English Online Dictionary and Taxonomy

Author: Endrédy István
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2014
Field of study

Repository of the Academy's Library

Improving chunker performance using a web-based semi-automatic training data analysis tool

Author: Endrédy István
Publication venue: Uniwersytet im. Adama Mickiewicza w Poznaniu
Publication date: 01/01/2015
Field of study

Repository of the Academy's Library

Corpus based evaluation of stemmers

Author: Endrédy István
Publication venue: Uniwersytet im. Adama Mickiewicza w Poznaniu
Publication date: 01/01/2015
Field of study

Repository of the Academy's Library

Szótövesítők összehasonlítása és alkalmazásaik

Author: Endrédy István
Novák Attila
Publication venue: Magyar Tudományos Akadémia
Publication date: 01/01/2015
Field of study

Repository of the Academy's Library

HunTag3, a general-purpose, modular sequential tagger - chunking phrases in English and maximal NPs and NER for Hungarian

Author: Endrédy István
Indig Balázs
Publication venue: Uniwersytet im. Adama Mickiewicza w Poznaniu
Publication date: 01/01/2015
Field of study

Repository of the Academy's Library

More effective boilerplate removal – the GoldMiner algorithm

Author: Endrédy István
Novák Attila
Publication venue: Centro de Innovación y Desarrollo Tecnológico en Cómputo, Instituto Politécnico Nacional
Publication date: 01/01/2013
Field of study

Abstract—The ever-increasing web is an important source for building large-scale corpora. However, dynamically generated web pages often contain much irrelevant and duplicated text, which impairs the quality of the corpus. To ensure the high quality of web-based corpora, a good boilerplate removal algorithm is needed to extract only the relevant content from web pages. In this article, we present an automatic text extraction procedure, GoldMiner, which by enhancing a previously published boilerplate removal algorithm, minimizes the occurrence of irrelevant duplicated content in corpora, and keeps the text more coherent than previous tools. The algorithm exploits similarities in the HTML structure of pages coming from the same domain. A new evaluation document set (CleanPortalEval) is also presented, which can demonstrate the power of boilerplate removal algorithms for web portal pages. Index Terms—corpus building, boilerplate removal, the web as corpus I. THE TASK When constructing corpora from web content, the extraction of relevant text from dynamically generated HTML pages is not a trivial task due to the great amount of irrelevant repeated text that needs to be identified and removed so that it does not compromise the quality of the corpus. This task, called boilerplate removal in the literature, consists of categorizing HTML content as valuable vs. irrelevant, filtering out menus, headers and footers, advertisements, and structure repeated on many pages. In this paper, we present a boilerplate removal algorithm that removes irrelevant content from crawled content more effectively than previous tools. The structure of our paper is as follows. First, we present some tools that we used as baselines when evaluating the performance of our system. The algorithm implemented in one of these tools, jusText, is also used as part of our enhanced boilerplate removal algorithm. This is followed by the presentation of the enhanced system, called GoldMiner, and the evaluation of the results

CiteSeerX

Crossref

Repository of the Academy's Library

Gut, Besser, Chunker – Selecting the Best Models for Text Chunking with Voting

Author: Endrédy István
Indig Balázs
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

Crossref

Repository of the Academy's Library

Nyelvtechnológiai algoritmusok korpuszok automatikus építéséhez és pontosabb feldolgozásukhoz

Author: Endrédy István
Publication venue
Publication date: 01/01/2016
Field of study

REAL-PhD

Gut, Besser, Chunker – Selecting the Best Models for Text Chunking with Voting

Author: Endrédy István
Indig Balázs
Publication venue: Springer International Publishing
Publication date: 01/01/2018
Field of study

Repository of the Academy's Library

Egy hatékonyabb webes sablonszűrő algoritmus : avagy miként lehet a cumisüveg potenciális veszélyforrás Obamára nézve

Author: Endrédy István
Novák Attila
Publication venue
Publication date: 01/01/2013
Field of study

A folyamatosan növekvő web1 tartalmából hatékonyan lehet nagyméretű korpuszt építeni. Azonban a dinamikusan generált weblapok gyakran sok irreleváns és ismétlődő szöveget tartalmaznak, amelyek egyes sablonos szövegrészeket, kifejezéseket felülreprezentálva rontják a korpusz minőségét. Ebben a cikkben olyan automatikus szövegkinyerő eljárást mutatunk be, amely a korábbi módszereknél hatékonyabban minimalizálja az irreleváns ismétlődő részek előfordulását a webről letöltött korpuszokban

University of Szeged