Search CORE

70 research outputs found

Feature Engineering for Domain Independent Named Entity Recognition and Biomedical Text Mining Applications

Author: Szarvas György
Publication venue
Publication date: 24/11/2008
Field of study

SZTE Doktori Értekezések Repozitórium (SZTE Repository of Dissertations)

Named entity recognition for Hungarian using various machine learning algorithms

Author: Farkas Richárd
Kocsor András
Szarvas György
Publication venue
Publication date: 01/01/2006
Field of study

In this paper we introduce a statistical Named Entity recognizer (NER) system for the Hungarian language. We examined three methods for identifying and disambiguating proper nouns (Artificial Neural Network, Support Vector Machine, C4.5 Decision Tree), their combinations and the effects of dimensionality reduction as well. We used a segment of Szeged Corpus [5] for training and validation purposes, which consists of short business news articles collected from MTI (Hungarian News Agency, www.mti.hu). Our results were presented at the Second Conference on Hungarian Computational Linguistics [7]. Our system makes use of both language dependent features (describing the orthography of proper nouns in Hungarian) and other, language independent information such as capitalization. Since we avoided the inclusion of large gazetteers of pre-classified entities, the system remains portable across languages without requiring any major modification, as long as the few specialized orthographical and syntactic characteristics are collected for a new target language. The best performing model achieved an F measure accuracy of 91.95%

University of Szeged

Linguistic scope-based and biological event-based speculation and negation annotations in the BioScope and Genia Event corpora

Author: Farkas Richárd
Móra György
Ohta Tomoko
Szarvas György
Vincze Veronika
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Crossref

PubMed Central

Nyelvfüggetlen tulajdonnév-felismerő rendszer, és alkalmazása különböző domainekre

Author: Farkas Richárd
Szarvas György
Publication venue
Publication date: 01/01/2006
Field of study

University of Szeged

Statistical named entity recognition for Hungarian

Author: Farkas Richárd
Szarvas György
Publication venue
Publication date: 01/01/2004
Field of study

In this paper, we present decision tree based statistical Named Entity recognizer system for Hungarian. The model was trained and tested on a segment of the Szeged Corpus, containing short business news articles, collected from MTI (Hungarian News Agency, www.mti.hul. We applied C4.5 for classificaton, and examined the accuracy of the system using training sets of different sizes. For this task we used only numerically encodable information (we excluded the word form itself), which contained some orthographical rules specific to Hungarian, but we trained for the recognition of foreign language proper nouns appearing frequently in business news as well. During the experiments the best results showed an accuracy of 89.6% F measure

University of Szeged

Eljárás radiológiai leletek automatikus BNO kódolására

Author: Farkas Richárd
Szarvas György
Publication venue
Publication date: 01/01/2007
Field of study

Cikkünkben egy amerikai kórházak és kutatóintézetek által, 2007 tavaszán rendezett nyílt verseny eredményeiről számolunk be. A verseny célja radiológiai leletek automatikus címkézése volt ICD-9-CM kódokkal (a Betegségek Nemzetközi Osztályozásával /BNO/ megegyező, számlázáshoz használt kódrendszer). A feladat érdekességét más, korábbi szövegfeldolgozási versenyekhez hasonlítva a szöveghez rendelendő kódok nagy száma, illetve a kódrendszer címkéi közti belső összefüggések adták (összesen 45 kód 96-féle különböző kombinációja fordult elő a korpuszban). A leletek automatikus osztályozását lehetővé tevő számítógépes eljárások fejlesztése létfontosságú, hiszen orvosi témájú szöveges dokumentumok kódolására, illetve a feladat során keletkező hibák javítására évi mintegy 25 milliárd dollárt fordítanak, pl. az Egyesült Államokban. A versenyre benyújtott rendszerek tanulsága, hogy a klinikai dokumentumok – emberi pontossághoz közelítő – eredményes feldolgozása nem lehetetlen célkitűzés a napjainkban rendelkezésre álló eszközökkel

University of Szeged

An apple-to-apple comparison of Learning-to-rank algorithms in terms of Normalized Discounted Cumulative Gain

Author: Busa-Fekete Róbert
Kégl B.
Szarvas György
Élteto Tamás
Publication venue: 'IOS Press'
Publication date: 28/08/2012
Field of study

International audienceThe Normalized Discounted Cumulative Gain (NDCG) is a widely used evaluation metric for learning-to-rank (LTR) systems. NDCG is designed for ranking tasks with more than one relevance levels. There are many freely available, open source tools for computing the NDCG score for a ranked result list. Even though the definition of NDCG is unambiguous, the various tools can produce different scores for ranked lists with certain properties, deteriorating the empirical tests in many published papers and thereby making the comparison of empirical results published in different studies difficult to compare. In this study, first, we identify the major differences between the various publicly available NDCG evaluation tools. Second, based on a set of comparative experiments using a common benchmark dataset in LTR research and 6 different LTR algorithms, we demonstrate how these differences affect the overall performance of different algorithms and the final scores that are used to compare different systems

HAL-CentraleSupelec

HAL-IN2P3

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

HAL-Rennes 1

Statisztikai alapú tulajdonnév-felismerő magyar nyelvre

Author: Farkas Richárd
Szarvas György
Publication venue
Publication date: 01/01/2004
Field of study

Ebben a cikkben bemutatunk egy döntési fa alapú statisztikai tulajdonnév-felismerő rendszert magyar nyelvre. A modellt a Szeged Korpusznak az MTI honlapjáról származó, gazdasági rövidhíreket tartalmazó szegmensén tanítottuk és teszteltük, s vizsgáltuk annak pontosságát különböző méretű és összetételű tanuló halmazok felhasználása esetén. A feladathoz csak numerikusán kódolható információkat használtunk fel (nem használtuk fel a szóalakot), melyek között előfordultak speciálisan a magyar nyelv tulajdonneveinek helyesírására vonatkozó előírásai is, de a feladat során célunk volt a gazdasági hírekben előforduló, nagy számú idegen eredetű tulajdonnév azonosítása is. A kísérletek során legjobb pontosságot mutató modell 89,6%-os F mértéket ért el

University of Szeged

Tune and mix: learning to rank using ensembles of calibrated multi-class classifiers

Author: Busa-Fekete Róbert
Kégl Balázs
Éltető Tamás
Szarvas György
Publication venue: Springer
Publication date: 01/01/2013
Field of study

ANR-2010-COSI-002In subset ranking, the goal is to learn a ranking function that approximates a gold standard partial ordering of a set of objects (in our case, a set of documents retrieved for the same query). The partial ordering is given by relevance labels representing the relevance of documents with respect to the query on an absolute scale. Our approach consists of three simple steps. First, we train standard multi-class classifiers (AdaBoost.MH and multi-class SVM) to discriminate between the relevance labels. Second, the posteriors of multi-class classifiers are calibrated using probabilistic and regression losses in order to estimate the Bayes-scoring function which optimizes the Normalized Discounted Cumulative Gain (NDCG). In the third step, instead of selecting the best multi-class hyperparameters and the best calibration, we mix all the learned models in a simple ensemble scheme. Our extensive experimental study is itself a substantial contribution. We compare most of the existing learning-to-rank techniques on all of the available large-scale benchmark data sets using a standardized implementation of the NDCG score. We show that our approach is competitive with conceptually more complex listwise and pairwise methods, and clearly outperforms them as the data size grows. As a technical contribution, we clarify some of the confusing results related to the ambiguities of the evaluation tools, and propose guidelines for future studies

HAL-IN2P3

Crossref

Publikationer från Linköpings universitet

SZTE Publicatio Repozitórium - SZTE - Repository of Publications

Digitala Vetenskapliga Arkivet - Academic Archive On-line