699 research outputs found

    Entitate izendunen desanbiguazioa ezagutza-base erraldoien arabera

    Get PDF
    130 p.Gaur egun, interneten nabigatzeko orduan, ia-ia ezinbestekoak dira bilatza-ileak, eta guztietatik ezagunena Google da. Bilatzaileek egungo arrakastarenzati handi bat ezagutza-baseen ustiaketatik eskuratu dute. Izan ere, bilaketasemantikoekin kontsulta soilak ezagutza-baseetako informazioaz aberastekogai dira. Esate baterako, musika talde bati buruzko informazioa bilatzean,bere diskografia edo partaideetara esteka gehigarriak eskaintzen dituzte. Her-rialde bateko lehendakariari buruzko informazioa bilatzean, lehendakari izan-dakoen estekak edo lurralde horretako informazio gehigarria eskaintzen dute.Hala ere, gaur egun pil-pilean dauden bilaketa semantikoen arrakasta kolokanjarriko duen arazoa existitzen da. Termino anbiguoek ezagutza-baseetatikeskuratuko den informazioaren egokitasuna baldintzatuko dute. Batez ere,arazo handienak izen berezien edo entitate izendunen aipamenek sortuko di-tuzte.Tesi-lan honen helburu nagusia entitate izendunen desanbiguazioa (EID)aztertu, eta hau burutzeko teknika berriak proposatzea da. EID sistemektestuetako izen-aipamenak desanbiguatu, eta ezagutza-baseetako entitateekinlotuko dituzte. Izen-aipamenen izaera anbiguoa dela eta, hainbat entitateizendatu ditzakete. Gainera, entitate berdina hainbat izen ezberdinekinizendatu daiteke, beraz, aipamen hauek egoki desanbiguatzea tesiaren gakoaizango da.Horretarako, lehenik, arloaren egoeraren oinarri diren bi desanbiguazioeredu aztertuko dira. Batetik, ezagutza-baseen egituraz baliatzen den ereduvglobala, eta bestetik, aipamenaren testuinguruko hitzen informazioa usti-atzen duen eredu lokala. Ondoren, bi informazio iturriak modu osagarriankonbinatuko dira. Konbinazioak arloaren egoerako emaitzak hainbat datu-multzo ezberdinetan gaindituko ditu, eta gainontzekoetan pareko emaitzaklortuko ditu.Bigarrenik, edozein desanbiguazio-sistema hobetzeko helburuarekin ideiaberritzaileak proposatu, aztertu eta ebaluatu dira. Batetik, diskurtso, bil-duma eta agerkidetza mailan entitateen portaera aztertu da, entitateek pa-troi jakin bat betetzen dutela baieztatuz. Ondoren, patroi horretan oinar-rituz eredu globalaren, lokalaren eta beste EID sistema baten emaitzak moduadierazgarrian hobetu dira. Bestetik, eredu lokala kanpotiko corpusetatik es-kuratutako ezagutzarekin elikatu da. Ekarpen honekin kanpo-ezagutza honenkalitatea ebaluatu da sistemari egiten dion ekarpena justifikatuz. Gainera,eredu lokalaren emaitzak hobetzea lortu da, berriz ere arloaren egoerakobalioak eskuratuz.Tesia artikuluen bilduma gisa aurkeztuko da. Sarrera eta arloaren ego-era azaldu ondoren, tesiaren oinarri diren ingelesezko lau artikulu erantsikodira. Azkenik, lau artikuluetan jorratu diren gaiak biltzeko ondorio orokorrakplanteatuko dira

    Neural Cross-Lingual Entity Linking

    Full text link
    A major challenge in Entity Linking (EL) is making effective use of contextual information to disambiguate mentions to Wikipedia that might refer to different entities in different contexts. The problem exacerbates with cross-lingual EL which involves linking mentions written in non-English documents to entries in the English Wikipedia: to compare textual clues across languages we need to compute similarity between textual fragments across languages. In this paper, we propose a neural EL model that trains fine-grained similarities and dissimilarities between the query and candidate document from multiple perspectives, combined with convolution and tensor networks. Further, we show that this English-trained system can be applied, in zero-shot learning, to other languages by making surprisingly effective use of multi-lingual embeddings. The proposed system has strong empirical evidence yielding state-of-the-art results in English as well as cross-lingual: Spanish and Chinese TAC 2015 datasets.Comment: Association for the Advancement of Artificial Intelligence (AAAI), 201

    Entitate izendunen desanbiguazioa ezagutza-base erraldoien arabera

    Get PDF
    130 p.Gaur egun, interneten nabigatzeko orduan, ia-ia ezinbestekoak dira bilatza-ileak, eta guztietatik ezagunena Google da. Bilatzaileek egungo arrakastarenzati handi bat ezagutza-baseen ustiaketatik eskuratu dute. Izan ere, bilaketasemantikoekin kontsulta soilak ezagutza-baseetako informazioaz aberastekogai dira. Esate baterako, musika talde bati buruzko informazioa bilatzean,bere diskografia edo partaideetara esteka gehigarriak eskaintzen dituzte. Her-rialde bateko lehendakariari buruzko informazioa bilatzean, lehendakari izan-dakoen estekak edo lurralde horretako informazio gehigarria eskaintzen dute.Hala ere, gaur egun pil-pilean dauden bilaketa semantikoen arrakasta kolokanjarriko duen arazoa existitzen da. Termino anbiguoek ezagutza-baseetatikeskuratuko den informazioaren egokitasuna baldintzatuko dute. Batez ere,arazo handienak izen berezien edo entitate izendunen aipamenek sortuko di-tuzte.Tesi-lan honen helburu nagusia entitate izendunen desanbiguazioa (EID)aztertu, eta hau burutzeko teknika berriak proposatzea da. EID sistemektestuetako izen-aipamenak desanbiguatu, eta ezagutza-baseetako entitateekinlotuko dituzte. Izen-aipamenen izaera anbiguoa dela eta, hainbat entitateizendatu ditzakete. Gainera, entitate berdina hainbat izen ezberdinekinizendatu daiteke, beraz, aipamen hauek egoki desanbiguatzea tesiaren gakoaizango da.Horretarako, lehenik, arloaren egoeraren oinarri diren bi desanbiguazioeredu aztertuko dira. Batetik, ezagutza-baseen egituraz baliatzen den ereduvglobala, eta bestetik, aipamenaren testuinguruko hitzen informazioa usti-atzen duen eredu lokala. Ondoren, bi informazio iturriak modu osagarriankonbinatuko dira. Konbinazioak arloaren egoerako emaitzak hainbat datu-multzo ezberdinetan gaindituko ditu, eta gainontzekoetan pareko emaitzaklortuko ditu.Bigarrenik, edozein desanbiguazio-sistema hobetzeko helburuarekin ideiaberritzaileak proposatu, aztertu eta ebaluatu dira. Batetik, diskurtso, bil-duma eta agerkidetza mailan entitateen portaera aztertu da, entitateek pa-troi jakin bat betetzen dutela baieztatuz. Ondoren, patroi horretan oinar-rituz eredu globalaren, lokalaren eta beste EID sistema baten emaitzak moduadierazgarrian hobetu dira. Bestetik, eredu lokala kanpotiko corpusetatik es-kuratutako ezagutzarekin elikatu da. Ekarpen honekin kanpo-ezagutza honenkalitatea ebaluatu da sistemari egiten dion ekarpena justifikatuz. Gainera,eredu lokalaren emaitzak hobetzea lortu da, berriz ere arloaren egoerakobalioak eskuratuz.Tesia artikuluen bilduma gisa aurkeztuko da. Sarrera eta arloaren ego-era azaldu ondoren, tesiaren oinarri diren ingelesezko lau artikulu erantsikodira. Azkenik, lau artikuluetan jorratu diren gaiak biltzeko ondorio orokorrakplanteatuko dira

    Domain-specific named entity disambiguation in historical memoirs

    Get PDF
    This paper presents the results of the extraction of named entities from a collection of historical memoirs about the italian Resistance during the World War II. The methodology followed for the extraction and disambiguation task will be discussed, as well as its evaluation. For the semantic annotations of the dataset, we have developed a pipeline based on established practices for extracting and disambiguating Named Entities. This has been necessary, considering the poor performances of out-of-the-box Named Entity Recognition and Disambiguation (NERD) tools tested in the initial phase of this work.Questo articolo presenta l’attività di estrazione di entità nominate realizzata su una collezione di memorie relative al periodo della Resistenza italiana nella Seconda Guerra Mondiale. Verrà discussa la metodologia sviluppata per il processo di estrazione e disambiguazione delle entità nominate, nonché la sua valutazione. L’implementazione di una metodologia di estrazione e disambiguazione basata su lookup si è resa necessaria in considerazione delle scarse prestazioni dei sistemi di Named Entity Recognition and Disambiguation (NERD), come si evince dalla discussione nella prima parte di questo lavoro

    Contextual weighting for Support Vector Machines in literature mining: an application to gene versus protein name disambiguation

    Get PDF
    BACKGROUND: The ability to distinguish between genes and proteins is essential for understanding biological text. Support Vector Machines (SVMs) have been proven to be very efficient in general data mining tasks. We explore their capability for the gene versus protein name disambiguation task. RESULTS: We incorporated into the conventional SVM a weighting scheme based on distances of context words from the word to be disambiguated. This weighting scheme increased the performance of SVMs by five percentage points giving performance better than 85% as measured by the area under ROC curve and outperformed the Weighted Additive Classifier, which also incorporates the weighting, and the Naive Bayes classifier. CONCLUSION: We show that the performance of SVMs can be improved by the proposed weighting scheme. Furthermore, our results suggest that in this study the increase of the classification performance due to the weighting is greater than that obtained by selecting the underlying classifier or the kernel part of the SVM

    Exploratory Search on Mobile Devices

    Get PDF
    The goal of this thesis is to provide a general framework (MobEx) for exploratory search especially on mobile devices. The central part is the design, implementation, and evaluation of several core modules for on-demand unsupervised information extraction well suited for exploratory search on mobile devices and creating the MobEx framework. These core processing elements, combined with a multitouch - able user interface specially designed for two families of mobile devices, i.e. smartphones and tablets, have been finally implemented in a research prototype. The initial information request, in form of a query topic description, is issued online by a user to the system. The system then retrieves web snippets by using standard search engines. These snippets are passed through a chain of NLP components which perform an ondemand or ad-hoc interactive Query Disambiguation, Named Entity Recognition, and Relation Extraction task. By on-demand or ad-hoc we mean the components are capable to perform their operations on an unrestricted open domain within special time constraints. The result of the whole process is a topic graph containing the detected associated topics as nodes and the extracted relation ships as labelled edges between the nodes. The Topic Graph is presented to the user in different ways depending on the size of the device she is using. Various evaluations have been conducted that help us to understand the potentials and limitations of the framework and the prototype

    The Epistemological Foundations of Knowledge Representations

    Get PDF
    This paper looks at the epistemological foundations of knowledge representations embodied in retrieval languages. It considers questions such as the validity of knowledge representations and their effectiveness for the purposes of retrieval and automation. The knowledge representations it considers are derived from three theories of meaning that have dominated twentieth-century philosophy.published or submitted for publicatio

    A Named Entity Recognition System Applied to Arabic Text in the Medical Domain

    Get PDF
    Currently, 30-35% of the global population uses the Internet. Furthermore, there is a rapidly increasing number of non-English language internet users, accompanied by an also increasing amount of unstructured text online. One area replete with underexploited online text is the Arabic medical domain, and one method that can be used to extract valuable data from Arabic medical texts is Named Entity Recognition (NER). NER is the process by which a system can automatically detect and categorise Named Entities (NE). NER has numerous applications in many domains, and medical texts are no exception. NER applied to the medical domain could assist in detection of patterns in medical records, allowing doctors to make better diagnoses and treatment decisions, enabling medical staff to quickly assess a patient's records and ensuring that patients are informed about their data, as just a few examples. However, all these applications would require a very high level of accuracy. To improve the accuracy of NER in this domain, new approaches need to be developed that are tailored to the types of named entities to be extracted and categorised. In an effort to solve this problem, this research applied Bayesian Belief Networks (BBN) to the process. BBN, a probabilistic model for prediction of random variables and their dependencies, can be used to detect and predict entities. The aim of this research is to apply BBN to the NER task to extract relevant medical entities such as disease names, symptoms, treatment methods, and diagnosis methods from modern Arabic texts in the medical domain. To achieve this aim, a new corpus related to the medical domain has been built and annotated. Our BBN approach achieved a 96.60% precision, 90.79% recall, and 93.60% F-measure for the disease entity, while for the treatment method entity, it achieved 69.33%, 70.99%, and 70.15% for precision, recall, and F-measure, respectively. For the diagnosis method and symptom categories, our system achieved 84.91% and 71.34%, respectively, for precision, 53.36% and 49.34%, respectively, for recall, and 65.53% and 58.33%, for F-measure, respectively. Our BBN strategy achieved good accuracy for NEs in the categories of disease and treatment method. However, the average word length of the other two NE categories observed, diagnosis method and symptom, may have had a negative effect on their accuracy. Overall, the application of BBN to Arabic medical NER is successful, but more development is needed to improve accuracy to a standard at which the results can be applied to real medical systems
    corecore