Search CORE

98,693 research outputs found

Evaluation of language identification methods using 285 languages

Author: Jauhiainen Heidi Annika
Jauhiainen Tommi Sakari
Linden Bo Krister Johan
Publication venue: 'Linkoping University Electronic Press'
Publication date: 01/01/2017
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Language identification in texts

Author: Jauhiainen Tommi
Publication venue: 'University of Helsinki Libraries'
Publication date: 28/05/2019
Field of study

This work investigates the task of identifying the language of digitally encoded text. Automatic methods for language identification have been developed since the 1960s. During the years, the significance of language identification as an important preprocessing element has grown at the same time as other natural language processing systems have become mainstream in day-to-day applications. The methods used for language identification are mostly shared with other text classification tasks as almost any modern machine learning method can be trained to distinguish between different languages. We begin the work by taking a detailed look at the research so far conducted in the field. As part of this work, we provide the largest survey on language identification available so far. Comparing the performance of different language identification methods presented in the literature has been difficult in the past. Before the introduction of a series of language identification shared tasks at the VarDial workshops, there were no widely accepted standard datasets which could be used to compare different methods. The shared tasks mostly concentrated on the issue of distinguishing between similar languages, but other open issues relating to language identification were addressed as well. In this work, we present the methods for language identification we have developed while participating in the shared tasks from 2015 to 2017. Most of the research for this work was accomplished within the Finno-Ugric Languages and the Internet project. In the project, our goal was to find and collect texts written in rare Uralic languages on the Internet. In addition to the open issues addressed at the shared tasks, we dealt with issues concerning domain compatibility and the number of languages. We created an evaluation set-up for addressing short out-of-domain texts in a large number of languages. Using the set-up, we evaluated our own method as well as other promising methods from the literature. The last issue we address in this work is the handling of multilingual documents. We developed a method for language set identification and used a previously published dataset to evaluate its performance.Tässä väitöskirjassa tutkitaan digitaalisessa muodossa olevan tekstin kielen automaattista tunnistamista. Tekstin kielen tunnistamisen automaattisia menetelmiä on kehitetty jo 1960-luvulta lähtien. Kuluneiden vuosikymmenien aikana kielentunnistamisen merkitys osana laajempia tietojärjestelmiä on vähitellen kasvanut. Tekstin kieli on tarpeellista tunnistaa, jotta tekstin jatkokäsittelyssä osataan käyttää sopivia kieliteknologisia menetelmiä. Tekstin kielentunnistus on kieleltään tai kieliltään tuntemattoman tekstin kielen tai kielien määrittämistä. Suurimmaksi osaksi kielentunnistukseen käytettyjä menetelmiä käytetään tai voidaan käyttää tekstin luokitteluun myös tekstin muiden ominaisuuksien, kuten aihealueen, perusteella. Tähän artikkeliväitöskirjaan kuuluvassa katsausartikkelissa esittelemme laajasti kielentunnistuksen tähänastista tutkimusta ja käymme kattavasti lävitse kielentunnistukseen tähän mennessä käytetyt menetelmät. Seuraavat kolme väistöskirjan artikkelia esittelevät ne kielentunnistuksen menetelmät joita käytimme VarDial työpajojen yhteydessä järjestetyissä kansainvälisissä kielentunnistuskilpailuissa vuodesta 2015 vuoteen 2017. Suurin osa tämän väitöskirjan tutkimuksesta on tehty osana Koneen säätiön rahoittamaa suomalais-ugrilaiset kielet ja internet -hanketta. Hankkeen päämääränä oli löytää internetistä tekstejä, jotka olivat kirjoitettu harvinaisemmilla uralilaisilla kielillä ja väitöskirjan viides artikkeli keskittyy projektin alkuvaiheiden kuvaamiseen. Väitöskirjan kuudes artikkeli kertoo miten hankkeen verkkoharavaan liitetty kielentunnistin evaluoitiin vaativasssa testiympäristössä, joka sisälsi tekstejä kirjoitettuna 285 eri kielellä. Seitsemäs ja viimeinen artikkeli käsittelee monikielisten tekstien kielivalikoiman selvittämistä

Helsingin yliopiston digitaalinen arkisto

Research Findings on Empirical Evaluation of Requirements Specifications Approaches

Author: Condori-Fernandez Nelly
Daneva Maya
Dieste Oscar
Pastor Oscar
Sikkel Klaas
Wieringa Roel
Publication venue: Valparaiso University Press
Publication date: 01/01/2009
Field of study

Numerous software requirements specification (SRS) approaches have been proposed in software engineering. However, there has been little empirical evaluation of the use of these approaches in specific contexts. This paper describes the results of a mapping study, a key instrument of the evidence-based paradigm, in an effort to understand what aspects of SRS are evaluated, in which context, and by using which research method. On the basis of 46 identified and categorized primary studies, we found that understandability is the most commonly evaluated aspect of SRS, experiments are the most commonly used research method, and the academic environment is where most empirical evaluation takes place

University of Twente Research Information

Knowledge-based support in Non-Destructive Testing for health monitoring of aircraft structures

Author: Kamsu-Foguem Bernard
Publication venue: 'Elsevier BV'
Publication date: 01/01/2012
Field of study

Maintenance manuals include general methods and procedures for industrial maintenance and they contain information about principles of maintenance methods. Particularly, Non-Destructive Testing (NDT) methods are important for the detection of aeronautical defects and they can be used for various kinds of material and in different environments. Conventional non-destructive evaluation inspections are done at periodic maintenance checks. Usually, the list of tools used in a maintenance program is simply located in the introduction of manuals, without any precision as regards to their characteristics, except for a short description of the manufacturer and tasks in which they are employed. Improving the identification concepts of the maintenance tools is needed to manage the set of equipments and establish a system of equivalence: it is necessary to have a consistent maintenance conceptualization, flexible enough to fit all current equipment, but also all those likely to be added/used in the future. Our contribution is related to the formal specification of the system of functional equivalences that can facilitate the maintenance activities with means to determine whether a tool can be substituted for another by observing their key parameters in the identified characteristics. Reasoning mechanisms of conceptual graphs constitute the baseline elements to measure the fit or unfit between an equipment model and a maintenance activity model. Graph operations are used for processing answers to a query and this graph-based approach to the search method is in-line with the logical view of information retrieval. The methodology described supports knowledge formalization and capitalization of experienced NDT practitioners. As a result, it enables the selection of a NDT technique and outlines its capabilities with acceptable alternatives

CiteSeerX

Open Archive Toulouse Archive Ouverte