    Experiences in evaluating multilingual and text-image information retrieval

    23 pages, 8 figures.One important step during the development of information retrieval (IR) processes is the evaluation of the output regarding the information needs of the user. The "high quality" of the output is related to the integration of different methods to be applied in the IR process and the information included in the retrieved documents, but how can "quality" be measured? Although some of these methods can be tested in a stand-alone way, it is not always clear what will happen when several methods are integrated. For this reason, much effort has been put into establishing a good combination of several methods or to correctly tuning some of the algorithms involved. The current approach is to measure the precision and recall figures yielded when different combinations of methods are included in an IR process. In this article, a short description of the current techniques and methods included in an IR system is given, paying special attention to the multilingual aspect of the problem. Also a discussion of their influence on the final performance of the IR process is presented by explaining previous experiences in the evaluation process followed in two projects (MIRACLE and OmniPaper) related to multilingual information retrieval.This work has been partially supported by the projects OmniPaper (European Union, 5th Framework Programme for Research and Technological Development, IST-2001-32174), NEDINE (E-Content project Ref.: 22225), and GPS Project—Software Process Management Platform: modeling, reuse, and measurement (National Research Plan, TIN2004-07083).Publicad

    The Eurovision St Andrews collection of photographs

    This report describes the Eurovision image collection compiled for the ImageCLEF (Cross Language Evaluation Forum) evaluation exercise. The image collection consists of around 30,000 photographs from the collection provided by the University of St Andrews Library. The construction and composition of this unique image collection are described, together with the necessary information to obtain and use the image collection

    Re-ranking of Yahoo snippets with the JIRS passage retrieval system

    Comunicación presentada en: Workshop on Cross Lingual Information Access, CLIA-2007, 20th International Joint Conference on Artificial Intelligence, IJCAI-07, Hyderabad, India, January 6-12, 2007Passage Retrieval (PR) systems are used as first step of the actual Question Answering (QA) systems. Usually, PR systems are traditional information retrieval systems which are not oriented to the specific problem of QA. In fact, these systems only search for the question keywords. JIRS Distance Density n-gram system is a QA-oriented PR system which has given good results in QA tasks when this is applied over static document collections. JIRS is able to search for the question structure in the document collection in order to find the passages with the greatest probability to contain the answer. JIRS is a language-independent PR system which has been already adapted to a few non-agglutinative European languages (such as Spanish, Italian, English and French) as well as to the Arabic language. A first attempt to adapt it to the Urdu Indian language was also made. In this paper, we investigate the possibility of basing on the web the JIRS retrieval of passages. The experiments we carried out show that JIRS allow to improve the coverage of the correct answers re-ranking the snippets obtained with Yahoo search engine.ICT EU-India; TEXT-MESS CICY

    ivrit.ai: A Comprehensive Dataset of Hebrew Speech for AI Research and Development

    We introduce "ivrit.ai", a comprehensive Hebrew speech dataset, addressing the distinct lack of extensive, high-quality resources for advancing Automated Speech Recognition (ASR) technology in Hebrew. With over 3,300 speech hours and a over a thousand diverse speakers, ivrit.ai offers a substantial compilation of Hebrew speech across various contexts. It is delivered in three forms to cater to varying research needs: raw unprocessed audio; data post-Voice Activity Detection, and partially transcribed data. The dataset stands out for its legal accessibility, permitting use at no cost, thereby serving as a crucial resource for researchers, developers, and commercial entities. ivrit.ai opens up numerous applications, offering vast potential to enhance AI capabilities in Hebrew. Future efforts aim to expand ivrit.ai further, thereby advancing Hebrew's standing in AI research and technology.Comment: 9 pages, 1 table and 3 figure

    On the voice-activated question answering

    [EN] Question answering (QA) is probably one of the most challenging tasks in the field of natural language processing. It requires search engines that are capable of extracting concise, precise fragments of text that contain an answer to a question posed by the user. The incorporation of voice interfaces to the QA systems adds a more natural and very appealing perspective for these systems. This paper provides a comprehensive description of current state-of-the-art voice-activated QA systems. Finally, the scenarios that will emerge from the introduction of speech recognition in QA will be discussed. © 2006 IEEE.This work was supported in part by Research Projects TIN2009-13391-C04-03 and TIN2008-06856-C05-02. This paper was recommended by Associate Editor V. Marik.Rosso, P.; Hurtado Oliver, LF.; Segarra Soriano, E.; Sanchís Arnal, E. (2012). On the voice-activated question answering. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews. 42(1):75-85. https://doi.org/10.1109/TSMCC.2010.2089620S758542

    A preliminary evaluation of metadata records machine translation

    Article discussing a preliminary evaluation study of metadata records machine translation. This study evaluates freely available machine translation (MT) services' performance in translating metadata records


    En este trabajo se expone: la problemática de la recuperación de pasajes, el dominio de los textos legales y las patentes y su característica de diversidad idiomática. Se presentan técnicas para solucionar problemas de recuperación de información y se analizan dos participaciones en competencias con prepuestas de enfoques novedosos.Correa García, S. (2010). RECUPERACIÓN DE PASAJES EN TEXTOS LEGALES Y PATENTES MULTILINGÜES. http://hdl.handle.net/10251/14084Archivo delegad

    JACY - a grammar for annotating syntax, semantics and pragmatics of written and spoken japanese for NLP application purposes

    In this text, we describe the development of a broad coverage grammar for Japanese that has been built for and used in different application contexts. The grammar is based on work done in the Verbmobil project (Siegel 2000) on machine translation of spoken dialogues in the domain of travel planning. The second application for JACY was the automatic email response task. Grammar development was described in Oepen et al. (2002a). Third, it was applied to the task of understanding material on mobile phones available on the internet, while embedded in the project DeepThought (Callmeier et al. 2004, Uszkoreit et al. 2004). Currently, it is being used for treebanking and ontology extraction from dictionary definition sentences by the Japanese company NTT (Bond et al. 2004)

    Methods for Answer Extraction in Textual Question Answering

    In this thesis we present and evaluate two pattern matching based methods for answer extraction in textual question answering systems. A textual question answering system is a system that seeks answers to natural language questions from unstructured text. Textual question answering systems are an important research problem because as the amount of natural language text in digital format grows all the time, the need for novel methods for pinpointing important knowledge from the vast textual databases becomes more and more urgent. We concentrate on developing methods for the automatic creation of answer extraction patterns. A new type of extraction pattern is developed also. The pattern matching based approach chosen is interesting because of its language and application independence. The answer extraction methods are developed in the framework of our own question answering system. Publicly available datasets in English are used as training and evaluation data for the methods. The techniques developed are based on the well known methods of sequence alignment and hierarchical clustering. The similarity metric used is based on edit distance. The main conclusions of the research are that answer extraction patterns consisting of the most important words of the question and of the following information extracted from the answer context: plain words, part-of-speech tags, punctuation marks and capitalization patterns, can be used in the answer extraction module of a question answering system. This type of patterns and the two new methods for generating answer extraction patterns provide average results when compared to those produced by other systems using the same dataset. However, most answer extraction methods in the question answering systems tested with the same dataset are both hand crafted and based on a system-specific and fine-grained question classification. The the new methods developed in this thesis require no manual creation of answer extraction patterns. As a source of knowledge, they require a dataset of sample questions and answers, as well as a set of text documents that contain answers to most of the questions. The question classification used in the training data is a standard one and provided already in the publicly available data.Tekstuaalinen kysymysvastausjärjestelmä on tietokoneohjelma, joka vastaa käyttäjän esittämiin kysymyksiin tekstidokumenteista eristämillään vastauksilla. Tekstuaaliset kysymysvastausjärjestelmät ovat tärkeä tutkimusongelma, sillä digitaalisessa muodossa olevien tekstidokumenttien määrä lisääntyy jatkuvasti. Samalla kasvaa myös sellaisten tiedonhakumenetelmien tarve, joiden avulla käyttäjä löytää tekstidokumenteista olleellisen tiedon nopeasti ja helposti. Kysymysvastausjärjestelmiä on tutkittu jo 1960-luvulta alkaen. Ensimmäiset järjestelmät osasivat vastata suppeaan joukkoon määrämuotoisia kysymyksiä, jotka koskivat jotakin tarkasti rajattua aihepiiriä kuten pesäpallotuloksia. Nykyään kysymysvastausjärjestelmien tutkimuksessa keskitytään järjestelmiin, joissa kysymykset voivat olla melko vapaasti muotoiltuja ja ne voivat liittyä mihin tahansa aihepiiriin. Nykyjärjestelmissä tiedonhaku kohdistuu usein laajoihin tekstidokumenttikokoelmiin kuten WWW:hen ja sanomalehtien uutisarkistoihin. Toisaalta myös rajatun aihepiirin järjestelmät ovat yhä tärkeä tutkimuskohde. Käytännön esimerkkejä rajatun aihepiirin järjestelmistä ovat yritysten asiakaspalvelua helpottavat järjestelmät. Nämä järjestelmät käsittelevät automaattisesti osan asiakkaiden yritykselle osoittamista kysymyksistä tai toimivat asiakasneuvojan apuvälineenä hänen etsiessään tietoa asiakkaan kysymykseen. Tässä väitöskirjassa kehitetyt menetelmät ovat sovellettavissa sekä avoimen että rajatun aihepiirin kysymysvastausjärjestelmiin. Väitöskirjassa on kehitetty kaksi uutta menetelmää vastausten eristämiseksi tekstistä ja tekstuaalinen kysymysvastausjärjestelmä, joka käyttää molempia menetelmiä. Menetelmät on arvioitu julkisesti saatavilla olevalla testidatalla. Väitöskirjassa kehitetyt vastauksen eristämismenetelmät ovat oppivia. Oppivuudella tarkoitetaan sitä, että vastausten eristämiseen käytettäviä hahmoja ei tarvitse ohjelmoida, vaan ne tuotetaan automaattisesti esimerkkidatan perusteella. Oppivuudella tehostetaan uusien kysymysvastausjärjestelmien kehittämistä. Tehokas järjestelmäkehitys on erityisen tärkeää silloin kun järjestelmästä tarvitaan useita kieliversioita. Myös uusien kysymys- ja tekstityyppien lisääminen järjestelmään helpottuu oppivan menetelmän ansiosta