4 research outputs found

    Corpus-based automatic detection of example sentences for dictionaries for Estonian learners

    Get PDF
    VĂ€itekirja elektrooniline versioon ei sisalda publikatsiooneNĂ€itelause tĂ€idab sĂ”nastikus kindlat eesmĂ€rki, aidates aru saada sĂ”na tĂ€hendusest ja illustreerides sĂ”na erinevaid kasutuskontekste. NĂ€itelausete pĂ”hiallikas on mahukas tekstikorpus, kust aga kĂ€sitsi on nĂ€itelauset leida vĂ€ga keeruline. Elektroonilise leksikograafia arenguga on Eestisse jĂ”udnud mitmed töövahendid, mis aitavad automaatselt tuvastada eri sĂ”nastike jaoks vajalikku infot, sealhulgas nĂ€itelauseid. VĂ€itekirjas uuritakse, missugused parameetrid iseloomustavad Eesti Keele Instituudis koostatud sĂ”nastike "Eesti keele sĂ”naraamat 2019", "Eesti keele pĂ”hisĂ”navara sĂ”nastik 2014", "Eesti keele naabersĂ”nad 2019" nĂ€itelauseid ning "Eesti keele A1−C1 Ă”pikute korpuse 2018" lauseid. Uurimuse eesmĂ€rk on vĂ€lja töötada meetod, mis vĂ”imaldab neid parameetreid arvestades korpusest automaatselt tuvastada eesti keele Ă”ppijatele sobivaid lauseid. Töö keskmes on reeglipĂ”hine lĂ€henemine, mida rakendatakse korpuspĂ€ringusĂŒsteemi Sketch Engine integreeritud tööriista GDEX ehk Good Dictionary Examples nĂ€itel. Parameetrite hÀÀlestamiseks on osaliselt kasutatud ka masinĂ”ppe elemente. SĂ”nastiku nĂ€itelausete ja Ă”pikulausete analĂŒĂŒs nĂ€itas, et hea eesti keele nĂ€itelause peab olema tĂ€islause ja vastama muuhulgas jĂ€rgmistele parameetritele: on 4–20 sĂ”net pikk; ei sisalda sĂ”nesid, mis on pikemad kui 20 tĂ€hemĂ€rki; ei alga teatud sĂ”naliikidega (nt sidesĂ”naga) ega tagasi viitavate sĂ”nade (nt sellepĂ€rast) vĂ”i sĂ”napaaridega (nt sellisel puhul); ei sisalda vulgaarseid ja halvustavaid sĂ”nu, madala sagedusega sĂ”nu jmt. Uurimuse tulemusena on loodud "Eesti keele Ă”ppekorpus 2018 (etSkELL)", mis sisaldab ainult vĂ€lja töötatud parameetritele vastavaid lauseid. Õppekorpus on omakorda aluseks eesti keele Ă”ppekeskkonnale Sketch Engine for Estonian Language Learning ehk etSkELL ja veebilausetele Eesti Keele Instituudi keeleportaalis SĂ”naveeb.The function of an example sentence in a dictionary is to help the reader understand the meaning of the headword and illustrate its contexts of use. Nowadays, the main source of example sentences is a large text corpus, where suitable sentences are hard to find. Luckily, e-lexicography has generated automatic tools to help detect various information for dictionaries, including example sentences. The dissertation examines certain parameters of the example sentences presented in the Dictionary of Estonian (2019), Basic Estonian Dictionary (2014), Estonian Collocations Dictionary (2019), and Estonian Coursebook Corpus (2018); all four were compiled at the Institute of the Estonian language. The aim of my study is to elaborate an automatic method using parameters which identify sentences suitable for learners of Estonian. To that end, a rule-based approach was applied to the example of Good Dictionary Examples (GDEX) integrated in the Sketch Engine corpus query tool. Machine learning elements were also adopted to fine-tune the parameters. According to the analysis of the example sentences used in the dictionaries and coursebook sentences, a good Estonian example sentence should be a full sentence meeting, inter alia, the following parameters: length 4–20 tokens; no tokens longer than 20 characters; never begins with certain parts of speech (e.g., conjunction) or an anaphoric word (e.g., sellepĂ€rast ‘this is why’) or word pair (e.g., sellisel puhul ‘in such a case’); and vulgar or disparaging words, rare words, etc., are excluded. The study resulted in the compilation of the Estonian Corpus for Learners 2018 (etSkELL), which contains no other sentences but those corresponding to the developed parameters. The corpus, in turn, serves as the basis for the corpus-based web tool Sketch Engine for Estonian Language Learning (etSkELL) and the web sentences in the language portal SĂ”naveeb of the Institute of the Estonian Language.https://www.ester.ee/record=b530293

    Annotation en rÎles sémantiques du français en domaine spécifique

    Get PDF
    In this Natural Language Processing Ph. D. Thesis, we aim to perform semantic role labeling on French domain-specific texts. This task first disambiguates the sense of predicates in a given text and annotates its child chunks with semantic roles such as Agent, Patient or Destination. The task helps many applications in domains where annotated corpora exist, but is difficult to use otherwise. We first evaluate on the FrameNet corpus an existing method based on VerbNet, which explains why the method is domain-independant. We show that substantial improvements can be obtained. We first use syntactic information by handling the passive voice. Next, we use semantic informations by taking advantage of the selectional restrictions present in VerbNet. To apply this method to French, we first translate lexical resources. We first translate the WordNet lexical database. Next, we translate the VerbNet lexicon which is organized semantically using syntactic information. We obtain its translation, VerbeNet, by reusing two French verb lexicons (the Lexique-Grammaire and Les Verbes Français) and by manually modifying and reorganizing the resulting lexicon. Finally, once those building blocks are in place, we evaluate the feasibility of semantic role labeling of French and English in three specific domains. We study the pros and cons of using VerbNet and VerbeNet to annotate those domains before explaining our future work.Cette thĂšse de Traitement Automatique des Langues a pour objectif l'annotation automatique en rĂŽles sĂ©mantiques du français en domaine spĂ©cifique. Cette tĂąche dĂ©sambiguĂŻse le sens des prĂ©dicats d'un texte et annote les syntagmes liĂ©s avec des rĂŽles sĂ©mantiques tels qu'Agent, Patient ou Destination. Elle aide de nombreuses applications dans les domaines oĂč des corpus annotĂ©s existent, mais est difficile Ă  utiliser quand ce n'est pas le cas. Nous avons d'abord Ă©valuĂ© sur le corpus FrameNet une mĂ©thode existante d'annotation basĂ©e uniquement sur VerbNet et donc indĂ©pendante du domaine considĂ©rĂ©. Nous montrons que des amĂ©liorations consĂ©quentes peuvent ĂȘtre obtenues Ă  la fois d'un point de vue syntaxique avec la prise en compte de la voix passive et d'un point de vue sĂ©mantique en utilisant les restrictions de sĂ©lection indiquĂ©es dans VerbNet. Pour utiliser cette mĂ©thode en français, nous traduisons deux ressources lexicales anglaises. Nous commençons par la base de donnĂ©es lexicales WordNet. Nous traduisons ensuite le lexique VerbNet dans lequel les verbes sont regroupĂ©s sĂ©mantiquement grĂące Ă  leurs traits syntaxiques. La traduction, VerbeNet, a Ă©tĂ© obtenue en rĂ©utilisant deux lexiques verbaux du français (le Lexique-Grammaire et Les Verbes Français) puis en modifiant manuellement l'ensemble des informations obtenues. Enfin, une fois ces briques en place, nous Ă©valuons la faisabilitĂ© de l'annotation en rĂŽles sĂ©mantiques en anglais et en français dans trois domaines spĂ©cifiques. Nous Ă©valuons quels sont les avantages et inconvĂ©nients de se baser sur VerbNet et VerbeNet pour annoter ces domaines, avant d'indiquer nos perspectives pour poursuivre ces travaux

    Normalisation of imprecise temporal expressions extracted from text

    Get PDF
    Orientador : Prof. Dr. Marcos Didonet Del FabroCo-Orientador : Prof. Dr. Angus RobertsTese (doutorado) - Universidade Federal do ParanĂĄ, Setor de CiĂȘncias Exatas, Programa de PĂłs-Graduação em InformĂĄtica. Defesa: Curitiba, 05/04/2016Inclui referĂȘncias : f. 95-105Resumo: TĂ©cnicas e sistemas de extração de informaçÔes sĂŁo capazes de lidar com a crescente quantidade de dados nĂŁo estruturados disponĂ­veis hoje em dia. A informação temporal estĂĄ entre os diferentes tipos de informaçÔes que podem ser extraĂ­dos a partir de tais fontes de dados nĂŁo estruturados, como documentos de texto. InformaçÔes temporais descrevem as mudanças que acontecem atravĂ©s da ocorrĂȘncia de eventos, e fornecem uma maneira de gravar, ordenar e medir a duração de tais ocorrĂȘncias. A impossibilidade de identificar e extrair informação temporal a partir de documentos textuais faz com que seja difĂ­cil entender como os eventos sĂŁo organizados em ordem cronolĂłgica. AlĂ©m disso, em muitas situaçÔes, o significado das expressĂ”es temporais Ă© impreciso, e nĂŁo pode ser descrito com precisĂŁo, o que leva a erros de interpretação. As soluçÔes existentes proporcionam formas alternativas de representar expressĂ”es temporais imprecisas. Elas sĂŁo, entretanto, especĂ­ficas e difĂ­ceis de generalizar. AlĂ©m disso, a anĂĄlise de dados temporais pode ser particularmente ineficiente na presença de erros ortogrĂĄficos. As abordagens existentes usam mĂ©todos de similaridade para procurar palavras vĂĄlidas dentro de um texto. No entanto, elas nĂŁo sĂŁo suficientes para processos erros de ortografia de uma forma eficiente. Nesta tese Ă© apresentada uma metodologia para analisar e normalizar das expressĂ”es temporais imprecisas, em que, apĂłs a coleta e prĂ©-processamento de dados sobre a forma como as pessoas interpretam descriçÔes vagas de tempo no texto, diferentes tĂ©cnicas sĂŁo comparadas a fim de criar e selecionar o modelo de normalização mais apropriada para diferentes tipos de expressĂ”es imprecisas. TambĂ©m sĂŁo comparados um sistema baseado em regras e uma abordagem de aprendizagem de mĂĄquina na tentativa de identificar expressĂ”es temporais em texto, e Ă© analisado o processo de produção de padrĂ”es de anotação, identificando possĂ­veis fontes de problemas, dando algumas recomendaçÔes para serem consideradas no futuro esforços de anotação manual. Finalmente, Ă© proposto um mapa fonĂ©tico e Ă© avaliado como a codificação de informação fonĂ©tica poderia ser usado a fim de auxiliar os mĂ©todos de busca de similaridade e melhorar a qualidade da informação extraĂ­da.Abstract: Information Extraction systems and techniques are able to deal with the increasing amount of unstructured data available nowadays. Time is amongst the different kinds of information that may be extracted from such unstructured data sources, including text documents. Time describes changes which happen through the occurrence of events, and provides a way to record, order, and measure the duration of such occurrences. The inability to identify and extract temporal information from text makes it difficult to understand how the events are organized in a chronological order. Moreover, in many situations, the meaning of temporal expressions is imprecise, and cannot be accurately described, leading to interpretation errors. Existing solutions provide alternative ways of representing imprecise temporal expressions, though they are specific and hard to generalise. Furthermore, the analysis of temporal data may be particularly inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words within a text. However, they are not rich enough to processes misspellings in an efficient way. In this thesis, we present a methodology to analyse and normalise of imprecise temporal expressions, in which, after collecting and pre-processing data on how people interpret vague descriptions of time in text, we compare different techniques in order to create and select the most appropriate normalisation model for different kinds of imprecise expressions. We also compare how a rule-based system and a machine learning approach perform on trying to identify temporal expression from text, and we analyse the process of producing gold standards, identifying possible sources of issues, giving some recommendations to be considered in future manual annotation efforts. Finally, we propose a phonetic map and evaluate how encoding phonetic information could be used in order to assist similarity search methods and improve information extraction quality
    corecore