27 research outputs found

    LexMeta model za leksičke resurse: teorija i primjena

    Get PDF
    This paper presents LexMeta, a metadata model for the description of lexical resources, such as dictionaries, word lists, glossaries, etc., to be used in language data catalogues mainly targeting the lexicographic and broader humanities communities but also users exploiting such resources in their research and applications. A comparative review of similar models is made in order to show the differences and commonalities with LexMeta. To enhance semantic interoperability and support the exchange of (meta)data across disciplinary and general catalogues, the most influential models for our purposes, namely FRBR (used in library catalogues) and META-SHARE (used for language resources), are selected as a base for the design of LexMeta. We discuss how these models are aligned and extended with new properties as required for the description of lexical resources. The formal representation of the model following the Linked Data paradigm aims to further enhance the semantic interoperability. The choice to implement it in two formats (as an RDF/OWL and as a Wikibase ontology) facilitates its adoption and hence its enrichment, yet poses challenges as to their synchronisation, which are addressed through automatic workflows. We conclude with ongoing and planned activities for the improvement of the model.Rad opisuje LexMeta, metapodatkovni model za opis leksičkih resursa kao što su rječnici, popisi riječi, glosari i dr., koji će se upotrebljavati u katalozima podataka namijenjenima leksikografskoj i široj humanističkoj zajednici, ali i korisnicima koji upotreblajvaju takve modele u istraživanjima i praktičnoj primjeni. U radu je dan usporedni pregled sličnih modela kako bi se pokazale razlike i sličnosti s LexMetom. Kako bi se poboljšala semantička interoperabilnost i podržala razmjena (meta) podataka između strukovnih i općih kataloga, kao temelj za dizajn LexMeta odabrani su najutjecajniji modeli, naime FRBR koji se upotrebljava u knjižničnim katalozima i META-SHARE koji se upotrebljava za jezične resurse. Rad donosi raspravu o tome kako su ti modeli usklađeni i prošireni novim značajkama potrebnima za opis leksičkih izvora. Formalni prikaz modela koji slijedi paradigmu povezanih podataka ima za cilj dodatno poboljšati semantičku interoperabilnost. Izbor da se implementira u dva formata (kao RDF/OWL i kao ontologija Wikibase) olakšava njegovo usvajanje, a time i obogaćivanje, ali i postavlja izazove koji se tiču sinkronizacije formata, koji se rješavaju automatskim tijekovima rada. Zaključujemo rad s opisom tekućih i planiranih aktivnosti na unapređenju modela

    MatureBayes: A Probabilistic Algorithm for Identifying the Mature miRNA within Novel Precursors

    Get PDF
    BACKGROUND: MicroRNAs (miRNAs) are small, single stranded RNAs with a key role in post-transcriptional regulation of thousands of genes across numerous species. While several computational methods are currently available for identifying miRNA genes, accurate prediction of the mature miRNA remains a challenge. Existing approaches fall short in predicting the location of mature miRNAs but also in finding the functional strand(s) of miRNA precursors. METHODOLOGY/PRINCIPAL FINDINGS: Here, we present a computational tool that incorporates a Naive Bayes classifier to identify mature miRNA candidates based on sequence and secondary structure information of their miRNA precursors. We take into account both positive (true mature miRNAs) and negative (same-size non-mature miRNA sequences) examples to optimize sensitivity as well as specificity. Our method can accurately predict the start position of experimentally verified mature miRNAs for both human and mouse, achieving a significantly larger (often double) performance accuracy compared with two existing methods. Moreover, the method exhibits a very high generalization performance on miRNAs from two other organisms. More importantly, our method provides direct evidence about the features of miRNA precursors which may determine the location of the mature miRNA. We find that the triplet of positions 7, 8 and 9 from the mature miRNA end towards the closest hairpin have the largest discriminatory power, are relatively conserved in terms of sequence composition (mostly contain a Uracil) and are located within or in very close proximity to the hairpin loop, suggesting the existence of a possible recognition site for Dicer and associated proteins. CONCLUSIONS: This work describes a novel algorithm for identifying the start position of mature miRNA(s) produced by miRNA precursors. Our tool has significantly better (often double) performance than two existing approaches and provides new insights about the potential use of specific sequence/structural information as recognition signals for Dicer processing. Web Tool available at: http://mirna.imbb.forth.gr/MatureBayes.html

    Computational morphology with OntoLex-Morph

    Get PDF
    This paper describes the current status of the emerging OntoLex module for linguistic morphology. It serves as an update to the previous version of the vocabulary (Klimek et al. 2019). Whereas this earlier model was exclusively focusing on descriptive morphology and focused on applications in lexicography, we now present a novel part and a novel application of the vocabulary to applications in language technology, i.e., the rule-based generation of lexicons, introducing a dynamic component into OntoLex

    The LexMeta Metadata Model for Lexical Resources: Theoretical and Implementation Issues

    Get PDF
    The paper presents LexMeta, a metadata model catering for descriptions of human-readable and computational lexical resources included in library catalogues and repositories of language resources. We present the main concepts of the model, its implementation, and discuss current findings and future plans

    Modelling collocations in OntoLex-FrAC

    Get PDF
    Following presentations of frequency and attestations, and embeddings and distributional similarity, this paper introduces the third cornerstone of the emerging OntoLex module for Frequency, Attestation and Corpus-based Information, OntoLex-FrAC. We provide an RDF vocabulary for collocations, established as a consensus over contributions from five different institutions and numerous data sets, with the goal of eliciting feedback from reviewers, workshop audience and the scientific community in preparation of the final consolidation of the OntoLex-FrAC module, whose publication as a W3C community report is foreseen for the end of this year. The novel collocation component of OntoLex-FrAC is described in application to a lexicographic resource and corpus-based collocation scores available from the web, and finally, we demonstrate the capability and genericity of the model by showing how to retrieve and aggregate collocation information by means of SPARQL, and its export to a tabular format, so that it can be easily processed in downstream applications

    OpenMinTeD: A Platform Facilitating Text Mining of Scholarly Content

    Get PDF
    The OpenMinTeD platform aims to bring full text Open Access scholarly content from a wide range of providers together with Text and Data Mining (TDM) tools from various Natural Language Processing frameworks and TDM developers in an integrated environment. In this way, it supports users who want to mine scientific literature with easy access to relevant content and allows running scalable TDM workflows in the cloud

    OntoLex-Morph: Morphology for the Web of Data

    Get PDF
    Purpose: OntoLex-Lemon is a widely used community standard for publishing lexical resources in machine-readable form, and is in fact the predominant RDF vocabulary for this purpose. With the growing popularity and increasing adoption of this model for applications in both language technology and lexicography, a number of new modules have been developed in the past year to complement the OntoLex core vocabulary and its lexicographic follow up, lexicog. In this paper, we describe the current status of the development of the OntoLex-Morph vocabulary

    Cross-Lingual Link Discovery for Under-Resourced Languages

    Get PDF
    CC BY-NC 4.0In this paper, we provide an overview of current technologies for cross-lingual link discovery, and we discuss challenges, experiences and prospects of their application to under-resourced languages. We first introduce the goals of cross-lingual linking and associated technologies, and in particular, the role that the Linked Data paradigm (Bizer et al., 2011) applied to language data can play in this context. We define under-resourced languages with a specific focus on languages actively used on the internet, i.e., languages with a digitally versatile speaker community, but limited support in terms of language technology. We argue that languages for which considerable amounts of textual data and (at least) a bilingual word list are available, techniques for cross-lingual linking can be readily applied, and that these enable the implementation of downstream applications for under-resourced languages via the localisation and adaptation of existing technologies and resources
    corecore