37 research outputs found

    Exploring and enriching a language resource archive via the web

    Get PDF
    The ”download first, then process paradigm” is still the predominant working method amongst the research community. The web-based paradigm, however, offers many advantages from a tool development and data management perspective as they allow a quick adaptation to changing research environments. Moreover, new ways of combining tools and data are increasingly becoming available and will eventually enable a true web-based workflow approach, thus challenging the ”download first, then process” paradigm. The necessary infrastructure for managing, exploring and enriching language resources via the Web will need to be delivered by projects like CLARIN and DARIA

    Ensuring semantic interoperability on lexical resources

    Get PDF
    In this paper, we describe a unifying approach to tackle data heterogeneity issues for lexica and related resources. We present LEXUS, our software that implements the Lexical Markup Framework (LMF) to uniformly describe and manage lexica of different structures. LEXUS also makes use of a central Data Category Registry (DCR) to address terminological issues with regard to linguistic concepts as well as the handling of working and object languages. Finally, we report on ViCoS, a LEXUS extension, providing support for the definition of arbitrary semantic relations between lexical entries or parts thereof

    Zapotec Language Activism And Talking Dictionaries

    Get PDF
    Online dictionaries have become a key tool for some indigenous communities to promote and preserve their languages, often in collaboration with linguists. They can provide a pathway for crossing the digital divide and for establishing a first-ever presence on the internet. Many questions around digital lexicography have been explored, although primarily in relation to large and well-resourced languages. Lexical projects on small and under-resourced languages can provide an opportunity to examine these questions from a different perspective and to raise new questions (Mosel, 2011). In this paper, linguists, technical experts, and Zapotec language activists, who have worked together in Mexico and the United States to create a multimedia platform to showcase and preserve lexical, cultural, and environmental knowledge, share their experience and insight in creating trilingual online Talking Dictionaries in several Zapotec languages. These dictionaries sit opposite from big data mining and illustrate the value of dictionary projects based on small corpora, including having the flexibility to make design decisions to maximize community impact and elevate the status of marginalized languages

    Approaches towards a Lexical Web: the role of Interoperability

    Get PDF
    After highlighting some of the major dimensions that are relevant for Language Resources (LR) and contribute to their infrastructural role, I underline some priority areas of concern today with respect to implementing an open Language Infrastructure, and specifically what we could call a ?Lexical Web?. My objective is to show that it is imperative to define an underlying global strategy behind the set of initiatives which are/can be launched in Europe and world-wide, and that it is necessary an allembracing vision and a cooperation among different communities to achieve more coherent and useful results. I end up mentioning two new European initiatives that in this direction and promise to be influential in shaping the future of the LR area

    Language-sites: Accessing and presenting language resources via geographic information systems

    Get PDF
    The emerging area of Geographic Information Systems (GIS) has proven to add an interesting dimension to many research projects. Within the language-sites initiative we have brought together a broad range of links to digital language corpora and resources. Via Google Earth's visually appealing 3D-interface users can spin the globe, zoom into an area they are interested in and access directly the relevant language resources. This paper focuses on several ways of relating the map and the online data (lexica, annotations, multimedia recordings, etc.). Furthermore, we discuss some of the implementation choices that have been made, including future challenges. In addition, we show how scholars (both linguists and anthropologists) are using GIS tools to fulfill their specific research needs by making use of practical examples. This illustrates how both scientists and the general public can benefit from geography-based access to digital language dat

    D6.1: Technologies and Tools for Lexical Acquisition

    Get PDF
    This report describes the technologies and tools to be used for Lexical Acquisition in PANACEA. It includes descriptions of existing technologies and tools which can be built on and improved within PANACEA, as well as of new technologies and tools to be developed and integrated in PANACEA platform. The report also specifies the Lexical Resources to be produced. Four main areas of lexical acquisition are included: Subcategorization frames (SCFs), Selectional Preferences (SPs), Lexical-semantic Classes (LCs), for both nouns and verbs, and Multi-Word Expressions (MWEs)

    The tesouro do léxico patrimonial galego e portugués. A galician and portuguese word bank

    Get PDF
    O proxecto internacional Tesouro do léxico patrimonial galego e portugués ten como obxectivo constituír un portal de léxico dialectal que reúna nunha mesma ferramenta informática materiais lexicográficos procedentes de Brasil, Galicia e Portugal. Este portal dialectal permitirá o acceso inmediato, através da internet e de modo gratuíto, a unha gran cantidade de datos lexicográficos, moitos deles aínda inéditos e de difícil acceso para os investigadores. A información léxica introducida no Tesouro está debidamente lematizada, clasificada semanticamente e xeoUreferenciada, polo que é posible obter resultados debidamente agrupados e xerar a representación cartográfica correspondente. Ademais do obvio interese para as investigacións de tipo dialectal e lexicográfico, o Tesouro fornecerá material de utilidade para pescudas onomasiolóxicas, de lingüística histórica, etimolóxicas e morfolóxicas, por citar algúns exemplos. Do mesmo modo, esta ferramenta tamén poderá ser aproveitada para estudos de tipo etnográfico e histórico, xa que pon ó alcance da comunidade científica moita información sobre o patrimonio material e inmaterial tradicional dos tres países, moi ameazado polo cambio nas formas de vida ocorridos nas últimas décadas.The international project Tesouro do léxico patrimonial galego e portugués (Thesaurus of the Galician and Portuguese heritage lexicon) aims to be a cross-dialectal lexical portal bringing together lexicographical material from Brazil, Galicia and Portugal in a single computer tool. This dialect portal will give direct access via Internet, free of charge, to a large body of lexicographical data much of which has until now remained unpublished and hard for researchers obtain. The lexical information in the Tesouro is fully lemmatized, semantically classified and geographically referenced, making it possible to obtain usefully grouped search results and generating a map representation corresponding to each data set. Besides its obvious value to dialect researchers and lexicographers, the Tesouro will also provide useful material for the study of names, historical linguistics, etymology, morphology and so on. This tool might also be exploited in ethnographical and historical research since it makes available to the scientific community a large amount of information about the inherited traditions, both material and immaterial, of all three countries, much of which is endangered on account of recent changes in traditional ways of life

    ISOcat: Remodeling metadata for language resources

    No full text
    The Max Planck Institute for Psycholinguistics in Nijmegen, The Netherlands, is creating a state-of-the-art web environment for the ISO TC 37 (terminology and other language and content resources) metadata registry. This Data Category Registry (DCR) is called ISOcat and encompasses data categories for a broad range of language resources. Under the governance of the DCR Board, ISOcat provides an open work space for creating data category specifications, defining Data Category Selections (DCSs) (domain-specific groups of data categories), and standardising selected data categories and DCSs. Designers visualise future interactivity among the DCR, reference registries and ontological knowledge space

    Sharing Semantic Resources

    Get PDF
    The Semantic Web is an extension of the current Web in which information, so far created for human consumption, becomes machine readable, “enabling computers and people to work in cooperation”. To turn into reality this vision several challenges are still open among which the most important is to share meaning formally represented with ontologies or more generally with semantic resources. This Semantic Web long-term goal has many convergences with the activities in the field of Human Language Technology and in particular in the development of Natural Language Processing applications where there is a great need of multilingual lexical resources. For instance, one of the most important lexical resources, WordNet, is also commonly regarded and used as an ontology. Nowadays, another important phenomenon is represented by the explosion of social collaboration, and Wikipedia, the largest encyclopedia in the world, is object of research as an up to date omni comprehensive semantic resource. The main topic of this thesis is the management and exploitation of semantic resources in a collaborative way, trying to use the already available resources as Wikipedia and Wordnet. This work presents a general environment able to turn into reality the vision of shared and distributed semantic resources and describes a distributed three-layer architecture to enable a rapid prototyping of cooperative applications for developing semantic resources

    The Use Of Kullback-Leibler Divergence In Opinion Retrieval

    Get PDF
    With the huge amount of subjective contents in on-line documents, there is a clear need for an information retrieval system that supports retrieval of documents containing opinions about the topic expressed in a user’s query. In recent years, blogs, a new publishing medium, have attracted a large number of people to express personal opinions covering all kinds of topics in response to the real-world events. The opinionated nature of blogs makes them a new interesting research area for opinion retrieval. Identification and extraction of subjective contents from blogs has become the subject of several research projects. In this thesis, four novel methods are proposed to retrieve blog posts that express opinions about the given topics. The first method utilizes the Kullback-Leibler divergence (KLD) to weight the lexicon of subjective adjectives around query terms. Considering the distances between the query terms and subjective adjectives, the second method uses KLD scores of subjective adjectives based on distances from the query terms for document re-ranking. The third method calculates KLD scores of subjective adjectives for predefined query categories. In the fourth method, collocates, words co-occurring with query terms in the corpus, are used to construct the subjective lexicon automatically. The KLD scores of collocates are then calculated and used for document ranking. Four groups of experiments are conducted to evaluate the proposed methods on the TREC test collections. The results of the experiments are compared with the baseline systems to determine the effectiveness of using KLD in opinion retrieval. Further studies are recommended to explore more sophisticated approaches to identify subjectivity and promising techniques to extract opinions
    corecore