367 research outputs found

    From GLÀFF to PsychoGLÀFF: a large psycholinguistics-oriented French lexical resource

    Get PDF
    International audienceIn this paper, we present two French lexical resources, GLÀFF and PsychoGLÀFF. The former, automatically extracted from the collaborative online dictionary Wiktionary, is a large-scale versatile lexicon exploitable in Natural Language Processing applications and linguistic studies. The latter, based on GLÀFF, is a lexicon specifically designed for psycholinguistic research. GLÀFF, counting more than 1.4 million entries, features an unprecedented size. It reports lemmas, main syntactic categories, inflectional features and phonemic transcriptions. PsychoGLÀFF contains additional information related to formal aspects of the lexicon and its distribution. It contains about 340,000 entries (120,000 lemmas) that are corpora-attested. We explain how the resources have been created and compare them to other known resources in terms of coverage and quality. Regarding PsychoGLÀFF, the comparison shows that it has an exceptionally large repertoire while having a comparable quality

    Semi-automatic enrichment of crowdsourced synonymy networks: the WISIGOTH system applied to Wiktionary

    Get PDF
    International audienceSemantic lexical resources are a mainstay of various Natural Language Processing applications. However, comprehensive and reliable resources are rare and not often freely available. Handcrafted resources are too costly for being a general solution while automatically-built resources need to be validated by experts or at least thoroughly evaluated. We propose in this paper a picture of the current situation with regard to lexical resources, their building and their evaluation. We give an in-depth description of Wiktionary, a freely available and collaboratively built multilingual dictionary. Wiktionary is presented here as a promising raw resource for NLP. We propose a semi-automatic approach based on random walks for enriching Wiktionary synonymy network that uses both endogenous and exogenous data. We take advantage of the wiki infrastructure to propose a validation "by crowds". Finally, we present an implementation called WISIGOTH, which supports our approach

    Scalable Named Entity Identification in Classical Studies

    Get PDF
    The Perseus Project and the Collections and Archives of Tufts University propose to develop infrastructure for finding references to particular people and places from classical antiquity in several ancient and modern languages in primary and secondary source collections. We will offer and publish open-source, stand alone services and Fedora repository disseminators for searching, browsing, and visualizing entities within the Tufts Digital Library. Under a creative commons license, we will publish knowledge sources such as: linguistic data to identify forms of the most common 60,000 proper classical names in seven languages; knowledge base of the 30,000 people and places most prominent in texts; indices associating c. 200,000 passages with particular entities and an association network of 500,000 tagged names for named entity identification systems; automatically generated index of classical people and places identified in a 1 billion-word testbed of both scholarly and general cultural documents

    Построение машинно-читаемого словаря на основе русского викисловаря

    Get PDF
    The practical questions of data extraction from Wiktionary are elaborated. Wiktionary is a multilingual free content dictionary (and in Russian Wiktionary there are more than 300 languages). In order to store the lexicographic data extracted from Russian Wiktionary (1) a database structure (tables and relations) was designed, (2) an application programming interface to this database was developed. The graphical user interface was implemented, which allows present the word-cards to the user. The paper is devoted of the creation of a machinereadable dictionary based on data from Russian Wiktionary.Сформулированы и решены практические вопросы извлечения данных из викисловаря, представляющего собой тезаурус и многофункциональный многоязычный словарь (только в русском викисловаре представлено более 300 языков). Для хранения лексикографической информации, извлеченной из русского викисловаря, разработаны структура базы данных машинно-читаемого словаря, а также интерфейс к этой базе данных который позволяет выводить на экран карточки словарных статей. В работе рассказывается о создании машинно-читаемого словаря на основе данных русского викисловаря

    Evaluating Copyright Protection in the Data-Driven Era: Centering on Motion Picture\u27s Past and Future

    Get PDF
    Since the 1910s, Hollywood has measured audience preferences with rough industry-created methods. In the 1940s, scientific audience research led by George Gallup started to conduct film audience surveys with traditional statistical and psychological methods. However, the quantity, quality, and speed were limited. Things dramatically changed in the internet age. The prevalence of digital data increases the instantaneousness, convenience, width, and depth of collecting audience and content data. Advanced data and AI technologies have also allowed machines to provide filmmakers with ideas or even make human-like expressions. This brings new copyright challenges in the data-driven era. Massive amounts of text and data are the premise of text and data mining (TDM), as well as the admission ticket to access machine learning technologies. Given the high and uncertain copyright violation risks in the data-driven creation process, whoever controls the copyrighted film materials can monopolize the data and AI technologies to create motion pictures in the data-driven era. Considering that copyright shall not be the gatekeeper to new technological uses that do not impair the original uses of copyrighted works in the existing markets, this study proposes to create a TDM and model training limitations or exceptions to copyrights and recommends the Singapore legislative model. Motion pictures, as public entertainment media, have inherently limited creative choices. Identifying data-driven works’ human original expression components is also challenging. This study proposes establishing a voluntarily negotiated license institution backed up by a compulsory license to enable other filmmakers to reuse film materials in new motion pictures. The film material’s degree of human original authorship certified by film artists’ guilds shall be a crucial factor in deciding the compulsory license’s royalty rate and terms to encourage retaining human artists. This study argues that international and domestic policymakers should enjoy broad discretion to qualify data-driven work’s copyright protection because data-driven work is a new category of work. It would be too late to wait until ubiquitous data-driven works block human creative freedom and floods of data-driven work copyright litigations overwhelm the judicial systems

    Building a Knowledge Graph for Food, Energy, and Water Systems

    Get PDF
    Title from PDF of title page viewed January 30, 2018Thesis advisor: Praveen R. RaoVitaIncludes bibliographical references (pages 41-44)Thesis (M.S.)--School of Computing and Engineering. University of Missouri--Kansas City, 2017A knowledge graph represents millions of facts and reliable information about people, places, and things. Several companies like Microsoft, Amazon, and Google have developed knowledge graphs to better customer experience. These knowledge graphs have proven their reliability and their usage for providing better search results; answering ambiguous questions regarding entities; and training semantic parsers to enhance the semantic relationships over the Semantic Web. Motivated by these reasons, in this thesis, we develop an approach to build a knowledge graph for the Food, Energy, and Water (FEW) systems given the vast amount of data that is available from federal agencies like the United States Department of Agriculture (USDA), the National Oceanic and Atmospheric Administration (NOAA), the U.S. Geological Survey (USGS), and the National Drought Mitigation Center (NDMC). Our goal is to facilitate better analytics for FEW and enable domain experts to conduct data-driven research. To construct the knowledge graph, we employ Semantic Web technologies, namely, the Resource Description Framework (RDF), the Web Ontology Language (OWL), and SPARQL. Starting with raw data (e.g., CSV files), we construct entities and relationships and extend them semantically using a tool called Karma. We enhance this initial knowledge graph by adding new relationships across entities by extracting information from ConceptNet via an efficient similarity searching algorithm. We show initial performance results and discuss the quality of the knowledge graph on several datasets from the USDA.Introduction -- Challenges -- Background and related work -- Approach -- Evaluation -- Conclusion and future wor
    corecore