4,424 research outputs found

    A knowledge-based approach to information extraction for semantic interoperability in the archaeology domain

    Get PDF
    The paper presents a method for automatic semantic indexing of archaeological grey-literature reports using empirical (rule-based) Information Extraction techniques in combination with domain-specific knowledge organization systems. Performance is evaluated via the Gold Standard method. The semantic annotation system (OPTIMA) performs the tasks of Named Entity Recognition, Relation Extraction, Negation Detection and Word Sense disambiguation using hand-crafted rules and terminological resources for associating contextual abstractions with classes of the standard ontology (ISO 21127:2006) CIDOC Conceptual Reference Model (CRM) for cultural heritage and its archaeological extension, CRM-EH, together with concepts from English Heritage thesauri and glossaries.Relation Extraction performance benefits from a syntactic based definition of relation extraction patterns derived from domain oriented corpus analysis. The evaluation also shows clear benefit in the use of assistive NLP modules relating to word-sense disambiguation, negation detection and noun phrase validation, together with controlled thesaurus expansion.The semantic indexing results demonstrate the capacity of rule-based Information Extraction techniques to deliver interoperable semantic abstractions (semantic annotations) with respect to the CIDOC CRM and archaeological thesauri. Major contributions include recognition of relevant entities using shallow parsing NLP techniques driven by a complimentary use of ontological and terminological domain resources and empirical derivation of context-driven relation extraction rules for the recognition of semantic relationships from phrases of unstructured text. The semantic annotations have proven capable of supporting semantic query, document study and cross-searching via the ontology framework

    Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources

    Get PDF
    Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen

    A Data-driven, High-performance and Intelligent CyberInfrastructure to Advance Spatial Sciences

    Get PDF
    abstract: In the field of Geographic Information Science (GIScience), we have witnessed the unprecedented data deluge brought about by the rapid advancement of high-resolution data observing technologies. For example, with the advancement of Earth Observation (EO) technologies, a massive amount of EO data including remote sensing data and other sensor observation data about earthquake, climate, ocean, hydrology, volcano, glacier, etc., are being collected on a daily basis by a wide range of organizations. In addition to the observation data, human-generated data including microblogs, photos, consumption records, evaluations, unstructured webpages and other Volunteered Geographical Information (VGI) are incessantly generated and shared on the Internet. Meanwhile, the emerging cyberinfrastructure rapidly increases our capacity for handling such massive data with regard to data collection and management, data integration and interoperability, data transmission and visualization, high-performance computing, etc. Cyberinfrastructure (CI) consists of computing systems, data storage systems, advanced instruments and data repositories, visualization environments, and people, all linked together by software and high-performance networks to improve research productivity and enable breakthroughs that are not otherwise possible. The Geospatial CI (GCI, or CyberGIS), as the synthesis of CI and GIScience has inherent advantages in enabling computationally intensive spatial analysis and modeling (SAM) and collaborative geospatial problem solving and decision making. This dissertation is dedicated to addressing several critical issues and improving the performance of existing methodologies and systems in the field of CyberGIS. My dissertation will include three parts: The first part is focused on developing methodologies to help public researchers find appropriate open geo-spatial datasets from millions of records provided by thousands of organizations scattered around the world efficiently and effectively. Machine learning and semantic search methods will be utilized in this research. The second part develops an interoperable and replicable geoprocessing service by synthesizing the high-performance computing (HPC) environment, the core spatial statistic/analysis algorithms from the widely adopted open source python package – Python Spatial Analysis Library (PySAL), and rich datasets acquired from the first research. The third part is dedicated to studying optimization strategies for feature data transmission and visualization. This study is intended for solving the performance issue in large feature data transmission through the Internet and visualization on the client (browser) side. Taken together, the three parts constitute an endeavor towards the methodological improvement and implementation practice of the data-driven, high-performance and intelligent CI to advance spatial sciences.Dissertation/ThesisDoctoral Dissertation Geography 201

    The European Language Resources and Technologies Forum: Shaping the Future of the Multilingual Digital Europe

    Get PDF
    Proceedings of the 1st FLaReNet Forum on the European Language Resources and Technologies, held in Vienna, at the Austrian Academy of Science, on 12-13 February 2009

    Post-editing machine translated text in a commercial setting: Observation and statistical analysis

    Get PDF
    Machine translation systems, when they are used in a commercial context for publishing purposes, are usually used in combination with human post-editing. Thus understanding human post-editing behaviour is crucial in order to maximise the benefit of machine translation systems. Though there have been a number of studies carried out on human post-editing to date, there is a lack of large-scale studies on post-editing in industrial contexts which focus on the activity in real-life settings. This study observes professional Japanese post-editors’ work and examines the effect of the amount of editing made during post-editing, source text characteristics, and post-editing behaviour, on the amount of post-editing effort. A mixed method approach was employed to both quantitatively and qualitatively analyse the data and gain detailed insights into the post-editing activity from various view points. The results indicate that a number of factors, such as sentence structure, document component types, use of product specific terms, and post-editing patterns and behaviour, have effect on the amount of post-editing effort in an intertwined manner. The findings will contribute to a better utilisation of machine translation systems in the industry as well as the development of the skills and strategies of post-editors

    An Investigation into Automatic Translation of Prepositions in IT Technical Documentation from English to Chinese

    Get PDF
    Machine Translation (MT) technology has been widely used in the localisation industry to boost the productivity of professional translators. However, due to the high quality of translation expected, the translation performance of an MT system in isolation is less than satisfactory due to various generated errors. This study focuses on translation of prepositions from English into Chinese within technical documents in an industrial localisation context. The aim of the study is to reveal the salient errors in the translation of prepositions and to explore possible methods to remedy these errors. This study proposes three new approaches to improve the translation of prepositions. All approaches attempt to make use of the strengths of the two most popular MT architectures at the moment: Rule-Based MT (RBMT) and Statistical MT (SMT). The approaches include: firstly building an automatic preposition dictionary for the RBMT system; secondly exploring and modifing the process of Statistical Post-Editing (SPE) and thirdly pre-processing the source texts to better suit the RBMT system. Overall evaluation results (both human evaluation and automatic evaluation) show the potential of our new approaches in improving the translation of prepositions. In addition, the current study also reveals a new function of automatic metrics in assisting researchers to obtain more valid or purpose-specific human valuation results

    English/Russian lexical cognates detection using NLP Machine Learning with Python

    Full text link
    Изучение языка – это замечательное занятие, которое расширяет наш кругозор и позволяет нам общаться с представителями различных культур и людей по всему миру. Традиционно языковое образование основывалось на традиционных методах, таких как учебники, словарный запас и языковой обмен. Однако с появлением машинного обучения наступила новая эра в обучении языку, предлагающая инновационные и эффективные способы ускорения овладения языком. Одним из интригующих применений машинного обучения в изучении языков является использование родственных слов, слов, которые имеют схожее значение и написание в разных языках. Для решения этой темы в данной исследовательской работе предлагается облегчить процесс изучения второго языка с помощью искусственного интеллекта, в частности нейронных сетей, которые могут идентифицировать и использовать слова, похожие или идентичные как на первом языке учащегося, так и на целевом языке. Эти слова, известные как лексические родственные слова, могут облегчить изучение языка, предоставляя учащимся знакомый ориентир и позволяя им связывать новый словарный запас со словами, которые они уже знают. Используя возможности нейронных сетей для обнаружения и использования этих родственных слов, учащиеся смогут ускорить свой прогресс в освоении второго языка. Хотя исследование семантического сходства в разных языках не является новой темой, наша цель состоит в том, чтобы применить другой подход для выявления русско-английских лексических родственных слов и представить полученные результаты в качестве инструмента изучения языка, используя выборку данных о лексическом и семантическом сходстве. между языками, чтобы построить модель обнаружения лексических родственных слов и ассоциаций слов. Впоследствии, в зависимости от нашего анализа и результатов, мы представим приложение для определения словесных ассоциаций, которое смогут использовать конечные пользователи. Учитывая, что русский и английский являются одними из наиболее распространенных языков в мире, а Россия является популярным местом для иностранных студентов со всего мира, это послужило значительной мотивацией для разработки инструмента искусственного интеллекта, который поможет людям, изучающим русский язык как англоговорящие, или изучающим английский язык. как русскоязычные.Language learning is a remarkable endeavor that expands our horizons and allows us to connect with diverse cultures and people around the world. Traditionally, language education has relied on conventional methods such as textbooks, vocabulary drills, and language exchanges. However, with the advent of machine learning, a new era has dawned upon language instruction, offering innovative and efficient ways to accelerate language acquisition. One intriguing application of machine learning in language learning is the utilization of cognates, words that share similar meanings and spellings across different languages. To address this subject, this research paper proposes to facilitate the process of acquiring a second language with the help of artificial intelligence, particularly neural networks, which can identify and use words that are similar or identical in both the learner's first language and the target language. These words, known as lexical cognates which can facilitate language learning by providing a familiar point of reference for the learner and enabling them to associate new vocabulary with words they already know. By leveraging the power of neural networks to detect and utilize these cognates, learners will be able to accelerate their progress in acquiring a second language. Although the study of semantic similarity across different languages is not a new topic, our objective is to adopt a different approach for identifying Russian-English Lexical cognates and present the obtained results as a language learning tool, by using the lexical and semantic similarity data sample across languages to build a lexical cognates detection and words association model. Subsequently, depend on our analysis and results, will present a word association application that can be utilized by end users. Given that Russian and English are among the most widely spoken languages globally and that Russia is a popular destination for international students from around the world, it served as a significant motivation to develop an AI tool to assist individuals learning Russian as English speakers or learning English as Russian speakers

    Proceedings of the 17th Annual Conference of the European Association for Machine Translation

    Get PDF
    Proceedings of the 17th Annual Conference of the European Association for Machine Translation (EAMT
    corecore