28,330 research outputs found
From Terminology Extraction to Terminology Validation: An Approach Adapted to Log Files
International audienceLog files generated by computational systems contain relevant and essential information. In some application areas like the design of integrated circuits, log files generated by design tools contain information which can be used in management information systems to evaluate the final products. However, the complexity of such textual data raises some challenges concerning the extraction of information from log files. Log files are usually multi-source, multi-format, and have a heterogeneous and evolving structure. Moreover, they usually do not respect natural language grammar and structures even though they are written in English. Classical methods of information extraction such as terminology extraction methods are particularly irrelevant to this context. In this paper, we introduce our approach Exterlog to extract terminology from log files. We detail how it deals with the specific features of such textual data. The performance is emphasized by favoring the most relevant terms of the domain based on a scoring function which uses a Web and context based measure. The experiments show that Exterlog is a well-adapted approach for terminology extraction from log files
From Terminology Extraction to Terminology Validation: An Approach Adapted to Log Files
Abstract: Log files generated by computational systems contain relevant and essential information. In some application areas like the design of integrated circuits, log files generated by design tools contain information which can be used in management information systems to evaluate the final products. However, the complexity of such textual data raises some challenges concerning the extraction of information from log files. Log files are usually multi-source, multi-format, and have a heterogeneous and evolving structure. Moreover, they usually do not respect natural language grammar and structures even though they are written in English. Classical methods of information extraction such as terminology extraction methods are particularly irrelevant to this context. In this paper, we introduce our approach Exterlog to extract terminology from log files. We detail how it deals with the specific features of such textual data. The performance is emphasized by favoring the most relevant terms of the domain based on a scoring function which uses a Web and context based measure. The experiments show that Exterlog is a well-adapted approach for terminology extraction from log files
Terminology Extraction for and from Communications in Multi-disciplinary Domains
Terminology extraction generally refers to methods and systems for identifying term candidates in a uni-disciplinary and uni-lingual
environment such as engineering, medical, physical and geological sciences, or administration, business and leisure. However, as
human enterprises get more and more complex, it has become increasingly important for teams in one discipline to collaborate with
others from not only a non-cognate discipline but also speaking a different language. Disaster mitigation and recovery, and conflict
resolution are amongst the areas where there is a requirement to use standardised multilingual terminology for communication. This
paper presents a feasibility study conducted to build terminology (and ontology) in the domain of disaster management and is part of the
broader work conducted for the EU project Sland \ub4 ail (FP7 607691). We have evaluated CiCui (for Chinese name \ub4 \u8bcd\u8403, which translates to
words gathered), a corpus-based text analytic system that combine frequency, collocation and linguistic analyses to extract candidates
terminologies from corpora comprised of domain texts from diverse sources. CiCui was assessed against four terminology extraction
systems and the initial results show that it has an above average precision in extracting terms
BlogForever D5.2: Implementation of Case Studies
This document presents the internal and external testing results for the BlogForever case studies. The evaluation of the BlogForever implementation process is tabulated under the most relevant themes and aspects obtained within the testing processes. The case studies provide relevant feedback for the sustainability of the platform in terms of potential users’ needs and relevant information on the possible long term impact
Towards structured sharing of raw and derived neuroimaging data across existing resources
Data sharing efforts increasingly contribute to the acceleration of
scientific discovery. Neuroimaging data is accumulating in distributed
domain-specific databases and there is currently no integrated access mechanism
nor an accepted format for the critically important meta-data that is necessary
for making use of the combined, available neuroimaging data. In this
manuscript, we present work from the Derived Data Working Group, an open-access
group sponsored by the Biomedical Informatics Research Network (BIRN) and the
International Neuroimaging Coordinating Facility (INCF) focused on practical
tools for distributed access to neuroimaging data. The working group develops
models and tools facilitating the structured interchange of neuroimaging
meta-data and is making progress towards a unified set of tools for such data
and meta-data exchange. We report on the key components required for integrated
access to raw and derived neuroimaging data as well as associated meta-data and
provenance across neuroimaging resources. The components include (1) a
structured terminology that provides semantic context to data, (2) a formal
data model for neuroimaging with robust tracking of data provenance, (3) a web
service-based application programming interface (API) that provides a
consistent mechanism to access and query the data model, and (4) a provenance
library that can be used for the extraction of provenance data by image
analysts and imaging software developers. We believe that the framework and set
of tools outlined in this manuscript have great potential for solving many of
the issues the neuroimaging community faces when sharing raw and derived
neuroimaging data across the various existing database systems for the purpose
of accelerating scientific discovery
Dutch compound splitting for bilingual terminology extraction
Compounds pose a problem for applications that rely on precise word alignments such as bilingual terminology extraction. We therefore developed a state-of-the-art hybrid compound splitter for Dutch that makes use of corpus frequency information and linguistic knowledge. Domain-adaptation techniques are used to combine large out-of-domain and dynamically compiled in-domain frequency lists. We perform an extensive intrinsic evaluation on a Gold Standard set of 50,000 Dutch compounds and a set of 5,000 Dutch compounds belonging to the automotive domain. We also propose a novel methodology for word alignment that makes use of the compound splitter. As compounds are not always translated compositionally, we train the word alignment models twice: a first time on the original data set and a second time on the data set in which the compounds are split into their component parts. The obtained word alignment points are then combined
Extracting Noun Phrases from Large-Scale Texts: A Hybrid Approach and Its Automatic Evaluation
To acquire noun phrases from running texts is useful for many applications,
such as word grouping,terminology indexing, etc. The reported literatures adopt
pure probabilistic approach, or pure rule-based noun phrases grammar to tackle
this problem. In this paper, we apply a probabilistic chunker to deciding the
implicit boundaries of constituents and utilize the linguistic knowledge to
extract the noun phrases by a finite state mechanism. The test texts are
SUSANNE Corpus and the results are evaluated by comparing the parse field of
SUSANNE Corpus automatically. The results of this preliminary experiment are
encouraging.Comment: 8 pages, Postscript file, Unix compressed, uuencode
Predicting Landslides Using Locally Aligned Convolutional Neural Networks
Landslides, movement of soil and rock under the influence of gravity, are
common phenomena that cause significant human and economic losses every year.
Experts use heterogeneous features such as slope, elevation, land cover,
lithology, rock age, and rock family to predict landslides. To work with such
features, we adapted convolutional neural networks to consider relative spatial
information for the prediction task. Traditional filters in these networks
either have a fixed orientation or are rotationally invariant. Intuitively, the
filters should orient uphill, but there is not enough data to learn the concept
of uphill; instead, it can be provided as prior knowledge. We propose a model
called Locally Aligned Convolutional Neural Network, LACNN, that follows the
ground surface at multiple scales to predict possible landslide occurrence for
a single point. To validate our method, we created a standardized dataset of
georeferenced images consisting of the heterogeneous features as inputs, and
compared our method to several baselines, including linear regression, a neural
network, and a convolutional network, using log-likelihood error and Receiver
Operating Characteristic curves on the test set. Our model achieves 2-7%
improvement in terms of accuracy and 2-15% boost in terms of log likelihood
compared to the other proposed baselines.Comment: Published in IJCAI 202
- …