11,182 research outputs found
ATLAS: A flexible and extensible architecture for linguistic annotation
We describe a formal model for annotating linguistic artifacts, from which we
derive an application programming interface (API) to a suite of tools for
manipulating these annotations. The abstract logical model provides for a range
of storage formats and promotes the reuse of tools that interact through this
API. We focus first on ``Annotation Graphs,'' a graph model for annotations on
linear signals (such as text and speech) indexed by intervals, for which
efficient database storage and querying techniques are applicable. We note how
a wide range of existing annotated corpora can be mapped to this annotation
graph model. This model is then generalized to encompass a wider variety of
linguistic ``signals,'' including both naturally occuring phenomena (as
recorded in images, video, multi-modal interactions, etc.), as well as the
derived resources that are increasingly important to the engineering of natural
language processing systems (such as word lists, dictionaries, aligned
bilingual corpora, etc.). We conclude with a review of the current efforts
towards implementing key pieces of this architecture.Comment: 8 pages, 9 figure
Ubiquitous Place Names Standardization and Study in Indonesia
Place names play a vital role in human society. Names exist in all languages and place names are an indispensible part of International communication. This has been acknowledged by the establishment of the United Nations Group of Experts on Geographical Names (UNGEGN). One of UNGEGN's tasks is to coordinate International efforts on the proper use of place names. Indonesia supports this effort and through its National Geospatial Agency (BIG). Place names are also of interest as an object of study in themselves. Academic studies into place names are found in linguistics, onomastics, philosophy and a number of other academic disciplines. This article looks at these two dimensions of place names, standardization efforts under the auspices of International and national bodies, and academic studies of names, with particular reference to the situation in Indonesia
Neural Vector Spaces for Unsupervised Information Retrieval
We propose the Neural Vector Space Model (NVSM), a method that learns
representations of documents in an unsupervised manner for news article
retrieval. In the NVSM paradigm, we learn low-dimensional representations of
words and documents from scratch using gradient descent and rank documents
according to their similarity with query representations that are composed from
word representations. We show that NVSM performs better at document ranking
than existing latent semantic vector space methods. The addition of NVSM to a
mixture of lexical language models and a state-of-the-art baseline vector space
model yields a statistically significant increase in retrieval effectiveness.
Consequently, NVSM adds a complementary relevance signal. Next to semantic
matching, we find that NVSM performs well in cases where lexical matching is
needed.
NVSM learns a notion of term specificity directly from the document
collection without feature engineering. We also show that NVSM learns
regularities related to Luhn significance. Finally, we give advice on how to
deploy NVSM in situations where model selection (e.g., cross-validation) is
infeasible. We find that an unsupervised ensemble of multiple models trained
with different hyperparameter values performs better than a single
cross-validated model. Therefore, NVSM can safely be used for ranking documents
without supervised relevance judgments.Comment: TOIS 201
Stabilizing knowledge through standards - A perspective for the humanities
It is usual to consider that standards generate mixed feelings among
scientists. They are often seen as not really reflecting the state of the art
in a given domain and a hindrance to scientific creativity. Still, scientists
should theoretically be at the best place to bring their expertise into
standard developments, being even more neutral on issues that may typically be
related to competing industrial interests. Even if it could be thought of as
even more complex to think about developping standards in the humanities, we
will show how this can be made feasible through the experience gained both
within the Text Encoding Initiative consortium and the International
Organisation for Standardisation. By taking the specific case of lexical
resources, we will try to show how this brings about new ideas for designing
future research infrastructures in the human and social sciences
Kun-dangwok: ''clan lects'' and Ausbau in western Arnhem Land
The sociolinguistic concept of an Ausbau language is widely thought of as exclusively associated with the standardization of languages for the political and social purposes of nation states. Language policy initiated by state institutions, the development of literacy and new specialist registers of language are typical elements involved in the Ausbau process. However, the linguistic ideologies of small language groups such as those of the minority languages of Aboriginal Australia can drive certain forms of deliberate language elaboration. An important aspect of Aboriginal linguistic ideology is language diversity, reflected in the development of elemental sociolinguistic varieties such as patrician lects. In the Bininj Kun-wok dialect chain of western Arnhem Land, a regional system of lectal differentiation known as Kun-dangwok has developed, reflecting an Aboriginal linguistic ideology whereby being different, especially different ways of speaking, are seen as central aspects of identity. The functions of the Kun-dangwok clan led system are described using examples of naturally occurring conversation which provide evidence that clan lects are the result of an Ausbau process that results in the opposite of language standardization and an increase in Abstand between varieties
A Formal Framework for Linguistic Annotation
`Linguistic annotation' covers any descriptive or analytic notations applied
to raw language data. The basic data may be in the form of time functions --
audio, video and/or physiological recordings -- or it may be textual. The added
notations may include transcriptions of all sorts (from phonetic features to
discourse structures), part-of-speech and sense tagging, syntactic analysis,
`named entity' identification, co-reference annotation, and so on. While there
are several ongoing efforts to provide formats and tools for such annotations
and to publish annotated linguistic databases, the lack of widely accepted
standards is becoming a critical problem. Proposed standards, to the extent
they exist, have focussed on file formats. This paper focuses instead on the
logical structure of linguistic annotations. We survey a wide variety of
existing annotation formats and demonstrate a common conceptual core, the
annotation graph. This provides a formal framework for constructing,
maintaining and searching linguistic annotations, while remaining consistent
with many alternative data structures and file formats.Comment: 49 page
CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines
Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective.
The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines.
From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research
Adversarial Network Bottleneck Features for Noise Robust Speaker Verification
In this paper, we propose a noise robust bottleneck feature representation
which is generated by an adversarial network (AN). The AN includes two cascade
connected networks, an encoding network (EN) and a discriminative network (DN).
Mel-frequency cepstral coefficients (MFCCs) of clean and noisy speech are used
as input to the EN and the output of the EN is used as the noise robust
feature. The EN and DN are trained in turn, namely, when training the DN, noise
types are selected as the training labels and when training the EN, all labels
are set as the same, i.e., the clean speech label, which aims to make the AN
features invariant to noise and thus achieve noise robustness. We evaluate the
performance of the proposed feature on a Gaussian Mixture Model-Universal
Background Model based speaker verification system, and make comparison to MFCC
features of speech enhanced by short-time spectral amplitude minimum mean
square error (STSA-MMSE) and deep neural network-based speech enhancement
(DNN-SE) methods. Experimental results on the RSR2015 database show that the
proposed AN bottleneck feature (AN-BN) dramatically outperforms the STSA-MMSE
and DNN-SE based MFCCs for different noise types and signal-to-noise ratios.
Furthermore, the AN-BN feature is able to improve the speaker verification
performance under the clean condition
- …