13,513 research outputs found
FlashProfile: A Framework for Synthesizing Data Profiles
We address the problem of learning a syntactic profile for a collection of
strings, i.e. a set of regex-like patterns that succinctly describe the
syntactic variations in the strings. Real-world datasets, typically curated
from multiple sources, often contain data in various syntactic formats. Thus,
any data processing task is preceded by the critical step of data format
identification. However, manual inspection of data to identify the different
formats is infeasible in standard big-data scenarios.
Prior techniques are restricted to a small set of pre-defined patterns (e.g.
digits, letters, words, etc.), and provide no control over granularity of
profiles. We define syntactic profiling as a problem of clustering strings
based on syntactic similarity, followed by identifying patterns that succinctly
describe each cluster. We present a technique for synthesizing such profiles
over a given language of patterns, that also allows for interactive refinement
by requesting a desired number of clusters.
Using a state-of-the-art inductive synthesis framework, PROSE, we have
implemented our technique as FlashProfile. Across tasks over large
real datasets, we observe a median profiling time of only s.
Furthermore, we show that access to syntactic profiles may allow for more
accurate synthesis of programs, i.e. using fewer examples, in
programming-by-example (PBE) workflows such as FlashFill.Comment: 28 pages, SPLASH (OOPSLA) 201
Patterns in syntactic dependency networks
Many languages are spoken on Earth. Despite their diversity, many robust language universals are known to exist. All languages share syntax, i.e., the ability of combining words for forming sentences. The origin of such traits is an issue of open debate. By using recent developments from the statistical physics of complex networks, we show that different syntactic dependency networks (from Czech, German, and Romanian) share many nontrivial statistical patterns such as the small world phenomenon, scaling in the distribution of degrees, and disassortative mixing. Such previously unreported features of syntax organization are not a trivial consequence of the structure of sentences, but an emergent trait at the global scale.Peer ReviewedPostprint (published version
Cross-Lingual Induction and Transfer of Verb Classes Based on Word Vector Space Specialisation
Existing approaches to automatic VerbNet-style verb classification are
heavily dependent on feature engineering and therefore limited to languages
with mature NLP pipelines. In this work, we propose a novel cross-lingual
transfer method for inducing VerbNets for multiple languages. To the best of
our knowledge, this is the first study which demonstrates how the architectures
for learning word embeddings can be applied to this challenging
syntactic-semantic task. Our method uses cross-lingual translation pairs to tie
each of the six target languages into a bilingual vector space with English,
jointly specialising the representations to encode the relational information
from English VerbNet. A standard clustering algorithm is then run on top of the
VerbNet-specialised representations, using vector dimensions as features for
learning verb classes. Our results show that the proposed cross-lingual
transfer approach sets new state-of-the-art verb classification performance
across all six target languages explored in this work.Comment: EMNLP 2017 (long paper
Coalescent Assimilation Across Wordboundaries in American English and in Polish English
Coalescent assimilation (CA), where alveolar obstruents /t, d, s, z/ in word-final position merge with word-initial /j/ to produce postalveolar /tʃ, dʒ, ʃ, ʒ/, is one of the most wellknown connected speech processes in English. Due to its commonness, CA has been discussed in numerous textbook descriptions of English pronunciation, and yet, upon comparing them it is difficult to get a clear picture of what factors make its application likely. This paper aims to investigate the application of CA in American English to see a) what factors increase the likelihood of its application for each of the four alveolar obstruents, and b) what is the allophonic realization of plosives /t, d/ if the CA does not apply. To do so, the Buckeye Corpus (Pitt et al. 2007) of spoken American English is analyzed quantitatively. As a second step, these results are compared with Polish English; statistics analogous to the ones listed above for American English are gathered for Polish English based on the PLEC corpus (Pęzik 2012). The last section focuses on what consequences for teaching based on a native speaker model the findings have. It is argued that a description of the phenomenon that reflects the behavior of speakers of American English more accurately than extant textbook accounts could be beneficial to the acquisition of these patterns
- …