702 research outputs found
Correlation-based Intrinsic Evaluation of Word Vector Representations
We introduce QVEC-CCA--an intrinsic evaluation metric for word vector
representations based on correlations of learned vectors with features
extracted from linguistic resources. We show that QVEC-CCA scores are an
effective proxy for a range of extrinsic semantic and syntactic tasks. We also
show that the proposed evaluation obtains higher and more consistent
correlations with downstream tasks, compared to existing approaches to
intrinsic evaluation of word vectors that are based on word similarity.Comment: RepEval 2016, 5 page
On the Effect of Semantically Enriched Context Models on Software Modularization
Many of the existing approaches for program comprehension rely on the
linguistic information found in source code, such as identifier names and
comments. Semantic clustering is one such technique for modularization of the
system that relies on the informal semantics of the program, encoded in the
vocabulary used in the source code. Treating the source code as a collection of
tokens loses the semantic information embedded within the identifiers. We try
to overcome this problem by introducing context models for source code
identifiers to obtain a semantic kernel, which can be used for both deriving
the topics that run through the system as well as their clustering. In the
first model, we abstract an identifier to its type representation and build on
this notion of context to construct contextual vector representation of the
source code. The second notion of context is defined based on the flow of data
between identifiers to represent a module as a dependency graph where the nodes
correspond to identifiers and the edges represent the data dependencies between
pairs of identifiers. We have applied our approach to 10 medium-sized open
source Java projects, and show that by introducing contexts for identifiers,
the quality of the modularization of the software systems is improved. Both of
the context models give results that are superior to the plain vector
representation of documents. In some cases, the authoritativeness of
decompositions is improved by 67%. Furthermore, a more detailed evaluation of
our approach on JEdit, an open source editor, demonstrates that inferred topics
through performing topic analysis on the contextual representations are more
meaningful compared to the plain representation of the documents. The proposed
approach in introducing a context model for source code identifiers paves the
way for building tools that support developers in program comprehension tasks
such as application and domain concept location, software modularization and
topic analysis
Construction Grammar and Language Models
Recent progress in deep learning and natural language processing has given rise to powerful models that are primarily trained on a cloze-like task and show some evidence of having access to substantial linguistic information, including some constructional knowledge. This groundbreaking discovery presents an exciting opportunity for a synergistic relationship between computational methods and Construction Grammar research. In this chapter, we explore three distinct approaches to the interplay between computational methods and Construction Grammar: (i) computational methods for text analysis, (ii) computational Construction Grammar, and (iii) deep learning models, with a particular focus on language models. We touch upon the first two approaches as a contextual foundation for the use of computational methods before providing an accessible, yet comprehensive overview of deep learning models, which also addresses reservations construction grammarians may have. Additionally, we delve into experiments that explore the emergence of constructionally relevant information within these models while also examining the aspects of Construction Grammar that may pose challenges for these models. This chapter aims to foster collaboration between researchers in the fields of natural language processing and Construction Grammar. By doing so, we hope to pave the way for new insights and advancements in both these fields
Construction Grammar and Language Models
Recent progress in deep learning and natural language processing has given
rise to powerful models that are primarily trained on a cloze-like task and
show some evidence of having access to substantial linguistic information,
including some constructional knowledge. This groundbreaking discovery presents
an exciting opportunity for a synergistic relationship between computational
methods and Construction Grammar research. In this chapter, we explore three
distinct approaches to the interplay between computational methods and
Construction Grammar: (i) computational methods for text analysis, (ii)
computational Construction Grammar, and (iii) deep learning models, with a
particular focus on language models. We touch upon the first two approaches as
a contextual foundation for the use of computational methods before providing
an accessible, yet comprehensive overview of deep learning models, which also
addresses reservations construction grammarians may have. Additionally, we
delve into experiments that explore the emergence of constructionally relevant
information within these models while also examining the aspects of
Construction Grammar that may pose challenges for these models. This chapter
aims to foster collaboration between researchers in the fields of natural
language processing and Construction Grammar. By doing so, we hope to pave the
way for new insights and advancements in both these fields.Comment: Accepted for publication in The Cambridge Handbook of Construction
Grammar, edited by Mirjam Fried and Kiki Nikiforidou. To appear in 202
Using the Outlier Detection Task to Evaluate Distributional Semantic Models
In this article, we define the outlier detection task and use it to compare neural-based word embeddings with transparent count-based distributional representations. Using the English Wikipedia as a text source to train the models, we observed that embeddings outperform count-based representations when their contexts are made up of bag-of-words. However, there are no sharp differences between the two models if the word contexts are defined as syntactic dependencies. In general, syntax-based models tend to perform better than those based on bag-of-words for this specific task. Similar experiments were carried out for Portuguese with similar results. The test datasets we have created for the outlier detection task in English and Portuguese are freely availableThis work was supported by a 2016 BBVA Foundation Grant for Researchers and Cultural Creators and by Project TELEPARES, Ministry of Economy and Competitiveness (FFI2014-51978-C2-1-R). It has received financial support from the Consellería de Cultura, Educación e Ordenación Universitaria (accreditation 2016–2019, ED431G/08) and the European Regional Development Fund (ERDF)S
- …