919 research outputs found
Handling non-compositionality in multilingual CNLs
In this paper, we describe methods for handling multilingual
non-compositional constructions in the framework of GF. We specifically look at
methods to detect and extract non-compositional phrases from parallel texts and
propose methods to handle such constructions in GF grammars. We expect that the
methods to handle non-compositional constructions will enrich CNLs by providing
more flexibility in the design of controlled languages. We look at two specific
use cases of non-compositional constructions: a general-purpose method to
detect and extract multilingual multiword expressions and a procedure to
identify nominal compounds in German. We evaluate our procedure for multiword
expressions by performing a qualitative analysis of the results. For the
experiments on nominal compounds, we incorporate the detected compounds in a
full SMT pipeline and evaluate the impact of our method in machine translation
process.Comment: CNL workshop in COLING 201
Non-Compositional Term Dependence for Information Retrieval
Modelling term dependence in IR aims to identify co-occurring terms that are
too heavily dependent on each other to be treated as a bag of words, and to
adapt the indexing and ranking accordingly. Dependent terms are predominantly
identified using lexical frequency statistics, assuming that (a) if terms
co-occur often enough in some corpus, they are semantically dependent; (b) the
more often they co-occur, the more semantically dependent they are. This
assumption is not always correct: the frequency of co-occurring terms can be
separate from the strength of their semantic dependence. E.g. "red tape" might
be overall less frequent than "tape measure" in some corpus, but this does not
mean that "red"+"tape" are less dependent than "tape"+"measure". This is
especially the case for non-compositional phrases, i.e. phrases whose meaning
cannot be composed from the individual meanings of their terms (such as the
phrase "red tape" meaning bureaucracy). Motivated by this lack of distinction
between the frequency and strength of term dependence in IR, we present a
principled approach for handling term dependence in queries, using both lexical
frequency and semantic evidence. We focus on non-compositional phrases,
extending a recent unsupervised model for their detection [21] to IR. Our
approach, integrated into ranking using Markov Random Fields [31], yields
effectiveness gains over competitive TREC baselines, showing that there is
still room for improvement in the very well-studied area of term dependence in
IR
A probabilistic framework for analysing the compositionality of conceptual combinations
Conceptual combination performs a fundamental role in creating the broad
range of compound phrases utilised in everyday language. This article provides
a novel probabilistic framework for assessing whether the semantics of conceptual
combinations are compositional, and so can be considered as a function of
the semantics of the constituent concepts, or not. While the systematicity and
productivity of language provide a strong argument in favor of assuming compositionality,
this very assumption is still regularly questioned in both cognitive
science and philosophy. Additionally, the principle of semantic compositionality
is underspecified, which means that notions of both "strong" and "weak"
compositionality appear in the literature. Rather than adjudicating between
different grades of compositionality, the framework presented here contributes
formal methods for determining a clear dividing line between compositional and
non-compositional semantics. In addition, we suggest that the distinction between
these is contextually sensitive. Compositionality is equated with a joint probability distribution modeling how the constituent concepts in the combination
are interpreted. Marginal selectivity is introduced as a pivotal probabilistic
constraint for the application of the Bell/CH and CHSH systems of inequalities.
Non-compositionality is equated with a failure of marginal selectivity, or violation
of either system of inequalities in the presence of marginal selectivity. This
means that the conceptual combination cannot be modeled in a joint probability
distribution, the variables of which correspond to how the constituent concepts
are being interpreted. The formal analysis methods are demonstrated by applying
them to an empirical illustration of twenty-four non-lexicalised conceptual
combinations
Human Associations Help to Detect Conventionalized Multiword Expressions
In this paper we show that if we want to obtain human evidence about
conventionalization of some phrases, we should ask native speakers about
associations they have to a given phrase and its component words. We have shown
that if component words of a phrase have each other as frequent associations,
then this phrase can be considered as conventionalized. Another type of
conventionalized phrases can be revealed using two factors: low entropy of
phrase associations and low intersection of component word and phrase
associations. The association experiments were performed for the Russian
language
A Study of Metrics of Distance and Correlation Between Ranked Lists for Compositionality Detection
Compositionality in language refers to how much the meaning of some phrase
can be decomposed into the meaning of its constituents and the way these
constituents are combined. Based on the premise that substitution by synonyms
is meaning-preserving, compositionality can be approximated as the semantic
similarity between a phrase and a version of that phrase where words have been
replaced by their synonyms. Different ways of representing such phrases exist
(e.g., vectors [1] or language models [2]), and the choice of representation
affects the measurement of semantic similarity.
We propose a new compositionality detection method that represents phrases as
ranked lists of term weights. Our method approximates the semantic similarity
between two ranked list representations using a range of well-known distance
and correlation metrics. In contrast to most state-of-the-art approaches in
compositionality detection, our method is completely unsupervised. Experiments
with a publicly available dataset of 1048 human-annotated phrases shows that,
compared to strong supervised baselines, our approach provides superior
measurement of compositionality using any of the distance and correlation
metrics considered
Non-Compositionality in Sentiment: New Data and Analyses
When natural language phrases are combined, their meaning is often more than
the sum of their parts. In the context of NLP tasks such as sentiment analysis,
where the meaning of a phrase is its sentiment, that still applies. Many NLP
studies on sentiment analysis, however, focus on the fact that sentiment
computations are largely compositional. We, instead, set out to obtain
non-compositionality ratings for phrases with respect to their sentiment. Our
contributions are as follows: a) a methodology for obtaining those
non-compositionality ratings, b) a resource of ratings for 259 phrases --
NonCompSST -- along with an analysis of that resource, and c) an evaluation of
computational models for sentiment analysis using this new resource.Comment: Published in EMNLP Findings 2023; 13 pages total (5 in the main
paper, 3 pages with limitations, acknowledgments and references, 5 pages with
appendices
Unified Representation for Non-compositional and Compositional Expressions
Accurate processing of non-compositional language relies on generating good
representations for such expressions. In this work, we study the representation
of language non-compositionality by proposing a language model, PIER, that
builds on BART and can create semantically meaningful and contextually
appropriate representations for English potentially idiomatic expressions
(PIEs). PIEs are characterized by their non-compositionality and contextual
ambiguity in their literal and idiomatic interpretations. Via intrinsic
evaluation on embedding quality and extrinsic evaluation on PIE processing and
NLU tasks, we show that representations generated by PIER result in 33% higher
homogeneity score for embedding clustering than BART, whereas 3.12% and 3.29%
gains in accuracy and sequence accuracy for PIE sense classification and span
detection compared to the state-of-the-art IE representation model, GIEA. These
gains are achieved without sacrificing PIER's performance on NLU tasks (+/- 1%
accuracy) compared to BART.Comment: This work is accepted to EMNLP 2023 Finding
- …