26 research outputs found

    A French Corpus Annotated for Multiword Nouns

    Get PDF
    International audienceThis paper presents a French corpus annotated for multiword nouns. This corpus is designed for investigation in information retrieval and extraction, as well as in deep and shallow syntactic parsing. We delimit which kind of multiword units we targeted for this annotation task; we describe the resources and methods we used for the annotation; and we briefly comment on the results. The annotated corpus is available at http://infolingu.univ-mlv.fr/ under the LGPLLR license.Cet article présente un corpus du français muni d'annotations sur les noms composés. Ce corpus est conçu pour la recherche sur l'extraction d'informations ainsi que sur l'analyse syntaxique superficielle ou profonde. Nous délimitons quels types de mots composés nous avons ciblés pour cette tâche d'annotation ; nous décrivons les ressources et les méthodes que nous avons utilisées pour l'annotation ; et nous commentons brièvement les résultats. Le corpus annoté est disponible sur http://infolingu.univ-mlv.fr/ sous licence LGPLLR

    An Empirical Study of English-Chinese Translation of Novel Context-Free Compound Nouns and Phrases

    Get PDF
    The current study designs a compound translation test and finds out that unlike English speakers, Chinese translators tend to bypass syntactic paraphrase and directly conduct semantic processing on the surface structure of compounds/phrases. Syntactic operations, semantic categories, and world knowledge are important factors in compound interpretation and translation. Syntactic analysis and semantic processing are important factors in the process of interpretation and translation of novel context-free compounds and phrases. The present study also reveals the psychological differences between English and Chinese speakers. Syntactic transformation knowledge is also quite helpful in disambiguating compounds/phrases with the same surface structures. Statistical results demonstrate that the abstractness of compounds affect translators’ processing effort as well as accuracy. Other possible factors in compounding comprehension and translation include world knowledge, contextual information and pragmatic awareness

    Opinion Holder and Target Extraction on Opinion Compounds – A Linguistic Approach

    Get PDF
    We present an approach to the new task of opinion holder and target extraction on opinion compounds. Opinion compounds (e.g. user rating or victim support) are noun compounds whose head is an opinion noun. We do not only examine features known to be effective for noun compound analysis, such as paraphrases and semantic classes of heads and modifiers, but also propose novel features tailored to this new task. Among them, we examine paraphrases that jointly consider holders and targets, a verb detour in which noun heads are replaced by related verbs, a global head constraint allowing inferencing between different compounds, and the categorization of the sentiment view that the head conveys

    Las Relaciones Semánticas Predicen la Desambiguación Estructural de las Unidades Terminológicas Poliléxicas con Tres Formantes

    Get PDF
    For English multiword terms (MWTs) of three or more constituents (e.g., sea level rise), a semantic analysis, based on linguistic and domain knowledge, is necessary to resolve the dependency between components. This structural disambiguation, often known as bracketing, involves the grouping of the dependent components so that the MWT is reduced to its basic form of modifier+head, as in [sea level] [rise]. Knowledge of these dependencies facilitates the comprehension of an MWT and its accurate translation into other languages. Moreover, the resolution of MWT bracketing provides a higher overall accuracy in machine translation systems and sentence parsers. This paper thus presents a pilot study that explored whether the bracketing of a ternary compound, when used as an argument in a sentence, can be predicted from the semantic information encoded in that sentence. It is shown that, with a random forest model, the semantic relation of the MWT to another argument in the same sentence, the lexical domain of the predicate, and the semantic role of the MWT were able to predict the bracketing of the 190 ternary compounds used as arguments in a sample of 188 semantically annotated sentences from a Coastal Engineering corpus (100% F1-score). Furthermore, only the semantic relation of an MWT to another argument in the same sentence proved enormous capability to predict ternary compound bracketing with a binary decision-tree model (94.12%F1-score).En unidades terminológicas poliléxicas (UTP) con tres o más formantes en lengua inglesa (p.ej., sea level rise), establecer la dependencia entre dichos formantes requiere de un análisis lingüístico y de conocimiento especializado del área concreta en que se emplean las UTP. Esta desambiguación estructural, o bracketing, implica el agrupamiento de los formantes para reducir la UTP a su estructura básica de modificador+núcleo, como en [sea level] [rise]. Conocer el bracketing de una UTP no solo facilita su comprensión y traducción a otras lenguas, sino que también mejora el desempeño de los sistemas de traducción automática y de los analizadores sintácticos. Por tanto, en este artículo presentamos un estudio piloto que explora si el bracketing de una UTP con tres formantes, al emplearse como argumento en una oración, puede predecirse a partir de la información semántica codificada en dicha oración. Se muestra que, con un modelo random forest, la relación semántica de la UTP con otro argumento en la misma oración, el dominio léxico del verbo y el rol semántico de la UTP son capaces de predecir el bracketing de las 190 UTP ternarias que se usan como argumento en una muestra de 188 oraciones, anotadas semánticamente y extraídas de un corpus sobre ingeniería de costas (con un valor de F1 del 100%). Además, únicamente la relación semántica que mantiene una UTP ternaria con otro argumento en la misma oración posee una enorme capacidad para predecir su bracketing mediante un árbol de decisión binario (con un valor de F1 del 94,12%).This research was carried out as part of projects PID2020-118369GB-I00, "Transversal Integration of Culture in a Terminological Knowledge Base on Environment" (TRANSCULTURE), funded by the Spanish Ministry of Science and Innovation; and A-HUM-600-UGR20, "Culture as Transversal Module in a Terminological Knowledge Base on the Environment" (CULTURAMA), funded by the Andalusian Ministry of Economy, Knowledge, Business, and University

    Head to head: Semantic similarity of multi-word terms

    Get PDF
    Terms are linguistic signifiers of domain–specific concepts. Semantic similarity between terms refers to the corresponding distance in the conceptual space. In this study, we use lexico–syntactic information to define a vector space representation in which cosine similarity closely approximates semantic similarity between the corresponding terms. Given a multi–word term, each word is weighed in terms of its defining properties. In this context, the head noun is given the highest weight. Other words are weighed depending on their relations to the head noun. We formalized the problem as that of determining a topological ordering of a direct acyclic graph, which is based on constituency and dependency relations within a noun phrase. To counteract the errors associated with automatically inferred constituency and dependency relations, we implemented a heuristic approach to approximating the topological ordering. Different weights are assigned to different words based on their positions. Clustering experiments performed on such a vector space representation showed considerable improvement over the conventional bag–of–word representation. Specifically, it more consistently reflected semantic similarity between the terms. This was established by analyzing the differences between automatically generated dendrograms and manually constructed taxonomies. In conclusion, our method can be used to semi–automate taxonomy construction

    Sammensætninger i et børnesprogsmateriale

    Get PDF
    This article discusses compounds in a corpus collected among native and multilingual children in schools in Aarhus, Denmark. The corpus consists of interviews with individual children and recordings of groups of children playing; the compounds were collected from the interviews only. Focus is on different semantic types of compounds. A large body of the corpus consists of simple lexicalized compounds where the compound is the only relevant alternative if you want to express yourself idiomatically. Occasionally we found interesting unconventional compounding used to replace lexical gaps with the children, among the native speakers as well as among the multilingual children. In some cases, very special types of compounds came up, especially in the attempts of the children to cope with specific challenges in some of the tasks in the interviews
    corecore