7 research outputs found
Detection of Product Comparisons – How Far Does an Out-of-the-box Semantic Role Labeling System Take You?
This short paper presents a pilot study investigating the training of a standard Semantic Role Labeling (SRL) system on product reviews for the new task of detecting comparisons. An (opinionated) comparison consists of a comparative “predicate ” and up to three “arguments”: the entity evaluated positively, the entity evaluated negatively, and the aspect under which the comparison is made. In user-generated product reviews, the “predicate” and “arguments ” are expressed in highly heterogeneous ways; but since the elements are textually annotated in existing datasets, SRL is technically applicable. We address the interesting question how well training an outof-the-box SRL model works for English data. We observe that even without any feature engineering or other major adaptions to our task, the system outperforms a reasonable heuristic baseline in all steps (predicate identification, argument identification and argument classification) and in three different datasets.
On Interpretation and Measurement of Soft Attributes for Recommendation
We address how to robustly interpret natural language refinements (or critiques) in recommender systems. In particular, in human-human recommendation settings people frequently use soft attributes to express preferences about items, including concepts like the originality of a movie plot, the noisiness of a venue, or the complexity of a recipe. While binary tagging is extensively studied in the context of recommender systems, soft attributes often involve subjective and contextual aspects, which cannot be captured reliably in this way, nor be represented as objective binary truth in a knowledge base. This also adds important considerations when measuring soft attribute ranking. We propose a more natural representation as personalized relative statements, rather than as absolute item properties. We present novel data collection techniques and evaluation approaches, and a new public dataset. We also propose a set of scoring approaches, from unsupervised to weakly supervised to fully supervised, as a step towards interpreting and acting upon soft attribute based critiques.publishedVersio
Annotation en rôles sémantiques du français en domaine spécifique
In this Natural Language Processing Ph. D. Thesis, we aim to perform semantic role labeling on French domain-specific texts. This task first disambiguates the sense of predicates in a given text and annotates its child chunks with semantic roles such as Agent, Patient or Destination. The task helps many applications in domains where annotated corpora exist, but is difficult to use otherwise. We first evaluate on the FrameNet corpus an existing method based on VerbNet, which explains why the method is domain-independant. We show that substantial improvements can be obtained. We first use syntactic information by handling the passive voice. Next, we use semantic informations by taking advantage of the selectional restrictions present in VerbNet. To apply this method to French, we first translate lexical resources. We first translate the WordNet lexical database. Next, we translate the VerbNet lexicon which is organized semantically using syntactic information. We obtain its translation, VerbeNet, by reusing two French verb lexicons (the Lexique-Grammaire and Les Verbes Français) and by manually modifying and reorganizing the resulting lexicon. Finally, once those building blocks are in place, we evaluate the feasibility of semantic role labeling of French and English in three specific domains. We study the pros and cons of using VerbNet and VerbeNet to annotate those domains before explaining our future work.Cette thèse de Traitement Automatique des Langues a pour objectif l'annotation automatique en rôles sémantiques du français en domaine spécifique. Cette tâche désambiguïse le sens des prédicats d'un texte et annote les syntagmes liés avec des rôles sémantiques tels qu'Agent, Patient ou Destination. Elle aide de nombreuses applications dans les domaines où des corpus annotés existent, mais est difficile à utiliser quand ce n'est pas le cas. Nous avons d'abord évalué sur le corpus FrameNet une méthode existante d'annotation basée uniquement sur VerbNet et donc indépendante du domaine considéré. Nous montrons que des améliorations conséquentes peuvent être obtenues à la fois d'un point de vue syntaxique avec la prise en compte de la voix passive et d'un point de vue sémantique en utilisant les restrictions de sélection indiquées dans VerbNet. Pour utiliser cette méthode en français, nous traduisons deux ressources lexicales anglaises. Nous commençons par la base de données lexicales WordNet. Nous traduisons ensuite le lexique VerbNet dans lequel les verbes sont regroupés sémantiquement grâce à leurs traits syntaxiques. La traduction, VerbeNet, a été obtenue en réutilisant deux lexiques verbaux du français (le Lexique-Grammaire et Les Verbes Français) puis en modifiant manuellement l'ensemble des informations obtenues. Enfin, une fois ces briques en place, nous évaluons la faisabilité de l'annotation en rôles sémantiques en anglais et en français dans trois domaines spécifiques. Nous évaluons quels sont les avantages et inconvénients de se baser sur VerbNet et VerbeNet pour annoter ces domaines, avant d'indiquer nos perspectives pour poursuivre ces travaux
Structurally informed methods for improved sentiment analysis
Sentiment analysis deals with methods to automatically analyze opinions in natural language texts, e.g., product reviews. Such reviews contain a large number of fine-grained opinions, but to automatically extract detailed information it is necessary to handle a wide variety of verbalizations of opinions. The goal of this thesis is to develop robust structurally informed models for sentiment analysis which address challenges that arise from structurally complex verbalizations of opinions. In this thesis, we look at two examples for such verbalizations that benefit from including structural information into the analysis: negation and comparisons.
Negation directly influences the polarity of sentiment expressions, e.g., while "good" is positive, "not good" expresses a negative opinion. We propose a machine learning approach that uses information from dependency parse trees to determine whether a sentiment word is in the scope of a negation expression.
Comparisons like "X is better than Y" are the main topic of this thesis. We present a machine learning system for the task of detecting the individual components of comparisons: the anchor or predicate of the comparison, the entities that are compared, which aspect they are compared in, and which entity is preferred. Again, we use structural context from a dependency parse tree to improve the performance of our system. We discuss two ways of addressing the issue of limited availability of training data for our system. First, we create a manually annotated corpus of comparisons in product reviews, the largest such resource available to date. Second, we use the semi-supervised method of structural alignment to expand a small seed set of labeled sentences with similar sentences from a large set of unlabeled sentences.
Finally, we work on the task of producing a ranked list of products that complements the isolated prediction of ratings and supports the user in a process of decision making. We demonstrate how we can use the information from comparisons to rank products and evaluate the result against two conceptually different external gold standard rankings.Sentimentanalyse befasst sich mit Methoden zur automatischen Analyse von Meinungen in Texten wie z.B. Produktbewertungen. Solche bewertenden Texte enthalten detaillierte Meinungsäußerungen. Um diese automatisch analysieren zu können müssen wir mit strukturell komplexen Äußerungen umgehen können. In dieser Arbeit präsentieren wir einen Ansatz für die robuste Analyse von komplexen Meinungsäußerungen mit Hilfe von Informationen aus der Satzstruktur. Wir betrachten zwei Beispiele für komplexe Meinungsäußerungen: Negationen und Vergleiche.
Eine Negation hat direkten Einfluss auf die Polarität einer Meinungsäußerung in einem Satz. Während "gut" eine positive Meinung ausdrückt, ist "nicht gut" negativ. Wir präsentieren ein System, das auf maschinellem Lernen beruht und Informationen aus dem Satzstrukturbaum verwendet um für ein gegebenes Schlüsselwort festzustellen, ob im Kontext eine Negation vorkommt die die Polarität beeinflusst.
Als zweites Beispiel für komplexe Meinungsäußerungen betrachten wir Vergleiche von Produkten, z.B. "X ist besser als Y". Wir präsentieren ein lernendes System, das die einzelnen Komponenten von Vergleichen identifiziert: Das Prädikat bzw. das Wort, das den Vergleich einführt, die beiden Entitäten, die verglichen werden, der Aspekt in dem sie verglichen werden, und welche Entität als besser bewertet wird. Auch hier verwenden wir Satzstrukturinformationen um die Erkennung zu verbessern. Ein Problem für die Anwendung von maschinellen Lernverfahren ist die eingeschränkte Verfügbarkeit von Trainingsdaten. Wir gehen dieses Problem auf zwei Arten an. Zum einen durch die Annotation eines eigenen Datensatzes von Vergleichen in Kamerabewertungen. Zum anderen indem wir eine halbüberwachte Methode einsetzen um eine kleine Menge von manuell annotierten Sätzen durch ähnliche Sätze aus einer großen Menge unannotierter Sätze zu ergänzen.
Abschließend bearbeiten wir die Aufgabe, den Auswahlprozess eines Kunden zu unterstützen indem wir eine Rangfolge von Produkten erstellen. Wir demonstrieren, wie wir Vergleiche zu diesem Zweck nutzen können und evaluieren unser System gegen zwei konzeptionell unterschiedliche Rangfolgen aus externen Quellen
Generación de un corpus para detección de competidores en el idioma español mediante minería de opiniones comparativas. Caso de estudio: sector textil en la provincia del Azuay
En la actualidad con el avance de la tecnología y más aún con la llegada de la pandemia el uso de
las plataformas digitales se ha incrementado. Un estudio presentado por la Cámara de Comercio
Electrónico Ecuatoriana del año 2020 demuestra que el comercio electrónico ha incrementado en
al menos 15 veces con respecto al 2019 el uso de plataformas digitales online con la llegada de la
pandemia. Debido a esto, las empresas para hacer estudios de mercado deben buscar nuevas
fuentes de información. Por lo tanto, el internet se ha convertido en un insumo intangible de toda
estrategia comercial. Una parte fundamental de una estrategia comercial es analizar a la
competencia, este análisis en años anteriores según la literatura se realizaba generalmente
mediante encuestas, pero con la llegada de las plataformas digitales ha cambiado este método y
hoy por hoy se puede extraer los datos de la web para luego implementar un proceso de Inteligencia
Competitiva (CI), la cual permite hacer un análisis completo para tener una ventaja competitiva. CI
comprende de varios pasos, esta investigación aborda todos estos pasos, pero se enfoca
principalmente en el paso inicial, la recolección y análisis de datos, que es un paso fundamental para
CI, donde actualmente existen problemas como: falta de corpus en español especializado para CI,
por lo cual los investigadores no tienen la facilidad de implementar modelos de aprendizaje
automático que les ayuden a tener una ventaja competitiva. El presente trabajo de investigación
presenta una metodología para la creación de un corpus en el idioma español que permita entrenar
algoritmos con el fin de realizar detección de competidores en el contexto del sector textil. Se han
generado dos resultados principales: 1) Una metodología utilizando técnicas de minería de textos
(minería de opiniones comparativas y reconocimiento de entidades nombradas) para construir
corpus enfocado hacia la Inteligencia Competitiva. 2) Un corpus en español, dentro del dominio de
comentarios de redes sociales, el cual sirve de base para futuras investigaciones relacionadas con la
inteligencia competitiva, específicamente en la detección de competidores en el lenguaje español,
donde la CI estaba estrictamente restringida por la falta de un corpus. Por último, se ha evaluado la
utilidad del corpus desarrollado mediante un Dashboard creado en base a un caso de estudio llevado
a cabo en el contexto del sector textil en redes sociales. Se ha demostrado que efectivamente es de
utilidad para el sector textil, sin embargo, se recomienda hacer una nueva validación con empresas
que estén directamente relacionadas al sector textil y así obtener una validación más directa,
también se recomienda evaluar en otros sectores.Currently, with the advancement of technology and even more so with the arrival of the
pandemic, the use of digital platforms has increased. A study presented by the Ecuadorian Chamber
of Electronic Commerce for the year 2020 shows that electronic commerce has increased the use of
online digital platforms by at least 15 times compared to 2019 with the arrival of the pandemic. Due
to this, companies to do market research must look for new sources of information. Therefore, the
internet has become an intangible input for any business strategy. A fundamental part of a
commercial strategy is to analyze the competition, this analysis in previous years according to the
literature was generally carried out through surveys, but with the arrival of digital platforms this
method has changed and today the data can be extracted from the web to then implement a
Competitive Intelligence (CI) process, which allows a complete analysis to have a competitive
advantage. CI comprises several steps, this research addresses all these steps, but focuses mainly
on the initial step, data collection and data analysis, which is a fundamental step for CI, where there
are currently problems such as: lack of corpus in Spanish specialized for CI, so researchers do not
have the facility to implement machine learning models that help them to have a competitive
advantage. This research presents a methodology for the creation of a corpus in the Spanish
language that allows algorithms to be trained in order to detect competitors in the context of the
textile sector. Two main results have been generated: 1) A methodology using text mining
techniques (comparative opinion mining and named entity recognition) to build a corpus focused
on Competitive Intelligence. 2) A corpus in Spanish, within the domain of social network comments,
which serves as a basis for future research related to competitive intelligence, specifically in the
detection of competitors in the Spanish language, where the CI was strictly restricted by the lack of
a corpus. Finally, the usefulness of the corpus developed has been evaluated through a Dashboard
created based on a case study carried out in the context of the textile sector in social networks. It
has been shown that it is indeed useful for the textile sector, however, it is recommended to carry
out a new validation with companies that are directly related to the textile sector and thus obtain a
more direct validation, it is also recommended to evaluate in other sectors.Ingeniero de SistemasCuenc