Search CORE

2,037 research outputs found

Measuring associational thinking through word embeddings

Author: Periñán-Pascual Carlos
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/03/2022
Field of study

[EN] The development of a model to quantify semantic similarity and relatedness between words has been the major focus of many studies in various fields, e.g. psychology, linguistics, and natural language processing. Unlike the measures proposed by most previous research, this article is aimed at estimating automatically the strength of associative words that can be semantically related or not. We demonstrate that the performance of the model depends not only on the combination of independently constructed word embeddings (namely, corpus- and network-based embeddings) but also on the way these word vectors interact. The research concludes that the weighted average of the cosine-similarity coefficients derived from independent word embeddings in a double vector space tends to yield high correlations with human judgements. Moreover, we demonstrate that evaluating word associations through a measure that relies on not only the rank ordering of word pairs but also the strength of associations can reveal some findings that go unnoticed by traditional measures such as Spearman's and Pearson's correlation coefficients.s Financial support for this research has been provided by the Spanish Ministry of Science, Innovation and Universities [grant number RTC 2017-6389-5], the Spanish ¿Agencia Estatal de Investigación¿ [grant number PID2020-112827GB-I00 / AEI / 10.13039/501100011033], and the European Union¿s Horizon 2020 research and innovation program [grant number 101017861: project SMARTLAGOON]. Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature.Periñán-Pascual, C. (2022). Measuring associational thinking through word embeddings. Artificial Intelligence Review. 55(3):2065-2102. https://doi.org/10.1007/s10462-021-10056-62065210255

RiuNet

Recommended from our members

Roadmap for Music Information ReSearch

Author: Benetos E.
Chudy M.
Dixon S.
Flexer A.
Gomez E.
Gouyon F.
Herrera P.
Jorda S.
Magas M.
Paytuvi O.
Peeters G.
Schlüter J.
Serra X.
Vinet H.
Widmer G.
Publication venue: MIRES Consortium
Publication date: 01/01/2013
Field of study

City Research Online

UPF Digital Repository

Productivity Measurement of Call Centre Agents using a Multimodal Classification Approach

Author: Ahmed Abdelrahman
Publication venue
Publication date: 18/10/2021
Field of study

Call centre channels play a cornerstone role in business communications and transactions, especially in challenging business situations. Operations’ efficiency, service quality, and resource productivity are core aspects of call centres’ competitive advantage in rapid market competition. Performance evaluation in call centres is challenging due to human subjective evaluation, manual assortment to massive calls, and inequality in evaluations because of different raters. These challenges impact these operations' efficiency and lead to frustrated customers. This study aims to automate performance evaluation in call centres using various deep learning approaches. Calls recorded in a call centre are modelled and classified into high- or low-performance evaluations categorised as productive or nonproductive calls. The proposed conceptual model considers a deep learning network approach to model the recorded calls as text and speech. It is based on the following: 1) focus on the technical part of agent performance, 2) objective evaluation of the corpus, 3) extension of features for both text and speech, and 4) combination of the best accuracy from text and speech data using a multimodal structure. Accordingly, the diarisation algorithm extracts that part of the call where the agent is talking from which the customer is doing so. Manual annotation is also necessary to divide the modelling corpus into productive and nonproductive (supervised training). Krippendorff’s alpha was applied to avoid subjectivity in the manual annotation. Arabic speech recognition is then developed to transcribe the speech into text. The text features are the words embedded using the embedding layer. The speech features make several attempts to use the Mel Frequency Cepstral Coefficient (MFCC) upgraded with Low-Level Descriptors (LLD) to improve classification accuracy. The data modelling architectures for speech and text are based on CNNs, BiLSTMs, and the attention layer. The multimodal approach follows the generated models to improve performance accuracy by concatenating the text and speech models using the joint representation methodology. The main contributions of this thesis are: • Developing an Arabic Speech recognition method for automatic transcription of speech into text. • Drawing several DNN architectures to improve performance evaluation using speech features based on MFCC and LLD. • Developing a Max Weight Similarity (MWS) function to outperform the SoftMax function used in the attention layer. • Proposing a multimodal approach for combining the text and speech models for best performance evaluation

idUS. Depósito de Investigación Universidad de Sevilla

단어임베딩을 이용한 일본어와 한국어에서의 영어 외래어 의미분석

Author: Yamada Akihiko
Publication venue: 서울대학교 대학원
Publication date: 01/02/2021
Field of study

학위논문 (박사) -- 서울대학교 대학원 : 인문대학 언어학과, 2021. 2. 신효필.전 세계적으로 활발한 문화 교류가 이루어짐에 따라 외래어가 일반적으로 자주 사용되는데, 외래어의 수용 과정에서 다양한 언어적 현상이 일어난다. 외래어가 수용됨에 따라 원래 차용주에 존재했던 단어가 사라지기도 하고, 차용어의 접미사와 단어가 차용주의 단어와 결합하여 새로운 단어를 생성하기도 하며, 차용어의 전치사가 외래어로서 그대로 사용되기도 한다. 또한, 외래어 자체는 차용주의 언어적 제약으로 인해 외래어의 정착 과정에서 형태, 음운 및 의미 변화를 겪는다. 이와 같이, 외래어의 수용 과정에서 차용주와 차용어의 다양한 변화가 일어나기 때문에 외래어는 역사언어학의 형태론, 음운론, 의미론과 같은 여러 분야에서 중요하게 연구되는 주제 중 하나이다. 외래어는 주로 차용주의 단어로는 표현할 수 없는 완전히 새로운 외국 제품명이나 개념을 나타내는 데 사용된다. 그런데 한편으로는 이미 고유어로 존재하는 단어를 좀 더 고급스럽고 학술적인 이미지로 바꾸기 위해 외래어를 사용하기도 하는데, 이러한 외래어의 사회언어학적 역할은 최근 특히 주목을 받고 있다. 대부분의 외래어 선행연구는 외래어의 많은 예를 수집하고 언어변화 패턴을 정리하는 방법으로 진행되었다. 최근 말뭉치 기반의 정량적 연구에서는 단어 길이와 같은 언어학적인 요인들이 외래어가 차용주에 성공적으로 정착하는 과정에 영향을 미치는지 통계적으로 연구하는 방법이 많이 사용되었다. 그러나 이러한 단어의 빈도기반 연구는 단어의 복잡한 의미 정보를 정량화하는 데에는 어려움이 있어 외래어 의미 현상에 대한 정량적 분석연구는 아직 진행되지 않았다. 본 연구는 외래어와 관련된 의미 현상을 정량적으로 분석하기 위한 단어임베딩(Word Embedding) 기반의 방법을 제안한다. 단어 임베딩 방법은 딥 러닝 방법과 언어 빅데이터를 사용하여 단어의 의미 문맥 정보를 벡터 값으로 효과적으로 변환할 수 있다. 이 방법을 활용하여 외래어와 관련된 의미 현상의 세 가지 주제, 어휘 경쟁, 의미적 적응, 사회적 의미 기능과 문화적 경향 변화에 초점을 맞추어 연구를 진행하였다. 첫 번째 연구는 외래어와 차용주의 동의어 간의 어휘경쟁에 중점을 둔다. 빈도기반의 방법으로는 어휘 경쟁의 유형(단어 대체 또는 의미 분화)을 구별할 수 없다. 어휘 경쟁의 유형을 판단하려면 외래어와 차용주 동의어 간의 문맥 공유 상태를 파악해야 한다. 문맥 공유 상태를 정량적으로 모델링하기 위해 본 연구는 기하학적 개념을 적용한다. 제안된 기하학적 단어 임베딩 기반 모델은 외래어와 수용언어의 동의어 사이에서 발생하는 어휘 경쟁을 정량적으로 판단함을 확인할 수 있었다. 두 번째 연구는 일본어와 한국어에서의 영어 외래어의 의미 적응에 중점을 둔다. 영어 외래어는 차용주에 정착하는 과정을 통해 의미 적응을 겪는다. 본 연구는 외래어와 영어 고유어와의 의미 차이를 비교하기 위해 변환 행렬 방법을 적용하여 영어 외래어의 일본어와 한국어에서의 의미 적응 차이를 분석하였다. 또한, 영어 단어의 다의성이 의미적응에 주는 영향을 통계적으로 분석하였다. 세 번째 연구는 일본과 한국의 최신 문화적 경향을 반영하는 외래어의 사회 의미적 역할에 초점을 맞춘다. 일본과 한국 사회의 미디어에서는 새로운 문화적인 경향이나 이슈가 생겼을 때 외래어를 자주 사용하므로, 외래어가 일본과 한국의 문화적 경향을 반영하는 역할을 가질 것이 예상된다. 본 연구는 이러한 외래어가 문화적 경향의 변화를 반영하는 지표로서의 역할을 한다는 가설을 제안한다. 이 가설을 검증하기 위해 사전 훈련된 문맥 임베딩 모델(BERT)을 사용하고 시간에 따른 외래어의 문맥 변화를 추적하는 방법을 제안한다. 실험 결과, 제안된 방법을 통해 외래어의 문맥 변화 추적을 통해 문화적 경향의 변화를 감지할 수 있었다. 본 연구에서는 기본적으로 일본어와 한국어 데이터를 사용하였다. 이것은 전산 다국어 대조 언어연구의 가능성을 보여준다. 이러한 단어 임베딩 기반의 의미 분석 방법은 다언어 계산의미론 및 계산사회언어학의 발전에 많은 기여를 할 수 있을 것으로 예상된다.Through cultural exchanges with foreign countries, a lot of foreign words have entered another country with a foreign culture. These foreign words, loanwords, have broadly prevailed in languages all over the world. Historical linguistics has actively studied the loanword because loanword can trigger the linguistic change within the recipient language. Loanwords affect existing words and grammar: native words become obsolete, foreign suffixes and words coin new words and phrases by combining with the native words in the recipient language, and foreign prepositions are used in the recipient language. Loanwords themselves also undergo language changes-morphological, phonological, and semantic changes-because of linguistic constraints of recipient languages through the process of integration and adaptation in the recipient language. Several fields of linguistics-morphology, phonology, and semantics-have studied these changes caused by the invasion of loanwords. Mainly loanwords introduce to the recipient language a completely new foreign product or concept that can not be expressed by the recipient language words. However, people often use loanwords for giving prestigious, luxurious, and academic images. These sociolinguistic roles of loanwords have recently received particular attention in sociolinguistics and pragmatics. Most previous works of loanwords have gathered many examples of loanwords and summarized the linguistic change patterns. Recently, corpus-based quantitative studies have started to statistically reveal several linguistic factors such as the word length influencing the successful integration and adaptation of loanwords in the recipient language. However, these frequency-based researches have difficulties quantifying the complex semantic information. Thus, the quantitative analysis of the loanword semantic phenomena has remained undeveloped. This research sheds light on the quantitative analysis of the semantic phenomena of loanwords using the Word Embedding method. Word embedding can effectively convert semantic contextual information of words to vector values with deep learning methods and big language data. This study suggests several quantitative methods for analyzing the semantic phenomena related to the loanword. This dissertation focuses on three topics of semantic phenomena related to the loanword: Lexical competition, Semantic adaptation, and Social semantic function and the cultural trend change. The first study focuses on the lexical competition between the loanword and the native synonym. Frequency can not distinguish the types of a lexical competition: Word replacement or Semantic differentiation. Judging the type of lexical competition requires to know the context sharing condition between loanwords and the native synonyms. We apply the geometrical concept to modeling the context sharing condition. This geometrical word embedding-based model quantitatively judges what lexical competitions happen between the loanwords and the native synonyms. The second study focus on the semantic adaptation of English loanwords in Japanese and Korean. The original English loanwords undergo semantic change (semantic adaptation) through the process of integration and adaptation in the recipient language. This study applies the transformation matrix method to compare the semantic difference between the loanwords and the original English words. This study extends this transformation method for a contrastive study of the semantic adaptation of English loanwords in Japanese and Korean. The third study focuses on the social semantic role of loanwords reflecting the current cultural trend in Japanese and Korean. Japanese and Korean society frequently use loanwords when new trends or issues happened. Loanwords seem to work as signals alarming the cultural trend in Japanese and Korean. Thus, we propose the hypothesis that loanwords have a role as an indicator of the cultural trend change. This study suggests the tracking method of the contextual change of loanwords through time with the pre-trained contextual embedding model (BERT) for verifying this hypothesis. This word embedding-based method can detect the cultural trend change through the contextual change of loanwords. Throughout these studies, we used our methods in Japanese and Korean data. This shows the possibility for the computational multilingual contrastive linguistic study. These word embedding-based semantic analysis methods will contribute a lot to the development of computational semantics and computational sociolinguistics in various languages.Abstract i Contents iv List of Tables viii List of Figures xi 1 Introduction 1 1.1 Overview of Loanword Study 1 1.2 Research Topics in this Dissertation 6 1.2.1 Lexical Competition between Loanword and Native Synonym 6 1.2.2 Semantic Adaptation of Loanwords 8 1.2.3 Social Semantic Function and the Cultural Trend Change 11 1.3 Methodological Background 14 1.3.1 The Vector Space Model 14 1.3.2 The Bag of Words Model 15 1.3.3 Neural Network and Neural Probabilistic Language Model 15 1.3.4 Distributional Model and Word2vec 18 1.3.5 The Contextual Word Embedding and BERT 21 1.4 Summary of this Chapter 23 2 Word Embeddings for Lexical Changes Caused by Lexical Competition between Loanwords and Native Words 25 2.1 Overview 25 2.2 Related Works 28 2.2.1 Lexical Competition in Loanword 28 2.2.2 Word Embedding Model and Semantic Change 30 2.3 Selection of Loanword and Korean Synonym Pairs 31 2.3.1 Viable Loanwords 31 2.3.2 Previous Approach: The Relative Frequency 31 2.3.3 New Approach: The Proportion Test 32 2.3.4 Technical Challenges for Performing the Proportion Test 32 2.3.5 Filtering Procedures 34 2.3.6 Handling Errors 35 2.3.7 Proportion Test and Questionnaire Survey 36 2.4 Analysis of Lexical Competition 38 2.4.1 The Geometrical Model for Analyzing the Lexical Competition 39 2.4.2 Word Embedding Model for Analyzing Lexical Competition 44 2.4.3 Result and Discussion 44 2.5 Conclusion and Future Work 48 3 Applying Word Embeddings to Measure the Semantic Adaptation of English Loanwords in Japanese and Korean 51 3.1 Overview 51 3.2 Methodology 54 3.3 Data and Experiment 55 3.4 Result and Discussion 58 3.4.1 Japanese 59 3.4.2 Korean 63 3.4.3 Comparison of Cosine Similarities of English Loanwords in Japanese and Korean 68 3.4.4 The Relationship Between the Number of Meanings and Cosine Similarities 75 3.5 Conclusion and Future Works 77 4 Detection of the Contextual Change of Loanwords and the Cultural Trend Change in Japanese and Korean through Pre-trained BERT Language Models 78 4.1 Overview 78 4.2 Related Work 81 4.2.1 Loanwords and Cultural Trend Change 81 4.2.2 Word Embeddings and Semantic Change 81 4.2.3 Contextualized Embedding and Diachronic Semantic Representation 82 4.3 The Framework 82 4.3.1 Sense Representation 82 4.3.2 Tracking the Contextual Changes 85 4.3.3 Evaluation of Frame Work 86 4.3.4 Discussion for Framework 89 4.4 The Cultural Trend Change Analysis through Loanword Contextual Change Detection 89 4.4.1 Methodology 89 4.4.2 Result and Discussion 91 4.5 Conclusion and Future Work 96 5 Conclusion and Future Works 97 5.1 Summary 97 5.2 Future Works 99 5.2.1 Revealing Statistical Law 99 5.2.2 Computational Contrastive Linguistic Study 100 5.2.3 Application to Other Semantics Tasks 100 A List of Loanword Having One Synset and One Definition in Korean CoreNet in Chapter 2 112 Abstract (In Korean) 118Docto

SNU Open Repository and Archive

Automated labeling of PDF mathematical exercises with word N-grams VSM classification

Author: Dai Yiling
Flanagan Brendan
Nakamoto Ryosuke
Ogata Hiroaki
Takami Kyosuke
Yamauchi Taisei
Publication venue: Springer Nature
Publication date: 18/10/2023
Field of study

In recent years, smart learning environments have become central to modern education and support students and instructors through tools based on prediction and recommendation models. These methods often use learning material metadata, such as the knowledge contained in an exercise which is usually labeled by domain experts and is costly and difficult to scale. It recognizes that automated labeling eases the workload on experts, as seen in previous studies using automatic classification algorithms for research papers and Japanese mathematical exercises. However, these studies didn’t delve into fine-grained labeling. In addition to that, as the use of materials in the system becomes more widespread, paper materials are transformed into PDF formats, which can lead to incomplete extraction. However, there is less emphasis on labeling incomplete mathematical sentences to tackle this problem in the previous research. This study aims to achieve precise automated classification even from incomplete text inputs. To tackle these challenges, we propose a mathematical exercise labeling algorithm that can handle detailed labels, even for incomplete sentences, using word n-grams, compared to the state-of-the-art word embedding method. The results of the experiment show that mono-gram features with Random Forest models achieved the best performance with a macro F-measure of 92.50%, 61.28% for 24-class labeling and 297-class labeling tasks, respectively. The contribution of this research is showing that the proposed method based on traditional simple n-grams has the ability to find context-independent similarities in incomplete sentences and outperforms state-of-the-art word embedding methods in specific tasks like classifying short and incomplete texts

Kyoto University Research Information Repository

Natural Language Processing Methods for Symbolic Music Generation and Information Retrieval: a Survey

Author: Bigo Louis
Herremans Dorien
Keller Mikaela
Le Dinh-Viet-Toan
Publication venue
Publication date: 27/02/2024
Field of study

Several adaptations of Transformers models have been developed in various domains since its breakthrough in Natural Language Processing (NLP). This trend has spread into the field of Music Information Retrieval (MIR), including studies processing music data. However, the practice of leveraging NLP tools for symbolic music data is not novel in MIR. Music has been frequently compared to language, as they share several similarities, including sequential representations of text and music. These analogies are also reflected through similar tasks in MIR and NLP. This survey reviews NLP methods applied to symbolic music generation and information retrieval studies following two axes. We first propose an overview of representations of symbolic music adapted from natural language sequential representations. Such representations are designed by considering the specificities of symbolic music. These representations are then processed by models. Such models, possibly originally developed for text and adapted for symbolic music, are trained on various tasks. We describe these models, in particular deep learning models, through different prisms, highlighting music-specialized mechanisms. We finally present a discussion surrounding the effective use of NLP tools for symbolic music data. This includes technical issues regarding NLP methods and fundamental differences between text and music, which may open several doors for further research into more effectively adapting NLP tools to symbolic MIR.Comment: 36 pages, 5 figures, 4 table

arXiv.org e-Print Archive

Statistical distribution of common audio features : encounters in a heavy-tailed universe

Author: Haro Berois Martín
Publication venue: 'Universitat Pompeu Fabra'
Publication date: 01/01/2013
Field of study

In the last few years some Music Information Retrieval (MIR) researchers have spotted important drawbacks in applying standard successful-in-monophonic algorithms to polyphonic music classification and similarity assessment. Noticeably, these so called “Bag-of-Frames” (BoF) algorithms share a common set of assumptions. These assumptions are substantiated in the belief that the numerical descriptions extracted from short-time audio excerpts (or frames) are enough to capture relevant information for the task at hand, that these frame-based audio descriptors are time independent, and that descriptor frames are well described by Gaussian statistics. Thus, if we want to improve current BoF algorithms we could: i) improve current audio descriptors, ii) include temporal information within algorithms working with polyphonic music, and iii) study and characterize the real statistical properties of these frame-based audio descriptors. From a literature review, we have detected that many works focus on the first two improvements, but surprisingly, there is a lack of research in the third one. Therefore, in this thesis we analyze and characterize the statistical distribution of common audio descriptors of timbre, tonal and loudness information. Contrary to what is usually assumed, our work shows that the studied descriptors are heavy-tailed distributed and thus, they do not belong to a Gaussian universe. This new knowledge led us to propose new algorithms that show improvements over the BoF approach in current MIR tasks such as genre classification, instrument detection, and automatic tagging of music. Furthermore, we also address new MIR tasks such as measuring the temporal evolution of Western popular music. Finally, we highlight some promising paths for future audio-content MIR research that will inhabit a heavy-tailed universe.En el campo de la extracción de información musical o Music Information Retrieval (MIR), los algoritmos llamados Bag-of-Frames (BoF) han sido aplicados con éxito en la clasificación y evaluación de similitud de señales de audio monofónicas. Por otra parte, investigaciones recientes han señalado problemas importantes a la hora de aplicar dichos algoritmos a señales de música polifónica. Estos algoritmos suponen que las descripciones numéricas extraídas de los fragmentos de audio de corta duración (o frames ) son capaces de capturar la información necesaria para la realización de las tareas planteadas, que el orden temporal de estos fragmentos de audio es irrelevante y que las descripciones extraídas de los segmentos de audio pueden ser correctamente descritas usando estadísticas Gaussianas. Por lo tanto, si se pretende mejorar los algoritmos BoF actuales se podría intentar: i) mejorar los descriptores de audio, ii) incluir información temporal en los algoritmos que trabajan con música polifónica y iii) estudiar y caracterizar las propiedades estadísticas reales de los descriptores de audio. La bibliografía actual sobre el tema refleja la existencia de un número considerable de trabajos centrados en las dos primeras opciones de mejora, pero sorprendentemente, hay una carencia de trabajos de investigación focalizados en la tercera opción. Por lo tanto, esta tesis se centra en el análisis y caracterización de la distribución estadística de descriptores de audio comúnmente utilizados para representar información tímbrica, tonal y de volumen. Al contrario de lo que se asume habitualmente, nuestro trabajo muestra que los descriptores de audio estudiados se distribuyen de acuerdo a una distribución de “cola pesada” y por lo tanto no pertenecen a un universo Gaussiano. Este descubrimiento nos permite proponer nuevos algoritmos que evidencian mejoras importantes sobre los algoritmos BoF actualmente utilizados en diversas tareas de MIR tales como clasificación de género, detección de instrumentos musicales y etiquetado automático de música. También nos permite proponer nuevas tareas tales como la medición de la evolución temporal de la música popular occidental. Finalmente, presentamos algunas prometedoras líneas de investigación para tareas de MIR ubicadas, a partir de ahora, en un universo de “cola pesada”.En l’àmbit de la extracció de la informació musical o Music Information Retrieval (MIR), els algorismes anomenats Bag-of-Frames (BoF) han estat aplicats amb èxit en la classificació i avaluació de similitud entre senyals monofòniques. D’altra banda, investigacions recents han assenyalat importants inconvenients a l’hora d’aplicar aquests mateixos algorismes en senyals de música polifònica. Aquests algorismes BoF suposen que les descripcions numèriques extretes dels fragments d’àudio de curta durada (frames) son suficients per capturar la informació rellevant per als algorismes, que els descriptors basats en els fragments son independents del temps i que l’estadística Gaussiana descriu correctament aquests descriptors. Per a millorar els algorismes BoF actuals doncs, es poden i) millorar els descriptors, ii) incorporar informació temporal dins els algorismes que treballen amb música polifònica i iii) estudiar i caracteritzar les propietats estadístiques reals d’aquests descriptors basats en fragments d’àudio. Sorprenentment, de la revisió bibliogràfica es desprèn que la majoria d’investigacions s’han centrat en els dos primers punts de millora mentre que hi ha una mancança quant a la recerca en l’àmbit del tercer punt. És per això que en aquesta tesi, s’analitza i caracteritza la distribució estadística dels descriptors més comuns de timbre, to i volum. El nostre treball mostra que contràriament al què s’assumeix, els descriptors no pertanyen a l’univers Gaussià sinó que es distribueixen segons una distribució de “cua pesada”. Aquest descobriment ens permet proposar nous algorismes que evidencien millores importants sobre els algorismes BoF utilitzats actualment en diferents tasques com la classificació del gènere, la detecció d’instruments musicals i l’etiquetatge automàtic de música. Ens permet també proposar noves tasques com la mesura de l’evolució temporal de la música popular occidental. Finalment, presentem algunes prometedores línies d’investigació per a tasques de MIR ubicades a partir d’ara en un univers de “cua pesada”

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Tesis Doctorals en Xarxa

Text-based Sentiment Analysis and Music Emotion Recognition

Author: Cano Erion
Publication venue: Politecnico di Torino
Publication date
Field of study

Nowadays, with the expansion of social media, large amounts of user-generated texts like tweets, blog posts or product reviews are shared online. Sentiment polarity analysis of such texts has become highly attractive and is utilized in recommender systems, market predictions, business intelligence and more. We also witness deep learning techniques becoming top performers on those types of tasks. There are however several problems that need to be solved for efficient use of deep neural networks on text mining and text polarity analysis. First of all, deep neural networks are data hungry. They need to be fed with datasets that are big in size, cleaned and preprocessed as well as properly labeled. Second, the modern natural language processing concept of word embeddings as a dense and distributed text feature representation solves sparsity and dimensionality problems of the traditional bag-of-words model. Still, there are various uncertainties regarding the use of word vectors: should they be generated from the same dataset that is used to train the model or it is better to source them from big and popular collections that work as generic text feature representations? Third, it is not easy for practitioners to find a simple and highly effective deep learning setup for various document lengths and types. Recurrent neural networks are weak with longer texts and optimal convolution-pooling combinations are not easily conceived. It is thus convenient to have generic neural network architectures that are effective and can adapt to various texts, encapsulating much of design complexity. This thesis addresses the above problems to provide methodological and practical insights for utilizing neural networks on sentiment analysis of texts and achieving state of the art results. Regarding the first problem, the effectiveness of various crowdsourcing alternatives is explored and two medium-sized and emotion-labeled song datasets are created utilizing social tags. One of the research interests of Telecom Italia was the exploration of relations between music emotional stimulation and driving style. Consequently, a context-aware music recommender system that aims to enhance driving comfort and safety was also designed. To address the second problem, a series of experiments with large text collections of various contents and domains were conducted. Word embeddings of different parameters were exercised and results revealed that their quality is influenced (mostly but not only) by the size of texts they were created from. When working with small text datasets, it is thus important to source word features from popular and generic word embedding collections. Regarding the third problem, a series of experiments involving convolutional and max-pooling neural layers were conducted. Various patterns relating text properties and network parameters with optimal classification accuracy were observed. Combining convolutions of words, bigrams, and trigrams with regional max-pooling layers in a couple of stacks produced the best results. The derived architecture achieves competitive performance on sentiment polarity analysis of movie, business and product reviews. Given that labeled data are becoming the bottleneck of the current deep learning systems, a future research direction could be the exploration of various data programming possibilities for constructing even bigger labeled datasets. Investigation of feature-level or decision-level ensemble techniques in the context of deep neural networks could also be fruitful. Different feature types do usually represent complementary characteristics of data. Combining word embedding and traditional text features or utilizing recurrent networks on document splits and then aggregating the predictions could further increase prediction accuracy of such models

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Learning Contextualized Semantics from Co-occurring Terms via a Siamese Architecture

Author: Chen Ke
Sandouk Ubai
Publication venue: 'Elsevier BV'
Publication date: 01/04/2016
Field of study

The University of Manchester - Institutional Repository