1,802 research outputs found
Natural Language Processing: Emerging Neural Approaches and Applications
This Special Issue highlights the most recent research being carried out in the NLP field to discuss relative open issues, with a particular focus on both emerging approaches for language learning, understanding, production, and grounding interactively or autonomously from data in cognitive and neural systems, as well as on their potential or real applications in different domains
Combining textual features with sentence embeddings
학위논문(박사) -- 서울대학교대학원 : 인문대학 언어학과, 2021.8. 박수지.이 논문의 목표는 한국어 기사 품질을 예측하기 위한 언어 모형을 개발하는 것이다. 기사 품질 예측 과제는 최근 가짜뉴스 등의 범람으로 그 필요성이 대두되면서도 자연언어처리의 최신 기법이 아직 적용되지 못하는 실정에 있다. 이 논문에서는 이러한 한계를 극복하기 위해 문장의 의미를 표상하는 SBERT 모형을 개발하고, 기사의 언어학적 자질을 활용하여 품질 분류의 성능을 높일 수 있는지를 검토하고자 한다. 그 결과 기사의 가독성, 응집성 등의 텍스트 자질을 사용한 기계학습 모형과 SBERT에서 자동으로 추출된 문맥 자질을 사용한 전이학습 모형이 모두 선행연구의 심층학습 결과보다 높은 성능을 보였고, 구체적으로는 SBERT 학습시 훈련 데이터를 확장하고 정제할 때, 그리고 텍스트 자질과 문맥 자질을 함께 사용할 때 성능이 더욱 향상되는 것을 관측하였다. 이를 통해 기사의 품질에서 언어학적 자질이 중요한 역할을 하며 자연언어처리의 최신 기법인 SBERT가 언어학적 자질을 추출하고 활용하는 데 실질적으로 기여할 수 있다는 결론을 내릴 수 있다.1 Introduction 1
2 Literature Review 5
2.1 Background 5
2.1.1 Text Classification 5
2.1.1.1 Initial Studies 5
2.1.1.2 News Classification 6
2.1.2 Text Quality Assessment 8
2.2 News Quality Prediction Task 9
2.2.1 News Data 9
2.2.1.1 Online vs. Offline 9
2.2.1.2 Expert-rated vs. User-rated 9
2.2.2 Prediction Methods 11
2.2.2.1 Manually Engineered Features v. Automatically Extracted Features 11
2.2.2.2 Machine Learning vs. Deep Learning 12
2.3 Instruments and Techniques 14
2.3.1 Sentence and Document Embeddings 14
2.3.1.1 Static Embeddings 14
2.3.1.2 Contextual Embeddings 16
2.3.2 Fusion Models 18
2.4 Summary 27
3 Methods 29
3.1 Data from Choi, Shin, and Kang (2021) 29
3.1.1 News Corpus 29
3.1.2 Quality Levels 29
3.1.3 Journalism Values 30
3.2 Linguistic Features 31
3.2.1 Justification of Using Linguistic Features Only 31
3.2.2 Two Types of Linguistic Features 32
3.2.2.1 Textual Features 32
3.2.2.2 Contextual Features 33
3.3 Summary 33
4 Ordinal Logistic Regression Models with Textual Features 35
4.1 Textual Features 35
4.1.1 Coh-Metrix 35
4.1.2 KOSAC Lexicon 36
4.1.3 K-LIWC 38
4.1.4 Others 38
4.2 Ordinal Logistic Regression 38
4.3 Results 39
4.3.1 Feature Selection 39
4.3.2 Impacts on Quality Evaluation 40
4.4 Discussion 40
4.4.1 Effect of Cosine Similarity by Issue 41
4.4.2 Effect of Quantitative Evidence 47
4.4.3 Effect of Sentiment 48
4.5 Summary 51
5 Deep Transfer Learning Models with Contextual Features 53
5.1 Contextual Features from SentenceBERT 53
5.1.1 Necessity of Sentence Embeddings 54
5.1.2 KR-SBERT 55
5.2 Deep Transfer Learning 56
5.3 Results 59
5.3.1 Measures of Multiclass Classification 59
5.3.2 Performances of news quality prediction models 60
5.4 Discussion 62
5.4.1 Effect of Data Size 62
5.4.2 Effect of Data Augmentation 62
5.4.3 Effect of Data Refinement 635.5 Summary 63
6 Fusion Models Combining Textual Features with Contextual Sentence Embeddings 65
6.1 Model Fusion 65
6.1.1 Feature-level Fusion: Concatenation 65
6.1.2 Logit-level Fusion: Interpolation 65
6.2 Results 68
6.2.1 Optimization of the Presentational Attribute Model 68
6.2.2 Performances of News Quality Prediction Models 68
6.3 Discussion 68
6.3.1 Effects of Fusion 70
6.3.2 Comparison with Choi et al. (2021) 71
6.4 Summary 71
7 Conclusion 73
References 75
A List of Words Used for Textual Feature Extraction 93
A.1 Coh-Metrix Features 93
A.2 Predicate Type Features 94
B Codes Used in Chapter 4 97
B.1 Python Code for Textual Feature Extraction 97
C Results of VIF test and Brant test 101
C.1 VIF Test in R 101
C.2 Brant Test in R 103
D Codes Used in Chapter 6 107
D.1 Python Code for Feature-Level Fusion 107
D.2 Python Code for Logit-Level Fusion 108박
Artificial Intelligence for Multimedia Signal Processing
Artificial intelligence technologies are also actively applied to broadcasting and multimedia processing technologies. A lot of research has been conducted in a wide variety of fields, such as content creation, transmission, and security, and these attempts have been made in the past two to three years to improve image, video, speech, and other data compression efficiency in areas related to MPEG media processing technology. Additionally, technologies such as media creation, processing, editing, and creating scenarios are very important areas of research in multimedia processing and engineering. This book contains a collection of some topics broadly across advanced computational intelligence algorithms and technologies for emerging multimedia signal processing as: Computer vision field, speech/sound/text processing, and content analysis/information mining
Recommended from our members
Problem-solving recognition in scientific text
As far back as Aristotle, problems and solutions have been recognised as a core pattern of thought, and in particular of the scientific method. Therefore, they play a significant role in the understanding of academic texts from the scientific domain. Capturing knowledge of such problem-solving utterances would provide a deep insight into text understanding. In this dissertation, I present the task of problem-solving recognition in scientific text.
To date, work on problem-solving recognition has received both theoretical and computational treatment. However, theories of problem-solving put forward by applied linguists lack practical adaptation to the domain of scientific text, and computational analyses have been narrow in scope.
This dissertation provides a new model of problem-solving. It is an adaptation of Hoey's (2001) model, tailored to the scientific domain. As far as modelling problems is concerned, I divided the text string expressing the statement of a problem into sub-components; this is one of my main contributions. I have mapped these sub-components to functional roles, and thus operationalised the model in such a way that it can be annotated by humans reliably. As far as the problem-solving relationship between problems and solutions is concerned, my model takes into account the local network of relationships existing between problems.
In order to validate this new model, a large-scale annotation study was conducted. The annotation study shows significant agreement amongst the annotators. The model is automated in two stages using a blend of classical machine learning and state-of-the-art deep learning methods. The first stage involves the implementation of problem and solution recognisers which operate at the sentence level. The second stage is more complex in that it recognises problems and solutions jointly at the token-level, and also establishes whether there is a problem-solving relationship between each of them. One of the best performers at this stage was a Neural Relational Topic Model. The results from automation show that the model is able to recognise problem-solving utterances in text to a high degree of accuracy.
My work has already shown a positive impact in both industry and academia. One start-up is currently using the model for representing academic articles, and a Japanese collaborator has received a grant to adapt my model to Japanese text
Data mining jako metoda použitelná v oblasti japonských studií
Tato práce se zaměřuje na problematiku potenciálního využití metod dolování z textu v oblasti japonských studií. První část práce shrnuje základní přístupy dolování z textu a jejich aplikace v praxi. Dále podáváme podrobný výklad problematiky předzpracování textu, u kterého se soustředíme na techniky používané v případě japonštiny a angličtiny. Hlavní část práce spočívá v aplikaci metod dolování z textu na tři konkrétní výzkumné otázky z oblasti japonských studií. V prvním tématu ukážeme na příkladu děl dvou vybraných japonských proletářských autorů, jak mohou techniky shlukování odhalit zajímavé tematické rysy literárních děl. V případě druhého výzkumného tématu využijeme analýzu sentimentu za účelem vyšetření míry negativního sentimentu, který se objevuje v japonských a zahraničních novinových článcích pojednávajících o návštěvách, které vykonávají japonští představitelé ve svatyni Jasukuni. Nakonec se zaměříme na metody automatického shrnutí dokumentů, které aplikujeme na japonské a anglické texty. Získané výsledky detailně diskutujeme, zvláště se zaměřujeme na vyhodnocení použitelnosti představovaných metod pro japonská studia.In this thesis we address the problem of possible utilization of text mining methods in the field of Japanese studies. We review the fundamental text mining approaches and their practical applications in the first part. Then we elaborate on the topic of preprocessing with special focus on techniques used for Japanese and English texts. In the main part of the thesis we apply text mining methods to three concrete research questions relevant in Japanese studies. The first research topic illustrates the technique of clustering applied to works written by two Japanese proletarian authors to reveal interesting topic patterns in their writings. The second topic makes use of the sentiment analysis with the aim of studying the extent of negative sentiment expressed in both foreign and Japanese newspaper articles that refer to Japanese officials' visits to Yasukuni shrine. Finally, we address methods of automatic summarization and their application to Japanese as well as English sample texts. The results obtained are discussed in detail with a special focus on the assessment of viability of the presented methods in Japanese studies.Institute of East Asian StudiesÚstav Dálného východuFaculty of ArtsFilozofická fakult
A Survey of GPT-3 Family Large Language Models Including ChatGPT and GPT-4
Large language models (LLMs) are a special class of pretrained language
models obtained by scaling model size, pretraining corpus and computation.
LLMs, because of their large size and pretraining on large volumes of text
data, exhibit special abilities which allow them to achieve remarkable
performances without any task-specific training in many of the natural language
processing tasks. The era of LLMs started with OpenAI GPT-3 model, and the
popularity of LLMs is increasing exponentially after the introduction of models
like ChatGPT and GPT4. We refer to GPT-3 and its successor OpenAI models,
including ChatGPT and GPT4, as GPT-3 family large language models (GLLMs). With
the ever-rising popularity of GLLMs, especially in the research community,
there is a strong need for a comprehensive survey which summarizes the recent
research progress in multiple dimensions and can guide the research community
with insightful future research directions. We start the survey paper with
foundation concepts like transformers, transfer learning, self-supervised
learning, pretrained language models and large language models. We then present
a brief overview of GLLMs and discuss the performances of GLLMs in various
downstream tasks, specific domains and multiple languages. We also discuss the
data labelling and data augmentation abilities of GLLMs, the robustness of
GLLMs, the effectiveness of GLLMs as evaluators, and finally, conclude with
multiple insightful future research directions. To summarize, this
comprehensive survey paper will serve as a good resource for both academic and
industry people to stay updated with the latest research related to GPT-3
family large language models.Comment: Preprint under review, 58 page
Text Classification: A Review, Empirical, and Experimental Evaluation
The explosive and widespread growth of data necessitates the use of text
classification to extract crucial information from vast amounts of data.
Consequently, there has been a surge of research in both classical and deep
learning text classification methods. Despite the numerous methods proposed in
the literature, there is still a pressing need for a comprehensive and
up-to-date survey. Existing survey papers categorize algorithms for text
classification into broad classes, which can lead to the misclassification of
unrelated algorithms and incorrect assessments of their qualities and behaviors
using the same metrics. To address these limitations, our paper introduces a
novel methodological taxonomy that classifies algorithms hierarchically into
fine-grained classes and specific techniques. The taxonomy includes methodology
categories, methodology techniques, and methodology sub-techniques. Our study
is the first survey to utilize this methodological taxonomy for classifying
algorithms for text classification. Furthermore, our study also conducts
empirical evaluation and experimental comparisons and rankings of different
algorithms that employ the same specific sub-technique, different
sub-techniques within the same technique, different techniques within the same
category, and categorie
Investigating and extending the methods in automated opinion analysis through improvements in phrase based analysis
Opinion analysis is an area of research which deals with the computational treatment of opinion statement and subjectivity in textual data. Opinion analysis has emerged over the past couple of decades as an active area of research, as it provides solutions to the issues raised by information overload. The problem of information overload has emerged with the advancements in communication technologies which gave rise to an exponential growth in user generated subjective data available online. Opinion analysis has a rich set of applications which are used to enable opportunities for organisations such as tracking user opinions about products, social issues in communities through to engagement in political participation etc.The opinion analysis area shows hyperactivity in recent years and research at different levels of granularity has, and is being undertaken. However it is observed that there are limitations in the state-of-the-art, especially as dealing with the level of granularities on their own does not solve current research issues. Therefore a novel sentence level opinion analysis approach utilising clause and phrase level analysis is proposed. This approach uses linguistic and syntactic analysis of sentences to understand the interdependence of words within sentences, and further uses rule based analysis for phrase level analysis to calculate the opinion at each hierarchical structure of a sentence. The proposed opinion analysis approach requires lexical and contextual resources for implementation. In the context of this Thesis the approach is further presented as part of an extended unifying framework for opinion analysis resulting in the design and construction of a novel corpus. The above contributions to the field (approach, framework and corpus) are evaluated within the Thesis and are found to make improvements on existing limitations in the field, particularly with regards to opinion analysis automation. Further work is required in integrating a mechanism for greater word sense disambiguation and in lexical resource development
- …