2,640 research outputs found
Recommended from our members
Topical subcategory structure in text classification
Data sets with rich topical structure are common in many real world text classification tasks. A single data set often contains a wide variety of topics and, in a typical task, documents belonging to each class are dispersed across many of the topics. Often, a complex relationship exists between the topic a document discusses and the class label: positive or negative sentiment is expressed in documents from many different topics, but knowing the topic does not necessarily help in determining the sentiment label. We know from tasks such as Domain Adaptation that sentiment is expressed in different ways under different topics. Topical context can in some cases even reverse the sentiment polarity of words: to be sharp is a good quality for knives but bad for singers. This property can be found in many different document classification tasks.
Standard document classification algorithms do not account for or take advantage of topical diversity; instead, classifiers are usually trained with the tacit assumption that topical diversity does not play a role. This thesis is focused on the interplay between the topical structure of corpora, how the target labels in a classification task distribute over the topics and how the topical structure can be utilised in building ensemble models for text classification. We show empirically that a dataset with rich topical structure can be problematic for single classifiers, and we develop two novel ensemble models to address the issues. We focus on two document classification tasks: document level sentiment analysis of product reviews and hierarchical categorisation of news text. For each task we develop a novel ensemble method that utilises topic models to address the shortcomings of traditional text classification algorithms.
Our contribution is in showing empirically that the class association of document features is topic dependent. We show that using the topical context of documents for building ensembles is beneficial for some tasks, and present two new ensemble models for document classification. We also provide a fresh viewpoint for reasoning about the relationship of class labels, topical categories and document features
A Large-Scale Community Questions Classification Accounting for Category Similarity: An Exploratory?
The paper reports on a large-scale topical categorization of questions from a Russian community question answering (CQA) service [email protected]. We used a data set containing all the questions (more than 11 millions) asked by [email protected] users in 2012. This is the first study on question categorization dealing with non-English data of this size. The study focuses on adjusting category structure in order to get more robust classification results. We investigate several approaches to measure similarity between categories: the share of identical questions, language models, and user activity. The results show that the proposed approach is promising.14-07-00589; RFBR; Russian Foundation for Basic Research
Analyzing Cognitive Presence in Online Courses Using an Artificial Neural Network
This work outlines the theoretical underpinnings, method, results, and implications for constructing a discussion list analysis tool that categorizes online, educational discussion list messages into levels of cognitive effort. Purpose The purpose of such a tool is to provide evaluative feedback to instructors who facilitate online learning, to researchers studying computer-supported collaborative learning, and to administrators interested in correlating objective measures of students’ cognitive effort with other measures of student success. This work connects computer–supported collaborative learning, content analysis, and artificial intelligence. Method Broadly, the method employed is a content analysis in which the data from the analysis is modeled using artificial neural network (ANN) software. A group of human coders categorized online discussion list messages, and inter-rater reliability was calculated among them. That reliability figure serves as a measuring stick for determining how well the ANN categorizes the same messages that the group of human coders categorized. Reliability between the ANN model and the group of human coders is compared to the reliability among the group of human coders to determine how well the ANN performs compared to humans. Findings Two experiments were conducted in which artificial neural network (ANN) models were constructed to model the decisions of human coders, and the experiments revealed that the ANN, under noisy, real-life circumstances codes messages with near-human accuracy. From experiment one, the reliability between the ANN model and the group of human coders, using Cohen’s kappa, is 0.519 while the human reliability values range from 0.494 to 0.742 (M=0.6). Improvements were made to the human content analysis with the goal of improving the reliability among coders. After these improvements were made, the humans coded messages with a kappa agreement ranging from 0.816 to 0.879 (M=0.848), and the kappa agreement between the ANN model and the group of human coders is 0.70
Case studies of academic writing in the sciences: a focus on the development of writing skills
The aim of the present thesis is to make a longitudinal study of changes affecting sentence-initial elements in articles published over time by a sample of researchers in international journals of physics. The linguistic framework adopted for such a study is a systematic-functional one. The general research methodology is established around two main axes, one linguistic, and the other statistical. To conduct a longitudinal survey focusing on thematic changes, it was necessary on the one hand to set up clear and unambiguous linguistic categories to capture these changes and, on the other, to present and interpret the findings in manageable and reliable ways with the assistance of statistics.
A pilot study was initially set up to explore possible changes in two articles published within a two year interval by the American Physical Society. The articles were the first and the last of a series of five articles written by the same researcher on the same problem in physics. The method of analysis of the texts used a formulation of Theme that included Subject as an obligatory component, and Contextual Frame - i.e. pre-Subject elements - as an optional one. The analysis, using taxonomies proposed by Davies (1988, 1997) and Gosden (1993, 1996), suggested differences in thematic elements, especially regarding a certain type of complex Subject.
On the basis of coding difficulties and the findings of the pilot study, taxonomies were modified to include in particular new Conventional and Instantial classes for Subject and Contextual Frame. Conventional wordings, both in Subject and in Contextual Frame position, are identified as being expressions which are readily available to novice writers of articles, because they are commonly used terms in the fields of research concerned. In contrast Instantial wordings are identified as being expressions which have been especially contrived by the writer to fit a given stretch of discourse. As writers develop and make their own the matter with which they are working; they become increasingly capable of crafting these more complex workings which involve multiple strands of meaning. In the case of this latter class, particular reference is made to post-modification and clause-type elements which allow meanings to be combined in specific ways
Recommended from our members
Summarizing and Searching Hidden-Web Databases Hierarchically Using Focused Probes
Many valuable text databases on the web have non-crawlable contents that are "hidden" behind search interfaces. Metasearchers are helpful tools for searching over many such databases at once through a unified query interface. A critical task for a metasearcher to process a query efficiently and effectively is the selection of the most promising databases for the query, a task that typically relies on statistical summaries of the database contents. Unfortunately, web-accessible text databases do not generally export content summaries. In this paper, we present an algorithm to derive content summaries from "uncooperative" databases by using "focused query probes," which adaptively zoom in on and extract documents that are representative of the topic coverage of the databases. The content summaries that result from this algorithm are efficient to derive and more accurate than those from previously proposed probing techniques for content-summary extraction. We also present a novel database selection algorithm that exploits both the extracted content summaries and a hierarchical classification of the databases, automatically derived during probing, to produce accurate results even for imperfect content summaries. Finally, we evaluate our techniques thoroughly using a variety of databases, including 50 real web-accessible text databases
Automatic maintenance of category hierarchy
Category hierarchy is an abstraction mechanism for efficiently managing large-scale resources. In an open environment, a category hierarchy will inevitably become inappropriate for managing resources that constantly change with unpredictable pattern. An inappropriate category hierarchy will mislead the management of resources. The increasing dynamicity and scale of online resources increase the requirement of automatically maintaining category hierarchy. Previous studies about category hierarchy mainly focus on either the generation of category hierarchy or the classification of resources under a pre-defined category hierarchy. The automatic maintenance of category hierarchy has been neglected. Making abstraction among categories and measuring the similarity between categories are two basic behaviours to generate a category hierarchy. Humans are good at making abstraction but limited in ability to calculate the similarities between large-scale resources. Computing models are good at calculating the similarities between large-scale resources but limited in ability to make abstraction. To take both advantages of human view and computing ability, this paper proposes a two-phase approach to automatically maintaining category hierarchy within two scales by detecting the internal pattern change of categories. The global phase clusters resources to generate a reference category hierarchy and gets similarity between categories to detect inappropriate categories in the initial category hierarchy. The accuracy of the clustering approaches in generating category hierarchy determines the rationality of the global maintenance. The local phase detects topical changes and then adjusts inappropriate categories with three local operations. The global phase can quickly target inappropriate categories top-down and carry out cross-branch adjustment, which can also accelerate the local-phase adjustments. The local phase detects and adjusts the local-range inappropriate categories that are not adjusted in the global phase. By incorporating the two complementary phase adjustments, the approach can significantly improve the topical cohesion and accuracy of category hierarchy. A new measure is proposed for evaluating category hierarchy considering not only the balance of the hierarchical structure but also the accuracy of classification. Experiments show that the proposed approach is feasible and effective to adjust inappropriate category hierarchy. The proposed approach can be used to maintain the category hierarchy for managing various resources in dynamic application environment. It also provides an approach to specialize the current online category hierarchy to organize resources with more specific categories
Structuring Wikipedia Articles with Section Recommendations
Sections are the building blocks of Wikipedia articles. They enhance
readability and can be used as a structured entry point for creating and
expanding articles. Structuring a new or already existing Wikipedia article
with sections is a hard task for humans, especially for newcomers or less
experienced editors, as it requires significant knowledge about how a
well-written article looks for each possible topic. Inspired by this need, the
present paper defines the problem of section recommendation for Wikipedia
articles and proposes several approaches for tackling it. Our systems can help
editors by recommending what sections to add to already existing or newly
created Wikipedia articles. Our basic paradigm is to generate recommendations
by sourcing sections from articles that are similar to the input article. We
explore several ways of defining similarity for this purpose (based on topic
modeling, collaborative filtering, and Wikipedia's category system). We use
both automatic and human evaluation approaches for assessing the performance of
our recommendation system, concluding that the category-based approach works
best, achieving precision@10 of about 80% in the human evaluation.Comment: SIGIR '18 camera-read
- …