Search CORE

3 research outputs found

Classifying for Diversity

Author: Szostak Rick
Publication venue: University of Washington Information School
Publication date: 31/10/2013
Field of study

This paper argues that a new approach to classification best supports and celebrates social diversity. It maintains that we should want a classification that both facilitates within-group communication and cross-group communication. This is best accomplished through a truly universal classification that classifies works in terms of authorial perspective. Strategies for classifying perspective are discussed. The paper then addresses issues of classification structure. It follows a feminist approach to classification, and shows how a web-of-relations approach can be instantiated in a classification. Finally the paper turns to classificatory process. The key argument here is that much (perhaps all) of the concern regarding the possibility that classes can be subdivided into subclasses in multiple ways (each favored by different groups or individuals) simply vanishes within a web-of-relations approach. The reason is that most of these supposed ways of subdividing a class are in fact ways of subdividing different relationships among classes

University of Washington: ResearchWorks Journal Hosting

Higher-order semantic smoothing for text classification

Author: Poyraz Mitat
Publication venue: Doğuş Üniversitesi Fen Bilimleri Enstitüsü
Publication date: 01/01/2013
Field of study

Poyraz, Mitat (Dogus Author)Text classification is the task of automatically sorting a set of documents into classes (or categories) from a predefined set. This task is of great practical importance given the massive volume of online text available through the World Wide Web, Internet news feeds, electronic mail and corporate databases. Existing statistical text classification algorithms can be trained to accurately classify documents, given a sufficient set of labeled training examples. However, in real world applications, only a small amount of labeled data is available because expert labeling of large amounts of data is expensive. In this case, making an adequate estimation of the model parameters of a classifier is challenging. Underlying this issue is the traditional assumption in machine learning algorithms that instances are independent and identically distributed (IID). Semi-supervised learning (SSL) is the machine learning concept concerned with leveraging explicit as well as implicit link information within data to provide a richer data representation for model parameter estimation. It has been shown that Latent Semantic Indexing (LSI) takes advantage of implicit higher order (or latent) structure in the association of terms and documents. Higher-order relations in LSI capture "latent semantics". lnspired by this, a novel Bayesian frarnework for classifıcation named Higher Order Naive Bayes (HONB), which can explicitly make use of these higher-order relations, has been introduced previously. In this thesis, a novel semantic smoothing rnethod named Higher Order Smoothing (HOS) for the Naive Bayes algorithm is presented. HOS is built on a similar graph based data representation of HONB which allows semantics in higher-order paths to be exploited. Additionally, we take the concept one step further in HOS and exploited the relationships between instances of different classes in order to improve the parameter estimation when dealing with insufficient labeled data. As a result, we have not only been able to move beyond instance boundaries, but also class boundaries to exploit the latent information in higher-order paths. The results of experiments demonstrate the value of HOS on several benchmark datasets.Metin sınıflandırma, bir dokümanlar kümesini daha önceden tanımlanan sınıflara ya da kategorilere otomatik olarak dahil etme işlemidir. Bu işlem, Web sayfalarında, Internet haber kaynaklarında, e-posta iletilerinde ve kurumsal veri tabanlarında mevcut olan çok büyük miktardaki elektronik metin nedeniyle, giderek büyük önem kazanmaktadır. Hali hazırdaki metin sınıflandırma algoritmaları, yeterli sayıda etiketli eğitim kümesi verildiği taktirde dokümanları doğru sınıflandırmak üzere eğitilebilir. Oysa ki gerçek hayatta, büyük miktarda verilerin uzman kişilerce etiketlenmesi pahalı olduğundan çok az sayıda etiketli veri mevcuttur. Bu durumda, sınıflandırıcının model parametreleri ile ilgili uygun bir kestirim yapmak zordur. Bunun temelinde, makine öğrenimi algoritmalarının, veri içerisindeki örneklerin dağılımının bağımsız ve özdeş olduğunu varsayması yatar. Yarı öğreticiyle öğrenme kavramı, model parametre kestirimi için, veri içerisindeki hem açık hem de saklı ilişkilerden yararlanıp, onu daha zengin bir şekilde temsil etmeyle ilgilenir. Saklı Anlam Indeksleme'nin (LSI) dokümanların içerdiği terimler arasındaki yüksek dereceli ilişkileri kullanan bir teknik olduğu ortaya konulmuştur. LSI tekniğinde kullanılan yüksek dereceli ilişkilerden kasıt, terimler arasındaki gizli anlamsal yakınlıktır. Bu teknikten esinlenerek, Higher Order Naive Bayes (HONB) adı verilen, metnin içerisindeki yüksek dereceli anlamsal ilişkileri kullanan, yeni bir metod literatürde yer almaktadır. Bu tezde Higher Order Smoothing (HOS) adı verilen, Naive Bayes algoritması için yeni bir anlamsal yumuşatma metodu ortaya konmuştur. HOS metodu, HONB uygulama çatısında yer alan, metin içerisindeki yüksek dereceli anlamsal ilişkileri kullanmaya imkan veren grafik tabanlı veri gösterimine dayanmaktadır. Ayrıca HOS metodunda, aynı sınıfların örnekleri arasındaki ilişkilerden faydalanma noktasından bir adım öteye geçilerek, farklı sınıfların örnekleri arasındaki ilişkilerden de faydalanılmıştır. Bu sayede, etiketli veri kümesinin yetersiz olduğu durumlardaki parametre kestirimi geliştirilmiştir. Sonuç olarak, yüksek dereceli anlamsal bilgilerden faydalanmak için, sadece örnek sınırlarının ötesine geçmekle kalmayıp aynı zamanda sınıf sınırlarının da ötesine geçebiliyoruz. Farklı veri kümeleriye yapılan deneylerin sonuçları, HOS metodunun değerini kanıtlamaktadır.PREFACE, iii -- ABSTRACT, iv -- ÖZET, v -- ACKNOWLEDMENT, vi -- LIST OF FIGURES, vii -- LIST OF TABLES, viii -- LIST OF SYMBOLS, ix -- ABBREVIATIONS, x - 1. INTRODUCTION, 1 -- 1.1. Scope and objectives of the Thesis, 1 -- 1.2. Methodology of the Thesis, 2 -- 2. LITERATURE REVIEW, 3 -- 3. METHODOLOGY, 16 -- 3.1. Theoretical Background, 16 -- 3.2. Naive Bayes Event Models, 16 -- 3.2.1. Jelinek-Mercer Smoothing, 17 -- 3.2.2. Higher Order Data Representation, 18 -- 3.2.3. Higher Order Naive Bayes, 19 -- 3.3. Higher Order Smoothing, 20 -- 4. CONCLUSION, 25 -- 4.1. Experiment Results, 25 -- 4.2. Discussion, 34 -- 4.3. Future Work, 35 -- REFERENCES, 37 -- CV, 4

Dogus University Institutional Repository