Search CORE

31 research outputs found

On the Combination of Textual and Semantic Descriptions for Automated Semantic Web Service Classification

Author: Georgios Meditskos
Grigorios Tsoumakas
Ioannis Katakis
Ioannis Vlahavas
Nick Bassiliades
Publication venue: Springer US
Publication date: 01/01/2009
Field of study

Abstract Semantic Web services have emerged as the solution to the need for automating several aspects related to service-oriented architectures, such as service discovery and composition, and they are realized by combining Semantic Web technologies and Web service standards. In the present paper, we tackle the problem of automated classification of Web services according to their application domain taking into account both the textual description and the semantic annotations of OWL-S advertisements. We present results that we obtained by applying machine learning algorithms on textual and semantic descriptions separately and we propose methods for increasing the overall classification accuracy through an extended feature vector and an ensemble of classifiers

CiteSeerX

Machine learning methods for automated text classification

Author: Katakis Ioannis
Κατάκης Ιωάννης
Publication venue: 'National Documentation Centre (EKT)'
Publication date: 01/01/2009
Field of study

Applications of machine learning methods to text data present great commercial and research interest due to the high availability of information in unstructured text format. The utilization of machine learning enables the analysis and automated management of large amounts of text. The contribution of this thesis regards three challenging text classification problems: a) text stream classification, b) multilabel text classification and c) text classification in the world wide web. Concerning text stream classification, the problem of the appearance of new predictive features (words) over time is discussed. A computationally efficient approach is presented that combines an incremental feature selection method with a learning algorithm that can operate in a dynamic feature space. The proposed method is incorporated into a personalized news reader. Additionally, the problem of recurring contexts is confronted by exploiting stream clustering in order to dynamically build and update an ensemble of incremental classifiers. To achieve this, a transformation function that maps batches of examples into a new conceptual representation model is proposed. The clustering algorithm is then applied in order to group batches of examples into concepts and identify recurring contexts. The ensemble is produced by creating and maintaining an incremental classifier for every concept discovered in the data stream. Furthermore, two methods are proposed for multilabel text classification that focus on the problem of large number of labels. The first one constructs a hierarchy of multilabel classifiers, each one dealing with a much smaller set of labels and a more balanced example distribution. The second one proposes breaking the initial set of labels into a number of small random subsets, and employing a multilabel classifier for each one. The set of labels can be either disjoint or overlapping, depending on which of two strategies is used to construct them. Empirical evidence indicates that both approaches manage to improve substantially over the base multilabel classifier, especially in domains with large numbers of labels. Additionally the overlapping approach outperforms the disjoint one and exhibits competitive performance against other highperforming multilabel learning methods. Finally, two applications of text classification for the world wide web were studied. In the first one a multilabel classification algorithm is utilized in order to build an automated tag recommender for web bookmarks and bibliographic references. The second one tackles the problem of automated classification of semantic web services according to their application domain. The method represents each web service as a feature vector based on the text and the semantic annotations of the web service description. A number of different representations is proposed. The classification is achieved by applying machine learning algorithms to these representations. An increase in predictive accuracy is obtained by exploiting classifier combination.Οι εφαρμογές μεθόδων μηχανικής μάθησης σε δεδομένα κειμένου παρουσιάζουν ι- διαίτερο ερευνητικό και εμπορικό ενδιαφέρον εξαιτίας της μεγάλης διαθεσιμότητας πληροφορίας σε μορφή κειμένου. Με τη χρήση της μηχανικής μάθησης είναι εφικτή η ανάλυση μεγάλου αριθμού κειμένων και η αυτόματη διαχείρισή τους. Σημαντικό ενδιαφέρον συγκεντρώνει η διεργασία της ταξινόμησης κειμένων την οποία πραγμα- τεύεται και η παρούσα διατριβή. Συγκεκριμένα, αντιμετωπίζονται τρία σημαντικά προ- ?λήματα της ταξινόμησης κειμένων : α) η ταξινόμηση ?οών κειμένων, ?) η ταξινόμηση κειμένων πολλαπλών ετικετών και γ) η ταξινόμηση κειμένων του παγκόσμιου ιστού. Αρχικά, η διατριβή επικεντρώνεται σε ένα πρόβλημα της ταξινόμησης ?οών κειμέ- νων, την εννοιολογική απόκλιση, και ειδικότερα στην εμφάνιση νέων χαρακτηριστικών με το πέρασμα του χρόνου. Παρουσιάζεται ένα πλαίσιο μάθησης το οποίο συνδυάζει μία επαυξητική μέθοδο επιλογής χαρακτηριστικών με έναν ταξινομητή που μπορεί να λειτουργήσει σε δυναμικούς χώρους χαρακτηριστικών με στόχο την αντιμετώπιση αυτού του προβλήματος. Το προτεινόμενο πλαίσιο εφαρμόζεται σε ένα προσαρμοστικό σύστημα ανάγνωσης ειδήσεων. Επίσης, προτείνεται μία μέθοδος ομάδας ταξινομητών κατά την οποία χρησιμο- ποιείται ένα νέο μοντέλο αναπαράστασης κατάλληλο για προβλήματα ταξινόμησης ?ο- ών δεδομένων που εμπεριέχουν επανεμφανιζόμενες έννοιες. Συγκεκριμένα, η ?οή διαχωρίζεται σε δέσμες δεδομένων οι οποίες μετασχηματίζονται σε διανύσματα που περιγράφουν τις έννοιες που εμπεριέχονται σε αυτά. Στην προκύπτουσα ?οή των διανυσμάτων αυτών εφαρμόζεται ένας αλγόριθμος ομαδοποίησης ?οών με στόχο την οργάνωσή τους σε ομάδες όπου επικρατούν οι ίδιες ή παρόμοιες έννοιες. Απώτερος σκοπός είναι η διατήρηση ενός ταξινομητή για κάθε έννοια της ?οής. Επιπλέον, προτείνονται δύο μέθοδοι για το πρόβλημα της ταξινόμησης πολλαπλών ετικετών με ιδιαίτερη έμφαση σε προβλήματα με μεγάλο αριθμό ετικετών. Η πρώτη, αντιμετωπίζει το πρόβλημα οργανώνοντας τις ετικέτες σε μία ιεραρχία με κύριο πλεονέ- κτημα τους μικρούς χρόνους ταξινόμησης αλλά και την ποιότητα πρόβλεψης. Για την οργάνωση των ετικετών στην ιεραρχία προτάθηκε ένας νέος αλγόριθμος ισορροπημέ- νης ομαδοποίησης. Στη δεύτερη μέθοδο, διασπάται τυχαία το αρχικό σύνολο ετικετών σε υποσύνολα. Σε κάθε ένα από αυτά εφαρμόζεται ένας ξεχωριστός ταξινομητής πολ- λαπλών ετικετών. Τέλος, παρουσιάζονται δύο μέθοδοι ταξινόμησης κειμένων στον παγκόσμιο ιστό. Η πρώτη χρησιμοποιεί έναν ταξινομητή πολλαπλών ετικετών για τη σύσταση λέξεων επισήμανσης σε σύστημα διαμοιρασμού ?ιβλιογραφικών αναφορών και σελιδοδεικτών ιστού. Η δεύτερη αφορά στην αυτόματη ταξινόμηση σημασιολογικών υπηρεσιών ιστού. Προτείνονται μέθοδοι για την αναπαράσταση των περιγραφών των υπηρεσιών ως δια- νύσματα χαρακτηριστικών στα οποία εφαρμόζονται αλγόριθμοι μηχανικής μάθησης. Παρουσιάζονται επίσης δύο μέθοδοι συνδυασμού αυτών των αναπαραστάσεων

Hellenic National Archive of Doctoral Dissertations

Incremental Clustering for the Classification of Concept-Drifting Data Streams

Author: Grigorios Tsoumakas
Ioannis Katakis
Ioannis Vlahavas
Publication venue
Publication date
Field of study

Abstract. Concept drift is a common phenomenon in streaming data environments and constitutes an interesting challenge for researchers in the machine learning and data mining community. This paper proposes a probabilistic representation model for data stream classification and investigates the use of incremental clustering algorithms in order to identify and adapt to concept drift. An experimental study is performed using three real-world datasets from the text domain, a basic implementation of the proposed framework and three baseline methods for dealing with drifting concepts. Results are promising and encourage further investigation. 1

CiteSeerX

Dynamic Feature Space and Incremental Feature Selection for the Classification of Textual Data Streams

Author: Grigorios Tsoumakas
Ioannis Katakis
Ioannis Vlahavas
Publication venue: Springer Verlag
Publication date
Field of study

Abstract. Real world text classification applications are of special interest for the machine learning and data mining community, mainly because they introduce and combine a number of special difficulties. They deal with high dimensional, streaming, unstructured, and, in many occasions, concept drifting data. Another important peculiarity of streaming text, not adequately discussed in the relative literature, is the fact that the feature space is initially unavailable. In this paper, we discuss this aspect of textual data streams. We underline the necessity for a dynamic feature space and the utility of incremental feature selection in streaming text classification tasks. In addition, we describe a computationally undemanding incremental learning framework that could serve as a baseline in the field. Finally, we introduce a new concept drifting dataset which could assist other researchers in the evaluation of new methodologies.

CiteSeerX

Email Mining 1 Email Mining: Emerging Techniques for Email Management

Author: Grigorios Tsoumakas
Ioannis Katakis
Ioannis Vlahavas
Publication venue
Publication date
Field of study

Email Mining 2 Email has met tremendous popularity over the past few years. People are sending and receiving many messages per day, communicating with partners and friends, or exchanging files and information. Unfortunately, the phenomenon of email overload has grown over the past years becoming a personal headache for users and a financial issue for companies. In this chapter, we will discuss how disciplines like Machine Learning and Data Mining can contribute to the solution of the problem by constructing intelligent techniques which automate email managing tasks and what advantages they hold over other conventional solutions. We will also discuss the particularity of email data and what special treatment it requires. Some interesting email mining applications like mail categorization, summarization, automatic answering and spam filtering will be also presented

CiteSeerX