12 research outputs found

    Social Network Opinion and Posts Mining for Community Preference Discovery

    Get PDF
    The popularity of posts, topics, and opinions on social media websites and the influence ability of users can be discovered by analyzing the responses of users (e.g., likes/dislikes, comments, ratings). Existing web opinion mining systems such as OpinionMiner is based on opinion text similarity scoring of users\u27 review texts and product ratings to generate database table of features, functions and opinions mined through classification to identify arriving opinions as positive or negative on user-service networks or interest networks (e.g., Amazon.com). These systems are not directly applicable to user-user networks or friendship networks (e.g., Facebook.com) since they do not consider multiple posts on multiple products, users\u27 relationships (such as influence), and diverse posts and comments. In this thesis, we propose a new influence network (IN) generation algorithm (Opinion Based IN:OBIN) through opinion mining of friendship networks (like Facebook.com). OBIN mines opinions using extended OpinionMiner that considers multiple posts and relationships (influences) between users. Approach used includes frequent pattern mining algorithm for determining community (positive or negative) preferences for a given product as input to standard influence maximization algorithms like CELF for target marketing. Experiments and evaluations show the effectiveness of OBIN over CELF in large-scale friendship networks. KEYWORDS Influence Analysis, Recommendation, Ranking, Sentiment Classification, Large Scale Network, Social Network, Opinion Mining, Text Mining

    Learning lost temporal fuzzy association rules

    Get PDF
    Fuzzy association rule mining discovers patterns in transactions, such as shopping baskets in a supermarket, or Web page accesses by a visitor to a Web site. Temporal patterns can be present in fuzzy association rules because the underlying process generating the data can be dynamic. However, existing solutions may not discover all interesting patterns because of a previously unrecognised problem that is revealed in this thesis. The contextual meaning of fuzzy association rules changes because of the dynamic feature of data. The static fuzzy representation and traditional search method are inadequate. The Genetic Iterative Temporal Fuzzy Association Rule Mining (GITFARM) framework solves the problem by utilising flexible fuzzy representations from a fuzzy rule-based system (FRBS). The combination of temporal, fuzzy and itemset space was simultaneously searched with a genetic algorithm (GA) to overcome the problem. The framework transforms the dataset to a graph for efficiently searching the dataset. A choice of model in fuzzy representation provides a trade-off in usage between an approximate and descriptive model. A method for verifying the solution to the hypothesised problem was presented. The proposed GA-based solution was compared with a traditional approach that uses an exhaustive search method. It was shown how the GA-based solution discovered rules that the traditional approach did not. This shows that simultaneously searching for rules and membership functions with a GA is a suitable solution for mining temporal fuzzy association rules. So, in practice, more knowledge can be discovered for making well-informed decisions that would otherwise be lost with a traditional approach.EPSRC DT

    SemAware: An Ontology-Based Web Recommendation System

    Get PDF
    Web Recommendation Systems (WRS\u27s) are used to recommend items and future page views to world wide web users. Web usage mining lays the platform for WRS\u27s, as results of mining user browsing patterns are used for recommendation and prediction. Existing WRS\u27s are still limited by several problems, some of which are the problem of recommending items to a new user whose browsing history is not available (Cold Start), sparse data structures (Sparsity), and no diversity in the set of recommended items (Content Overspecialization). Existing WRS\u27s also fail to make full use of the semantic information about items and the relations (e.g., is-a, has-a, part-of) among them. A domain ontology, advocated by the Semantic Web, provides a formal representation of domain knowledge with relations, concepts and axioms.This thesis proposes SemAware system, which integrates domain ontology into web usage mining and web recommendation, and increases the effectiveness and efficiency of the system by solving problems of cold start, sparsity, content overspecialization and complexity-accuracy tradeoffs. SemAware technique includes enriching the web log with semantic information through a proposed semantic distance measure based on Jaccard coefficient. A matrix of semantic distances is then used in Semantics-aware Sequential Pattern Mining (SPM) of the web log, and is also integrated with the transition probability matrix of Markov models built from the web log. In the recommendation phase, the proposed SPM and Markov models are used to add interpretability. The proposed recommendation engine uses vector-space model to build anitem-concept correlation matrix in combination with user-provided tags to generate top-n recommendation.Experimental studies show that SemAware outperforms popular recommendation algorithms, and that its proposed components are effective and efficient for solving the contradicting predictions problem, the scalability and sparsity of SPM and top-n recommendations, and content overspecialization problems

    Mining interesting events on large and dynamic data

    Get PDF
    Nowadays, almost every human interaction produces some form of data. These data are available either to every user, e.g.~images uploaded on Flickr or to users with specific privileges, e.g.~transactions in a bank. The huge amount of these produced data can easily overwhelm humans that try to make sense out of it. The need for methods that will analyse the content of the produced data, identify emerging topics in it and present the topics to the users has emerged. In this work, we focus on emerging topics identification over large and dynamic data. More specifically, we analyse two types of data: data published in social networks like Twitter, Flickr etc.~and structured data stored in relational databases that are updated through continuous insertion queries. In social networks, users post text, images or videos and annotate each of them with a set of tags describing its content. We define sets of co-occurring tags to represent topics and track the correlations of co-occurring tags over time. We split the tags to multiple nodes and make each node responsible of computing the correlations of its assigned tags. We implemented our approach in Storm, a distributed processing engine, and conducted a user study to estimate the quality of our results. In structured data stored in relational databases, top-k group-by queries are defined and an emerging topic is considered to be a change in the top-k results. We maintain the top-k result sets in the presence of updates minimising the interaction with the underlying database. We implemented and experimentally tested our approach.Heutzutage entstehen durch fast jede menschliche Aktion und Interaktion Daten. Fotos werden auf Flickr bereitgestellt, Neuigkeiten 眉ber Twitter verbreitet und Kontakte in Linkedin und Facebook verwaltet; neben traditionellen Vorg盲ngen wie Banktransaktionen oder Flugbuchungen, die 脛nderungen in Datenbanken erzeugen. Solch eine riesige Menge an Daten kann leicht 眉berw盲ltigend sein bei dem Versuch die Essenz dieser Daten zu extrahieren. Neue Methoden werden ben枚tigt, um Inhalt der Daten zu analysieren, neu entstandene Themen zu identifizieren und die so gewonnenen Erkenntnisse dem Benutzer in einer 眉bersichtlichen Art und Weise zu pr盲sentieren. In dieser Arbeit werden Methoden zur Identifikation neuer Themen in gro脽en und dynamischen Datenmengen behandelt. Dabei werden einerseits die ver枚ffentlichten Daten aus sozialen Netzwerken wie Twitter und Flickr und andererseits strukturierte Daten aus relationalen Datenbanken, welche kontinuierlich aktualisiert werden, betrachtet. In sozialen Netzwerken stellen die Benutzer Texte, Bilder oder Videos online und beschreiben diese f眉r andere Nutzer mit Schlagworten, sogenannten Tags. Wir interpretieren Gruppen von zusammen auftretenden Tags als eine Art Thema und verfolgen die Beziehung bzw. Korrelation dieser Tags 眉ber einen gewissen Zeitraum. Abrupte Anstiege in der Korrelation werden als Hinweis auf Trends aufgefasst. Die eigentlich Aufgabe, das Z盲hlen von zusammen auftretenden Tags zur Berechnung von Korrelationsma脽en, wird dabei auf eine Vielzahl von Computerknoten verteilt. Die entwickelten Algorithmen wurden in Storm, einem neuartigen verteilten Datenstrommanagementsystem, implementiert und bzgl. Lastbalancierung und anfallender Netzwerklast sorgf盲ltig evaluiert. Durch eine Benutzerstudie wird dar眉ber hinaus gezeigt, dass die Qualit盲t der gewonnenen Trends h枚her ist als die Qualit盲t der Ergebnisse bestehender Systeme. In strukturierten Daten von relationalen Datenbanksystemen werden Beste-k Ergebnislisten durch Aggregationsanfragen in SQL definiert. Interessant dabei sind eintretende 脛nderungen in diesen Listen, was als Ereignisse (Trends) aufgefasst wird. In dieser Arbeit werden Methoden pr盲sentiert diese Ergebnislisten m枚glichst effizient instand zu halten, um Interaktionen mit der eigentlichen Datenbank zu minimieren

    Large-Scale Pattern-Based Information Extraction from the World Wide Web

    Get PDF
    Extracting information from text is the task of obtaining structured, machine-processable facts from information that is mentioned in an unstructured manner. It thus allows systems to automatically aggregate information for further analysis, efficient retrieval, automatic validation, or appropriate visualization. This thesis explores the potential of using textual patterns for Information Extraction from the World Wide Web

    Large-Scale Pattern-Based Information Extraction from the World Wide Web

    Get PDF
    Extracting information from text is the task of obtaining structured, machine-processable facts from information that is mentioned in an unstructured manner. It thus allows systems to automatically aggregate information for further analysis, efficient retrieval, automatic validation, or appropriate visualization. This work explores the potential of using textual patterns for Information Extraction from the World Wide Web

    Temporal Mining for Distributed Systems

    Get PDF
    Many systems and applications are continuously producing events. These events are used to record the status of the system and trace the behaviors of the systems. By examining these events, system administrators can check the potential problems of these systems. If the temporal dynamics of the systems are further investigated, the underlying patterns can be discovered. The uncovered knowledge can be leveraged to predict the future system behaviors or to mitigate the potential risks of the systems. Moreover, the system administrators can utilize the temporal patterns to set up event management rules to make the system more intelligent. With the popularity of data mining techniques in recent years, these events grad- ually become more and more useful. Despite the recent advances of the data mining techniques, the application to system event mining is still in a rudimentary stage. Most of works are still focusing on episodes mining or frequent pattern discovering. These methods are unable to provide a brief yet comprehensible summary to reveal the valuable information from the high level perspective. Moreover, these methods provide little actionable knowledge to help the system administrators to better man- age the systems. To better make use of the recorded events, more practical techniques are required. From the perspective of data mining, three correlated directions are considered to be helpful for system management: (1) Provide concise yet comprehensive summaries about the running status of the systems; (2) Make the systems more intelligence and autonomous; (3) Effectively detect the abnormal behaviors of the systems. Due to the richness of the event logs, all these directions can be solved in the data-driven manner. And in this way, the robustness of the systems can be enhanced and the goal of autonomous management can be approached. This dissertation mainly focuses on the foregoing directions that leverage tem- poral mining techniques to facilitate system management. More specifically, three concrete topics will be discussed, including event, resource demand prediction, and streaming anomaly detection. Besides the theoretic contributions, the experimental evaluation will also be presented to demonstrate the effectiveness and efficacy of the corresponding solutions

    Behaviour modelling with data obtained from the Internet and contributions to cluster validation

    Get PDF
    [EN]This PhD thesis makes contributions in modelling behaviours found in different types of data acquired from the Internet and in the field of clustering evaluation. Two different types of Internet data were processed, on the one hand, internet traffic with the objective of attack detection and on the other hand, web surfing activity with the objective of web personalization, both data being of sequential nature. To this aim, machine learning techniques were applied, mostly unsupervised techniques. Moreover, contributions were made in cluster evaluation, in order to make easier the selection of the best partition in clustering problems. With regard to network attack detection, first, gureKDDCup database was generated which adds payload data to KDDCup99 connection attributes because it is essential to detect non-flood attacks. Then, by modelling this data a network Intrusion Detection System (nIDS) was proposed where context-independent payload processing was done obtaining satisfying detection rates. In the web mining context web surfing activity was modelled for web personalization. In this context, generic and non-invasive systems to extract knowledge were proposed just using the information stored in webserver log files. Contributions were done in two senses: in problem detection and in link suggestion. In the first application a meaningful list of navigation attributes was proposed for each user session to group and detect different navigation profiles. In the latter, a general and non-invasive link suggestion system was proposed which was evaluated with satisfactory results in a link prediction context. With regard to the analysis of Cluster Validity Indices (CVI), the most extensive CVI comparison found up to a moment was carried out using a partition similarity measure based evaluation methodology. Moreover, we analysed the behaviour of CVIs in a real web mining application with elevated number of clusters in which they tend to be unstable. We proposed a procedure which automatically selects the best partition analysing the slope of different CVI values.[EU]Doktorego-tesi honek internetetik eskuratutako datu mota ezberdinetan aurkitutako portaeren modelugintzan eta multzokatzeen ebaluazioan egiten ditu bere ekarpenak. Zehazki, bi mota ezberdinetako interneteko datuak prozesatu dira: batetik, interneteko trafikoa, erasoak hautemateko helburuarekin; eta bestetik, web nabigazioen jarduera, weba pertsonalizatzeko helburuarekin; bi datu motak izaera sekuentzialekoak direlarik. Helburu hauek lortzeko, ikasketa automatikoko teknikak aplikatu dira, nagusiki gainbegiratu-gabeko teknikak. Testuinguru honetan, multzokatzeen partizio onenaren aukeraketak dakartzan arazoak gutxitzeko multzokatzeen ebaluazioan ere ekarpenak egin dira. Sareko erasoen hautemateari dagokionez, lehenik gureKDDCup datubasea eratu da KDDCup99-ko konexio atributuei payload-ak (sareko paketeen datu eremuak) gehituz, izan ere, ez-flood erasoak (pakete gutxi erabiltzen dituzten erasoak) hautemateko ezinbestekoak baitira. Ondoren, datu hauek modelatuz testuinguruarekiko independenteak diren payload prozesaketak oinarri dituen sareko erasoak hautemateko sistema (network Intrusion Detection System (nIDS)) bat proposatu da maila oneko eraso hautemate-tasak lortuz. Web meatzaritzaren testuinguruan, weba pertsonalizatzeko helburuarekin web nabigazioen jarduera modelatu da. Honetarako, web zerbizarietako lorratz fitxategietan metatutako informazioa soilik erabiliz ezagutza erabilgarria erauziko duen sistema orokor eta ez-inbasiboak proposatu dira. Ekarpenak bi zentzutan eginaz: arazoen hautematean eta esteken iradokitzean. Lehen aplikazioan sesioen nabigazioa adierazteko atributu esanguratsuen zerrenda bat proposatu da, gero nabigazioak multzokatu eta nabigazio profil ezberdinak hautemateko. Bigarren aplikazioan, estekak iradokitzeko sistema orokor eta ez-inbasibo bat proposatu da, eta berau, estekak aurresateko testuinguruan ebaluatu da emaitza onak lortuz. Multzokatzeak balioztatzeko indizeen (Cluster Validity Indices (CVI)) azterketari dagokionez, gaurdaino aurkitu den CVI-en konparaketa zabalena burutu da partizioen antzekotasun neurrian oinarritutako ebaluazio metodologia erabiliz. Gainera, CVI-en portaera aztertu da egiazko web meatzaritza aplikazio batean normalean baino multzo kopuru handiagoak dituena, non CVI-ek ezegonkorrak izateko joera baitute. Arazo honi aurre eginaz, CVI ezberdinek partizio ezberdinetarako lortzen dituzten balioen maldak aztertuz automatikoki partiziorik onena hautatzen duen prozedura proposatu da.[ES]Esta tesis doctoral hace contribuciones en el modelado de comportamientos encontrados en diferentes tipos de datos adquiridos desde internet y en el campo de la evaluaci贸n del clustering. Dos tipos de datos de internet han sido procesados: en primer lugar el tr谩fico de internet con el objetivo de detectar ataques; y en segundo lugar la actividad generada por los usuarios web con el objetivo de personalizar la web; siendo los dos tipos de datos de naturaleza secuencial. Para este fin, se han aplicado t茅cnicas de aprendizaje autom谩tico, principalmente t茅cnicas no-supervisadas. Adem谩s, se han hecho aportaciones en la evaluaci贸n de particiones de clusters para facilitar la selecci贸n de la mejor partici贸n de clusters. Respecto a la detecci贸n de ataques en la red, primero, se gener贸 la base de datos gureKDDCup que a帽ade el payload (la parte de contenido de los paquetes de la red) a los atributos de la conexi贸n de KDDCup99 porque el payload es esencial para la detecci贸n de ataques no-flood (ataques que utilizan pocos paquetes). Despu茅s, se propuso un sistema de detecci贸n de intrusos (network Intrusion Detection System (IDS)) modelando los datos de gureKDDCup donde se propusieron varios preprocesos del payload independientes del contexto obteniendo resultados satisfactorios. En el contexto de la miner谋虂a web, se ha modelado la actividad de la navegaci贸n web para la personalizaci贸n web. En este contexto se propondr谩n sistemas gen茅ricos y no-invasivos para la extracci贸n del conocimiento, utilizando 煤nicamente la informaci贸n almacenada en los ficheros log de los servidores web. Se han hecho aportaciones en dos sentidos: en la detecci贸n de problemas y en la sugerencia de links. En la primera aplicaci贸n, se propuso una lista de atributos significativos para representar las sesiones de navegaci贸n web para despu茅s agruparlos y detectar diferentes perfiles de navegaci贸n. En la segunda aplicaci贸n, se propuso un sistema general y no-invasivo para sugerir links y se evalu贸 en el contexto de predicci贸n de links con resultados satisfactorios. Respecto al an谩lisis de 谋虂ndices de validaci贸n de clusters (Cluster Validity Indices (CVI)), se ha realizado la m谩s amplia comparaci贸n encontrada hasta el momento que utiliza la metodolog谋虂a de evaluaci贸n basada en medidas de similitud de particiones. Adem谩s, se ha analizado el comportamiento de los CVIs en una aplicaci贸n real de miner谋虂a web con un n煤mero elevado de clusters, contexto en el que los CVIs tienden a ser inestables, as谋虂 que se propuso un procedimiento para la selecci贸n autom谩tica de la mejor partici贸n en base a la pendiente de los valores de diferentes CVIs.Grant of the Basque Government (ref.: BFI08.226); Grant of Ministry of Economy and Competitiveness of the Spanish Government (ref.: BES-2011-045989); Research stay grant of Spanish Ministry of Economy and Competitiveness (ref.: EEBB-I-14-08862); University of the Basque Country UPV/EHU (BAILab, grant UFI11/45); Department of Education, Universities and Research of the Basque Government (grant IT-395-10); Ministry of Economy and Competitiveness of the Spanish Government and by the European Regional Development Fund - ERDF (eGovernAbility, grant TIN2014-52665-C2-1-R)
    corecore