Search CORE

10 research outputs found

Data Mining Techniques for Complex User-Generated Data

Author: XIAO XIN
Publication venue: country:Italy
Publication date: 01/01/2016
Field of study

Nowadays, the amount of collected information is continuously growing in a variety of different domains. Data mining techniques are powerful instruments to effectively analyze these large data collections and extract hidden and useful knowledge. Vast amount of User-Generated Data (UGD) is being created every day, such as user behavior, user-generated content, user exploitation of available services and user mobility in different domains. Some common critical issues arise for the UGD analysis process such as the large dataset cardinality and dimensionality, the variable data distribution and inherent sparseness, and the heterogeneous data to model the different facets of the targeted domain. Consequently, the extraction of useful knowledge from such data collections is a challenging task, and proper data mining solutions should be devised for the problem under analysis. In this thesis work, we focus on the design and development of innovative solutions to support data mining activities over User-Generated Data characterised by different critical issues, via the integration of different data mining techniques in a unified frame- work. Real datasets coming from three example domains characterized by the above critical issues are considered as reference cases, i.e., health care, social network, and ur- ban environment domains. Experimental results show the effectiveness of the proposed approaches to discover useful knowledge from different domains

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

Predictive statistical user models under the collaborative approach

Author: Castaño Zabaleta Leonardo
Publication venue
Publication date: 01/01/2016
Field of study

Mención Internacional en el título de doctorUser models and recommender systems due to their similarity can be considered the same thing except from the use that we make of them. Both have their root in multiple disciplines such as information retrieval or machine learning among others. The impact has grown rapidly with the importance of data on systems and applications. Most of the big companies employ one of the other for different reasons such as: gathering more customers, boost sales or increase revenue. Thus very well-known companies like Amazon, EBay or Google use models to improve their businesses. In fact, as data becomes more and more important for companies, universities and people, user models are crucial to make decisions over large amounts of data. Although user models can provide accurate predictions on large populations their use and application is not restricted to predictions but can be extended to selection of dialogue strategies or detection of communities within complex domains. After a deep review of the existing literature, it was found that there is a lack of statistical user models based on experience plus the existing models in the area are content-based models that suffer from major problems as scalability, cold-start or new user problem. Furthermore, researchers in the area of user modelling usually develop their own models and then perform ad-hoc evaluations that are not replicable and therefore not comparable. The lack of a complete framework for evaluation makes very difficult to compare results across models and domains. There are two main approaches to build a user model or recommender system: the content based approach, where predictions are based on the same user past behaviours; and the collaborative approach where predictions rely on like-minded people. Both approaches have advantages but also downsides that have to be considered before building a model. The main goal of this thesis is to develop a hybrid user model that takes the strengths of both approaches and mitigates the downsides by combining both methods. The proposed hybrid model is based on an R-Tree structure. The selection of this structure to support the models is backed from the fact that the rectangle tree is specifically designed to effectively store and manipulate multidimensional data. This data structure introduced by Guttman in 1984 is a height balanced tree that only requires visiting a few nodes to perform a tree search. As a result, it can manage large populations of data efficiently as only a few nodes are visited during the inference. R-Tree has two different typologies of nodes: the leaf-node and the non-leaf node. Leaf nodes contain the whole universe of users while non leaf nodes are somehow redundant and contain summaries of child nodes. Along this thesis two statistical user models based on experience have been proposed. The first one is a knowledge base user mode (KLUM), is a classical approach that summarizes and remove data in order to keep performance level within reasonable margins. The second one, an R-Tree user model (RTUM), is an innovative model based on an R-Tree structure. This new model not only solves the problem of removing data but also the scalability problem which turns out to be one of the major problems in the area of user modelling. Both models have been developed and tested with equivalent formulations to make comparisons relevant. Both models are prepared to create their own knowledge base from scratch but also they can be fed with expert knowledge. Thus alleviating another major problem in the area of user modelling as it is the start-up problem. Regarding the proposal of this thesis, two statistical user models are proposed (KLUM and RTUM). In addition, a refinement of RTUM user model is proposed, while RTUM performs node partitions based on the centroids of the users in that node, the new refinement implements a new partition based on privileged features. Hence, the new approach takes advantage of most discriminatory features of the domain to perform the partition. This new approach not only provides accurate inferences, but also an excellent clustering that can be useful in many different scenarios. For instance, this clustering can be employed in the area of social networks to detect communities within the social network. This is a tough task that has been one of the goals of many researchers during the last few years. This thesis also provides a complete evaluation of the models with a great diversity of parameterizations and domains. The models are tested in four different domains and as a result of the evaluation, it is proved that RTUM user model provides a massive gain against classical user models as KLUM. During the evaluation, RTUM reached success rates of 85% while the analogous KLUM could only reach a 65% thus leaving a 20% gain for the proposed model. The evaluation provided not only compares models and success rates, but also provides a broad analysis of how every parameter of the models impact the performance plus a complete study of the databases sizes and inference times for the models. The main conclusion to the evaluation is that after a complete evaluation with a wide diversity of parameters and domains RTUM outperforms KLUM on every scenario tested. As previously mentioned, after the literature review it was also found a lack of evaluation frameworks for user modelling. This thesis also provides a complete evaluation framework for user modelling. This fills a gap in the literature as well as makes the evaluation replicable and therefore comparable. Along years researchers and developers had found difficulties to compare evaluations and measure the quality of their models in different domains due to the lack of an evaluation standard. The evaluation framework presented in this thesis covers data samples including training set and test set plus different sets of experiments alongside with a statistical analysis of the domain, confidence intervals and confidence levels to guarantee that each experiment is statistically significant. The evaluation framework can be downloaded and then used to complete evaluations and cross-validate results across different models.This thesis would not have been possible without the financial support of the following research projects Cadooh (TSI-020302-2011-21), Thuban (TIN2008-02711) that funded part of this research.Programa Oficial de Doctorado en Ciencia y Tecnología InformáticaPresidente: Antonio de Amescua Seco.- Secretario: Ruth Cobos Pérez.- Vocal: Dominikus Heckman

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Universidad Carlos III de Madrid e-Archivo

Large-Scale Pattern-Based Information Extraction from the World Wide Web

Author: Blohm Sebastian
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2010
Field of study

Extracting information from text is the task of obtaining structured, machine-processable facts from information that is mentioned in an unstructured manner. It thus allows systems to automatically aggregate information for further analysis, efficient retrieval, automatic validation, or appropriate visualization. This thesis explores the potential of using textual patterns for Information Extraction from the World Wide Web

KITopen

Large-Scale Pattern-Based Information Extraction from the World Wide Web

Author: Blohm Sebastian
Publication venue: KIT Scientific Publishing, Karlsruhe
Publication date: 01/01/2010
Field of study

Extracting information from text is the task of obtaining structured, machine-processable facts from information that is mentioned in an unstructured manner. It thus allows systems to automatically aggregate information for further analysis, efficient retrieval, automatic validation, or appropriate visualization. This work explores the potential of using textual patterns for Information Extraction from the World Wide Web

KITopen

Directory of Open Access Books (DOAB)

Formal Linguistic Models and Knowledge Processing. A Structuralist Approach to Rule-Based Ontology Learning and Population

Author: Di Buono Maria Pia
Publication venue: Universita degli studi di Salerno
Publication date: 02/03/2016
Field of study

2013 - 2014The main aim of this research is to propose a structuralist approach for knowledge processing by means of ontology learning and population, achieved starting from unstructured and structured texts. The method suggested includes distributional semantic approaches and NL formalization theories, in order to develop a framework, which relies upon deep linguistic analysis... [edited by author]XIII n.s

EleA@UniSA - Università degli Studi di Salerno

A Polyhedral Study of Mixed 0-1 Set

Author: Agra Agostinho
Doostmohammadi Mahdi
Publication venue: ALIO-EURO 2011
Publication date: 01/01/2011
Field of study

We consider a variant of the well-known single node fixed charge network flow set with constant capacities. This set arises from the relaxation of more general mixed integer sets such as lot-sizing problems with multiple suppliers. We provide a complete polyhedral characterization of the convex hull of the given set

University of Strathclyde Institutional Repository

Actas de las VI Jornadas Nacionales (JNIC2021 LIVE)

Author: Alcaraz Cristina
Calvo Guillermo
Castro Noemí de
Fernández-Medina Eduardo
Serrano Manuel A.
Publication venue: 'Universidad Castilla la Mancha'
Publication date: 01/06/2021
Field of study

Estas jornadas se han convertido en un foro de encuentro de los actores más relevantes en el ámbito de la ciberseguridad en España. En ellas, no sólo se presentan algunos de los trabajos científicos punteros en las diversas áreas de ciberseguridad, sino que se presta especial atención a la formación e innovación educativa en materia de ciberseguridad, y también a la conexión con la industria, a través de propuestas de transferencia de tecnología. Tanto es así que, este año se presentan en el Programa de Transferencia algunas modificaciones sobre su funcionamiento y desarrollo que han sido diseñadas con la intención de mejorarlo y hacerlo más valioso para toda la comunidad investigadora en ciberseguridad

Universidad de Castilla-La Mancha: Repositorio Universitario Institucional de Recursos Abiertos (RUIdeRA)