47,709 research outputs found
The Ideal Candidate. Analysis of Professional Competences through Text Mining of Job Offers
The aim of this paper is to propose analytical tools for identifying peculiar aspects of job market for graduates. We propose a strategy for dealing with daa tat have different source and nature
A database with enterprise application for mining astronomical data obtained by MOA : a thesis submitted in partial fulfilment of the requirements for the degree of the Master of Information Science in Computer Science, Massey University at Albany, Auckland, New Zealand
The MOA (Microlensing Observations in Astrophysics) Project is one of a new generation of modern astronomy endeavours that generates huge volumes of data. These have enormous scientific data mining potential. However, it is common for astronomers to deal with millions and even billions of records. The challenge of how to manage these large data sets is an important case for researchers. A good database management system is vital for the research. With the modern observation equipments used, MOA suffers from the growing volume of the data and a database management solution is needed. This study analyzed the modern technology for database and enterprise application. After analysing the data mining requirements of MOA, a prototype data management system based on MVC pattern was developed. Furthermore, the application supports sharing MOA findings and scientific data on the Internet. It was tested on a 7GB subset of achieved MOA data set. After testing, it was found that the application could query data in an efficient time and support data mining
Distributed multinomial regression
This article introduces a model-based approach to distributed computing for
multinomial logistic (softmax) regression. We treat counts for each response
category as independent Poisson regressions via plug-in estimates for fixed
effects shared across categories. The work is driven by the
high-dimensional-response multinomial models that are used in analysis of a
large number of random counts. Our motivating applications are in text
analysis, where documents are tokenized and the token counts are modeled as
arising from a multinomial dependent upon document attributes. We estimate such
models for a publicly available data set of reviews from Yelp, with text
regressed onto a large set of explanatory variables (user, business, and rating
information). The fitted models serve as a basis for exploring the connection
between words and variables of interest, for reducing dimension into supervised
factor scores, and for prediction. We argue that the approach herein provides
an attractive option for social scientists and other text analysts who wish to
bring familiar regression tools to bear on text data.Comment: Published at http://dx.doi.org/10.1214/15-AOAS831 in the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Image mining: issues, frameworks and techniques
[Abstract]: Advances in image acquisition and storage technology have led to tremendous growth in significantly large and detailed image databases. These images, if analyzed, can reveal useful information to the human users. Image mining deals with the extraction of implicit knowledge, image data relationship, or other patterns not explicitly stored in the images. Image mining is more than just an extension of data mining to image domain. It is an
interdisciplinary endeavor that draws upon expertise in
computer vision, image processing, image retrieval, data
mining, machine learning, database, and artificial
intelligence. Despite the development of many
applications and algorithms in the individual research
fields cited above, research in image mining is still in its infancy. In this paper, we will examine the research issues in image mining, current developments in image mining, particularly, image mining frameworks, state-of-the-art techniques and systems. We will also identify some future research directions for image mining at the end of this paper
Towards a semantic and statistical selection of association rules
The increasing growth of databases raises an urgent need for more accurate
methods to better understand the stored data. In this scope, association rules
were extensively used for the analysis and the comprehension of huge amounts of
data. However, the number of generated rules is too large to be efficiently
analyzed and explored in any further process. Association rules selection is a
classical topic to address this issue, yet, new innovated approaches are
required in order to provide help to decision makers. Hence, many interesting-
ness measures have been defined to statistically evaluate and filter the
association rules. However, these measures present two major problems. On the
one hand, they do not allow eliminating irrelevant rules, on the other hand,
their abun- dance leads to the heterogeneity of the evaluation results which
leads to confusion in decision making. In this paper, we propose a two-winged
approach to select statistically in- teresting and semantically incomparable
rules. Our statis- tical selection helps discovering interesting association
rules without favoring or excluding any measure. The semantic comparability
helps to decide if the considered association rules are semantically related
i.e comparable. The outcomes of our experiments on real datasets show promising
results in terms of reduction in the number of rules
- …