87 research outputs found
A New Estimator of Intrinsic Dimension Based on the Multipoint Morisita Index
The size of datasets has been increasing rapidly both in terms of number of
variables and number of events. As a result, the empty space phenomenon and the
curse of dimensionality complicate the extraction of useful information. But,
in general, data lie on non-linear manifolds of much lower dimension than that
of the spaces in which they are embedded. In many pattern recognition tasks,
learning these manifolds is a key issue and it requires the knowledge of their
true intrinsic dimension. This paper introduces a new estimator of intrinsic
dimension based on the multipoint Morisita index. It is applied to both
synthetic and real datasets of varying complexities and comparisons with other
existing estimators are carried out. The proposed estimator turns out to be
fairly robust to sample size and noise, unaffected by edge effects, able to
handle large datasets and computationally efficient
Addressing Digital Divide through Digital Literacy Training Programs: A Systematic Literature Review
Digital literacy training programs (DLTPs) are influential in developing digital skills to help build a more inclusive and participatory ecosystem. This study presents a review of 86 studies related to DLTPs for marginalised populations in developed and developing countries. It aims to understand (a) the profile of DLTPs, (b) the digital competences incorporated in the training curriculum and (c) tangible outcomes of Internet use post-training. The review indicated that developed countries focus more upon developing digital literacy in elderly populations. In contrast, the focus still lies in developing digital literacy among people with low skills and education levels in developing countries. The training curriculums focus mainly on developing information-seeking and communication competencies, besides the basic operations of digital devices. Most of the studies reported an increase in the personal-level outcomes around health, leisure and self-actualisation achieved post-training. This study can help policymakers, practitioners, and educational researchers improve the scope and quality of educational programs and contribute to people's digital empowerment and well-being
A Brief History of Web Crawlers
Web crawlers visit internet applications, collect data, and learn about new
web pages from visited pages. Web crawlers have a long and interesting history.
Early web crawlers collected statistics about the web. In addition to
collecting statistics about the web and indexing the applications for search
engines, modern crawlers can be used to perform accessibility and vulnerability
checks on the application. Quick expansion of the web, and the complexity added
to web applications have made the process of crawling a very challenging one.
Throughout the history of web crawling many researchers and industrial groups
addressed different issues and challenges that web crawlers face. Different
solutions have been proposed to reduce the time and cost of crawling.
Performing an exhaustive crawl is a challenging question. Additionally
capturing the model of a modern web application and extracting data from it
automatically is another open question. What follows is a brief history of
different technique and algorithms used from the early days of crawling up to
the recent days. We introduce criteria to evaluate the relative performance of
web crawlers. Based on these criteria we plot the evolution of web crawlers and
compare their performanc
Knowledge extraction from minutes of Portuguese municipalities meetings
A very relevant problem in e-government is that a great amount
of knowledge is in natural language unstructured documents. If
that knowledge was stored using a computer-processable representation
it would be more easily accessed. In this paper we
present the architecture, modules and initial results of a prototype
under development for extracting information from government
documents. The prototype stores the information using
a formal representation of the set of concepts and the relationships
between those concepts - an ontology. The system was
tested using minutes of Portuguese Municipal Boards meetings.
Initial results are presented for an important and frequent topic
of the minutes: the subsidies granted by municipalities
Privacy Tradeoffs in Predictive Analytics
Online services routinely mine user data to predict user preferences, make
recommendations, and place targeted ads. Recent research has demonstrated that
several private user attributes (such as political affiliation, sexual
orientation, and gender) can be inferred from such data. Can a
privacy-conscious user benefit from personalization while simultaneously
protecting her private attributes? We study this question in the context of a
rating prediction service based on matrix factorization. We construct a
protocol of interactions between the service and users that has remarkable
optimality properties: it is privacy-preserving, in that no inference algorithm
can succeed in inferring a user's private attribute with a probability better
than random guessing; it has maximal accuracy, in that no other
privacy-preserving protocol improves rating prediction; and, finally, it
involves a minimal disclosure, as the prediction accuracy strictly decreases
when the service reveals less information. We extensively evaluate our protocol
using several rating datasets, demonstrating that it successfully blocks the
inference of gender, age and political affiliation, while incurring less than
5% decrease in the accuracy of rating prediction.Comment: Extended version of the paper appearing in SIGMETRICS 201
A Comparison Study of Second-Order Screening Designs and Their Extension
Recent literature has proposed employing a single experimental design capable of preforming both factor screening and response surface estimation when conducting sequential experiments is unrealistic due to time, budget, or other constraints. Military systems, particularly aerodynamic systems, are complex. It is not unusual for these systems to exhibit nonlinear response behavior. Developmental testing may be tasked to characterize the nonlinear behavior of such systems while being restricted in how much testing can be accomplished. Second-order screening designs provide a means in a single design experiment to effectively focus test resources onto those factors driving system performance. Sponsored by the Office of the Secretary of Defense (ODS) in support of the Science of Test initiative, this research characterizes and adds to the area of second-order screening designs, particularly as applied to defense testing. Existing design methods are empirically tested and examined for robustness. The leading design method, a method that is very run efficient, is extended to overcome limitations when screening for non-linear effects. A case study and screening design guidance for defense testers is also provided
Grid Data Management: Open Problems and New Issues
International audienceInitially developed for the scientific community, Grid computing is now gaining much interest in important areas such as enterprise information systems. This makes data management critical since the techniques must scale up while addressing the autonomy, dynamicity and heterogeneity of the data sources. In this paper, we discuss the main open problems and new issues related to Grid data management. We first recall the main principles behind data management in distributed systems and the basic techniques. Then we make precise the requirements for Grid data management. Finally, we introduce the main techniques needed to address these requirements. This implies revisiting distributed database techniques in major ways, in particular, using P2P techniques
- …