20,893 research outputs found
Finding Support Documents with a Logistic Regression Approach
Entity retrieval finds the relevant results for a user’s information needs at a finer unit called “entity”. To retrieve such entity, people usually first locate a small set of support documents which contain answer entities, and then further detect the answer entities in this set. In the literature, people view the support documents as relevant documents, and their findings as a conventional document retrieval problem. In this paper, we will state that finding support documents and that of relevant documents, although sounds similar, have important differences. Further, we propose a logistic regression approach to find support documents. Our experiment results show that the logistic regression method performs significantly better than a baseline system that treat the support document finding as a conventional document retrieval problem
Relation Discovery from Web Data for Competency Management
This paper describes a technique for automatically discovering associations between people and expertise from an analysis of very large data sources (including web pages, blogs and emails), using a family of algorithms that perform accurate named-entity recognition, assign different weights to terms according to an analysis of document structure, and access distances between terms in a document. My contribution is to add a social networking approach called BuddyFinder which relies on associations within a large enterprise-wide "buddy list" to help delimit the search space and also to provide a form of 'social triangulation' whereby the system can discover documents from your colleagues that contain pertinent information about you. This work has been influential in the information retrieval community generally, as it is the basis of a landmark system that achieved overall first place in every category in the Enterprise Search Track of TREC2006
DeltaPhish: Detecting Phishing Webpages in Compromised Websites
The large-scale deployment of modern phishing attacks relies on the automatic
exploitation of vulnerable websites in the wild, to maximize profit while
hindering attack traceability, detection and blacklisting. To the best of our
knowledge, this is the first work that specifically leverages this adversarial
behavior for detection purposes. We show that phishing webpages can be
accurately detected by highlighting HTML code and visual differences with
respect to other (legitimate) pages hosted within a compromised website. Our
system, named DeltaPhish, can be installed as part of a web application
firewall, to detect the presence of anomalous content on a website after
compromise, and eventually prevent access to it. DeltaPhish is also robust
against adversarial attempts in which the HTML code of the phishing page is
carefully manipulated to evade detection. We empirically evaluate it on more
than 5,500 webpages collected in the wild from compromised websites, showing
that it is capable of detecting more than 99% of phishing webpages, while only
misclassifying less than 1% of legitimate pages. We further show that the
detection rate remains higher than 70% even under very sophisticated attacks
carefully designed to evade our system.Comment: Preprint version of the work accepted at ESORICS 201
Identifying e-Commerce in Enterprises by means of Text Mining and Classification Algorithms
Monitoring specific features of the enterprises, for example, the adoption of e-commerce, is an important and basic task for several economic activities. This type of information is usually obtained by means of surveys, which are costly due to the amount of personnel involved in the task. An automatic detection of this information would allow consistent savings. This can actually be performed by relying on computer engineering, since in general this information is publicly available on-line through the corporate websites. This work describes how to convert the detection of e-commerce into a supervised classification problem, where each record is obtained from the automatic analysis of one corporate website, and the class is the presence or the absence of e-commerce facilities. The automatic generation of similar data records requires the use of several Text Mining phases; in particular we compare six strategies based on the selection of best words and best n-grams. After this, we classify the obtained dataset by means of four classification algorithms: Support Vector Machines; Random Forest; Statistical and Logical Analysis of Data; Logistic Classifier. This turns out to be a difficult case of classification problem. However, after a careful design and set-up of the whole procedure, the results on a practical case of Italian enterprises are encouraging
Mining web data for competency management
We present CORDER (COmmunity Relation Discovery by named Entity Recognition) an un-supervised machine learning algorithm that exploits named entity recognition and co-occurrence data to associate individuals in an organization with their expertise and associates. We
discuss the problems associated with evaluating
unsupervised learners and report our initial evaluation
experiments
Recommended from our members
Knowledge Cartography: Software tools and mapping techniques
Knowledge Cartography is the discipline of mapping intellectual landscapes.The focus of this book is on the process by which manually crafting interactive, hypertextual maps clarifies one’s own understanding, as well as communicating it.The authors see mapping software as a set of visual tools for reading and writing in a networked age. In an information ocean, the primary challenge is to find meaningful patterns around which we can weave plausible narratives. Maps of concepts, discussions and arguments make the connections between ideas tangible and disputable.
With 17 chapters from the leading researchers and practitioners, the reader will find the current state–of-the-art in the field. Part 1 focuses on educational applications in schools and universities, before Part 2 turns to applications in professional communitie
Automatically assembling a full census of an academic field
The composition of the scientific workforce shapes the direction of
scientific research, directly through the selection of questions to
investigate, and indirectly through its influence on the training of future
scientists. In most fields, however, complete census information is difficult
to obtain, complicating efforts to study workforce dynamics and the effects of
policy. This is particularly true in computer science, which lacks a single,
all-encompassing directory or professional organization. A full census of
computer science would serve many purposes, not the least of which is a better
understanding of the trends and causes of unequal representation in computing.
Previous academic census efforts have relied on narrow or biased samples, or on
professional society membership rolls. A full census can be constructed
directly from online departmental faculty directories, but doing so by hand is
prohibitively expensive and time-consuming. Here, we introduce a topical web
crawler for automating the collection of faculty information from web-based
department rosters, and demonstrate the resulting system on the 205
PhD-granting computer science departments in the U.S. and Canada. This method
constructs a complete census of the field within a few minutes, and achieves
over 99% precision and recall. We conclude by comparing the resulting 2017
census to a hand-curated 2011 census to quantify turnover and retention in
computer science, in general and for female faculty in particular,
demonstrating the types of analysis made possible by automated census
construction.Comment: 11 pages, 6 figures, 2 table
- …