17,415 research outputs found
i2MapReduce: Incremental MapReduce for Mining Evolving Big Data
As new data and updates are constantly arriving, the results of data mining
applications become stale and obsolete over time. Incremental processing is a
promising approach to refreshing mining results. It utilizes previously saved
states to avoid the expense of re-computation from scratch.
In this paper, we propose i2MapReduce, a novel incremental processing
extension to MapReduce, the most widely used framework for mining big data.
Compared with the state-of-the-art work on Incoop, i2MapReduce (i) performs
key-value pair level incremental processing rather than task level
re-computation, (ii) supports not only one-step computation but also more
sophisticated iterative computation, which is widely used in data mining
applications, and (iii) incorporates a set of novel techniques to reduce I/O
overhead for accessing preserved fine-grain computation states. We evaluate
i2MapReduce using a one-step algorithm and three iterative algorithms with
diverse computation characteristics. Experimental results on Amazon EC2 show
significant performance improvements of i2MapReduce compared to both plain and
iterative MapReduce performing re-computation
Open-world Learning and Application to Product Classification
Classic supervised learning makes the closed-world assumption, meaning that
classes seen in testing must have been seen in training. However, in the
dynamic world, new or unseen class examples may appear constantly. A model
working in such an environment must be able to reject unseen classes (not seen
or used in training). If enough data is collected for the unseen classes, the
system should incrementally learn to accept/classify them. This learning
paradigm is called open-world learning (OWL). Existing OWL methods all need
some form of re-training to accept or include the new classes in the overall
model. In this paper, we propose a meta-learning approach to the problem. Its
key novelty is that it only needs to train a meta-classifier, which can then
continually accept new classes when they have enough labeled data for the
meta-classifier to use, and also detect/reject future unseen classes. No
re-training of the meta-classifier or a new overall classifier covering all old
and new classes is needed. In testing, the method only uses the examples of the
seen classes (including the newly added classes) on-the-fly for classification
and rejection. Experimental results demonstrate the effectiveness of the new
approach.Comment: accepted by The Web Conference (WWW 2019) Previous title: Learning to
Accept New Classes without Trainin
Rank-aware, Approximate Query Processing on the Semantic Web
Search over the Semantic Web corpus frequently leads to queries having large result sets. So, in order to discover relevant data elements, users must rely on ranking techniques to sort results according to their relevance. At the same time, applications oftentimes deal with information needs, which do not require complete and exact results. In this thesis, we face the problem of how to process queries over Web data in an approximate and rank-aware fashion
Bidding for Complex Projects: Evidence From the Acquisitions of IT Services
Competitive bidding (as auctions) is commonly used to procure goods and services. Public buyers are often mandated by law to adopt competitive procedures to ensure transparency and promote full competition. Recent theoretical literature, however, suggests that open competition can perform poorly in allocating complex projects. In exploring the determinants of suppliers’ bidding behavior in procurement auctions for complex IT services, we find results that are consistent with theory. We find that price and quality do not exhibit the classical tradeoff one would expect: quite surprisingly, high quality is associated to low prices. Furthermore, while quality is mainly driven by suppliers’ experience, price is affected more by the scoring rule and by the level of expected competition. These results might suggest that (scoring) auctions fail to appropriately incorporate buyers’ complex price/quality preferences in the tender design.Procurement Auctions, Scoring Rules, IT Contracts, Price/Quality Ratio
Mining Web Dynamics for Search
Billions of web users collectively contribute to a dynamic web that preserves how information sources and descriptions change over time. This dynamic process sheds light on the quality of web content, and even indicates the temporal properties of information needs expressed via queries. However, existing commercial search engines typically utilize one crawl of web content (the latest) without considering the complementary information concealed in web dynamics. As a result, the generated rankings may be biased due to the efficiency of knowledge on page or hyperlink evolution, and the time-sensitive facet within search quality, e.g., freshness, has to be neglected. While previous research efforts have been focused on exploring the temporal dimension in retrieval process, few of them showed consistent improvements on large-scale real-world archival web corpus with a broad time span.We investigate how to utilize the changes of web pages and hyperlinks to improve search quality, in terms of freshness and relevance of search results. Three applications that I have focused on are: (1) document representation, in which the anchortext (short descriptive text associated with hyperlinks) importance is estimated by considering its historical status; (2) web authority estimation, in which web freshness is quantified and utilized for controlling the authority propagation; and (3) learning to rank, in which freshness and relevance are optimized simultaneously in an adaptive way depending on query type. The contributions of this thesis are: (1) incorporate web dynamics information into critical components within search infrastructure in a principled way; and (2) empirically verify the proposed methods by conducting experiments based on (or depending on) a large-scale real-world archival web corpus, and demonstrated their superiority over existing state-of-the-art
- …