169 research outputs found
Contexts and Contributions: Building the Distributed Library
This report updates and expands on A Survey of Digital Library Aggregation Services, originally commissioned by the DLF as an internal report in summer 2003, and released to the public later that year. It highlights major developments affecting the ecosystem of scholarly communications and digital libraries since the last survey and provides an analysis of OAI implementation demographics, based on a comparative review of repository registries and cross-archive search services. Secondly, it reviews the state-of-practice for a cohort of digital library aggregation services, grouping them in the context of the problem space to which they most closely adhere. Based in part on responses collected in fall 2005 from an online survey distributed to the original core services, the report investigates the purpose, function and challenges of next-generation aggregation services. On a case-by-case basis, the advances in each service are of interest in isolation from each other, but the report also attempts to situate these services in a larger context and to understand how they fit into a multi-dimensional and interdependent ecosystem supporting the worldwide community of scholars. Finally, the report summarizes the contributions of these services thus far and identifies obstacles requiring further attention to realize the goal of an open, distributed digital library system
International conference on software engineering and knowledge engineering: Session chair
The Thirtieth International Conference on Software Engineering and Knowledge Engineering (SEKE 2018) will be held at the Hotel Pullman, San Francisco Bay, USA, from July 1 to July 3, 2018. SEKE2018 will also be dedicated in memory of Professor Lofti Zadeh, a great scholar, pioneer and leader in fuzzy sets theory and soft computing.
The conference aims at bringing together experts in software engineering and knowledge engineering to discuss on relevant results in either software engineering or knowledge engineering or both. Special emphasis will be put on the transference of methods between both domains. The theme this year is soft computing in software engineering & knowledge engineering. Submission of papers and demos are both welcome
Department of Computer Science Activity 1998-2004
This report summarizes much of the research and teaching activity of the Department of Computer Science at Dartmouth College between late 1998 and late 2004. The material for this report was collected as part of the final report for NSF Institutional Infrastructure award EIA-9802068, which funded equipment and technical staff during that six-year period. This equipment and staff supported essentially all of the department\u27s research activity during that period
An Active Learning Approach to Efficiently Ranking Retrieval Engines
Evaluating retrieval systems, such as those submitted to the annual TREC competition, usually requires a large number of documents to be read and judged for relevance to query topics. Test collections are far too big to be exhaustively judged, so only a subset of documents is selected to form the judgment ``pool.\u27\u27 The selection method that TREC uses produces pools that are still quite large. Research has indicated that it is possible to rank the retrieval systems correctly using substantially smaller pools. This paper introduces an active learning algorithm whose goal is to reach the correct rankings using the smallest possible number of relevance judgments. It adds one document to the pool at a time, always trying to select the document with the highest information gain. Several variants of this algorithm are described, each with improvements on the one before. Results from experiments are included for comparison with the traditional TREC pooling method. The best version of the algorithm reliably outperforms the traditional method, although its degree of improvement varies
Organisation of Web pages Using Document Clustering Techniques: Some Algorithms
With the increase in information on the World Wide Web it has
become difficult to find the desired information on search
engines. Clustering Techniques are now being used to give a
meaningful search result on web. This paper gives an idea
about Web Page document clustering, different algorithms
including similarity measures and clustering engines
Search beyond traditional probabilistic information retrieval
"This thesis focuses on search beyond probabilistic information retrieval. Three ap- proached are proposed beyond the traditional probabilistic modelling. First, term associ- ation is deeply examined. Term association considers the term dependency using a factor analysis based model, instead of treating each term independently. Latent factors, con- sidered the same as the hidden variables of ""eliteness"" introduced by Robertson et al. to gain understanding of the relation among term occurrences and relevance, are measured by the dependencies and occurrences of term sequences and subsequences. Second, an entity-based ranking approach is proposed in an entity system named ""EntityCube"" which has been released by Microsoft for public use. A summarization page is given to summarize the entity information over multiple documents such that the truly relevant entities can be highly possibly searched from multiple documents through integrating the local relevance contributed by proximity and the global enhancer by topic model. Third, multi-source fusion sets up a meta-search engine to combine the ""knowledge"" from different sources. Meta-features, distilled as high-level categories, are deployed to diversify the baselines. Three modified fusion methods are employed, which are re- ciprocal, CombMNZ and CombSUM with three expanded versions. Through extensive experiments on the standard large-scale TREC Genomics data sets, the TREC HARD data sets and the Microsoft EntityCube Web collections, the proposed extended models beyond probabilistic information retrieval show their effectiveness and superiority.
Untangling the Web: A Guide To Internet Research
[Excerpt] Untangling the Web for 2007 is the twelfth edition of a book that started as a small handout. After more than a decade of researching, reading about, using, and trying to understand the Internet, I have come to accept that it is indeed a Sisyphean task. Sometimes I feel that all I can do is to push the rock up to the top of that virtual hill, then stand back and watch as it rolls down again. The Internet—in all its glory of information and misinformation—is for all practical purposes limitless, which of course means we can never know it all, see it all, understand it all, or even imagine all it is and will be. The more we know about the Internet, the more acute is our awareness of what we do not know. The Internet emphasizes the depth of our ignorance because our knowledge can only be finite, while our ignorance must necessarily be infinite. My hope is that Untangling the Web will add to our knowledge of the Internet and the world while recognizing that the rock will always roll back down the hill at the end of the day
Online Data Cleaning
Data-centric applications have never been more ubiquitous in our lives, e.g., search engines, route navigation and social media. This has brought along a new age where digital data is at the core of many decisions we make as individuals, e.g., looking for the most scenic route to plan a road trip, or as professionals, e.g., analysing customers’ transactions to predict the best time to restock different products. However, the surge in data generation has also led to creating massive amounts of dirty data, i.e., inaccurate or redundant data. Using dirty data to inform business decisions comes with dire consequences, for instance, an IBM report estimates that dirty data costs the U.S. $3.1 trillion a year.
Dirty data is the product of many factors which include data entry errors and integration of several data sources. Data integration of multiple sources is especially prone to producing dirty data. For instance, while individual sources may not have redundant data, they often carry redundant data across each other. Furthermore, different data sources may obey different business rules (sometimes not even known) which makes it challenging to reconcile the integrated data. Even if the data is clean at the time of the integration, data updates would compromise its quality over time.
There is a wide spectrum of errors that can be found in the data, e,g, duplicate records, missing values, obsolete data, etc. To address these problems, several data cleaning efforts have been proposed, e.g., record linkage to identify duplicate records, data fusion to fuse duplicate data items into a single representation and enforcing integrity constraints on the data. However, most existing efforts make two key assumptions: (1) Data cleaning is done in one shot; and (2) The data is available in its entirety. Those two assumptions do not hold in our age where data is highly volatile and integrated from several sources. This calls for a paradigm shift in approaching data cleaning: it has to be made iterative where data comes in chunks and not all at once. Consequently, cleaning the data should not be repeated from scratch whenever the data changes, but instead, should be done only for data items affected by the updates. Moreover, the repair should be computed effciently to support applications where cleaning is performed online (e.g. query time data cleaning). In this dissertation, we present several proposals to realize this paradigm for two major types of data errors: duplicates and integrity constraint violations.
We frst present a framework that supports online record linkage and fusion over Web databases. Our system processes queries posted to Web databases. Query results are deduplicated, fused and then stored in a cache for future reference. The cache is updated iteratively with new query results. This effort makes it possible to perform record linkage and fusion effciently, but also effectively, i.e., the cache contains data items seen in previous queries which are jointly cleaned with incoming query results.
To address integrity constraints violations, we propose a novel way to approach Functional Dependency repairs, develop a new class of repairs and then demonstrate it is superior to existing efforts, in runtime and accuracy. We then show how our framework can be easily tuned to work iteratively to support online applications. We implement a proof-ofconcept query answering system to demonstrate the iterative capability of our system
- …