305 research outputs found
Comparing fbeta-optimal with distance based merge functions
Merge functions informally combine information from a certain universe into a solution over that same universe. This typically results in a, preferably optimal, summarization. In previous research, merge functions over sets have been looked into extensively. A specic case concerns sets that allow elements to appear more than once, multisets. In this paper we compare two types of merge functions over multisets against each other. We examine both general properties as practical usability in a real world application
Enhanced Web Search Engines with Query-Concept Bipartite Graphs
With rapid growth of information on the Web, Web search engines have gained great momentum for exploiting valuable Web resources. Although keywords-based Web search engines provide relevant search results in response to users’ queries, future enhancement is still needed. Three important issues include (1) search results can be diverse because ambiguous keywords in queries can be interpreted to different meanings; (2) indentifying keywords in long queries is difficult for search engines; and (3) generating query-specific Web page summaries is desirable for Web search results’ previews. Based on clickthrough data, this thesis proposes a query-concept bipartite graph for representing queries’ relations, and applies the queries’ relations to applications such as (1) personalized query suggestions, (2) long queries Web searches and (3) query-specific Web page summarization. Experimental results show that query-concept bipartite graphs are useful for performance improvement for the three applications
Text Summarization
Text summarization is the process of distilling the most important information from a
source (or sources) to produce an abridged version for a particular user (or users) and task
(or tasks) [2]. By providing a text summarization system that will simplify the bulk of
information and producing only the most important points, the task of reading and
understanding a text would inevitably be made easier and faster. With a large volume of
text documents, a summary of each document greatly facilitates the task of finding the
desired documents and the desired data from the documents. As a solution for the above
matter, this project objective is to simplify the texts from a previous text summarization
system and further reducing the number of words in a sentence, shortening the sentences
and eliminating sentences with similar meanings and also produce grammar rules that
generate sentences that are human-like. The waterfall model is chosen as the project
development life cycle. A detailed research has been conducted during the requirement
definition phase and the system prototype is designed in the system and software design
phase. During the development phase, the coding implementation will be conducted and
the unit testing part will be done throughout that development process. After the entire
unit has been tested, they will be integrated together and the system testing can be done
as a whole. The complete program is put through thorough test and evaluation to ensure
its functionality and efficiency. As the conclusion, this project should be able to produce
a summarized text as the output product and meet the project requirements and
objectives
Similarity search and data mining techniques for advanced database systems.
Modern automated methods for measurement, collection, and analysis of data in industry and science are providing more and more data with drastically increasing structure complexity. On the one hand, this growing complexity is justified by the need for a richer and more precise description of real-world objects, on the other hand it is justified by the rapid progress in measurement and analysis techniques that allow the user a versatile exploration of objects. In order to manage the huge volume of such complex data, advanced database systems are employed. In contrast to conventional database systems that support exact match queries, the user of these advanced database systems focuses on applying similarity search and data mining techniques.
Based on an analysis of typical advanced database systems — such as biometrical, biological, multimedia, moving, and CAD-object database systems — the following three challenging characteristics of complexity are detected: uncertainty (probabilistic feature vectors), multiple instances (a set of homogeneous feature vectors), and multiple representations (a set of heterogeneous feature vectors). Therefore, the goal of this thesis is to develop similarity search and data mining techniques that are capable of handling uncertain, multi-instance, and multi-represented objects.
The first part of this thesis deals with similarity search techniques. Object identification is a similarity search technique that is typically used for the recognition of objects from image, video, or audio data. Thus, we develop a novel probabilistic model for object identification. Based on it, two novel types of identification queries are defined. In order to process the novel query types efficiently, we introduce an index structure called Gauss-tree. In addition, we specify further probabilistic models and query types for uncertain multi-instance objects and uncertain spatial objects. Based on the index structure, we develop algorithms for an efficient processing of these query types. Practical benefits of using probabilistic feature vectors are demonstrated on a real-world application for video similarity search. Furthermore, a similarity search technique is presented that is based on aggregated multi-instance objects, and that is suitable for video similarity search. This technique takes multiple representations into account in order to achieve better effectiveness.
The second part of this thesis deals with two major data mining techniques: clustering and classification. Since privacy preservation is a very important demand of distributed advanced applications, we propose using uncertainty for data obfuscation in order to provide privacy preservation during clustering. Furthermore, a model-based and a density-based clustering method for multi-instance objects are developed. Afterwards, original extensions and enhancements of the density-based clustering algorithms DBSCAN and OPTICS for handling multi-represented objects are introduced. Since several advanced database systems like biological or multimedia database systems handle predefined, very large class systems, two novel classification techniques for large class sets that benefit from using multiple representations are defined. The first classification method is based on the idea of a k-nearest-neighbor classifier. It employs a novel density-based technique to reduce training instances and exploits the entropy impurity of the local neighborhood in order to weight a given representation. The second technique addresses hierarchically-organized class systems. It uses a novel hierarchical, supervised method for the reduction of large multi-instance objects, e.g. audio or video, and applies support vector machines for efficient hierarchical classification of multi-represented objects. User benefits of this technique are demonstrated by a prototype that performs a classification of large music collections.
The effectiveness and efficiency of all proposed techniques are discussed and verified by comparison with conventional approaches in versatile experimental evaluations on real-world datasets
The Influence of Visual Provenance Representations on Strategies in a Collaborative Hand-off Data Analysis Scenario
Conducting data analysis tasks rarely occur in isolation. Especially in
intelligence analysis scenarios where different experts contribute knowledge to
a shared understanding, members must communicate how insights develop to
establish common ground among collaborators. The use of provenance to
communicate analytic sensemaking carries promise by describing the interactions
and summarizing the steps taken to reach insights. Yet, no universal guidelines
exist for communicating provenance in different settings. Our work focuses on
the presentation of provenance information and the resulting conclusions
reached and strategies used by new analysts. In an open-ended, 30-minute,
textual exploration scenario, we qualitatively compare how adding different
types of provenance information (specifically data coverage and interaction
history) affects analysts' confidence in conclusions developed, propensity to
repeat work, filtering of data, identification of relevant information, and
typical investigation strategies. We see that data coverage (i.e., what was
interacted with) provides provenance information without limiting individual
investigation freedom. On the other hand, while interaction history (i.e., when
something was interacted with) does not significantly encourage more mimicry,
it does take more time to comfortably understand, as represented by less
confident conclusions and less relevant information-gathering behaviors. Our
results contribute empirical data towards understanding how provenance
summarizations can influence analysis behaviors.Comment: to be published in IEEE Vis 202
- …