7 research outputs found
Recommended from our members
Clustering Information Retrieval Search Outputs
Users are known to have difficulties in dealing with information retrieval search outputs especially if the outputs are above a certain size. It has been argued by several researchers that search output clustering can help users in their interaction with IR systems. Clustering may provide users an overview of the output by exploiting the topicality information that resides in the output but has not been used in the retrieval stage. It can enable them to find the relevant documents more easily and also help them to form an understanding of the different facets of the query that have been provided for their Inspection. This project aimed to investigate the viability of using clustering as a way of mediating users’ interaction with search outputs and attempted to identify its possible benefits.
Can&Ozkarahan’s(90) C3M algorithm was used to test the effectiveness of clustering as a way of search output presentation. C3M is a relatively simple, non-hierarchical method that has been shown to give compatible or superior results to best-known hierarchical methods.
The method was implemented in TCL and linked to the department’s experimental IR system Okapi. Implementation included a procedure of term selection for document representation which preceded the clustering process and a procedure involving cluster representation for users’ viewing following the clustering process. After some tuning of the implementation parameters for the databases used, several experiments were designed and conducted to assess whether clusters could group documents in useful ways.
One group of experiments aimed to assess the ability of the implementation to bring together topically related documents. It was quite difficult to gather data for such an assessment, but the existence of a set of data generated for TREC Interactive track(1996) enabled us to design experiments that at least approximately satisfied our objective. TREC provided a set of queries, and groups of relevant documents with facet assignments made by expert users. It was thus possible to make an Inference by measuring the correlation between the clusters relevant documents were assigned to and the facet assignments made for the documents by TREC experts.
The utility of this data set was limited for various reasons discussed in the related chapters, however, it can be concluded that clusters cannot be relied on to bring together relevant documents assigned to a certain facet. While there was some correlation between the cluster and facet assignments of the documents when the clustering was done only on relevant documents, no correlation could be found when the clustering was based on results of queries defined by City participants to the Interactive track.
Another group of experiments was conducted to compare output clustering with relevance ranking as a search output representation method. This comparison was necessary as an immediate consequence of clustering search output would be the loss of relevance ranking. It had to be assessed whether clustering could help users to find the relevant documents more easily than by relevance ranking, before any clustering solution could be proposed as an alternative to relevance ranked output.
For this purpose, two sets of user experiments(n=20 and n=57) were conducted based on the users’ own information needs. While changes have been made to the implementation between the first and the second set of experiments, the experimental design was almost the same in both runs. Users were first asked to rank clusters formed from the search output(top 50 documents) and then make relevance judgements for the individual documents for the same output. The precision of cluster(s) marked best by the users were then compared to precision values that would be attained by relevance ranking at comparable thresholds.
The results from the 1st group of user experiments were not conclusive(in some part due to the smallness of the data set), but they drew our attention to the importance of representation of clusters and documents for users’ viewing. After some changes to the implementation, mainly related to representation issues, and an intermediate set of 10 experiments to assess two new representation formats, a set of 57 user experiments were conducted to measure and compare precision values attainable by clustering versus relevance ranking.
These experiments revealed no significant precision difference between clustered outputs and ranked lists. The number of cases where one method achieved better than the other was slightly higher for the ranked lists at the top cluster level and slightly higher for the clustered representation at the top two clusters level. However the overall average precision values were higher for the ranked list at both levels.
As such, clustering did not appear to be preferable to ranked lists especially as It also represented overheads in both computing time and resources involved in creation of the clusters, and the time and effort taken by the users to inspect them.
An interesting outcome of the user experiments was the ability of the users to identify clusters that do not include relevant information. There were less relevant documents among the clusters marked last by the users as compared to the documents ranked last at similar threshold levels. This brought out the possibility of using clusters as an exclusion tool to improve the precision of ranked lists. After exclusion of documents from the last cluster, ranked lists performed significantly better than the clusters at the top cluster level.
There was also some evidence (consisting of observation of users during the experiments and a few user comments) that clusters could be used to provide the users with a glimpse of the search results, in order to decide whether to inspect the search results or initiate a new query straight away.
In summary, cumulative experiment results imply that clustering cannot outperform relevance ranking, and seems to deserve only a secondary role in users’ interaction with IR systems. However, it should also be noted that the experiment results are not representative of the whole set of possible user types and search situations and it may be possible to Identify search situations where clustering can be more beneficial than relevance ranking
Improvement of Information Retrieval Systems by Using Hidden Vertical Search
The exponential growth of the number of documents in digital libraries and on the Web calls for very intensive development of retrieval systems. One possible architectural approach to IRS, an architecture with hidden verticals, is proposed in this paper. In IRS with hidden verticals, documents from the searched corpus are stored into a predefined set of classes. The user's query is classified before the search, and searching is done only within the corresponding class. The performance of the proposed system is compared to the performance of standard IRS (that contains a unique inverted index) and IRS with cluster pruning (in which searching corpus is clustered and query is compared to the clusters' centroids first, then search is done only in the most similar cluster). Search time in the proposed system is 7.9 times shorter than in the standard IRS and 1.7 times shorter than in the system with cluster pruning. The precision of the proposed system is 2.59 times higher than the precision of the standard IRS, and 1.68 times better compared to the IRS with cluster pruning. The recall of the proposed system is 1.09 times smaller than the recall of the standard IRS, but it is 1.28 times better than the recall of the IRS with cluster pruning. Based on the above results, we can say that proposed approach reduces search time and increases search precision with a minimal reduction in recall
Efficient Communication in Agent-based Autonomous Logistic Processes
Transportation of goods plays a vital role for the success of a logistics network. The ability to transport goods quickly and cost effectively is one of the major requirements of the customers. Dynamics involved in the logistics process like change or cancellation of orders or uncertain information about the orders add to the complexity of the logistic network and can even reduce the efficiency of the entire logistics process. This brings about a need of integrating technology and making the system more autonomous to handle these dynamics and to reduce the complexity. Therefore, the distributed logistics routing protocol (DLRP) was developed at the University of Bremen. In this thesis, DLRP is extended with the concept of clustering of transport goods, two novel routing decision schemes and a negotiation process between the cluster of goods and the vehicle. DLRP provides the individual logistic entities the ability to perform routing tasks autonomously e.g., discovering the best route to the destination at the given time. Even though DLRP seems to solve the routing problem in real-time, the amount of message flooding involved in the route discovery process is enormous. This motivated the author to introduce a cluster-based routing approach using software agents. The DLRP along with the clustering algorithm is termed as the cluster-based DLRP. In the latter, the goods are first clustered into groups based on criteria such as the common destination. The routing is now handled by the cluster head rather than the individual transport goods which results in a reduced communication volume in the route discovery. The latter is proven by evaluating the performance of the cluster-based DLRP approach compared to the legacy DLRP. After the routing process is completed by the cluster heads, the next step is to improve the transport performance in the logistics network by identifying the best means to transport the clustered goods. For example, to have better utilization of the transport capacity, clusters can be transported together on a stretch of overlapping route. In order to make optimal transport decisions, the vehicle calculates the correlation metric of the routes selected by the various clusters. The correlation metric aids in identifying the clusters which can be transported together and thereby can result in better utilization of the transport resources. In turn, the transportation cost that has to be paid to the vehicle can be shared between the different clusters. The transportation cost for a stretch of route is calculated by the vehicle and offered to the cluster. The latter can decide based upon the transportation cost or the selected route whether to accept the transport offer from the vehicle or not. In this regard, different strategies are developed and investigated. Thereby a performance evaluation of the capacity utilization of the vehicle and the transportation cost incurred by the cluster is presented. Finally, the thesis introduces the concept of negotiation in the cluster based routing methods. The negotiation process enhances the transport decisions by giving the clusters and the vehicles the flexibility to negotiate the transportation cost. Thus, the focus of this part of the thesis is to analyse the negotiation strategies used by the logistics entities and their role in saving negotiation time while achieving a favorable transportation cost. In this regard, a performance evaluation of the different proposed strategies is presented, which in turn gives the logistics practitioners an overview of the best strategy to be deployed in various scenarios. Clustering of goods aid in the negotiation process as on the one hand, a group of transport goods have a stronger basis for negotiation to achieve a favorable transportation price from the vehicle. On the other hand it makes it easier for the vehicle to select the packages for transport and helps the vehicle to operate close to its capacity. In addition, clustering enables the negotiation process to be less complex and voluminous. From the analytical considerations and obtained results in the three parts of this thesis, it can be concluded that efficient transport decisions, though very complex in a logistics network, can be simplified to a certain extent utilizing the available information of the goods and vehicles in the network
Effective retrieval to support learning
To use digital resources to support learning, we need to be able to retrieve them. This thesis introduces a new area of research within information retrieval, the retrieval of educational resources from the Web. Successful retrieval of educational resources requires an understanding of how the resources being searched are managed, how searchers interact with those resources and the systems that manage them, and the needs of the people searching. As such, we began by investigating how resources are managed and reused in a higher education setting. This investigation involved running four focus groups with 23 participants, 26 interviews and a survey. The second part of this work is motivated by one of our initial findings; when people look for educational resources, they prefer to search the World Wide Web using a public search engine. This finding suggests users searching for educational resources may be more satisfied with search engine results if only those resources likely to support learning are presented. To provide satisfactory result sets, resources that are unlikely to support learning should not be present. A filter to detect material that is likely to support learning would therefore be useful. Information retrieval systems are often evaluated using the Cranfield method, which compares system performance with a ground truth provided by human judgments. We propose a method of evaluating systems that filter educational resources based on this method. By demonstrating that judges can agree on which resources are educational, we establish that a single human judge for each resource provides a sufficient ground truth. Machine learning techniques are commonly used to classify resources. We investigate how machine learning can be used to classify resources retrieved from the Web as likely or unlikely to support learning. We found that reasonable classification performance can be achieved using text extracted from resources in conjunction with Naïve Bayes, AdaBoost, and Random Forest classifiers. We also found that attributes developed from the structural elements—hyperlinks and headings found in a resource—did not substantially improve classification to support learning. We found that reasonable classification performance can be achieved using text extracted from resources in conjunction with Naïve Bayes, AdaBoost, and Random Forest classifiers. We also found that attributes developed from the structural elements—hyperlinks and headings found in a resource—did not substantially improve classification over simply using the text
The Effectiveness of Query-Based Hierarchic Clustering of Documents for Information Retrieval
Hierarchic document clustering has been applied to Information Retrieval (IR) for over three decades. Its introduction to IR was based on the grounds of its potential to improve the effectiveness of IR systems. Central to the issue of improved effectiveness is the Cluster Hypothesis. The hypothesis states that relevant documents tend to be highly similar to each other, and therefore tend to appear in the same clusters. However, previous research has been inconclusive as to whether document clustering does bring improvements. The main motivation for this work has been to investigate methods for the improvement of the effectiveness of document clustering, by challenging some assumptions that implicitly characterise its application. Such assumptions relate to the static manner in which document clustering is typically performed, and include the static application of document clustering prior to querying, and the static calculation of interdocument associations. The type of clustering that is investigated in this thesis is query-based, that is, it incorporates information from the query into the process of generating clusters of documents. Two approaches for incorporating query information into the clustering process are examined: clustering documents which are returned from an IR system in response to a user query (post-retrieval clustering), and clustering documents by using query-sensitive similarity measures. For the first approach, post-retrieval clustering, an analytical investigation into a number of issues that relate to its retrieval effectiveness is presented in this thesis. This is in contrast to most of the research which has employed post-retrieval clustering in the past, where it is mainly viewed as a convenient and efficient means of presenting documents to users. In this thesis, post-retrieval clustering is employed based on its potential to introduce effectiveness improvements compared both to static clustering and best-match IR systems. The motivation for the second approach, the use of query-sensitive measures, stems from the role of interdocument similarities for the validity of the cluster hypothesis. In this thesis, an axiomatic view of the hypothesis is proposed, by suggesting that documents relevant to the same query (co-relevant documents) display an inherent similarity to each other which is dictated by the query itself. Because of this inherent similarity, the cluster hypothesis should be valid for any document collection. Past research has attributed failure to validate the hypothesis for a document collection to characteristics of the collection. Contrary to this, the view proposed in this thesis suggests that failure of a document set to adhere to the hypothesis is attributed to the assumptions made about interdocument similarity. This thesis argues that the query determines the context and the purpose for which the similarity between documents is judged, and it should therefore be incorporated in the similarity calculations. By taking the query into account when calculating interdocument similarities, co-relevant documents can be "forced" to be more similar to each other. This view challenges the typically static nature of interdocument relationships in IR. Specific formulas for the calculation of query-sensitive similarity are proposed in this thesis. Four hierarchic clustering methods and six document collections are used in the experiments. Three main issues are investigated: the effectiveness of hierarchic post-retrieval clustering which uses static similarity measures, the effectiveness of query-sensitive measures at increasing the similarity of pairs of co-relevant documents, and the effectiveness of hierarchic clustering which uses query-sensitive similarity measures. The results demonstrate the effectiveness improvements that are introduced by the use of both approaches of query-based clustering, compared both to the effectiveness of static clustering and to the effectiveness of best-match IR systems. Query-sensitive similarity measures, in particular, introduce significant improvements over the use of static similarity measures for document clustering, and they also significantly improve the structure of the document space in terms of the similarity of pairs of co-relevant documents. The results provide evidence for the effectiveness of hierarchic query-based clustering of documents, and also challenge findings of previous research which had dismissed the potential of hierarchic document clustering as an effective method for information retrieval
Clustering information retrieval search outputs
Users are known to experience difficulties in dealing with information retrieval search outputs, especially if those outputs are above a certain size. It has been argued by several researchers that search output clustering can help users in their interaction with IR systems in some retrieval situations, providing them with an overview of their results by exploiting the topicality information that resides in the output but has not been used at the retrieval stage. This overview might enable them to find relevant documents more easily by focusing on the most promising clusters, or to use the clusters as a starting-point for query refinement or expansion. In this paper, the results of experiments carried out to assess the viability of clustering as a search output presentation method are reported and discussed. 1
Clustering information retrieval search outputs
SIGLEAvailable from British Library Document Supply Centre-DSC:DXN034248 / BLDSC - British Library Document Supply CentreGBUnited Kingdo