552 research outputs found

    Computing text semantic relatedness using the contents and links of a hypertext encyclopedia

    Get PDF
    We propose a method for computing semantic relatedness between words or texts by using knowledge from hypertext encyclopedias such as Wikipedia. A network of concepts is built by filtering the encyclopedia's articles, each concept corresponding to an article. Two types of weighted links between concepts are considered: one based on hyperlinks between the texts of the articles, and another one based on the lexical similarity between them. We propose and implement an efficient random walk algorithm that computes the distance between nodes, and then between sets of nodes, using the visiting probability from one (set of) node(s) to another. Moreover, to make the algorithm tractable, we propose and validate empirically two truncation methods, and then use an embedding space to learn an approximation of visiting probability. To evaluate the proposed distance, we apply our method to four important tasks in natural language processing: word similarity, document similarity, document clustering and classification, and ranking in information retrieval. The performance of the method is state-of-the-art or close to it for each task, thus demonstrating the generality of the knowledge resource. Moreover, using both hyperlinks and lexical similarity links improves the scores with respect to a method using only one of them, because hyperlinks bring additional real-world knowledge not captured by lexical similarity. (C) 2012 Elsevier B.V. All rights reserved

    Similarity Learning Over Large Collaborative Networks

    Get PDF
    In this thesis, we propose novel solutions to similarity learning problems on collaborative networks. Similarity learning is essential for modeling and predicting the evolution of collaborative networks. In addition, similarity learning is used to perform ranking, which is the main component of recommender systems. Due to the the low cost of developing such collaborative networks, they grow very quickly, and therefore, our objective is to develop models that scale well to large networks. The similarity measures proposed in this thesis make use of the global link structure of the network and of the attributes of the nodes in a complementary way. We first define a random walk model, named Visiting Probability (VP), to measure proximity between two nodes in a graph. VP considers all the paths between two nodes collectively and thus reduces the effect of potentially unreliable individual links. Moreover, using VP and the structural characteristics of small-world networks (a frequent type of networks), we design scalable algorithms based on VP similarity. We then model the link structure of a graph within a similarity learning framework, in which the transformation of nodes to a latent space is trained using a discriminative model. When trained over VP scores, the model is able to better predict the relations in a graph in comparison to models learned directly from the network’s links. Using the VP approach, we explain how to transfer knowledge from a hypertext encyclopedia to text analysis tasks. We consider the graph of Wikipedia articles with two types of links between them: hyperlinks and content similarity ones. To transfer the knowledge learned from the Wikipedia network to text analysis tasks, we propose and test two shared representation methods. In the first one, a given text is mapped to the corresponding concepts in the network. Then, to compute similarity between two texts, VP similarity is applied to compute the distance between the two sets of nodes. The second method uses the latent space model for representation, by training a transformation from words to the latent space over VP scores. We test our proposals on several benchmark tasks: word similarity, document similarity / clustering / classification, information retrieval, and learning to rank. The results are most often competitive compared to state-of-the-art task-specific methods, thus demonstrating the generality of our proposal. These results also support the hypothesis that both types of links over Wikipedia are useful, as the improvement is higher when both are used. In many collaborative networks, different link types can be used in a complementary way. Therefore, we propose two joint similarity learning models over the nodes’ attributes, to be used for link prediction in networks with multiple link types. The first model learns a similarity metric that consists of two parts: the general part, which is shared between all link types, and the specific part, which is trained specifically for each type of link. The second model consists of two layers: the first layer, which is shared between all link types, embeds the objects of the network into a new space, and then a similarity is learned specifically for each link type in this new space. Our experiments show that the proposed joint modeling and training frameworks improve link prediction performance significantly for each link type in comparison to multiple baselines. The two-layer similarity model outperforms the first one, as expected, due to its capability of modeling negative correlations among different link types. Finally, we propose a learning to rank algorithm on network data, which uses both the attributes of the nodes and the structure of the links for learning and inference. Link structure is used in training through a neighbor-aware ranker which considers both node attributes and scores of neighbor nodes. The global link structure of the network is used in inference through an original propagation method called the Iterative Ranking Algorithm. This propagates the predicted scores in the graph on condition that they are above a given threshold. Thresholding improves performance, and makes a time-efficient implementation possible, for application to large scale graphs. The observed improvements are explained considering the structural properties of small-world networks

    Technologies to enhance self-directed learning from hypertext

    Get PDF
    With the growing popularity of the World Wide Web, materials presented to learners in the form of hypertext have become a major instructional resource. Despite the potential of hypertext to facilitate access to learning materials, self-directed learning from hypertext is often associated with many concerns. Self-directed learners, due to their different viewpoints, may follow different navigation paths, and thus they will have different interactions with knowledge. Therefore, learners can end up being disoriented or cognitively-overloaded due to the potential gap between what they need and what actually exists on the Web. In addition, while a lot of research has gone into supporting the task of finding web resources, less attention has been paid to the task of supporting the interpretation of Web pages. The inability to interpret the content of pages leads learners to interrupt their current browsing activities to seek help from other human resources or explanatory learning materials. Such activity can weaken learner engagement and lower their motivation to learn. This thesis aims to promote self-directed learning from hypertext resources by proposing solutions to the above problems. It first presents Knowledge Puzzle, a tool that proposes a constructivist approach to learn from the Web. Its main contribution to Web-based learning is that self-directed learners will be able to adapt the path of instruction and the structure of hypertext to their way of thinking, regardless of how the Web content is delivered. This can effectively reduce the gap between what they need and what exists on the Web. SWLinker is another system proposed in this thesis with the aim of supporting the interpretation of Web pages using ontology based semantic annotation. It is an extension to the Internet Explorer Web browser that automatically creates a semantic layer of explanatory information and instructional guidance over Web pages. It also aims to break the conventional view of Web browsing as an individual activity by leveraging the notion of ontology-based collaborative browsing. Both of the tools presented in this thesis were evaluated by students within the context of particular learning tasks. The results show that they effectively fulfilled the intended goals by facilitating learning from hypertext without introducing high overheads in terms of usability or browsing efforts

    Towards a typology of classificatory change

    Get PDF
    Classifications of all types invariably change in response to shifting conditions in the information environment. Revising the contents of subject-based schemes is an important type of change, but the phenomenon of classificatory change has multiple interrelated aspects that go beyond content. Conceptually isolating these aspects offers a starting point for describing and comparing different types of classificatory change. The typology proposed here attempts to situate classification schemes within a context of use, interacting with other elements of the information environment. As the digital information landscape continues to evolve, there are increased opportunities for classificatory innovation. While hyperlinks have become a pervasive element in the repertory of knowledge organization, the hypertext technique of transclusion has received considerably less attention. Transclusion offers an alternative way to envision the relationship between digital resources and classification schemes. Examples from the English-language Wikipedia demonstrate how transclusion is used in the digital encyclopedia to embed modular subject-based schemes that supplement knowledge navigation and discovery

    Analysing entity context in multilingual Wikipedia to support entity-centric retrieval applications

    Get PDF
    Representation of influential entities, such as famous people and multinational corporations, on the Web can vary across languages, reflecting language-specific entity aspects as well as divergent views on these entities in different communities. A systematic analysis of language specific entity contexts can provide a better overview of the existing aspects and support entity-centric retrieval applications over multilingual Web data. An important source of cross-lingual information about influential entities is Wikipedia — an online community-created encyclopaedia — containing more than 280 language editions. In this paper we focus on the extraction and analysis of the language-specific entity contexts from different Wikipedia language editions over multilingual data. We discuss alternative ways such contexts can be built, including graph-based and article-based contexts. Furthermore, we analyse the similarities and the differences in these contexts in a case study including 80 entities and five Wikipedia language editions

    The effects of the number of links and navigation support on cognitive load and learning with hypertext: The mediating role of reading order

    Get PDF
    In an experiment, we tested DeStefano and LeFevre's predictions as well as the usefulness of link suggestions. Participants used different versions of a hypertext, either with 3-links or 8-links per page, presenting link suggestions or not. We tested their cognitive load and learning outcomes. Results showed that there was a benefit of using link suggestions for learning, but no effect of number of links on learning was found. Moreover, the effects of our manipulations on cognitive load were mediated by the reading order that participants selected. Implications for research and the design of navigation support systems are discussed

    Analysis of category co-occurrence in Wikipedia networks

    Get PDF
    Wikipedia has seen a huge expansion of content since its inception. Pages within this online encyclopedia are organised by assigning them to one or more categories, where Wikipedia maintains a manually constructed taxonomy graph that encodes the semantic relationship between these categories. An alternative, called the category co-occurrence graph, can be produced automatically by linking together categories that have pages in common. Properties of the latter graph and its relationship to the former is the concern of this thesis. The analytic framework, called t-component, is introduced to formalise the graphs and discover category clusters connecting relevant categories together. The m-core, a cohesive subgroup concept as a clustering model, is used to construct a subgraph depending on the number of shared pages between the categories exceeding a given threshold t. The significant of the clustering result of the m-core is validated using a permutation test. This is compared to the k-core, another clustering model. TheWikipedia category co-occurrence graphs are scale-free with a few category hubs and the majority of clusters are size 2. All observed properties for the distribution of the largest clusters of the category graphs obey power-laws with decay exponent averages around 1. As the threshold t of the number of shared pages is increased, eventually a critical threshold is reached when the largest cluster shrinks significantly in size. This phenomena is only exhibited for the m-core but not the k-core. Lastly, the clustering in the category graph is shown to be consistent with the distance between categories in the taxonomy graph

    Dynamics of conflicts in Wikipedia

    Get PDF
    In this work we study the dynamical features of editorial wars in Wikipedia (WP). Based on our previously established algorithm, we build up samples of controversial and peaceful articles and analyze the temporal characteristics of the activity in these samples. On short time scales, we show that there is a clear correspondence between conflict and burstiness of activity patterns, and that memory effects play an important role in controversies. On long time scales, we identify three distinct developmental patterns for the overall behavior of the articles. We are able to distinguish cases eventually leading to consensus from those cases where a compromise is far from achievable. Finally, we analyze discussion networks and conclude that edit wars are mainly fought by few editors only.Comment: Supporting information adde
    • …
    corecore