71 research outputs found

    Entity-centric knowledge discovery for idiosyncratic domains

    Get PDF
    Technical and scientific knowledge is produced at an ever-accelerating pace, leading to increasing issues when trying to automatically organize or process it, e.g., when searching for relevant prior work. Knowledge can today be produced both in unstructured (plain text) and structured (metadata or linked data) forms. However, unstructured content is still themost dominant formused to represent scientific knowledge. In order to facilitate the extraction and discovery of relevant content, new automated and scalable methods for processing, structuring and organizing scientific knowledge are called for. In this context, a number of applications are emerging, ranging fromNamed Entity Recognition (NER) and Entity Linking tools for scientific papers to specific platforms leveraging information extraction techniques to organize scientific knowledge. In this thesis, we tackle the tasks of Entity Recognition, Disambiguation and Linking in idiosyncratic domains with an emphasis on scientific literature. Furthermore, we study the related task of co-reference resolution with a specific focus on named entities. We start by exploring Named Entity Recognition, a task that aims to identify the boundaries of named entities in textual contents. We propose a newmethod to generate candidate named entities based on n-gram collocation statistics and design several entity recognition features to further classify them. In addition, we show how the use of external knowledge bases (either domain-specific like DBLP or generic like DBPedia) can be leveraged to improve the effectiveness of NER for idiosyncratic domains. Subsequently, we move to Entity Disambiguation, which is typically performed after entity recognition in order to link an entity to a knowledge base. We propose novel semi-supervised methods for word disambiguation leveraging the structure of a community-based ontology of scientific concepts. Our approach exploits the graph structure that connects different terms and their definitions to automatically identify the correct sense that was originally picked by the authors of a scientific publication. We then turn to co-reference resolution, a task aiming at identifying entities that appear using various forms throughout the text. We propose an approach to type entities leveraging an inverted index built on top of a knowledge base, and to subsequently re-assign entities based on the semantic relatedness of the introduced types. Finally, we describe an application which goal is to help researchers discover and manage scientific publications. We focus on the problem of selecting relevant tags to organize collections of research papers in that context. We experimentally demonstrate that the use of a community-authored ontology together with information about the position of the concepts in the documents allows to significantly increase the precision of tag selection over standard methods

    Tag Recommendation for Large-Scale Ontology-Based Information Systems

    Get PDF
    We tackle the problem of improving the relevance of automatically selected tags in large-scale ontology-based information systems. Contrary to traditional settings where tags can be chosen arbitrarily, we focus on the problem of recommending tags (e.g., concepts) directly from a collaborative, user-driven ontology. We compare the effectiveness of a series of approaches to select the best tags ranging from traditional IR techniques such as TF/IDF weighting to novel techniques based on ontological distances and latent Dirichlet allocation. All our experiments are run against a real corpus of tags and documents extracted from the ScienceWise portal, which is connected to ArXiv.org and is currently used by growing number of researchers. The datasets for the experiments are made available online for reproducibility purposes

    Analysis and implementation of methods for the text categorization

    Get PDF
    Text Categorization (TC) is the automatic classification of text documents under pre-defined categories, or classes. Popular TC approaches map categories into symbolic labels and use a training set of documents, previously labeled by human experts, to build a classifier which enables the automatic TC of unlabeled documents. Suitable TC methods come from the field of data mining and information retrieval, however the following issues remain unsolved. First, the classifier performance depends heavily on hand-labeled documents that are the only source of knowledge for learning the classifier. Being a labor-intensive and time consuming activity, the manual attribution of documents to categories is extremely costly. This creates a serious limitations when a set of manual labeled data is not available, as it happens in most cases. Second, even a moderately sized text collection often has tens of thousands of terms in that making the classification cost prohibitive for learning algorithms that do not scale well to large problem sizes. Most important, TC should be based on the text content rather than on a set of hand-labeled documents whose categorization depends on the subjective judgment of a human classifier. This thesis aims at facing the above issues by proposing innovative approaches which leverage techniques from data mining and information retrieval. To face problems about both the high dimensionality of the text collection and the large number of terms in a single text, the thesis proposes a hybrid model for term selection which combines and takes advantage of both filter and wrapper approaches. In detail, the proposed model uses a filter to rank the list of terms present in documents to ensure that useful terms are unlikely to be screened out. Next, to limit classification problems due to the correlation among terms, this ranked list is refined by a wrapper that uses a Genetic Algorithm (GA) to retaining the most informative and discriminative terms. Experimental results compare well with some of the top-performing learning algorithms for TC and seems to confirm the effectiveness of the proposed model. To face the issues about the lack and the subjectivity of manually labeled datasets, the basic idea is to use an ontology-based approach which does not depend on the existence of a training set and relies solely on a set of concepts within a given domain and the relationships between concepts. In this regard, the thesis proposes a text categorization approach that applies WordNet for selecting the correct sense of words in a document, and utilizes domain names in WordNet Domains for classification purposes. Experiments show that the proposed approach performs well in classifying a large corpus of documents. This thesis contributes to the area of data mining and information retrieval. Specifically, it introduces and evaluates novel techniques to the field of text categorization. The primary objective of this thesis is to test the hypothesis that: text categorization requires and benefits from techniques designed to exploit document content. hybrid methods from data mining and information retrieval can better support problems about high dimensionality that is the main aspect of large document collections. in absence of manually annotated documents, WordNet domain abstraction can be used that is both useful and general enough to categorize any documents collection. As a final remark, it is important to acknowledge that much of the inspiration and motivation for this work derived from the vision of the future of text categorization processes which are related to specific application domains such as the business area and the industrial sectors, just to cite a few. In the end, it is this vision that provided the guiding framework. However, it is equally important to understand that many of the results and techniques developed in this thesis are not limited to text categorization. For example, the evaluation of disambiguation methods is interesting in its own right and is likely to be relevant to other application fields

    Analysis and implementation of methods for the text categorization

    Get PDF
    Text Categorization (TC) is the automatic classification of text documents under pre-defined categories, or classes. Popular TC approaches map categories into symbolic labels and use a training set of documents, previously labeled by human experts, to build a classifier which enables the automatic TC of unlabeled documents. Suitable TC methods come from the field of data mining and information retrieval, however the following issues remain unsolved. First, the classifier performance depends heavily on hand-labeled documents that are the only source of knowledge for learning the classifier. Being a labor-intensive and time consuming activity, the manual attribution of documents to categories is extremely costly. This creates a serious limitations when a set of manual labeled data is not available, as it happens in most cases. Second, even a moderately sized text collection often has tens of thousands of terms in that making the classification cost prohibitive for learning algorithms that do not scale well to large problem sizes. Most important, TC should be based on the text content rather than on a set of hand-labeled documents whose categorization depends on the subjective judgment of a human classifier. This thesis aims at facing the above issues by proposing innovative approaches which leverage techniques from data mining and information retrieval. To face problems about both the high dimensionality of the text collection and the large number of terms in a single text, the thesis proposes a hybrid model for term selection which combines and takes advantage of both filter and wrapper approaches. In detail, the proposed model uses a filter to rank the list of terms present in documents to ensure that useful terms are unlikely to be screened out. Next, to limit classification problems due to the correlation among terms, this ranked list is refined by a wrapper that uses a Genetic Algorithm (GA) to retaining the most informative and discriminative terms. Experimental results compare well with some of the top-performing learning algorithms for TC and seems to confirm the effectiveness of the proposed model. To face the issues about the lack and the subjectivity of manually labeled datasets, the basic idea is to use an ontology-based approach which does not depend on the existence of a training set and relies solely on a set of concepts within a given domain and the relationships between concepts. In this regard, the thesis proposes a text categorization approach that applies WordNet for selecting the correct sense of words in a document, and utilizes domain names in WordNet Domains for classification purposes. Experiments show that the proposed approach performs well in classifying a large corpus of documents. This thesis contributes to the area of data mining and information retrieval. Specifically, it introduces and evaluates novel techniques to the field of text categorization. The primary objective of this thesis is to test the hypothesis that: text categorization requires and benefits from techniques designed to exploit document content. hybrid methods from data mining and information retrieval can better support problems about high dimensionality that is the main aspect of large document collections. in absence of manually annotated documents, WordNet domain abstraction can be used that is both useful and general enough to categorize any documents collection. As a final remark, it is important to acknowledge that much of the inspiration and motivation for this work derived from the vision of the future of text categorization processes which are related to specific application domains such as the business area and the industrial sectors, just to cite a few. In the end, it is this vision that provided the guiding framework. However, it is equally important to understand that many of the results and techniques developed in this thesis are not limited to text categorization. For example, the evaluation of disambiguation methods is interesting in its own right and is likely to be relevant to other application fields

    Linked Data Supported Information Retrieval

    Get PDF
    Um Inhalte im World Wide Web ausfindig zu machen, sind Suchmaschienen nicht mehr wegzudenken. Semantic Web und Linked Data Technologien ermöglichen ein detaillierteres und eindeutiges Strukturieren der Inhalte und erlauben vollkommen neue Herangehensweisen an die Lösung von Information Retrieval Problemen. Diese Arbeit befasst sich mit den Möglichkeiten, wie Information Retrieval Anwendungen von der Einbeziehung von Linked Data profitieren können. Neue Methoden der computer-gestützten semantischen Textanalyse, semantischen Suche, Informationspriorisierung und -visualisierung werden vorgestellt und umfassend evaluiert. Dabei werden Linked Data Ressourcen und ihre Beziehungen in die Verfahren integriert, um eine Steigerung der Effektivität der Verfahren bzw. ihrer Benutzerfreundlichkeit zu erzielen. Zunächst wird eine Einführung in die Grundlagen des Information Retrieval und Linked Data gegeben. Anschließend werden neue manuelle und automatisierte Verfahren zum semantischen Annotieren von Dokumenten durch deren Verknüpfung mit Linked Data Ressourcen vorgestellt (Entity Linking). Eine umfassende Evaluation der Verfahren wird durchgeführt und das zu Grunde liegende Evaluationssystem umfangreich verbessert. Aufbauend auf den Annotationsverfahren werden zwei neue Retrievalmodelle zur semantischen Suche vorgestellt und evaluiert. Die Verfahren basieren auf dem generalisierten Vektorraummodell und beziehen die semantische Ähnlichkeit anhand von taxonomie-basierten Beziehungen der Linked Data Ressourcen in Dokumenten und Suchanfragen in die Berechnung der Suchergebnisrangfolge ein. Mit dem Ziel die Berechnung von semantischer Ähnlichkeit weiter zu verfeinern, wird ein Verfahren zur Priorisierung von Linked Data Ressourcen vorgestellt und evaluiert. Darauf aufbauend werden Visualisierungstechniken aufgezeigt mit dem Ziel, die Explorierbarkeit und Navigierbarkeit innerhalb eines semantisch annotierten Dokumentenkorpus zu verbessern. Hierfür werden zwei Anwendungen präsentiert. Zum einen eine Linked Data basierte explorative Erweiterung als Ergänzung zu einer traditionellen schlüsselwort-basierten Suchmaschine, zum anderen ein Linked Data basiertes Empfehlungssystem

    Helping users learn about social processes while learning from users : developing a positive feedback in social computing

    Get PDF
    Advisors: Philippe J. GiabbanelliSocial computing is concerned with the interaction of social behavior and computational systems. From its early days, social computing has had two foci. One was the development of technology and interfaces to support online communities. The other was to use computational techniques to study society and assess the expected impact of policies. This thesis seeks to develop systems for social computing, both in the context of online communities and the study of societal processes, that allow users to learn while in turn learning from users. Communities are approached through the problem of Massive Open Online Courses (MOOCs), via a complementary use of network analysis and text mining. In particular, we show that an efficient system can be designed such that instructors do not need to categorize the interactions of all students to assess their learning experience. This thesis explores the study of societal processes by showing how text analytics, visual analytics, and fuzzy cognitive map (FCM) can collectively help an analyst to understand complex scenarios such as obesity. Overall, this work had two key limitations. One was in the dataset we used, as it was small and didn't show all possible interactions, and the other is in the scalability of our systems. Future work can include the use of non-n-gram features to improve our MOOC system and the use of graph layouts for our visualization system.M.S. (Master of Science

    Intelligent RSS Tool

    Get PDF
    Projecte realitzat en col·laboració amb la Aalto UniversityEasy access to a wide range of information available online enables people to explore this information with an ambition to explore interesting content even more. This opportunity often leads to a problem of finding interesting and relevant information from the sea of knowledge. This problem is often referred to as the information overload problem, which is getting harder and harder to deal with as the amount of information available online grows. In this thesis, one source of information is exploited and organized in such a way that the task of discovering new content is made easier. We use Really Simple Syndication (RSS) as our source of information and two methods to categorize it: document clustering with K-Means and Latent Dirichlet Allocation (LDA). We use the textual information that the RSS contains, each RSS feed usually contains a specific set of topics. Our first goal is to perform document clustering to the data, in order to generate meaningful clusters with the help of natural language processing (NLP) techniques to preprocess the data. Our second goal is to analyze the clustered RSS feeds and exploit the similarities between the documents to generate meaningful user models based on user feed subscriptions. The third goal is to provide relevant recommendations based on the user models we have learned. We combine the current state-of-the-art methods and present novel methods to compare feeds. We exploit WordNet shallow ontologies in our novel method to create generalized representations of the feeds. The final goal is to develop a functional application that can leverage the methods we developed with the help of machine learning libraries. The method we propose is a combination of document clustering techniques, text similarity, feed modeling and recommendation system.The results of our experiments show that K-Means clustered documents combined with recommendations based on the feed contents yield the best results. Using WordNet to measure the similarity of words provides also promising results. Further exploring the advantages of using semantic similarities would be an interesting research topic in the document similarity measures

    Report on shape analysis and matching and on semantic matching

    No full text
    In GRAVITATE, two disparate specialities will come together in one working platform for the archaeologist: the fields of shape analysis, and of metadata search. These fields are relatively disjoint at the moment, and the research and development challenge of GRAVITATE is precisely to merge them for our chosen tasks. As shown in chapter 7 the small amount of literature that already attempts join 3D geometry and semantics is not related to the cultural heritage domain. Therefore, after the project is done, there should be a clear ‘before-GRAVITATE’ and ‘after-GRAVITATE’ split in how these two aspects of a cultural heritage artefact are treated.This state of the art report (SOTA) is ‘before-GRAVITATE’. Shape analysis and metadata description are described separately, as currently in the literature and we end the report with common recommendations in chapter 8 on possible or plausible cross-connections that suggest themselves. These considerations will be refined for the Roadmap for Research deliverable.Within the project, a jargon is developing in which ‘geometry’ stands for the physical properties of an artefact (not only its shape, but also its colour and material) and ‘metadata’ is used as a general shorthand for the semantic description of the provenance, location, ownership, classification, use etc. of the artefact. As we proceed in the project, we will find a need to refine those broad divisions, and find intermediate classes (such as a semantic description of certain colour patterns), but for now the terminology is convenient – not least because it highlights the interesting area where both aspects meet.On the ‘geometry’ side, the GRAVITATE partners are UVA, Technion, CNR/IMATI; on the metadata side, IT Innovation, British Museum and Cyprus Institute; the latter two of course also playing the role of internal users, and representatives of the Cultural Heritage (CH) data and target user’s group. CNR/IMATI’s experience in shape analysis and similarity will be an important bridge between the two worlds for geometry and metadata. The authorship and styles of this SOTA reflect these specialisms: the first part (chapters 3 and 4) purely by the geometry partners (mostly IMATI and UVA), the second part (chapters 5 and 6) by the metadata partners, especially IT Innovation while the joint overview on 3D geometry and semantics is mainly by IT Innovation and IMATI. The common section on Perspectives was written with the contribution of all

    개인화 검색 및 파트너쉽 선정을 위한 사용자 프로파일링

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 치의과학과, 2014. 2. 김홍기.The secret of change is to focus all of your energy not on fighting the old, but on building the new. - Socrates The automatic identification of user intention is an important but highly challenging research problem whose solution can greatly benefit information systems. In this thesis, I look at the problem of identifying sources of user interests, extracting latent semantics from it, and modelling it as a user profile. I present algorithms that automatically infer user interests and extract hidden semantics from it, specifically aimed at improving personalized search. I also present a methodology to model user profile as a buyer profile or a seller profile, where the attributes of the profile are populated from a controlled vocabulary. The buyer profiles and seller profiles are used in partnership match. In the domain of personalized search, first, a novel method to construct a profile of user interests is proposed which is based on mining anchor text. Second, two methods are proposed to builder a user profile that gather terms from a folksonomy system where matrix factorization technique is explored to discover hidden relationship between them. The objective of the methods is to discover latent relationship between terms such that contextually, semantically, and syntactically related terms could be grouped together, thus disambiguating the context of term usage. The profile of user interests is also analysed to judge its clustering tendency and clustering accuracy. Extensive evaluation indicates that a profile of user interests, that can correctly or precisely disambiguate the context of user query, has a significant impact on the personalized search quality. In the domain of partnership match, an ontology termed as partnership ontology is proposed. The attributes or concepts, in the partnership ontology, are features representing context of work. It is used by users to lay down their requirements as buyer profiles or seller profiles. A semantic similarity measure is defined to compute a ranked list of matching seller profiles for a given buyer profile.1 Introduction 1 1.1 User Profiling for Personalized Search . . . . . . . . 9 1.1.1 Motivation . . . . . . . . . . . . . . . . . . . 10 1.1.2 Research Problems . . . . . . . . . . . . . . 11 1.2 User Profiling for Partnership Match . . . . . . . . 18 1.2.1 Motivation . . . . . . . . . . . . . . . . . . . 19 1.2.2 Research Problems . . . . . . . . . . . . . . 24 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . 25 1.4 System Architecture - Personalized Search . . . . . 29 1.5 System Architecture - Partnership Match . . . . . . 31 1.6 Organization of this Dissertation . . . . . . . . . . 32 2 Background 35 2.1 Introduction to Social Web . . . . . . . . . . . . . . 35 2.2 Matrix Decomposition Methods . . . . . . . . . . . 40 2.3 User Interest Profile For Personalized Web Search Non Folksonomy based . . . . . . . . . . . . . . . . 43 2.4 User Interest Profile for Personalized Web Search Folksonomy based . . . . . . . . . . . . . . . . . . . 45 2.5 Personalized Search . . . . . . . . . . . . . . . . . . 47 2.6 Partnership Match . . . . . . . . . . . . . . . . . . 52 3 Mining anchor text for building User Interest Profile: A non-folksonomy based personalized search 56 3.1 Exclusively Yours' . . . . . . . . . . . . . . . . . . . 59 3.1.1 Infer User Interests . . . . . . . . . . . . . . 61 3.1.2 Weight Computation . . . . . . . . . . . . . 64 3.1.3 Query Expansion . . . . . . . . . . . . . . . 67 3.2 Exclusively Yours' Algorithm . . . . . . . . . . . . 68 3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . 71 3.3.1 DataSet . . . . . . . . . . . . . . . . . . . . 72 3.3.2 Evaluation Metrics . . . . . . . . . . . . . . 73 3.3.3 User Profile Efficacy . . . . . . . . . . . . . 74 3.3.4 Personalized vs. Non-Personalized Results . 76 3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . 80 4 Matrix factorization for building Clustered User Interest Profile: A folksonomy based personalized search 82 4.1 Aggregating tags from user search history . . . . . 86 4.2 Latent Semantics in UIP . . . . . . . . . . . . . . . 90 4.2.1 Computing the tag-tag Similarity matrix . . 90 4.2.2 Tag Clustering to generate svdCUIP and modSvdCUIP 98 4.3 Personalized Search . . . . . . . . . . . . . . . . . . 101 4.4 Experimental Evaluation . . . . . . . . . . . . . . . 103 4.4.1 Data Set and Experiment Methodology . . . 103 4.4.1.1 Custom Data Set and Evaluation Metrics . . . . . . . . . . . . . . . 103 4.4.1.2 AOL Query Data Set and Evaluation Metrics . . . . . . . . . . . . . 107 4.4.1.3 Experiment set up to estimate the value of k and d . . . . . . . . . . 107 4.4.1.4 Experiment set up to compare the proposed approaches with other approaches . . . . . . . . . . . . . . . 109 4.4.2 Experiment Results . . . . . . . . . . . . . . 111 4.4.2.1 Clustering Tendency . . . . . . . . 111 4.4.2.2 Determining the value for dimension parameter, k, for the Custom Data Set . . . . . . . . . . . . . . . 113 4.4.2.3 Determining the value of distinctness parameter, d, for the Custom data set . . . . . . . . . . . . . . . 115 4.4.2.4 CUIP visualization . . . . . . . . . 117 4.4.2.5 Determining the value of the dimension reduction parameter k for the AOL data set. . . . . . . . . . . . 119 4.4.2.6 Determining the value of distinctness parameter, d, for the AOL data set . . . . . . . . . . . . . . . . . . 120 4.4.2.7 Time to generate svdCUIP and modSvd-CUIP . . . . . . . . . . . . . . . . 122 4.4.2.8 Comparison of the svdCUIP, modSvd-CUIP, and tfIdfCUIP for different classes of queries . . . . . . . . . . 123 4.4.2.9 Comparing all five methods - Improvement . . . . . . . . . . . . . . 124 4.4.3 Discussion . . . . . . . . . . . . . . . . . . . 126 5 User Profiling for Partnership Match 133 5.1 Supplier Selection . . . . . . . . . . . . . . . . . . . 137 5.2 Criteria for Partnership Establishment . . . . . . . 140 5.3 Partnership Ontology . . . . . . . . . . . . . . . . . 143 5.4 Case Study . . . . . . . . . . . . . . . . . . . . . . 147 5.4.1 Buyer Profile and Seller Profile . . . . . . . 153 5.4.2 Semantic Similarity Measure . . . . . . . . . 155 5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . 160 5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . 162 6 Conclusion 164 6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . 167 6.1.1 Degree of Personalization . . . . . . . . . . . 167 6.1.2 Filter Bubble . . . . . . . . . . . . . . . . . 168 6.1.3 IPR issues in Partnership Match . . . . . . . 169 Bibliography 170 Appendices 193 .1 Pairs of Query and target URL . . . . . . . . . . . 194 .2 Examples of Expanded Queries . . . . . . . . . . . 197 .3 An example of svdCUIP, modSvdCUIP, tfIdfCUIP 198Docto
    corecore