3 research outputs found

    A Symmetric Loss Perspective of Reliable Machine Learning

    Full text link
    When minimizing the empirical risk in binary classification, it is a common practice to replace the zero-one loss with a surrogate loss to make the learning objective feasible to optimize. Examples of well-known surrogate losses for binary classification include the logistic loss, hinge loss, and sigmoid loss. It is known that the choice of a surrogate loss can highly influence the performance of the trained classifier and therefore it should be carefully chosen. Recently, surrogate losses that satisfy a certain symmetric condition (aka., symmetric losses) have demonstrated their usefulness in learning from corrupted labels. In this article, we provide an overview of symmetric losses and their applications. First, we review how a symmetric loss can yield robust classification from corrupted labels in balanced error rate (BER) minimization and area under the receiver operating characteristic curve (AUC) maximization. Then, we demonstrate how the robust AUC maximization method can benefit natural language processing in the problem where we want to learn only from relevant keywords and unlabeled documents. Finally, we conclude this article by discussing future directions, including potential applications of symmetric losses for reliable machine learning and the design of non-symmetric losses that can benefit from the symmetric condition.Comment: Preprint of an Invited Review Articl

    Automatic keyword extraction for a partial search engine index

    Get PDF
    Full-text search engines play a critical role in many enterprise applications, where the quantity and complexity of the information are overwhelming. Promptly finding documents that contain relevant information for pressing questions is a necessity for efficient operation. This is especially the case for financial and legal teams executing Mergers and Acquisitions deals. The goal of the thesis is to provide search services for such teams without storing the sensitive documents involved, minimising the risk of potential data leaks. A literature review of related methods and concepts is presented. As search engine technologies that use encrypted indices for commercial applications are still in their early stages, the solution proposed in the thesis is the use of partial indexing by keyword extraction. A cosine similarity-based evaluation was used to measure the performance difference between the keyword-based partial index and the complete index. The partial indices were constructed using unsupervised keyword extraction methods based on term frequency, document graphs, and topic modelling. The frequency-based methods were term frequency, TF-IDF, and YAKE!. The graph-based method was TextRank. The topic modelling-based methods were NMF, LDA, and LSI. The methods were evaluated by running 51 reference queries on the LEDGAR data set, which contains 60,540 contracts. The results show that using only five keywords per document from the TF-IDF or YAKE! methods, the best matching documents in the result lists have a cosine similarity of 0.7 on average. This value is reasonably high, particularly considering the small number of keywords. The topic modelling-based methods were found to perform poorly due to being too general. The term frequency and TextRank methods were mediocre

    Modeling Scholar Profile in Expert Recommendation based on Multi-Layered Bibliographic Graph

    Get PDF
    A recommendation system requires the profile of researchers which called here as Scholar Profile for suggestions based on expertise. This dissertation contributes on modeling unbiased scholar profile for more objective expertise evidence that consider interest changes and less focused on citations. Interest changes lead to diverse topics and make the expertise levels on topics differ. Scholar profile is expected to capture expertise in terms of productivity aspect which often signified from the volume of publications and citations. We include researcher behavior in publishing articles to avoid misleading citation. Therefore, the expertise levels of researchers on topics is influenced by interest evolution, productivity, dynamicity, and behavior extracted from bibliographic data of published scholarly articles. As this dissertation output, the scholar profile model employed within a recommendation system for recommending productive researchers who provide academic guidance. The scholar profile is generated from multi layers of bibliographic data, such as layers of author, topic, and relations between those layers to represent academic social network. There is no predefined information of topics in a cold-start situation, such that procedures of topic mapping are necessary. Then, features of productivity, dynamicity and behavior of researchers within those layers are taken from some observed years to accommodate the behavior aspect. We experimented with AMiner dataset often used in the following bibliographic data related studies to empirically investigate: (a) topic mapping strategies to obtain interest of researchers, (b) feature extraction model for productivity, dynamicity, and behavior aspects based on the mapped topics, and (c) expertise rank that considers interest changes and less focused on citations from the scholar profile. Ensuring the validity results, our experiments worked on standard expert list of AMiner researchers. We selected Natural Language Processing and Information Extraction (NLP-IE) domains because of their familiarity and interrelated context to make it easier for introducing cases of interest changes. Using the mapped topics, we also made minor contributions on transformation procedures for visualizing researchers on maps of Scopus subjects and investigating the possibilities of conflict of interest
    corecore