31 research outputs found

    SoK: Training Machine Learning Models over Multiple Sources with Privacy Preservation

    Full text link
    Nowadays, gathering high-quality training data from multiple data controllers with privacy preservation is a key challenge to train high-quality machine learning models. The potential solutions could dramatically break the barriers among isolated data corpus, and consequently enlarge the range of data available for processing. To this end, both academia researchers and industrial vendors are recently strongly motivated to propose two main-stream folders of solutions: 1) Secure Multi-party Learning (MPL for short); and 2) Federated Learning (FL for short). These two solutions have their advantages and limitations when we evaluate them from privacy preservation, ways of communication, communication overhead, format of data, the accuracy of trained models, and application scenarios. Motivated to demonstrate the research progress and discuss the insights on the future directions, we thoroughly investigate these protocols and frameworks of both MPL and FL. At first, we define the problem of training machine learning models over multiple data sources with privacy-preserving (TMMPP for short). Then, we compare the recent studies of TMMPP from the aspects of the technical routes, parties supported, data partitioning, threat model, and supported machine learning models, to show the advantages and limitations. Next, we introduce the state-of-the-art platforms which support online training over multiple data sources. Finally, we discuss the potential directions to resolve the problem of TMMPP.Comment: 17 pages, 4 figure

    Adversarial Adaptation of Scene Graph Models for Understanding Civic Issues

    Full text link
    Citizen engagement and technology usage are two emerging trends driven by smart city initiatives. Governments around the world are adopting technology for faster resolution of civic issues. Typically, citizens report issues, such as broken roads, garbage dumps, etc. through web portals and mobile apps, in order for the government authorities to take appropriate actions. Several mediums -- text, image, audio, video -- are used to report these issues. Through a user study with 13 citizens and 3 authorities, we found that image is the most preferred medium to report civic issues. However, analyzing civic issue related images is challenging for the authorities as it requires manual effort. Moreover, previous works have been limited to identifying a specific set of issues from images. In this work, given an image, we propose to generate a Civic Issue Graph consisting of a set of objects and the semantic relations between them, which are representative of the underlying civic issue. We also release two multi-modal (text and images) datasets, that can help in further analysis of civic issues from images. We present a novel approach for adversarial training of existing scene graph models that enables the use of scene graphs for new applications in the absence of any labelled training data. We conduct several experiments to analyze the efficacy of our approach, and using human evaluation, we establish the appropriateness of our model at representing different civic issues.Comment: Accepted at WWW'1

    Dynamically-Driven Inactivation of the Catalytic Machinery of the SARS 3C-Like Protease by the N214A Mutation on the Extra Domain

    Get PDF
    Despite utilizing the same chymotrypsin fold to host the catalytic machinery, coronavirus 3C-like proteases (3CLpro) noticeably differ from picornavirus 3C proteases in acquiring an extra helical domain in evolution. Previously, the extra domain was demonstrated to regulate the catalysis of the SARS-CoV 3CLpro by controlling its dimerization. Here, we studied N214A, another mutant with only a doubled dissociation constant but significantly abolished activity. Unexpectedly, N214A still adopts the dimeric structure almost identical to that of the wild-type (WT) enzyme. Thus, we conducted 30-ns molecular dynamics (MD) simulations for N214A, WT, and R298A which we previously characterized to be a monomer with the collapsed catalytic machinery. Remarkably, three proteases display distinctive dynamical behaviors. While in WT, the catalytic machinery stably retains in the activated state; in R298A it remains largely collapsed in the inactivated state, thus implying that two states are not only structurally very distinguishable but also dynamically well separated. Surprisingly, in N214A the catalytic dyad becomes dynamically unstable and many residues constituting the catalytic machinery jump to sample the conformations highly resembling those of R298A. Therefore, the N214A mutation appears to trigger the dramatic change of the enzyme dynamics in the context of the dimeric form which ultimately inactivates the catalytic machinery. The present MD simulations represent the longest reported so far for the SARS-CoV 3CLpro, unveiling that its catalysis is critically dependent on the dynamics, which can be amazingly modulated by the extra domain. Consequently, mediating the dynamics may offer a potential avenue to inhibit the SARS-CoV 3CLpro

    Schema Free Querying of Semantic Data

    No full text
    Developing interfaces to enable casual, non-expert users to query complex structured data has been the subject of much research over the past forty years. We refer to them as schema-free query interfaces, since they allow users to freely query data without understanding its schema, knowing how to refer to objects, or mastering the appropriate formal query language. Schema-free query interfaces address fundamental problems in natural language processing, databases and AI to connect users' conceptual models and machine representations. However, schema-free query interface systems are faced with three hard problems. First, we still lack a practical interface. Natural Language Interfaces (NLIs) are easy for users but hard for machines. Current NLP techniques are still unreliable in extracting the relational structure from natural language questions. Keyword query interfaces, on the other hand, have limited expressiveness and inherit ambiguity from the natural language terms used as keywords. Second, people express or model the same meaning in many different ways, which can result in the vocabulary and structure mismatches between users' queries and the machines' representation. We still rely on ad hoc and labor-intensive approaches to deal with this ``semantic heterogeneity problem''. Third, the Web has seen increasing amounts of open domain semantic data with heterogeneous or unknown schemas, which challenges traditional NLI systems that require a well-defined schema. Some modern systems gave up the approach of translating the user query into a formal query at the schema level and chose to directly search into the entity network (ABox) for the matchings of the user query. This approach, however, is computationally expensive and has an ad hoc nature. In this thesis, we develop a novel approach to address the three hard problems. We introduce a new schema-free query interface, SFQ interface, in which users explicitly specify the relational structure of the query as a graphical skeleton and annotate it with freely chosen words, phrases and entity names. This circumvents the unreliable step of extracting complete relations from natural language queries. We describe a framework for interpreting these SFQ queries over open domain semantic data that automatically translates them to formal queries. First, we learn a schema statistically from the entity network and represent it as a graph, which we call the schema network. Our mapping algorithms run on the schema network rather than the entity network, enhancing scalability. We define the probability of observing a path on the schema network. Following it, we create two statistical association models that will be used to carry out disambiguation. Novel mapping algorithms are developed that exploit semantic similarity measures and association measures to address the structure and vocabulary mismatch problems. Our approach is fully computational and requires no special lexicons, mapping rules, domain-specific syntactic or semantic grammars, thesauri or hard-coded semantics. We evaluate our approach on two large datasets, DBLP+ and DBpedia. We developed DBLP+ by augmenting the DBLP dataset with additional data from CiteSeerX and ArnetMiner. We created 220 SFQ queries on the DBLP+ dataset. For DBpedia, we had three human subjects (who were unfamiliar with DBpedia) translate 33 natural language questions from the 2011 QALD workshop into SFQ queries. We carried out cross-validation on the 220 DBLP+ queries and cross-domain validation on the 99 DBpedia queries in which the parameters tuned for the DBLP+ queries are applied to the DBpedia queries. The evaluation results on the two datasets show that our system has very good efficacy and efficiency

    Schema Free Querying of Semantic Data

    No full text
    Developing interfaces to enable casual, non-expert users to query complex structured data has been the subject of much research over the past forty years. We refer to them as schema-free query interfaces, since they allow users to freely query data without understanding its schema, knowing how to refer to objects, or mastering the appropriate formal query language. Schema-free query interfaces address fundamental problems in natural language processing, databases and AI to connect users' conceptual models and machine representations. However, schema-free query interface systems are faced with three hard problems. First, we still lack a practical interface. Natural Language Interfaces (NLIs) are easy for users but hard for machines. Current NLP techniques are still unreliable in extracting the relational structure from natural language questions. Keyword query interfaces, on the other hand, have limited expressiveness and inherit ambiguity from the natural language terms used as keywords. Second, people express or model the same meaning in many different ways, which can result in the vocabulary and structure mismatches between users' queries and the machines' representation. We still rely on ad hoc and labor-intensive approaches to deal with this ``semantic heterogeneity problem''. Third, the Web has seen increasing amounts of open domain semantic data with heterogeneous or unknown schemas, which challenges traditional NLI systems that require a well-defined schema. Some modern systems gave up the approach of translating the user query into a formal query at the schema level and chose to directly search into the entity network (ABox) for the matchings of the user query. This approach, however, is computationally expensive and has an ad hoc nature. In this thesis, we develop a novel approach to address the three hard problems. We introduce a new schema-free query interface, SFQ interface, in which users explicitly specify the relational structure of the query as a graphical skeleton and annotate it with freely chosen words, phrases and entity names. This circumvents the unreliable step of extracting complete relations from natural language queries. We describe a framework for interpreting these SFQ queries over open domain semantic data that automatically translates them to formal queries. First, we learn a schema statistically from the entity network and represent it as a graph, which we call the schema network. Our mapping algorithms run on the schema network rather than the entity network, enhancing scalability. We define the probability of observing a path on the schema network. Following it, we create two statistical association models that will be used to carry out disambiguation. Novel mapping algorithms are developed that exploit semantic similarity measures and association measures to address the structure and vocabulary mismatch problems. Our approach is fully computational and requires no special lexicons, mapping rules, domain-specific syntactic or semantic grammars, thesauri or hard-coded semantics. We evaluate our approach on two large datasets, DBLP+ and DBpedia. We developed DBLP+ by augmenting the DBLP dataset with additional data from CiteSeerX and ArnetMiner. We created 220 SFQ queries on the DBLP+ dataset. For DBpedia, we had three human subjects (who were unfamiliar with DBpedia) translate 33 natural language questions from the 2011 QALD workshop into SFQ queries. We carried out cross-validation on the 220 DBLP+ queries and cross-domain validation on the 99 DBpedia queries in which the parameters tuned for the DBLP+ queries are applied to the DBpedia queries. The evaluation results on the two datasets show that our system has very good efficacy and efficiency

    Improving Word Similarity by Augmenting PMI with Estimates of Word Polysemy

    No full text
    Pointwise mutual information (PMI) is a widely used word similarity measure, but it lacks a clear explanation of how it works. We explore how PMI differs from distributional similarity, and we introduce a novel metric, PMI max , that augments PMI with information about a word's number of senses. The coefficients of PMI max are determined empirically by maximizing a utility function based on the performance of automatic thesaurus generation. We show that it outperforms traditional PMI in the application of automatic thesaurus generation and in two word similarity benchmark tasks: human similarity ratings and TOEFL synonym questions. PMI max achieves a correlation coefficient comparable to the best knowledge-based approaches on the Miller-Charles similarity rating data set

    Determining the prognosis of Lung cancer from mutated genes using a deep learning survival model: a large multi-center study

    No full text
    Abstract Background Gene status has become the focus of prognosis prediction. Furthermore, deep learning has frequently been implemented in medical imaging to diagnose, prognosticate, and evaluate treatment responses in patients with cancer. However, few deep learning survival (DLS) models based on mutational genes that are directly associated with patient prognosis in terms of progression-free survival (PFS) or overall survival (OS) have been reported. Additionally, DLS models have not been applied to determine IO-related prognosis based on mutational genes. Herein, we developed a deep learning method to predict the prognosis of patients with lung cancer treated with or without immunotherapy (IO). Methods Samples from 6542 patients from different centers were subjected to genome sequencing. A DLS model based on multi-panels of somatic mutations was trained and validated to predict OS in patients treated without IO and PFS in patients treated with IO. Results In patients treated without IO, the DLS model (low vs. high DLS) was trained using the training MSK-MET cohort (HR = 0.241 [0.213–0.273], P < 0.001) and tested in the inter-validation MSK-MET cohort (HR = 0.175 [0.148–0.206], P < 0.001). The DLS model was then validated with the OncoSG, MSK-CSC, and TCGA-LUAD cohorts (HR = 0.420 [0.272–0.649], P < 0.001; HR = 0.550 [0.424–0.714], P < 0.001; HR = 0.215 [0.159–0.291], P < 0.001, respectively). Subsequently, it was fine-tuned and retrained in patients treated with IO. The DLS model (low vs. high DLS) could predict PFS and OS in the MIND, MSKCC, and POPLAR/OAK cohorts (P < 0.001, respectively). Compared with tumor-node-metastasis staging, the COX model, tumor mutational burden, and programmed death-ligand 1 expression, the DLS model had the highest C-index in patients treated with or without IO. Conclusions The DLS model based on mutational genes can robustly predict the prognosis of patients with lung cancer treated with or without IO
    corecore