31 research outputs found
SoK: Training Machine Learning Models over Multiple Sources with Privacy Preservation
Nowadays, gathering high-quality training data from multiple data controllers
with privacy preservation is a key challenge to train high-quality machine
learning models. The potential solutions could dramatically break the barriers
among isolated data corpus, and consequently enlarge the range of data
available for processing. To this end, both academia researchers and industrial
vendors are recently strongly motivated to propose two main-stream folders of
solutions: 1) Secure Multi-party Learning (MPL for short); and 2) Federated
Learning (FL for short). These two solutions have their advantages and
limitations when we evaluate them from privacy preservation, ways of
communication, communication overhead, format of data, the accuracy of trained
models, and application scenarios.
Motivated to demonstrate the research progress and discuss the insights on
the future directions, we thoroughly investigate these protocols and frameworks
of both MPL and FL. At first, we define the problem of training machine
learning models over multiple data sources with privacy-preserving (TMMPP for
short). Then, we compare the recent studies of TMMPP from the aspects of the
technical routes, parties supported, data partitioning, threat model, and
supported machine learning models, to show the advantages and limitations.
Next, we introduce the state-of-the-art platforms which support online training
over multiple data sources. Finally, we discuss the potential directions to
resolve the problem of TMMPP.Comment: 17 pages, 4 figure
Adversarial Adaptation of Scene Graph Models for Understanding Civic Issues
Citizen engagement and technology usage are two emerging trends driven by
smart city initiatives. Governments around the world are adopting technology
for faster resolution of civic issues. Typically, citizens report issues, such
as broken roads, garbage dumps, etc. through web portals and mobile apps, in
order for the government authorities to take appropriate actions. Several
mediums -- text, image, audio, video -- are used to report these issues.
Through a user study with 13 citizens and 3 authorities, we found that image is
the most preferred medium to report civic issues. However, analyzing civic
issue related images is challenging for the authorities as it requires manual
effort. Moreover, previous works have been limited to identifying a specific
set of issues from images. In this work, given an image, we propose to generate
a Civic Issue Graph consisting of a set of objects and the semantic relations
between them, which are representative of the underlying civic issue. We also
release two multi-modal (text and images) datasets, that can help in further
analysis of civic issues from images. We present a novel approach for
adversarial training of existing scene graph models that enables the use of
scene graphs for new applications in the absence of any labelled training data.
We conduct several experiments to analyze the efficacy of our approach, and
using human evaluation, we establish the appropriateness of our model at
representing different civic issues.Comment: Accepted at WWW'1
Dynamically-Driven Inactivation of the Catalytic Machinery of the SARS 3C-Like Protease by the N214A Mutation on the Extra Domain
Despite utilizing the same chymotrypsin fold to host the catalytic machinery, coronavirus 3C-like proteases (3CLpro) noticeably differ from picornavirus 3C proteases in acquiring an extra helical domain in evolution. Previously, the extra domain was demonstrated to regulate the catalysis of the SARS-CoV 3CLpro by controlling its dimerization. Here, we studied N214A, another mutant with only a doubled dissociation constant but significantly abolished activity. Unexpectedly, N214A still adopts the dimeric structure almost identical to that of the wild-type (WT) enzyme. Thus, we conducted 30-ns molecular dynamics (MD) simulations for N214A, WT, and R298A which we previously characterized to be a monomer with the collapsed catalytic machinery. Remarkably, three proteases display distinctive dynamical behaviors. While in WT, the catalytic machinery stably retains in the activated state; in R298A it remains largely collapsed in the inactivated state, thus implying that two states are not only structurally very distinguishable but also dynamically well separated. Surprisingly, in N214A the catalytic dyad becomes dynamically unstable and many residues constituting the catalytic machinery jump to sample the conformations highly resembling those of R298A. Therefore, the N214A mutation appears to trigger the dramatic change of the enzyme dynamics in the context of the dimeric form which ultimately inactivates the catalytic machinery. The present MD simulations represent the longest reported so far for the SARS-CoV 3CLpro, unveiling that its catalysis is critically dependent on the dynamics, which can be amazingly modulated by the extra domain. Consequently, mediating the dynamics may offer a potential avenue to inhibit the SARS-CoV 3CLpro
Schema Free Querying of Semantic Data
Developing interfaces to enable casual, non-expert users to query complex structured data has been the subject of much research over the past forty years. We refer to them as schema-free query interfaces, since they allow users to freely query data without understanding its schema, knowing how to refer to objects, or mastering the appropriate formal query language. Schema-free query interfaces address fundamental problems in natural language processing, databases and AI to connect users' conceptual models and machine representations. However, schema-free query interface systems are faced with three hard problems. First, we still lack a practical interface. Natural Language Interfaces (NLIs) are easy for users but hard for machines. Current NLP techniques are still unreliable in extracting the relational structure from natural language questions. Keyword query interfaces, on the other hand, have limited expressiveness and inherit ambiguity from the natural language terms used as keywords. Second, people express or model the same meaning in many different ways, which can result in the vocabulary and structure mismatches between users' queries and the machines' representation. We still rely on ad hoc and labor-intensive approaches to deal with this ``semantic heterogeneity problem''. Third, the Web has seen increasing amounts of open domain semantic data with heterogeneous or unknown schemas, which challenges traditional NLI systems that require a well-defined schema. Some modern systems gave up the approach of translating the user query into a formal query at the schema level and chose to directly search into the entity network (ABox) for the matchings of the user query. This approach, however, is computationally expensive and has an ad hoc nature. In this thesis, we develop a novel approach to address the three hard problems. We introduce a new schema-free query interface, SFQ interface, in which users explicitly specify the relational structure of the query as a graphical skeleton and annotate it with freely chosen words, phrases and entity names. This circumvents the unreliable step of extracting complete relations from natural language queries. We describe a framework for interpreting these SFQ queries over open domain semantic data that automatically translates them to formal queries. First, we learn a schema statistically from the entity network and represent it as a graph, which we call the schema network. Our mapping algorithms run on the schema network rather than the entity network, enhancing scalability. We define the probability of observing a path on the schema network. Following it, we create two statistical association models that will be used to carry out disambiguation. Novel mapping algorithms are developed that exploit semantic similarity measures and association measures to address the structure and vocabulary mismatch problems. Our approach is fully computational and requires no special lexicons, mapping rules, domain-specific syntactic or semantic grammars, thesauri or hard-coded semantics. We evaluate our approach on two large datasets, DBLP+ and DBpedia. We developed DBLP+ by augmenting the DBLP dataset with additional data from CiteSeerX and ArnetMiner. We created 220 SFQ queries on the DBLP+ dataset. For DBpedia, we had three human subjects (who were unfamiliar with DBpedia) translate 33 natural language questions from the 2011 QALD workshop into SFQ queries. We carried out cross-validation on the 220 DBLP+ queries and cross-domain validation on the 99 DBpedia queries in which the parameters tuned for the DBLP+ queries are applied to the DBpedia queries. The evaluation results on the two datasets show that our system has very good efficacy and efficiency
Schema Free Querying of Semantic Data
Developing interfaces to enable casual, non-expert users to query complex structured data has been the subject of much research over the past forty years. We refer to them as schema-free query interfaces, since they allow users to freely query data without understanding its schema, knowing how to refer to objects, or mastering the appropriate formal query language. Schema-free query interfaces address fundamental problems in natural language processing, databases and AI to connect users' conceptual models and machine representations. However, schema-free query interface systems are faced with three hard problems. First, we still lack a practical interface. Natural Language Interfaces (NLIs) are easy for users but hard for machines. Current NLP techniques are still unreliable in extracting the relational structure from natural language questions. Keyword query interfaces, on the other hand, have limited expressiveness and inherit ambiguity from the natural language terms used as keywords. Second, people express or model the same meaning in many different ways, which can result in the vocabulary and structure mismatches between users' queries and the machines' representation. We still rely on ad hoc and labor-intensive approaches to deal with this ``semantic heterogeneity problem''. Third, the Web has seen increasing amounts of open domain semantic data with heterogeneous or unknown schemas, which challenges traditional NLI systems that require a well-defined schema. Some modern systems gave up the approach of translating the user query into a formal query at the schema level and chose to directly search into the entity network (ABox) for the matchings of the user query. This approach, however, is computationally expensive and has an ad hoc nature. In this thesis, we develop a novel approach to address the three hard problems. We introduce a new schema-free query interface, SFQ interface, in which users explicitly specify the relational structure of the query as a graphical skeleton and annotate it with freely chosen words, phrases and entity names. This circumvents the unreliable step of extracting complete relations from natural language queries. We describe a framework for interpreting these SFQ queries over open domain semantic data that automatically translates them to formal queries. First, we learn a schema statistically from the entity network and represent it as a graph, which we call the schema network. Our mapping algorithms run on the schema network rather than the entity network, enhancing scalability. We define the probability of observing a path on the schema network. Following it, we create two statistical association models that will be used to carry out disambiguation. Novel mapping algorithms are developed that exploit semantic similarity measures and association measures to address the structure and vocabulary mismatch problems. Our approach is fully computational and requires no special lexicons, mapping rules, domain-specific syntactic or semantic grammars, thesauri or hard-coded semantics. We evaluate our approach on two large datasets, DBLP+ and DBpedia. We developed DBLP+ by augmenting the DBLP dataset with additional data from CiteSeerX and ArnetMiner. We created 220 SFQ queries on the DBLP+ dataset. For DBpedia, we had three human subjects (who were unfamiliar with DBpedia) translate 33 natural language questions from the 2011 QALD workshop into SFQ queries. We carried out cross-validation on the 220 DBLP+ queries and cross-domain validation on the 99 DBpedia queries in which the parameters tuned for the DBLP+ queries are applied to the DBpedia queries. The evaluation results on the two datasets show that our system has very good efficacy and efficiency
Improving Word Similarity by Augmenting PMI with Estimates of Word Polysemy
Pointwise mutual information (PMI) is a widely used word similarity measure, but it lacks a clear explanation of how it works. We explore how PMI differs from distributional similarity, and we introduce a novel metric, PMI max , that augments PMI with information about a word's number of senses. The coefficients of PMI max are determined empirically by maximizing a utility function based on the performance of automatic thesaurus generation. We show that it outperforms traditional PMI in the application of automatic thesaurus generation and in two word similarity benchmark tasks: human similarity ratings and TOEFL synonym questions. PMI max achieves a correlation coefficient comparable to the best knowledge-based approaches on the Miller-Charles similarity rating data set
Determining the prognosis of Lung cancer from mutated genes using a deep learning survival model: a large multi-center study
Abstract Background Gene status has become the focus of prognosis prediction. Furthermore, deep learning has frequently been implemented in medical imaging to diagnose, prognosticate, and evaluate treatment responses in patients with cancer. However, few deep learning survival (DLS) models based on mutational genes that are directly associated with patient prognosis in terms of progression-free survival (PFS) or overall survival (OS) have been reported. Additionally, DLS models have not been applied to determine IO-related prognosis based on mutational genes. Herein, we developed a deep learning method to predict the prognosis of patients with lung cancer treated with or without immunotherapy (IO). Methods Samples from 6542 patients from different centers were subjected to genome sequencing. A DLS model based on multi-panels of somatic mutations was trained and validated to predict OS in patients treated without IO and PFS in patients treated with IO. Results In patients treated without IO, the DLS model (low vs. high DLS) was trained using the training MSK-MET cohort (HR = 0.241 [0.213–0.273], P < 0.001) and tested in the inter-validation MSK-MET cohort (HR = 0.175 [0.148–0.206], P < 0.001). The DLS model was then validated with the OncoSG, MSK-CSC, and TCGA-LUAD cohorts (HR = 0.420 [0.272–0.649], P < 0.001; HR = 0.550 [0.424–0.714], P < 0.001; HR = 0.215 [0.159–0.291], P < 0.001, respectively). Subsequently, it was fine-tuned and retrained in patients treated with IO. The DLS model (low vs. high DLS) could predict PFS and OS in the MIND, MSKCC, and POPLAR/OAK cohorts (P < 0.001, respectively). Compared with tumor-node-metastasis staging, the COX model, tumor mutational burden, and programmed death-ligand 1 expression, the DLS model had the highest C-index in patients treated with or without IO. Conclusions The DLS model based on mutational genes can robustly predict the prognosis of patients with lung cancer treated with or without IO