304 research outputs found
Cross-lingual Entity Alignment via Joint Attribute-Preserving Embedding
Entity alignment is the task of finding entities in two knowledge bases (KBs)
that represent the same real-world object. When facing KBs in different natural
languages, conventional cross-lingual entity alignment methods rely on machine
translation to eliminate the language barriers. These approaches often suffer
from the uneven quality of translations between languages. While recent
embedding-based techniques encode entities and relationships in KBs and do not
need machine translation for cross-lingual entity alignment, a significant
number of attributes remain largely unexplored. In this paper, we propose a
joint attribute-preserving embedding model for cross-lingual entity alignment.
It jointly embeds the structures of two KBs into a unified vector space and
further refines it by leveraging attribute correlations in the KBs. Our
experimental results on real-world datasets show that this approach
significantly outperforms the state-of-the-art embedding approaches for
cross-lingual entity alignment and could be complemented with methods based on
machine translation
Recurrent Latent Variable Networks for Session-Based Recommendation
In this work, we attempt to ameliorate the impact of data sparsity in the
context of session-based recommendation. Specifically, we seek to devise a
machine learning mechanism capable of extracting subtle and complex underlying
temporal dynamics in the observed session data, so as to inform the
recommendation algorithm. To this end, we improve upon systems that utilize
deep learning techniques with recurrently connected units; we do so by adopting
concepts from the field of Bayesian statistics, namely variational inference.
Our proposed approach consists in treating the network recurrent units as
stochastic latent variables with a prior distribution imposed over them. On
this basis, we proceed to infer corresponding posteriors; these can be used for
prediction and recommendation generation, in a way that accounts for the
uncertainty in the available sparse training data. To allow for our approach to
easily scale to large real-world datasets, we perform inference under an
approximate amortized variational inference (AVI) setup, whereby the learned
posteriors are parameterized via (conventional) neural networks. We perform an
extensive experimental evaluation of our approach using challenging benchmark
datasets, and illustrate its superiority over existing state-of-the-art
techniques
Differentially Private Model Selection with Penalized and Constrained Likelihood
In statistical disclosure control, the goal of data analysis is twofold: The
released information must provide accurate and useful statistics about the
underlying population of interest, while minimizing the potential for an
individual record to be identified. In recent years, the notion of differential
privacy has received much attention in theoretical computer science, machine
learning, and statistics. It provides a rigorous and strong notion of
protection for individuals' sensitive information. A fundamental question is
how to incorporate differential privacy into traditional statistical inference
procedures. In this paper we study model selection in multivariate linear
regression under the constraint of differential privacy. We show that model
selection procedures based on penalized least squares or likelihood can be made
differentially private by a combination of regularization and randomization,
and propose two algorithms to do so. We show that our private procedures are
consistent under essentially the same conditions as the corresponding
non-private procedures. We also find that under differential privacy, the
procedure becomes more sensitive to the tuning parameters. We illustrate and
evaluate our method using simulation studies and two real data examples
Building Legal Case Retrieval Systems with Lexical Matching and Summarization using A Pre-Trained Phrase Scoring Model
We present our method for tackling the legal case retrieval task of the
Competition on Legal Information Extraction/Entailment 2019. Our approach is
based on the idea that summarization is important for retrieval. On one hand,
we adopt a summarization based model called encoded summarization which encodes
a given document into continuous vector space which embeds the summary
properties of the document. We utilize the resource of COLIEE 2018 on which we
train the document representation model. On the other hand, we extract lexical
features on different parts of a given query and its candidates. We observe
that by comparing different parts of the query and its candidates, we can
achieve better performance. Furthermore, the combination of the lexical
features with latent features by the summarization-based method achieves even
better performance. We have achieved the state-of-the-art result for the task
on the benchmark of the competition
MRI-based Surgical Planning for Lumbar Spinal Stenosis
The most common reason for spinal surgery in elderly patients is lumbar
spinal stenosis(LSS). For LSS, treatment decisions based on clinical and
radiological information as well as personal experience of the surgeon shows
large variance. Thus a standardized support system is of high value for a more
objective and reproducible decision. In this work, we develop an automated
algorithm to localize the stenosis causing the symptoms of the patient in
magnetic resonance imaging (MRI). With 22 MRI features of each of five spinal
levels of 321 patients, we show it is possible to predict the location of
lesion triggering the symptoms. To support this hypothesis, we conduct an
automated analysis of labeled and unlabeled MRI scans extracted from 788
patients. We confirm quantitatively the importance of radiological information
and provide an algorithmic pipeline for working with raw MRI scans
Sharing Social Network Data: Differentially Private Estimation of Exponential-Family Random Graph Models
Motivated by a real-life problem of sharing social network data that contain
sensitive personal information, we propose a novel approach to release and
analyze synthetic graphs in order to protect privacy of individual
relationships captured by the social network while maintaining the validity of
statistical results. A case study using a version of the Enron e-mail corpus
dataset demonstrates the application and usefulness of the proposed techniques
in solving the challenging problem of maintaining privacy \emph{and} supporting
open access to network data to ensure reproducibility of existing studies and
discovering new scientific insights that can be obtained by analyzing such
data. We use a simple yet effective randomized response mechanism to generate
synthetic networks under -edge differential privacy, and then use
likelihood based inference for missing data and Markov chain Monte Carlo
techniques to fit exponential-family random graph models to the generated
synthetic networks.Comment: Updated, 39 page
Post-operative outcomes and predictors of mortality after colorectal cancer surgery in the very elderly patients
Background: The frailty of the very elderly patients who undergo surgery for colorectal cancer negatively influences postoperative mortality. This study aimed to identify risk factors for postoperative mortality in octogenarian and nonagenarian patients who underwent surgical treatment for colorectal cancer. Methods: This is a single institution retrospective study. The primary outcomes were risk factors for postoperative mortality. The variables of the octogenarians and nonagenarians were compared by using t-test, chi-square test, and Fisher exact test. A multivariate logistic regression analysis was carried out on the combined cohorts. Results: we identified 319 octogenarians and 43 nonagenarians (N = 362) who underwent surgery for colorectal cancer at the Sant'Orsola-Malpighi university hospital in Bologna between 2011 and 2015. The 30-day post-operative mortality was 6% (N = 18) among octogenarians and 21% (N = 9) for the nonagenarians. The groups significantly differed in the type of surgery (elective vs. urgent surgery, p < 0.0001), ASA score (p = 0.0003) and rates of 30-day postoperative mortality (6% vs. 21%, p = 0.0003). In the multivariate analysis ASA > III (OR 2.37, 95% CI [1.43\u20133.93], p < 0,001), and urgent surgery (OR 2.17, 95% CI [1.17\u20134.04], p = 0.014) were associated to post-operative mortality. On the contrary, pre-operative albumin 653.4 g/dL (OR 0.14, 95% CI [0.05\u20130.52], p = 0.001) was associated with a protective effect on postoperative mortality. Conclusions: In the very elderly affected by colorectal cancer, preoperative nutritional status and pre-existing comorbidities, rather than age itself, should be considered as selection criteria for surgery. Preoperative improvement of nutritional status and ASA risk assessment may be beneficial for stratification of patients and ultimately for optimizing outcomes
BlinkML: Efficient Maximum Likelihood Estimation with Probabilistic Guarantees
The rising volume of datasets has made training machine learning (ML) models
a major computational cost in the enterprise. Given the iterative nature of
model and parameter tuning, many analysts use a small sample of their entire
data during their initial stage of analysis to make quick decisions (e.g., what
features or hyperparameters to use) and use the entire dataset only in later
stages (i.e., when they have converged to a specific model). This sampling,
however, is performed in an ad-hoc fashion. Most practitioners cannot precisely
capture the effect of sampling on the quality of their model, and eventually on
their decision-making process during the tuning phase. Moreover, without
systematic support for sampling operators, many optimizations and reuse
opportunities are lost.
In this paper, we introduce BlinkML, a system for fast, quality-guaranteed ML
training. BlinkML allows users to make error-computation tradeoffs: instead of
training a model on their full data (i.e., full model), BlinkML can quickly
train an approximate model with quality guarantees using a sample. The quality
guarantees ensure that, with high probability, the approximate model makes the
same predictions as the full model. BlinkML currently supports any ML model
that relies on maximum likelihood estimation (MLE), which includes Generalized
Linear Models (e.g., linear regression, logistic regression, max entropy
classifier, Poisson regression) as well as PPCA (Probabilistic Principal
Component Analysis). Our experiments show that BlinkML can speed up the
training of large-scale ML tasks by 6.26x-629x while guaranteeing the same
predictions, with 95% probability, as the full model.Comment: 22 pages, SIGMOD 201
An Exploratory Analysis of the Latent Structure of Process Data via Action Sequence Autoencoder
Computer simulations have become a popular tool of assessing complex skills
such as problem-solving skills. Log files of computer-based items record the
entire human-computer interactive processes for each respondent. The response
processes are very diverse, noisy, and of nonstandard formats. Few generic
methods have been developed for exploiting the information contained in process
data. In this article, we propose a method to extract latent variables from
process data. The method utilizes a sequence-to-sequence autoencoder to
compress response processes into standard numerical vectors. It does not
require prior knowledge of the specific items and human-computers interaction
patterns. The proposed method is applied to both simulated and real process
data to demonstrate that the resulting latent variables extract useful
information from the response processes.Comment: 28 pages, 13 figure
In vitro biosafety profile evaluation of multipotent mesenchymal stem cells derived from the bone marrow of sarcoma patients.
BACKGROUND: In osteosarcoma (OS) and most Ewing sarcoma (EWS) patients, the primary tumor originates in the bone. Although tumor resection surgery is commonly used to treat these diseases, it frequently leaves massive bone defects that are particularly difficult to be treated. Due to the therapeutic potential of mesenchymal stem cells (MSCs), OS and EWS patients could benefit from an autologous MSCs-based bone reconstruction. However, safety concerns regarding the in vitro expansion of bone marrow-derived MSCs have been raised. To investigate the possible oncogenic potential of MSCs from OS or EWS patients (MSC-SAR) after expansion, this study focused on a biosafety assessment of MSC-SAR obtained after short- and long-term cultivation compared with MSCs from healthy donors (MSC-CTRL). METHODS: We initially characterized the morphology, immunophenotype, and differentiation multipotency of isolated MSC-SAR. MSC-SAR and MSC-CTRL were subsequently expanded under identical culture conditions. Cells at the early (P3/P4) and late (P10) passages were collected for the in vitro analyses including: the sequencing of genes frequently mutated in OS and EWS, evaluation of telomerase activity, assessment of the gene expression profile and activity of major cancer pathways, cytogenetic analysis on synchronous MSC, and molecular karyotyping using a comparative genomic hybridization (CGH) array. RESULTS: MSC-SAR displayed comparable morphology, immunophenotype, proliferation rate, differentiation potential, and telomerase activity to MSC-CTRL. Both cell types displayed signs of senescence in the late stages of culture with no relevant changes in cancer gene expression. However, cytogenetic analysis detected chromosomal anomalies in the early and late stages of MSC-SAR and MSC-CTRL after culture. CONCLUSIONS: Our results demonstrated that the in vitro expansion of MSC does not influence or favor malignant transformation since MSC-SAR were not more prone than MSC-CTRL to deleterious changes during culture. However, the presence of chromosomal aberrations supports rigorous phenotypic, functional and genetic evaluation of the biosafety of MSCs, which is important for clinical applications
- …