18 research outputs found
Efficient inference of large prokaryotic pangenomes with PanTA
Pangenome inference is an indispensable step in bacterial genomics, yet its scalability poses a challenge due to the rapid growth of genomic collections. This paper presents PanTA, a software package designed for constructing pangenomes of large bacterial datasets, showing unprecedented efficiency levels multiple times higher than existing tools. PanTA introduces a novel mechanism to construct the pangenome progressively without rebuilding the accumulated collection from scratch. The progressive mode is shown to consume orders of magnitude less computational resources than existing solutions in managing growing datasets. The software is open source and is publicly available at https://github.com/amromics/panta and at 10.6084/m9.figshare.23724705
Cross-sectional study of coeliac autoimmunity in a population of Vietnamese children
Objective: The prevalence of coeliac disease (CD) inVietnam is unknown. To fill this void, we assessed the prevalence of serological markers of CD autoimmunity in a population of children in Hanoi.
Setting: The outpatient blood drawing laboratory of the largest paediatric hospital in North Vietnam was used for the study, which was part of an international project of collaboration between Italy and Vietnam.
Participants: Children having blood drawn for any reason were included. Exclusion criteria were age younger than 2 years, acquired or congenital immune deficiency and inadequate sample. A total of 1961
children (96%) were enrolled (838 females, 1123 males, median age 5.3 years).
Outcomes: Primary outcome was the prevalence of positive autoimmunity to both IgA antitransglutaminase antibodies (anti-tTG) assessed with an ELISA test and antiendomysial antibodies (EMA). Secondary outcome
was the prevalence of CD predisposing human leucocyte antigens (HLA) (HLA DQ2/8) in the positive children and in a random group of samples negative for IgA anti-tTG.
Results: The IgA anti-tTG test was positive in 21/1961 (1%; 95% CI 0.61% to 1.53%); however, EMA antibodies were negative in all. HLA DQ2/8 was present in 7/21 (33%; 95% CI 14.5% to 56.9%) of the
anti-tTG-positive children and in 72/275 (26%; 95% CI 21% to 32%) of those who were negative.
Conclusions: Coeliac autoimmunity is rare in Vietnam, although prevalence of HLA DQ2/8 is similar to that of other countries. We hypothesise that the scarce exposure to gluten could be responsible for
these findings
Rheological properties of emulsion of crude oil and water
In the paper the rheological properties of crude oil of White Tiger oil-field (Vietnam) and its emulsion with sea-water, including measurement results and analytical approximation formulae for wide range of pressure, temperature and water concentration, are presented. As it is known, the crude oil of White Tiger oil-field is a high-paraffin and high-viscous oil. At the low temperature (T ≤ 40°C) it behaves as non-Newtonian fluid of Bingham-Shvedov group. Therefore, beside the effective viscosity, the effective dynamic shear stress is also measured and approximated. The rheological properties of crude oil and emulsion of crude oil and water are also measured and approximated for the case when the mixture contains 0.1% chemical reagent ES-3363
Clustering patient medical records via sparse subspace representation
The health industry is facing increasing challenge with “big data” as traditional methods fail to manage the scale and complexity. This paper examines clustering of patient records for chronic diseases to facilitate a better construction of care plans. We solve this problem under the framework of subspace clustering. Our novel contribution lies in the exploitation of sparse representation to discover subspaces automatically and a domain-specific construction of weighting matrices for patient records. We show the new formulation is readily solved by extending existing 1 -regularized optimization algorithms. Using a cohort of both diabetes and stroke data we show that we outperform existing benchmark clustering techniques in the literature
Improved subspace clustering via exploitation of spatial constraints
We present a novel approach to improving subspace clustering by exploiting the spatial constraints. The new method encourages the sparse solution to be consistent with the spatial geometry of the tracked points, by embedding weights into the sparse formulation. By doing so, we are able to correct sparse representations in a principled manner without introducing much additional computational cost. We discuss alternative ways to treat the missing and corrupted data using the latest theory in robust lasso regression and suggest numerical algorithms so solve the proposed formulation. The experiments on the benchmark Johns Hopkins 155 dataset demonstrate that exploiting spatial constraints significantly improves motion segmentation.<br /
Sparse subspace clustering via group sparse coding
We propose in this paper a novel sparse subspace clustering method that regularizes sparse subspace representation by exploiting the structural sharing between tasks and data points via group sparse coding. We derive simple, provably convergent, and computationally efficient algorithms for solving the proposed group formulations. We demonstrate the advantage of the framework on three challenging benchmark datasets ranging from medical record data to image and text clustering and show that they consistently outperforms rival methods
Sparse subspace representation for spectral document clustering
We present a novel method for document clustering using sparse representation of documents in conjunction with spectral clustering. An â„“1-norm optimization formulation is posed to learn the sparse representation of each document, allowing us to characterize the affinity between documents by considering the overall information instead of traditional pair wise similarities. This document affinity is encoded through a graph on which spectral clustering is performed. The decomposition into multiple subspaces allows documents to be part of a sub-group that shares a smaller set of similar vocabulary, thus allowing for cleaner clusters. Extensive experimental evaluations on two real-world datasets from Reuters-21578 and 20Newsgroup corpora show that our proposed method consistently outperforms state-of-the-art algorithms. Significantly, the performance improvement over other methods is prominent for this datasets
Detection of cross-channel anomalies from multiple data channels
We identify and formulate a novel problem: crosschannel anomaly detection from multiple data channels. Cross channel anomalies are common amongst the individual channel anomalies, and are often portent of significant events. Using spectral approaches, we propose a two-stage detection method: anomaly detection at a single-channel level, followed by the detection of cross-channel anomalies from the amalgamation of single channel anomalies. Our mathematical analysis shows that our method is likely to reduce the false alarm rate. We demonstrate our method in two applications: document understanding with multiple text corpora, and detection of repeated anomalies in video surveillance. The experimental results consistently demonstrate the superior performance of our method compared with related state-of-art methods, including the one-class SVM and principal component pursuit. In addition, our framework can be deployed in a decentralized manner, lending itself for large scale data stream analysis.<br /