2,070 research outputs found
No Pattern, No Recognition: a Survey about Reproducibility and Distortion Issues of Text Clustering and Topic Modeling
Extracting knowledge from unlabeled texts using machine learning algorithms
can be complex. Document categorization and information retrieval are two
applications that may benefit from unsupervised learning (e.g., text clustering
and topic modeling), including exploratory data analysis. However, the
unsupervised learning paradigm poses reproducibility issues. The initialization
can lead to variability depending on the machine learning algorithm.
Furthermore, the distortions can be misleading when regarding cluster geometry.
Amongst the causes, the presence of outliers and anomalies can be a determining
factor. Despite the relevance of initialization and outlier issues for text
clustering and topic modeling, the authors did not find an in-depth analysis of
them. This survey provides a systematic literature review (2011-2022) of these
subareas and proposes a common terminology since similar procedures have
different terms. The authors describe research opportunities, trends, and open
issues. The appendices summarize the theoretical background of the text
vectorization, the factorization, and the clustering algorithms that are
directly or indirectly related to the reviewed works
Pervasive gaps in Amazonian ecological research
Biodiversity loss is one of the main challenges of our time,1,2 and attempts to address it require a clear un derstanding of how ecological communities respond to environmental change across time and space.3,4
While the increasing availability of global databases on ecological communities has advanced our knowledge
of biodiversity sensitivity to environmental changes,5–7 vast areas of the tropics remain understudied.8–11 In
the American tropics, Amazonia stands out as the world’s most diverse rainforest and the primary source of
Neotropical biodiversity,12 but it remains among the least known forests in America and is often underrepre sented in biodiversity databases.13–15 To worsen this situation, human-induced modifications16,17 may elim inate pieces of the Amazon’s biodiversity puzzle before we can use them to understand how ecological com munities are responding. To increase generalization and applicability of biodiversity knowledge,18,19 it is thus
crucial to reduce biases in ecological research, particularly in regions projected to face the most pronounced
environmental changes. We integrate ecological community metadata of 7,694 sampling sites for multiple or ganism groups in a machine learning model framework to map the research probability across the Brazilian
Amazonia, while identifying the region’s vulnerability to environmental change. 15%–18% of the most ne glected areas in ecological research are expected to experience severe climate or land use changes by
2050. This means that unless we take immediate action, we will not be able to establish their current status,
much less monitor how it is changing and what is being lostinfo:eu-repo/semantics/publishedVersio
SARS-CoV-2 introductions and early dynamics of the epidemic in Portugal
Genomic surveillance of SARS-CoV-2 in Portugal was rapidly implemented by
the National Institute of Health in the early stages of the COVID-19 epidemic, in collaboration
with more than 50 laboratories distributed nationwide.
Methods By applying recent phylodynamic models that allow integration of individual-based
travel history, we reconstructed and characterized the spatio-temporal dynamics of SARSCoV-2 introductions and early dissemination in Portugal.
Results We detected at least 277 independent SARS-CoV-2 introductions, mostly from
European countries (namely the United Kingdom, Spain, France, Italy, and Switzerland),
which were consistent with the countries with the highest connectivity with Portugal.
Although most introductions were estimated to have occurred during early March 2020, it is
likely that SARS-CoV-2 was silently circulating in Portugal throughout February, before the
first cases were confirmed.
Conclusions Here we conclude that the earlier implementation of measures could have
minimized the number of introductions and subsequent virus expansion in Portugal. This
study lays the foundation for genomic epidemiology of SARS-CoV-2 in Portugal, and highlights the need for systematic and geographically-representative genomic surveillance.We gratefully acknowledge to Sara Hill and Nuno Faria (University of Oxford) and
Joshua Quick and Nick Loman (University of Birmingham) for kindly providing us with
the initial sets of Artic Network primers for NGS; Rafael Mamede (MRamirez team,
IMM, Lisbon) for developing and sharing a bioinformatics script for sequence curation
(https://github.com/rfm-targa/BioinfUtils); Philippe Lemey (KU Leuven) for providing
guidance on the implementation of the phylodynamic models; Joshua L. Cherry
(National Center for Biotechnology Information, National Library of Medicine, National
Institutes of Health) for providing guidance with the subsampling strategies; and all
authors, originating and submitting laboratories who have contributed genome data on
GISAID (https://www.gisaid.org/) on which part of this research is based. The opinions
expressed in this article are those of the authors and do not reflect the view of the
National Institutes of Health, the Department of Health and Human Services, or the
United States government. This study is co-funded by Fundação para a Ciência e Tecnologia
and Agência de Investigação Clínica e Inovação Biomédica (234_596874175) on
behalf of the Research 4 COVID-19 call. Some infrastructural resources used in this study
come from the GenomePT project (POCI-01-0145-FEDER-022184), supported by
COMPETE 2020 - Operational Programme for Competitiveness and Internationalisation
(POCI), Lisboa Portugal Regional Operational Programme (Lisboa2020), Algarve Portugal
Regional Operational Programme (CRESC Algarve2020), under the PORTUGAL
2020 Partnership Agreement, through the European Regional Development Fund
(ERDF), and by Fundação para a Ciência e a Tecnologia (FCT).info:eu-repo/semantics/publishedVersio
Pervasive gaps in Amazonian ecological research
Biodiversity loss is one of the main challenges of our time,1,2 and attempts to address it require a clear understanding of how ecological communities respond to environmental change across time and space.3,4 While the increasing availability of global databases on ecological communities has advanced our knowledge of biodiversity sensitivity to environmental changes,5,6,7 vast areas of the tropics remain understudied.8,9,10,11 In the American tropics, Amazonia stands out as the world's most diverse rainforest and the primary source of Neotropical biodiversity,12 but it remains among the least known forests in America and is often underrepresented in biodiversity databases.13,14,15 To worsen this situation, human-induced modifications16,17 may eliminate pieces of the Amazon's biodiversity puzzle before we can use them to understand how ecological communities are responding. To increase generalization and applicability of biodiversity knowledge,18,19 it is thus crucial to reduce biases in ecological research, particularly in regions projected to face the most pronounced environmental changes. We integrate ecological community metadata of 7,694 sampling sites for multiple organism groups in a machine learning model framework to map the research probability across the Brazilian Amazonia, while identifying the region's vulnerability to environmental change. 15%–18% of the most neglected areas in ecological research are expected to experience severe climate or land use changes by 2050. This means that unless we take immediate action, we will not be able to establish their current status, much less monitor how it is changing and what is being lost
Particle-flow reconstruction and global event description with the CMS detector
The CMS apparatus was identified, a few years before the start of the LHC operation at CERN, to feature properties well suited to particle-flow (PF) reconstruction: a highly-segmented tracker, a fine-grained electromagnetic calorimeter, a hermetic hadron calorimeter, a strong magnetic field, and an excellent muon spectrometer. A fully-fledged PF reconstruction algorithm tuned to the CMS detector was therefore developed and has been consistently used in physics analyses for the first time at a hadron collider. For each collision, the comprehensive list of final-state particles identified and reconstructed by the algorithm provides a global event description that leads to unprecedented CMS performance for jet and hadronic tau decay reconstruction, missing transverse momentum determination, and electron and muon identification. This approach also allows particles from pileup interactions to be identified and enables efficient pileup mitigation methods. The data collected by CMS at a centre-of-mass energy of 8 TeV show excellent agreement with the simulation and confirm the superior PF performance at least up to an average of 20 pileup interactions
Measurement of differential cross sections for top quark pair production using the lepton plus jets final state in proton-proton collisions at 13 TeV
National Science Foundation (U.S.
Identification of heavy-flavour jets with the CMS detector in pp collisions at 13 TeV
Many measurements and searches for physics beyond the standard model at the LHC rely on the efficient identification of heavy-flavour jets, i.e. jets originating from bottom or charm quarks. In this paper, the discriminating variables and the algorithms used for heavy-flavour jet identification during the first years of operation of the CMS experiment in proton-proton collisions at a centre-of-mass energy of 13 TeV, are presented. Heavy-flavour jet identification algorithms have been improved compared to those used previously at centre-of-mass energies of 7 and 8 TeV. For jets with transverse momenta in the range expected in simulated events, these new developments result in an efficiency of 68% for the correct identification of a b jet for a probability of 1% of misidentifying a light-flavour jet. The improvement in relative efficiency at this misidentification probability is about 15%, compared to previous CMS algorithms. In addition, for the first time algorithms have been developed to identify jets containing two b hadrons in Lorentz-boosted event topologies, as well as to tag c jets. The large data sample recorded in 2016 at a centre-of-mass energy of 13 TeV has also allowed the development of new methods to measure the efficiency and misidentification probability of heavy-flavour jet identification algorithms. The heavy-flavour jet identification efficiency is measured with a precision of a few per cent at moderate jet transverse momenta (between 30 and 300 GeV) and about 5% at the highest jet transverse momenta (between 500 and 1000 GeV)
Search for heavy resonances decaying to a top quark and a bottom quark in the lepton+jets final state in proton–proton collisions at 13 TeV
info:eu-repo/semantics/publishe
Evidence for the Higgs boson decay to a bottom quark–antiquark pair
info:eu-repo/semantics/publishe
- …