34 research outputs found
Towards Memory-Efficient Training for Extremely Large Output Spaces -- Learning with 500k Labels on a Single Commodity GPU
In classification problems with large output spaces (up to millions of
labels), the last layer can require an enormous amount of memory. Using sparse
connectivity would drastically reduce the memory requirements, but as we show
below, it can result in much diminished predictive performance of the model.
Fortunately, we found that this can be mitigated by introducing a penultimate
layer of intermediate size. We further demonstrate that one can constrain the
connectivity of the sparse layer to be uniform, in the sense that each output
neuron will have the exact same number of incoming connections. This allows for
efficient implementations of sparse matrix multiplication and connection
redistribution on GPU hardware. Via a custom CUDA implementation, we show that
the proposed approach can scale to datasets with 670,000 labels on a single
commodity GPU with only 4GB memory
Physics-guided adversarial networks for artificial digital image correlation data generation
Digital image correlation (DIC) has become a valuable tool in the evaluation
of mechanical experiments, particularly fatigue crack growth experiments. The
evaluation requires accurate information of the crack path and crack tip
position, which is difficult to obtain due to inherent noise and artefacts.
Machine learning models have been extremely successful in recognizing this
relevant information given labelled DIC displacement data. For the training of
robust models, which generalize well, big data is needed. However, data is
typically scarce in the field of material science and engineering because
experiments are expensive and time-consuming. We present a method to generate
synthetic DIC displacement data using generative adversarial networks with a
physics-guided discriminator. To decide whether data samples are real or fake,
this discriminator additionally receives the derived von Mises equivalent
strain. We show that this physics-guided approach leads to improved results in
terms of visual quality of samples, sliced Wasserstein distance, and geometry
score
CascadeXML: Rethinking Transformers for End-to-end Multi-resolution Training in Extreme Multi-label Classification
Extreme Multi-label Text Classification (XMC) involves learning a classifier
that can assign an input with a subset of most relevant labels from millions of
label choices. Recent approaches, such as XR-Transformer and LightXML, leverage
a transformer instance to achieve state-of-the-art performance. However, in
this process, these approaches need to make various trade-offs between
performance and computational requirements. A major shortcoming, as compared to
the Bi-LSTM based AttentionXML, is that they fail to keep separate feature
representations for each resolution in a label tree. We thus propose
CascadeXML, an end-to-end multi-resolution learning pipeline, which can harness
the multi-layered architecture of a transformer model for attending to
different label resolutions with separate feature representations. CascadeXML
significantly outperforms all existing approaches with non-trivial gains
obtained on benchmark datasets consisting of up to three million labels. Code
for CascadeXML will be made publicly available at
\url{https://github.com/xmc-aalto/cascadexml}
Materials Physics in the Quantum Realm
Ăberblick ĂŒber das Projekt QuantiCoM und aktuelle Herausforderungen im Bereich der atomistischen Simulationen mit Quantencomputern
Generalized test utilities for long-tail performance in extreme multi-label classification
Extreme multi-label classification (XMLC) is the task of selecting a small subset of relevant labels from a very large set of possible labels. As such, it is characterized by long-tail labels, i.e., most labels have very few positive instances. With standard performance measures such as precision@k, a classifier can ignore tail labels and still report good performance. However, it is often argued that correct predictions in the tail are more "interesting" or "rewarding," but the community has not yet settled on a metric capturing this intuitive concept. The existing propensity-scored metrics fall short on this goal by confounding the problems of long-tail and missing labels. In this paper, we analyze generalized metrics budgeted "at k" as an alternative solution. To tackle the challenging problem of optimizing these metrics, we formulate it in the expected test utility (ETU) framework, which aims to optimize the expected performance on a fixed test set. We derive optimal prediction rules and construct computationally efficient approximations with provable regret guarantees and robustness against model misspecification. Our algorithm, based on block coordinate ascent, scales effortlessly to XMLC problems and obtains promising results in terms of long-tail performance
Sloan Digital Sky Survey IV: Mapping the Milky Way, Nearby Galaxies, and the Distant Universe
We describe the Sloan Digital Sky Survey IV (SDSS-IV), a project encompassing three major spectroscopic programs. The Apache Point Observatory Galactic Evolution Experiment 2 (APOGEE-2) is observing hundreds of thousands of Milky Way stars at high resolution and high signal-to-noise ratios in the near-infrared. The Mapping Nearby Galaxies at Apache Point Observatory (MaNGA) survey is obtaining spatially resolved spectroscopy for thousands of nearby galaxies (median ). The extended Baryon Oscillation Spectroscopic Survey (eBOSS) is mapping the galaxy, quasar, and neutral gas distributions between and 3.5 to constrain cosmology using baryon acoustic oscillations, redshift space distortions, and the shape of the power spectrum. Within eBOSS, we are conducting two major subprograms: the SPectroscopic IDentification of eROSITA Sources (SPIDERS), investigating X-ray AGNs and galaxies in X-ray clusters, and the Time Domain Spectroscopic Survey (TDSS), obtaining spectra of variable sources. All programs use the 2.5 m Sloan Foundation Telescope at the Apache Point Observatory; observations there began in Summer 2014. APOGEE-2 also operates a second near-infrared spectrograph at the 2.5 m du Pont Telescope at Las Campanas Observatory, with observations beginning in early 2017. Observations at both facilities are scheduled to continue through 2020. In keeping with previous SDSS policy, SDSS-IV provides regularly scheduled public data releases; the first one, Data Release 13, was made available in 2016 July
Recommended from our members
The Fifteenth Data Release of the Sloan Digital Sky Surveys: First Release of MaNGA-derived Quantities, Data Visualization Tools, and Stellar Library
Twenty years have passed since first light for the Sloan Digital Sky Survey (SDSS). Here, we release data taken by the fourth phase of SDSS (SDSS-IV) across its first three years of operation (2014 Julyâ2017 July). This is the third data release for SDSS-IV, and the 15th from SDSS (Data Release Fifteen; DR15). New data come from MaNGAâwe release 4824 data cubes, as well as the first stellar spectra in the MaNGA Stellar Library (MaStar), the first set of survey-supported analysis products (e.g., stellar and gas kinematics, emission-line and other maps) from the MaNGA Data Analysis Pipeline, and a new data visualization and access tool we call "Marvin." The next data release, DR16, will include new data from both APOGEE-2 and eBOSS; those surveys release no new data here, but we document updates and corrections to their data processing pipelines. The release is cumulative; it also includes the most recent reductions and calibrations of all data taken by SDSS since first light. In this paper, we describe the location and format of the data and tools and cite technical references describing how it was obtained and processed. The SDSS website (www.sdss.org) has also been updated, providing links to data downloads, tutorials, and examples of data use. Although SDSS-IV will continue to collect astronomical data until 2020, and will be followed by SDSS-V (2020â2025), we end this paper by describing plans to ensure the sustainability of the SDSS data archive for many years beyond the collection of data
Sloan Digital Sky Survey IV: mapping the Milky Way, nearby galaxies, and the distant universe
We describe the Sloan Digital Sky Survey IV (SDSS-IV), a project encompassing three major spectroscopic programs. The Apache Point Observatory Galactic Evolution Experiment 2 (APOGEE-2) is observing hundreds of thousands of Milky Way stars at high resolution and high signal-to-noise ratios in the near-infrared. The Mapping Nearby Galaxies at Apache Point Observatory (MaNGA) survey is obtaining spatially resolved spectroscopy for thousands of nearby galaxies (median ). The extended Baryon Oscillation Spectroscopic Survey (eBOSS) is mapping the galaxy, quasar, and neutral gas distributions between and 3.5 to constrain cosmology using baryon acoustic oscillations, redshift space distortions, and the shape of the power spectrum. Within eBOSS, we are conducting two major subprograms: the SPectroscopic IDentification of eROSITA Sources (SPIDERS), investigating X-ray AGNs and galaxies in X-ray clusters, and the Time Domain Spectroscopic Survey (TDSS), obtaining spectra of variable sources. All programs use the 2.5 m Sloan Foundation Telescope at the Apache Point Observatory; observations there began in Summer 2014. APOGEE-2 also operates a second near-infrared spectrograph at the 2.5 m du Pont Telescope at Las Campanas Observatory, with observations beginning in early 2017. Observations at both facilities are scheduled to continue through 2020. In keeping with previous SDSS policy, SDSS-IV provides regularly scheduled public data releases; the first one, Data Release 13, was made available in 2016 July
The Seventeenth Data Release of the Sloan Digital Sky Surveys: Complete Release of MaNGA, MaStar and APOGEE-2 Data
This paper documents the seventeenth data release (DR17) from the Sloan Digital Sky Surveys; the fifth and final release from the fourth phase (SDSS-IV). DR17 contains the complete release of the Mapping Nearby Galaxies at Apache Point Observatory (MaNGA) survey, which reached its goal of surveying over 10,000 nearby galaxies. The complete release of the MaNGA Stellar Library (MaStar) accompanies this data, providing observations of almost 30,000 stars through the MaNGA instrument during bright time. DR17 also contains the complete release of the Apache Point Observatory Galactic Evolution Experiment 2 (APOGEE-2) survey which publicly releases infra-red spectra of over 650,000 stars. The main sample from the Extended Baryon Oscillation Spectroscopic Survey (eBOSS), as well as the sub-survey Time Domain Spectroscopic Survey (TDSS) data were fully released in DR16. New single-fiber optical spectroscopy released in DR17 is from the SPectroscipic IDentification of ERosita Survey (SPIDERS) sub-survey and the eBOSS-RM program. Along with the primary data sets, DR17 includes 25 new or updated Value Added Catalogs (VACs). This paper concludes the release of SDSS-IV survey data. SDSS continues into its fifth phase with observations already underway for the Milky Way Mapper (MWM), Local Volume Mapper (LVM) and Black Hole Mapper (BHM) surveys