116 research outputs found
Decoding Data Quality via Synthetic Corruptions: Embedding-guided Pruning of Code Data
Code datasets, often collected from diverse and uncontrolled sources such as
GitHub, potentially suffer from quality issues, thereby affecting the
performance and training efficiency of Large Language Models (LLMs) optimized
for code generation. Previous studies demonstrated the benefit of using
embedding spaces for data pruning, but they mainly focused on duplicate removal
or increasing variety, and in other modalities, such as images. Our work
focuses on using embeddings to identify and remove "low-quality" code data.
First, we explore features of "low-quality" code in embedding space, through
the use of synthetic corruptions. Armed with this knowledge, we devise novel
pruning metrics that operate in embedding space to identify and remove
low-quality entries in the Stack dataset. We demonstrate the benefits of this
synthetic corruption informed pruning (SCIP) approach on the well-established
HumanEval and MBPP benchmarks, outperforming existing embedding-based methods.
Importantly, we achieve up to a 3% performance improvement over no pruning,
thereby showing the promise of insights from synthetic corruptions for data
pruning.Comment: 12 pages, 4 figures, Oral Presentation at 3rd Workshop on Efficient
Natural Language and Speech Processing (ENLSP-III), NeurIPS 202
Cactus pear (Opuntia ficus-indica) productivity, proximal composition and soil parameters as affected by planting time and agronomic management in a semi-arid region of india
Study of appropriate planting time and response to agronomic management practices is imperative for the newly introduced cactus pear (Opuntia ficus-indica (L.) Mill.) into a semi-arid region of India. Responses of cactus pear to agronomic practices (planting time and irrigation and fertilizer application) were evaluated to determine the potential for fodder production and livestock feed in a semi-arid environment of India. We assessed four planting times (February, March, July and October) and two agronomic managements (with and without irrigation and fertilizer application) during 2016–2020 at Jhansi, India. Cactus pear establishment and growth improved with planting time in July and October due to favorable soil moisture and congenial temperature. However, plant height (19 cm) and cladode weight (118 g) were greater in July than in October planting. Nutrient uptake and crude protein contents, however, were higher for the earlier plantings of February and April compared to June and October. Irrigation and nutrients application had little effect on the cactus pear plant growth, except on plant width and cladode length and width. Cactus pear can be planted during July in moderately fertile soils without any agronomic intervention in semi-arid situations of India and has potential as an effective alternative source of forage for livestock during the summer months
Machine learning for the Zwicky transient facility
The Zwicky Transient Facility is a large optical survey in multiple filters producing hundreds of thousands of transient alerts per night. We describe here various machine learning (ML) implementations and plans to make the maximal use of the large data set by taking advantage of the temporal nature of the data, and further combining it with other data sets. We start with the initial steps of separating bogus candidates from real ones, separating stars and galaxies, and go on to the classification of real objects into various classes. Besides the usual methods (e.g., based on features extracted from light curves) we also describe early plans for alternate methods including the use of domain adaptation, and deep learning. In a similar fashion we describe efforts to detect fast moving asteroids. We also describe the use of the Zooniverse platform for helping with classifications through the creation of training samples, and active learning. Finally we mention the synergistic aspects of ZTF and LSST from the ML perspective
Machine learning for the Zwicky Transient Facility
The Zwicky Transient Facility is a large optical survey in multiple filters producing hundreds of thousands of transient alerts per night. We describe here various machine learning (ML) implementations and plans to make the maximal use of the large data set by taking advantage of the temporal nature of the data, and further combining it with other data sets. We start with the initial steps of separating bogus candidates from real ones, separating stars and galaxies, and go on to the classification of real objects into various classes. Besides the usual methods (e.g., based on features extracted from light curves) we also describe early plans for alternate methods including the use of domain adaptation, and deep learning. In a similar fashion we describe efforts to detect fast moving asteroids. We also describe the use of the Zooniverse platform for helping with classifications through the creation of training samples, and active learning. Finally we mention the synergistic aspects of ZTF and LSST from the ML perspective
Genome-wide association study reveals novel genomic regions governing agronomic and grain quality traits and superior allelic combinations for Basmati rice improvement
BackgroundBasmati is a speciality segment in the rice genepool characterised by explicit grain quality. For the want of suitable populations, genome-wide association study (GWAS) in Basmati rice has not been attempted.MaterialsTo address this gap, we have performed a GWAS on a panel of 172 elite Basmati multiparent population comprising of potential restorers and maintainers. Phenotypic data was generated for various agronomic and grain quality traits across seven different environments during two consecutive crop seasons. Based on the observed phenotypic variation, three agronomic traits namely, days to fifty per cent flowering, plant height and panicle length, and three grain quality traits namely, kernel length before cooking, length breadth ratio and kernel length after cooking were subjected to GWAS. Genotyped with 80K SNP array, the population was subjected to principal component analysis to stratify the underlying substructure and subjected to the association analysis using Bayesian-information and Linkage-disequilibrium Iteratively Nested Keyway (BLINK) model.ResultsWe identified 32 unique MTAs including 11 robust MTAs for the agronomic traits and 25 unique MTAs including two robust MTAs for the grain quality traits. Six out of 13 robust MTAs were novel. By genome annotation, six candidate genes associated with the robust MTAs were identified. Further analysis of the allelic combinations of the robust MTAs enabled the identification of superior allelic combinations in the population. This information was utilized in selecting 77 elite Basmati rice genotypes from the panel.ConclusionThis is the first ever GWAS study in Basmati rice which could generate valuable information usable for further breeding through marker assisted selection, including enhancing of heterosis
A922 Sequential measurement of 1 hour creatinine clearance (1-CRCL) in critically ill patients at risk of acute kidney injury (AKI)
Meeting abstrac
- …