116 research outputs found

    Decoding Data Quality via Synthetic Corruptions: Embedding-guided Pruning of Code Data

    Full text link
    Code datasets, often collected from diverse and uncontrolled sources such as GitHub, potentially suffer from quality issues, thereby affecting the performance and training efficiency of Large Language Models (LLMs) optimized for code generation. Previous studies demonstrated the benefit of using embedding spaces for data pruning, but they mainly focused on duplicate removal or increasing variety, and in other modalities, such as images. Our work focuses on using embeddings to identify and remove "low-quality" code data. First, we explore features of "low-quality" code in embedding space, through the use of synthetic corruptions. Armed with this knowledge, we devise novel pruning metrics that operate in embedding space to identify and remove low-quality entries in the Stack dataset. We demonstrate the benefits of this synthetic corruption informed pruning (SCIP) approach on the well-established HumanEval and MBPP benchmarks, outperforming existing embedding-based methods. Importantly, we achieve up to a 3% performance improvement over no pruning, thereby showing the promise of insights from synthetic corruptions for data pruning.Comment: 12 pages, 4 figures, Oral Presentation at 3rd Workshop on Efficient Natural Language and Speech Processing (ENLSP-III), NeurIPS 202

    Cactus pear (Opuntia ficus-indica) productivity, proximal composition and soil parameters as affected by planting time and agronomic management in a semi-arid region of india

    Get PDF
    Study of appropriate planting time and response to agronomic management practices is imperative for the newly introduced cactus pear (Opuntia ficus-indica (L.) Mill.) into a semi-arid region of India. Responses of cactus pear to agronomic practices (planting time and irrigation and fertilizer application) were evaluated to determine the potential for fodder production and livestock feed in a semi-arid environment of India. We assessed four planting times (February, March, July and October) and two agronomic managements (with and without irrigation and fertilizer application) during 2016–2020 at Jhansi, India. Cactus pear establishment and growth improved with planting time in July and October due to favorable soil moisture and congenial temperature. However, plant height (19 cm) and cladode weight (118 g) were greater in July than in October planting. Nutrient uptake and crude protein contents, however, were higher for the earlier plantings of February and April compared to June and October. Irrigation and nutrients application had little effect on the cactus pear plant growth, except on plant width and cladode length and width. Cactus pear can be planted during July in moderately fertile soils without any agronomic intervention in semi-arid situations of India and has potential as an effective alternative source of forage for livestock during the summer months

    Machine learning for the Zwicky transient facility

    Get PDF
    The Zwicky Transient Facility is a large optical survey in multiple filters producing hundreds of thousands of transient alerts per night. We describe here various machine learning (ML) implementations and plans to make the maximal use of the large data set by taking advantage of the temporal nature of the data, and further combining it with other data sets. We start with the initial steps of separating bogus candidates from real ones, separating stars and galaxies, and go on to the classification of real objects into various classes. Besides the usual methods (e.g., based on features extracted from light curves) we also describe early plans for alternate methods including the use of domain adaptation, and deep learning. In a similar fashion we describe efforts to detect fast moving asteroids. We also describe the use of the Zooniverse platform for helping with classifications through the creation of training samples, and active learning. Finally we mention the synergistic aspects of ZTF and LSST from the ML perspective

    Machine learning for the Zwicky Transient Facility

    Get PDF
    The Zwicky Transient Facility is a large optical survey in multiple filters producing hundreds of thousands of transient alerts per night. We describe here various machine learning (ML) implementations and plans to make the maximal use of the large data set by taking advantage of the temporal nature of the data, and further combining it with other data sets. We start with the initial steps of separating bogus candidates from real ones, separating stars and galaxies, and go on to the classification of real objects into various classes. Besides the usual methods (e.g., based on features extracted from light curves) we also describe early plans for alternate methods including the use of domain adaptation, and deep learning. In a similar fashion we describe efforts to detect fast moving asteroids. We also describe the use of the Zooniverse platform for helping with classifications through the creation of training samples, and active learning. Finally we mention the synergistic aspects of ZTF and LSST from the ML perspective

    Genome-wide association study reveals novel genomic regions governing agronomic and grain quality traits and superior allelic combinations for Basmati rice improvement

    Get PDF
    BackgroundBasmati is a speciality segment in the rice genepool characterised by explicit grain quality. For the want of suitable populations, genome-wide association study (GWAS) in Basmati rice has not been attempted.MaterialsTo address this gap, we have performed a GWAS on a panel of 172 elite Basmati multiparent population comprising of potential restorers and maintainers. Phenotypic data was generated for various agronomic and grain quality traits across seven different environments during two consecutive crop seasons. Based on the observed phenotypic variation, three agronomic traits namely, days to fifty per cent flowering, plant height and panicle length, and three grain quality traits namely, kernel length before cooking, length breadth ratio and kernel length after cooking were subjected to GWAS. Genotyped with 80K SNP array, the population was subjected to principal component analysis to stratify the underlying substructure and subjected to the association analysis using Bayesian-information and Linkage-disequilibrium Iteratively Nested Keyway (BLINK) model.ResultsWe identified 32 unique MTAs including 11 robust MTAs for the agronomic traits and 25 unique MTAs including two robust MTAs for the grain quality traits. Six out of 13 robust MTAs were novel. By genome annotation, six candidate genes associated with the robust MTAs were identified. Further analysis of the allelic combinations of the robust MTAs enabled the identification of superior allelic combinations in the population. This information was utilized in selecting 77 elite Basmati rice genotypes from the panel.ConclusionThis is the first ever GWAS study in Basmati rice which could generate valuable information usable for further breeding through marker assisted selection, including enhancing of heterosis

    A922 Sequential measurement of 1 hour creatinine clearance (1-CRCL) in critically ill patients at risk of acute kidney injury (AKI)

    Get PDF
    Meeting abstrac
    corecore