Search CORE

116 research outputs found

Decoding Data Quality via Synthetic Corruptions: Embedding-guided Pruning of Code Data

Author: Ardalani Newsha
Elhoushi Mostafa
Gloeckle Fabian
Mahmoud Anas
Morcos Ari S.
Rozière Baptiste
Singh Aaditya K.
Tirumala Kushal
Wu Carole-Jean
Yang Yu
Publication venue
Publication date: 04/12/2023
Field of study

Code datasets, often collected from diverse and uncontrolled sources such as GitHub, potentially suffer from quality issues, thereby affecting the performance and training efficiency of Large Language Models (LLMs) optimized for code generation. Previous studies demonstrated the benefit of using embedding spaces for data pruning, but they mainly focused on duplicate removal or increasing variety, and in other modalities, such as images. Our work focuses on using embeddings to identify and remove "low-quality" code data. First, we explore features of "low-quality" code in embedding space, through the use of synthetic corruptions. Armed with this knowledge, we devise novel pruning metrics that operate in embedding space to identify and remove low-quality entries in the Stack dataset. We demonstrate the benefits of this synthetic corruption informed pruning (SCIP) approach on the well-established HumanEval and MBPP benchmarks, outperforming existing embedding-based methods. Importantly, we achieve up to a 3% performance improvement over no pruning, thereby showing the promise of insights from synthetic corruptions for data pruning.Comment: 12 pages, 4 figures, Oral Presentation at 3rd Workshop on Efficient Natural Language and Speech Processing (ENLSP-III), NeurIPS 202

arXiv.org e-Print Archive

Cactus pear (Opuntia ficus-indica) productivity, proximal composition and soil parameters as affected by planting time and agronomic management in a semi-arid region of india

Author: Ahmad S.
Appaswamygowda B. H.
Dana Ram P.
Govindasamy P.
Hassan S.
Kumar S.
Liguori G.
Louhaichi M.
Mahawer S. K.
Prasad M.
Probir Kumar G.
Rai A. K.
Sarker A.
Tirumala K. K.
Publication venue: 'MDPI AG'
Publication date: 18/08/2021
Field of study

Study of appropriate planting time and response to agronomic management practices is imperative for the newly introduced cactus pear (Opuntia ficus-indica (L.) Mill.) into a semi-arid region of India. Responses of cactus pear to agronomic practices (planting time and irrigation and fertilizer application) were evaluated to determine the potential for fodder production and livestock feed in a semi-arid environment of India. We assessed four planting times (February, March, July and October) and two agronomic managements (with and without irrigation and fertilizer application) during 2016–2020 at Jhansi, India. Cactus pear establishment and growth improved with planting time in July and October due to favorable soil moisture and congenial temperature. However, plant height (19 cm) and cladode weight (118 g) were greater in July than in October planting. Nutrient uptake and crude protein contents, however, were higher for the earlier plantings of February and April compared to June and October. Irrigation and nutrients application had little effect on the cactus pear plant growth, except on plant width and cladode length and width. Cactus pear can be planted during July in moderately fertile soils without any agronomic intervention in semi-arid situations of India and has potential as an effective alternative source of forage for livestock during the summer months

Archivio istituzionale della ricerca - Università di Palermo

Implementation and evaluation of a new TCP loss recovery architecture

Author: A Bakre
A Capone
A Tirumala
B Hari
B Kim
B Saad
EH-K Wu
H Balakrishnam
Hosung Park
Jeonghoon Mo
K Fall
K Xu
M Kang
M Kang
M Mathis
M Mathis
M Zhang
Moonsoo Kang
P Karn
Q Ni
S Bhandarka
S Bohacek
S Cen
S Mascolo
S Mascolo
V Jacobson
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Machine learning for the Zwicky transient facility

Author: Adams S
Bellm EC
Biswas R
Blagorodnova N
Branton D
Bue B
Burdge K
Cannella C
Chang CK
Connolly A
Dekany R
Duev DA
Feindt U
Fortson L
Frederick S
Fremling C
Gezari S
Graham M
Groom S
Hung T
Kasliwal MM
Kulkarni S
Kupfer T
Lin HW
Lintott C
Lunnan R
Mahabal A
Masci FJ
Miller AA
Nordin J
Parejko J
Prince TA
Rebbapragada U
Riddle R
Rusholme B
Saunders N
Sedaghat N
Shupe DL
Singer LP
Soumagnac MT
Szkody P
Tachibana Y
Tirumala K
van Roestel J
van Velzen S
Walters R
Ward C
Wright D
Ye QZ
Zach Golkhou V
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

The Zwicky Transient Facility is a large optical survey in multiple filters producing hundreds of thousands of transient alerts per night. We describe here various machine learning (ML) implementations and plans to make the maximal use of the large data set by taking advantage of the temporal nature of the data, and further combining it with other data sets. We start with the initial steps of separating bogus candidates from real ones, separating stars and galaxies, and go on to the classification of real objects into various classes. Besides the usual methods (e.g., based on features extracted from light curves) we also describe early plans for alternate methods including the use of domain adaptation, and deep learning. In a similar fashion we describe efforts to detect fast moving asteroids. We also describe the use of the Zooniverse platform for helping with classifications through the creation of training samples, and active learning. Finally we mention the synergistic aspects of ZTF and LSST from the ML perspective

eScholarship - University of California

Oxford University Research Archive

Machine learning for the Zwicky Transient Facility

Author: Adams S
Bellm EC
Biswas R
Blagorodnova N
Branton D
Bue B
Burdge K
Cannella C
Chang C-K
Connolly A
Dekany R
Duev DA
Feindt U
Fortson L
Frederick S
Fremling C
Gezari S
Golkhou VZ
Graham M
Groom S
Hung T
Kasliwal MM
Kulkarni S
Kupfer T
Lin HW
Lintott Christopher
Lunnan R
Mahabal A
Masci FJ
Miller AA
Nordin J
Parejko J
Prince TA
Rebbapragada U
Riddle R
Rusholme B
Saunders N
Sedaghat N
Shupe DL
Singer LP
Soumagnac MT
Szkody P
Tachibana Y
Tirumala K
Van Roestel J
Van Velzen S
Walters R
Ward C
Wright D
Ye Q-Z
Publication venue: IOP Publishing
Publication date: 31/01/2019
Field of study

eScholarship - University of California

Oxford University Research Archive

Alkali metal-cationized serine clusters studied by sonic spray ionization tandem mass spectrometry

Crossref

Genome-wide association study reveals novel genomic regions governing agronomic and grain quality traits and superior allelic combinations for Basmati rice improvement

Author: Ashok Kumar Singh
Gaurav Dhawan
Haritha Bollinedi
Krishnan P. Abhijith
Kunnummal Kurungara Vinod
Kuram Tirumala Ravikiran
Mariappan Nagarajan
Pankaj Kumar
Prolay Kumar Bhowmick
Rakesh Seth
Ranjith Kumar Ellur
Ritesh Sharma
S. Gopala Krishnan
Sourav Kumar Badhran
Publication venue: 'Frontiers Media SA'
Publication date: 01/12/2022
Field of study

BackgroundBasmati is a speciality segment in the rice genepool characterised by explicit grain quality. For the want of suitable populations, genome-wide association study (GWAS) in Basmati rice has not been attempted.MaterialsTo address this gap, we have performed a GWAS on a panel of 172 elite Basmati multiparent population comprising of potential restorers and maintainers. Phenotypic data was generated for various agronomic and grain quality traits across seven different environments during two consecutive crop seasons. Based on the observed phenotypic variation, three agronomic traits namely, days to fifty per cent flowering, plant height and panicle length, and three grain quality traits namely, kernel length before cooking, length breadth ratio and kernel length after cooking were subjected to GWAS. Genotyped with 80K SNP array, the population was subjected to principal component analysis to stratify the underlying substructure and subjected to the association analysis using Bayesian-information and Linkage-disequilibrium Iteratively Nested Keyway (BLINK) model.ResultsWe identified 32 unique MTAs including 11 robust MTAs for the agronomic traits and 25 unique MTAs including two robust MTAs for the grain quality traits. Six out of 13 robust MTAs were novel. By genome annotation, six candidate genes associated with the robust MTAs were identified. Further analysis of the allelic combinations of the robust MTAs enabled the identification of superior allelic combinations in the population. This information was utilized in selecting 77 elite Basmati rice genotypes from the panel.ConclusionThis is the first ever GWAS study in Basmati rice which could generate valuable information usable for further breeding through marker assisted selection, including enhancing of heterosis

Directory of Open Access Journals