Search CORE

17 research outputs found

Better Context Makes Better Code Language Models: A Case Study on Function Call Argument Completion

Author: Karypis George
Lausen Leonard
Pei Hengzhi
Zha Sheng
Zhao Jinman
Publication venue
Publication date: 01/06/2023
Field of study

Pretrained code language models have enabled great progress towards program synthesis. However, common approaches only consider in-file local context and thus miss information and constraints imposed by other parts of the codebase and its external dependencies. Existing code completion benchmarks also lack such context. To resolve these restrictions we curate a new dataset of permissively licensed Python packages that includes full projects and their dependencies and provide tools to extract non-local information with the help of program analyzers. We then focus on the task of function call argument completion which requires predicting the arguments to function calls. We show that existing code completion models do not yield good results on our completion task. To better solve this task, we query a program analyzer for information relevant to a given function call, and consider ways to provide the analyzer results to different code completion models during inference and training. Our experiments show that providing access to the function implementation and function usages greatly improves the argument completion performance. Our ablation study provides further insights on how different types of information available from the program analyzer and different ways of incorporating the information affect the model performance.Comment: 12 pages. Accepted to AAAI 202

arXiv.org e-Print Archive

Parameter and Data Efficient Continual Pre-training for Robustness to Dialectal Variance in Arabic

Author: Lausen Leonard
Lin Kaixiang
Mansour Saab
Sarkar Soumajyoti
Sengupta Sailik
Zha Sheng
Publication venue
Publication date: 07/11/2022
Field of study

The use of multilingual language models for tasks in low and high-resource languages has been a success story in deep learning. In recent times, Arabic has been receiving widespread attention on account of its dialectal variance. While prior research studies have tried to adapt these multilingual models for dialectal variants of Arabic, it still remains a challenging problem owing to the lack of sufficient monolingual dialectal data and parallel translation data of such dialectal variants. It remains an open problem on whether the limited dialectical data can be used to improve the models trained in Arabic on its dialectal variants. First, we show that multilingual-BERT (mBERT) incrementally pretrained on Arabic monolingual data takes less training time and yields comparable accuracy when compared to our custom monolingual Arabic model and beat existing models (by an avg metric of +

6.41

). We then explore two continual pre-training methods -- (1) using small amounts of dialectical data for continual finetuning and (2) parallel Arabic to English data and a Translation Language Modeling loss function. We show that both approaches help improve performance on dialectal classification tasks (

+4.64

avg. gain) when used on monolingual models

arXiv.org e-Print Archive

Large Language Models of Code Fail at Completing Code with Potential Bugs

Author: Dinh Tuan
Karypis George
Lausen Leonard
Negrinho Renato
Tan Samson
Zha Sheng
Zhao Jinman
Publication venue
Publication date: 06/06/2023
Field of study

Large language models of code (Code-LLMs) have recently brought tremendous advances to code completion, a fundamental feature of programming assistance and code intelligence. However, most existing works ignore the possible presence of bugs in the code context for generation, which are inevitable in software development. Therefore, we introduce and study the buggy-code completion problem, inspired by the realistic scenario of real-time code suggestion where the code context contains potential bugs -- anti-patterns that can become bugs in the completed program. To systematically study the task, we introduce two datasets: one with synthetic bugs derived from semantics-altering operator changes (buggy-HumanEval) and one with realistic bugs derived from user submissions to coding problems (buggy-FixEval). We find that the presence of potential bugs significantly degrades the generation performance of the high-performing Code-LLMs. For instance, the passing rates of CodeGen-2B-mono on test cases of buggy-HumanEval drop more than 50% given a single potential bug in the context. Finally, we investigate several post-hoc methods for mitigating the adverse effect of potential bugs and find that there remains a large gap in post-mitigation performance.Comment: 25 page

arXiv.org e-Print Archive

HYTREL: Hypergraph-enhanced Tabular Data Representation Learning

Author: Chen Pei
Huang Ruihong
Karypis George
Lausen Leonard
Sarkar Soumajyoti
Srinivasan Balasubramaniam
Zha Sheng
Publication venue
Publication date: 26/10/2023
Field of study

Language models pretrained on large collections of tabular data have demonstrated their effectiveness in several downstream tasks. However, many of these models do not take into account the row/column permutation invariances, hierarchical structure, etc. that exist in tabular data. To alleviate these limitations, we propose HYTREL, a tabular language model, that captures the permutation invariances and three more structural properties of tabular data by using hypergraphs - where the table cells make up the nodes and the cells occurring jointly together in each row, column, and the entire table are used to form three different types of hyperedges. We show that HYTREL is maximally invariant under certain conditions for tabular data, i.e., two tables obtain the same representations via HYTREL iff the two tables are identical up to permutations. Our empirical results demonstrate that HYTREL consistently outperforms other competitive baselines on four downstream tasks with minimal pretraining, illustrating the advantages of incorporating the inductive biases associated with tabular data into the representations. Finally, our qualitative analyses showcase that HYTREL can assimilate the table structures to generate robust representations for the cells, rows, columns, and the entire table.Comment: NeurIPS 2023 (spotlight

arXiv.org e-Print Archive

Non-Standard Errors

Author: Abad-Díaz David
Abudy Menachem
Adrian Tobias
Ait-Sahalia Yacine
Akmansoy Olivier
Alcock Jamie T.
Alexeev Vitali
Aloosh Arash
Amato Livia
Amaya Diego
Angel James J.
Avetikian Alejandro T.
Bach Amadeus
Baidoo Edwin
Bakalli Gaetan
Bao Li
Barbon Andrea
Bashchenko Oksana
Bindra Parampreet C.
Bjønnes Geir H.
Black Bernard S.
Black Jeffrey R.
Bogoev Dimitar
Bohorquez Correa Santiago
Bondarenko Oleg
Bos Charles S.
Bosch-Rosa Ciril
Bouri Elie
Brownlees Christian
Calamia Anna
Cao Viet Nga
Capelle-Blancard Gunther
Capera Romero Laura M.
Caporin Massimiliano
Carrion Allen
Caskurlu Tolga
Chakrabarty Bidisha
Chen Jian
Chernov Mikhail
Cheung William
Chincarini Ludwig B.
Chordia Tarun
Chow Sheung-Chi
Clapham Benjamin
Colliard Jean-Edouard
Comerton-Forde Carole
Curran Edward
Dao Thong
Dare Wale
Davies Ryan J.
De Blasis Riccardo
De Nard Gianluca F.
Declerck Fany
Deev Oleg
Degryse Hans
Deku Solomon Y.
Desagre Christophe
Dim Chukwuma
Dimpfl Thomas
Dong Yun Jiang
Dreber Anna
Drummond Philip A.
Dudda Tom
Duevski Teodor
Dumitrescu Ariadna
Dyakov Teodor
Dyhrberg Anne Haubo
Dzielinski Michał
Eksi Asli
El Kalak Izidin
Eugster Nicolas
Evans Martin D. D.
Farrell Michael
Felez-Vinas Ester
Ferrara Gerardo
Ferrouhi El Mehdi
Flori Andrea
Fluharty Jonathan T.
Foley Sean D. V.
Fong Kingsley Y. L.
Foucault Thierry
Franus Tatiana
Franzoni Francesco
Frijns Bart
Frömmel Michael
Fu Servanna M.
Füllbrunn Sascha C.
Gan Baoqing
Gao Ge
Gehrig Thomas P.
Gemayel Roland
Gerritsen Dirk
Gil-Bazo Javier
Gilder Dudley
Glosten Lawrence R.
Gomez Thomas
Gorbenko Arseny
Grammig Joachim
Grégoire Vincent
Güçbilmez Ufuk
Hagströmer Björn
Hambuckers Julien
Hapnes Erik
Harris Jeffrey H.
Harris Lawrence
Hartmann Simon
Hasse Jean-Baptiste
Hautsch Nikolaus
He Xue-Zhong (Tony)
Heath Davidson
Hediger Simon
Hendershott Terrence
Hibbert Ann Marie
Hjalmarsson Erik
Hoelscher Seth
Hoffmann Peter
Holden Craig W.
Holzmeister Felix
Horenstein Alex R.
Huang Da
Huang Wenqian
Huber Juergen
Hurlin Christophe
Ilczuk Konrad
Ivashchenko Alexey
Iyer Subramanian R.
Jahanshahloo Hossein
Jalkh Naji P.
Johannesson Magnus
Jones Charles M.
Jurkatis Simon
Jylhä Petri
Kaeck Andreas T.
Kaiser Gabriel
Karam Arzé
Karmaziene Egle
Kassner Bernhard
Kaustia Markku
Kazak Ekaterina
Kearney Fearghal
Kervel Vincent van
Khan Saad A.
Khomyn Marta K.
Kirchler Michael
Klein Olga
Klein Tony
Klos Alexander
Koetter Michael
Kolokolov Aleksey
Korajczyk Robert A.
Kozhan Roman
Krahnen Jan P.
Kuhle Paul
Kwan Amy
Lajaunie Quentin
Lam F. Y. Eric C.
Lambert Marie
Langlois Hugues
Lausen Jens
Lauter Tobias
Leippold Markus
Levin Vladimir
Li Hui
Li Yijie
Liew Chee Yoong
Lindner Thomas
Linton Oliver
Liu Anqi
Liu Jiacheng
Llorente Guillermo
Lof Matthijs
Lohr Ariel
Longstaff Francis
Lopez-Lira Alejandro
Mankad Shawn
Mano Nicola
Marchal Alexis
Martineau Charles
Mazzola Francesco
Meloso Debrah
Menkfeld Albert J.
Mi Michael G.
Mihet Roxana
Mohan Vijay
Moinas Sophie
Moore David
Mu Liangyi
Muravyev Dmitriy
Murphy Dermot
Neszveda Gabor
Neumeier Christian
Neusüss Sebastian
Nielsson Ulf
Nimalendran Mahendrarajah
Nolte Sven
Norden Lars L.
O'Neill Peter W.
Obaid Khaled
Pagnotta Emiliano
Painter Marcus
Palan Stefan
Palit Imon J.
Park Andreas
Pascual Roberto
Pasquariello Paolo
Pastor Lubos
Patel Vinay
Patton Andrew J.
Pearson Neil D.
Pelizzon Loriana
Pelli Michele
Pelster Matthias
Pfiffer Cameron
Philip Richard
Plíhal Tomáš
Prakash Puneet
Press Oliver-Alexander
Prodromou Tina
Prokopczuk Marcel
Putnins Talis
Pérignon Christophe
Qian Ya
Raizada Gaurav
Rakowski David
Ranaldo Angelo
Razen Michael
Regis Luca
Reitz Stefan
Renault Thomas
Renjie Rex W.
Reno Roberto
Riddiough Steven J.
Rinne Kalle
Rintamäki Paul J.
Riordan Ryan
Rittmannsberger Thomas
Rodríguez Longarela Iñaki
Roesch Dominik
Rognone Lavinia
Roseman Brian
Rosu Ioanid
Roy Saurabh
Rudolf Nicolas
Rush Stephen R.
Rzayev Khaladdin
Rzeznik Aleksandra A.
Sanford Anthony
Sankaran Harikumar
Sarkar Asani
Sarno Lucio
Scaillet Olivier
Scharnowski Stefan
Schenk-Hoppé Klaus R.
Schertler Andrea
Schneider Michael
Schroeder Florian
Schuster Philipp
Schwarz Marco A.
Schürhoff Norman
Seasholes Mark S.
Seeger Norman J.
Shachar Or
Shkilko Andriy
Shui Jessica
Sikic Mario
Simion Giorgia
Smales Lee A.
Sojli Elvira
Sokolov Konstantin
Spokeviciute Laima
Stefanova Denitsa
Subrahmanyam Marti G.
Szaszi Barnabas
Söderlind Paul
Sönksen Jantje
Talavera Oleksandr
Tang Yuehua
Taylor Nick
ter Ellen Saskia
Tham Wing Wah
Theissen Erik
Thimme Julian
Tonks Ian
Tran Hai
Trapin Luca
Trolle Anders B.
Vaduva M. Andreea
Valente Giorgio
van der Wel Michel
van Dijk Mathijs A.
Van Ness Robert A.
Vasquez Aurelio
Verousis Thanos
Verwijmeren Patrick
Vilhelmsson Anders
Vilkov Grigory
Vladimirov Vladimir
Vogel Sebastian
Voigt Stefan
Wagner Wolf
Walther Thomas
Weiss Patrick
Weitzel Utz
Werner Ingrid M.
Westerholm Joakim
Westheide Christian
Wika Hans C.
Wipplinger Evert
Wolf Michael
Wolff Christian C. P.
Wolk Leonard
Wong Wink-Keung
Wrampelmeyer Jan
Wu Zhen-Xing
Xia Shuo
Xiu Dacheng
Xu Caihong
Xu Ke
Yadav Pradeep K.
Yagüe José
Yan Cheng
Yang Antti
Yoo Woongsun
Yu Shihao
Yu Wenjia
Yu Yihe
Yueshen Bart Z.
Yuferova Darya
Zamojski Marcin
Zareei Abalfazl
Zeisberger Stefan M.
Zhang Lu
Zhang S. Sarah
Zhang Xiaoyu
Zhao Lu
Zhong Zhuo
Zhou Chen
Zhou Zeyang (Ivy)
Zhu Xingyu S.
Zoican Marius
Zwinkels Remco
Östberg Per
Ødegaard Bernt A.
Publication venue
Publication date: 11/02/2023
Field of study

In statistics, samples are drawn from a population in a data-generating process (DGP). Standard errors measure the uncertainty in estimates of population parameters. In science, evidence is generated to test hypotheses in an evidence-generating process (EGP). We claim that EGP variation across researchers adds uncertainty: Non-standard errors (NSEs). We study NSEs by letting 164 teams test the same hypotheses on the same data. NSEs turn out to be sizable, but smaller for better reproducible or higher rated research. Adding peer-review stages reduces NSEs. We further find that this type of uncertainty is underestimated by participants

Open Access LMU

CrowdRisk: exploring crowdsourcing of risk information

Author: Foth Marcus
Horton Ella
Lausen Leonard
Mitchell Peta
Rittenbruch Markus
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2016
Field of study

This paper describes the outcomes of a preliminary study into the design of a mobile app to crowdsource information related to “risk”. For the purpose of this study the notion of risk is defined broadly; however, we predominantly focus on the personal, subjective perception of risk. The study involved building a prototypical mobile app to crowdsource risk and exploring the use of the app as part of an expert workshop. Outcomes show challenges and opportunities with regards to the categorisation of results, the motivation of users, and interaction design of the prototype. The study provides value by giving an initial insight into this design space

Queensland University of Technology ePrints Archive

Targets of the Tal1 transcription factor in erythrocytes: E2 ubiquitin conjugase regulation by Tal1

Author: Koch B.
Kuvardina-Bolling O.N.
Lausen J.
Leonard F.
Leutz A.
Pless O.
Publication venue: 'American Society for Biochemistry & Molecular Biology (ASBMB)'
Publication date: 22/12/2009
Field of study

The Tal1 transcription factor is essential for the development of the hematopoietic system and plays a role during definitive erythropoiesis in the adult. Despite the importance of Tal1 in erythropoiesis only a small number of erythroid differentiation target genes are known. A chromatin precipitation and cloning approach was established to uncover novel Tal1 target genes in erythropoiesis. The BirA-tag/BirA-ligase biotinylation system in combination with streptavidine chromatin precipitation (Strep-CP) was used to co-precipitate genomic DNA bound to Tal1. Tal1 was found to bind in the vicinity of 31 genes including the E2-ubiquitin conjugase UBE2H gene. Binding of Tal1 to UBE2H was confirmed by chromatin immuno-precipitation. UBE2H expression is increased during erythroid differentiation of hCD34+ cells. Tal1 expression activated UBE2H expression whereas Tal1 knock-down reduced UBE2H expression and ubiquitin transfer activity. This study identifies parts of the ubiquitinylation machinery as a cellular target down-stream of the transcription factor Tal1 and provides novel insights into Tal1 regulated erythropoiesis

PubMed Central

MDC Repository