Search CORE

15 research outputs found

Improved neural machine translation systems for low resource correction tasks

Author: Harer Jacob Alexander
Publication venue
Publication date: 14/02/2020
Field of study

Recent advances in Neural Machine Translation (NMT) systems have achieved impressive results on language translation tasks. However, the success of these systems has been limited when applied to similar low-resource tasks, such as language correction. In these cases, datasets are often small whilst still containing long sequences, leading to significant overfitting and poor generalization. In this thesis we study issues preventing widespread adoption of NMT systems into low resource tasks, with a special focus on sequence correction for both code and language. We propose two novel techniques for handling these low-resource tasks. The first uses Generative Adversarial Networks to handle datasets without paired data. This technique allows the use of available unpaired datasets which are typically much larger than paired datasets since they do not require manual annotation. We first develop a methodology for generation of discrete sequences using a Wasserstein Generative Adversarial Network, and then use this methodology to train a NMT system on unpaired data. Our second technique converts sequences into a tree-structured representation, and performs translation from tree-to-tree. This improves the handling of very long sequences since it reduces the distance between nodes in the network, and allows the network to take advantage of information contained in the tree structure to reduce overfitting

Boston University Institutional Repository (OpenBU)

NodeDrop: a method for finding sufficient network architecture size

Author: Chin Sang
Harer Jacob
Jensen Louis
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/07/2020
Field of study

Determining an appropriate number of features for each layer in a neural network is an important and difficult task. This task is especially important in applications on systems with limited memory or processing power. Many current approaches to reduce network size either utilize iterative procedures, which can extend training time significantly, or require very careful tuning of algorithm parameters to achieve reasonable results. In this paper we propose NodeDrop, a new method for eliminating features in a network. With NodeDrop, we define a condition to identify and guarantee which nodes carry no information, and then use regularization to encourage nodes to meet this condition. We find that NodeDrop drastically reduces the number of features in a network while maintaining high performance. NodeDrop reduces the number of parameters by a factor of 114x for a VGG like network on CIFAR10 without a drop in accuracy.Accepted manuscrip

Crossref

Boston University Institutional Repository (OpenBU)

Guidelines for Genome-Scale Analysis of Biological Rhythms

Author: Abruzzi Katherine C.
Allada Ravi
Anafi Ron
Arpat Alaaddin Bulak
Asher Gad
Baldi Pierre
Bell-Pedersen Deborah
Blau Justin
Brown Steve
Ceriani M. Fernanda
Chen Zheng
Chiu Joanna C.
Cox Juergen
Crowell Alexander M.
de Bekker Charissa
de Goede Paul
de la Iglesia Horacio O.
DeBruyne Jason P.
Dijk Derk-Jan
DiTacchio Luciano
Doyle Francis J.
Duffield Giles E.
Dunlap Jay C.
Eckel-Mahan Kristin
Esser Karyn A.
FitzGerald Garret A.
Forger Daniel B.
Francey Lauren J.
Fu Ying-Hui
Gachon Frédéric
Gatfield David
Golden Susan S.
Green Carla
Harer John
Harmer Stacey
Haspel Jeff
Hastings Michael H.
Herzel Hanspeter
Herzog Erik D.
Hoffmann Christy
Hogenesch John B.
Hong Christian
Hughes Michael E.
Hughey Jacob J.
Hurley Jennifer M.
Johnson Carl
Kay Steve A.
Koike Nobuya
Kornacker Karl
Kramer Achim
Lamia Katja
Leise Tanya
Lewis Scott A.
Li Jiajia
Li Xiaodong
Liu Andrew C.
Loros Jennifer J.
Martino Tami A.
Menet Jerome S.
Merrow Martha
Millar Andrew J.
Mockler Todd
Naef Felix
Nagoshi Emi
Nitabach Michael N.
Nusinow Dmitri A.
Olmedo Maria
Ptáček Louis J.
Rand David
Reddy Akhilesh B.
Robles Maria S.
Roenneberg Till
Rosbash Michael
Ruben Marc D.
Rund Samuel S.C.
Sancar Aziz
Sassone-Corsi Paolo
Sehgal Amita
Sherrill-Mix Scott
Skene Debra J.
Storch Kai-Florian
Takahashi Joseph S.
Ueda Hiroki R.
Wang Han
Weitz Charles
Westermark Pål O.
Wijnen Herman
Wu Gang
Xu Ying
Yoo Seung-Hee
Young Michael
Zhang Eric Erquan
Zielinski Tomasz
Publication venue
Publication date: 01/01/2017
Field of study

Genome biology approaches have made enormous contributions to our understanding of biological rhythms, particularly in identifying outputs of the clock, including RNAs, proteins, and metabolites, whose abundance oscillates throughout the day. These methods hold significant promise for future discovery, particularly when combined with computational modeling. However, genome-scale experiments are costly and laborious, yielding “big data” that are conceptually and statistically difficult to analyze. There is no obvious consensus regarding design or analysis. Here we discuss the relevant technical considerations to generate reproducible, statistically sound, and broadly useful genome-scale data. Rather than suggest a set of rigid rules, we aim to codify principles by which investigators, reviewers, and readers of the primary literature can evaluate the suitability of different experimental designs for measuring different aspects of biological rhythms. We introduce CircaInSilico, a web-based application for generating synthetic genome biology data to benchmark statistical methods for studying biological rhythms. Finally, we discuss several unmet analytical needs, including applications to clinical medicine, and suggest productive avenues to address them

Carolina Digital Repository

Guidelines for Genome-Scale Analysis of Biological Rhythms

Author: Achim Kramer
Akhilesh B. Reddy
Alaaddin Bulak Arpat
Alexander M. Crowell
Amita Sehgal
Andrew C. Liu
Andrew J. Millar
Aziz Sancar
Carl Johnson
Carla Green
Charissa de Bekker
Charles Weitz
Christian Hong
Christy Hoffmann
Daniel B. Forger
David Gatfield
David Rand
Deborah Bell-Pedersen
Debra J. Skene
Derk-Jan Dijk
Dmitri A. Nusinow
Emi Nagoshi
Eric Erquan Zhang
Erik D. Herzog
Felix Naef
Francis J. Doyle
Frédéric Gachon
Gad Asher
Gang Wu
Garret A. FitzGerald
Giles E. Duffield
Han Wang
Hanspeter Herzel
Herman Wijnen
Hiroki R. Ueda
Horacio O. de la Iglesia
Jacob J. Hughey
Jason P. DeBruyne
Jay C. Dunlap
Jeff Haspel
Jennifer J. Loros
Jennifer M. Hurley
Jerome S. Menet
Jiajia Li
Joanna C. Chiu
John B. Hogenesch
John Harer
Joseph S. Takahashi
Juergen Cox
Justin Blau
Kai-Florian Storch
Karl Kornacker
Karyn A. Esser
Katherine C. Abruzzi
Katja Lamia
Kristin Eckel-Mahan
Lauren J. Francey
Louis J. Ptáček
Luciano DiTacchio
M. Fernanda Ceriani
Marc D. Ruben
Maria Olmedo
Maria S. Robles
Martha Merrow
Michael E. Hughes
Michael H. Hastings
Michael N. Nitabach
Michael Rosbash
Michael Young
Nobuya Koike
Paolo Sassone-Corsi
Paul de Goede
Pierre Baldi
Pål O. Westermark
Ravi Allada
Ren Y
Ron Anafi
Samuel S.C. Rund
Scott A. Lewis
Scott Sherrill-Mix
Seung-Hee Yoo
Stacey Harmer
Steve A. Kay
Steve Brown
Susan S. Golden
Tami A. Martino
Tanya Leise
Till Roenneberg
Todd Mockler
Tomasz Zielinski
Xiaodong Li
Ying Xu
Ying-Hui Fu
Zheng Chen
Publication venue: 'SAGE Publications'
Publication date: 01/01/2017
Field of study

Genome biology approaches have made enormous contributions to our understanding of biological rhythms, particularly in identifying outputs of the clock, including RNAs, proteins, and metabolites, whose abundance oscillates throughout the day. These methods hold significant promise for future discovery, particularly when combined with computational modeling. However, genome-scale experiments are costly and laborious, yielding ‘big data’ that is conceptually and statistically difficult to analyze. There is no obvious consensus regarding design or analysis. Here we discuss the relevant technical considerations to generate reproducible, statistically sound, and broadly useful genome scale data. Rather than suggest a set of rigid rules, we aim to codify principles by which investigators, reviewers, and readers of the primary literature can evaluate the suitability of different experimental designs for measuring different aspects of biological rhythms. We introduce CircaInSilico, a web-based application for generating synthetic genome biology data to benchmark statistical methods for studying biological rhythms. Finally, we discuss several unmet analytical needs, including applications to clinical medicine, and suggest productive avenues to address them

Southampton (e-Prints Soton)

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Serveur académique lausannois

Open Access LMU ( Ludwig-Maximilians-Univ. München)

Edinburgh Research Explorer

Carolina Digital Repository

Warwick Research Archives Portal Repository

Surrey Research Insight

MPG.PuRe

Archive ouverte UNIGE

Infoscience - École polytechnique fédérale de Lausanne

Crossref

CONICET Digital

Harvard University - DASH

eScholarship - University of California

ZORA

Hochschulbibliothekszentrum des Landes Nordrhein-Westfalen (hbz)

TypeWriter: Neural Type Prediction with Search-Based Validation

Author: Anderson Christopher
Bader Johannes
Harer Jacob
Jensen Simon Holm
Liu Kui
Mikolov Tomas
Pradel Michael
White Martin
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2020
Field of study

Maintaining large code bases written in dynamically typed languages, such as JavaScript or Python, can be challenging due to the absence of type annotations: simple data compatibility errors proliferate, IDE support is limited, and APIs are hard to comprehend. Recent work attempts to address those issues through either static type inference or probabilistic type prediction. Unfortunately, static type inference for dynamic languages is inherently limited, while probabilistic approaches suffer from imprecision. This paper presents TypeWriter, the first combination of probabilistic type prediction with search-based refinement of predicted types. TypeWriter’s predictor learns to infer the return and argument types for functions from partially annotated code bases by combining the natural language properties of code with programming language-level information. To validate predicted types, TypeWriter invokes a gradual type checker with different combinations of the predicted types, while navigating the space of possible type combinations in a feedback-directed manner. We implement the TypeWriter approach for Python and evaluate it on two code corpora: a multi-million line code base at Facebook and a collection of 1,137 popular open-source projects. We show that TypeWriter’s type predictor achieves an F1 score of 0.64 (0.79) in the top-1 (top-5) predictions for return types, and 0.57 (0.80) for argument types, which clearly outperforms prior type prediction models. By combining predictions with search-based validation, TypeWriter can fully annotate between 14% to 44% of the files in a randomly selected corpus, while ensuring type correctness. A comparison with a static type inference tool shows that TypeWriter adds many more non-trivial types. TypeWriter currently suggests types to developers at Facebook and several thousands of types have already been accepted with minimal changes.Software Engineerin

arXiv.org e-Print Archive

Crossref

TU Delft Repository