42 research outputs found
Array operators using multiple dispatch: a design methodology for array implementations in dynamic languages
Arrays are such a rich and fundamental data type that they tend to be built
into a language, either in the compiler or in a large low-level library.
Defining this functionality at the user level instead provides greater
flexibility for application domains not envisioned by the language designer.
Only a few languages, such as C++ and Haskell, provide the necessary power to
define -dimensional arrays, but these systems rely on compile-time
abstraction, sacrificing some flexibility. In contrast, dynamic languages make
it straightforward for the user to define any behavior they might want, but at
the possible expense of performance.
As part of the Julia language project, we have developed an approach that
yields a novel trade-off between flexibility and compile-time analysis. The
core abstraction we use is multiple dispatch. We have come to believe that
while multiple dispatch has not been especially popular in most kinds of
programming, technical computing is its killer application. By expressing key
functions such as array indexing using multi-method signatures, a surprising
range of behaviors can be obtained, in a way that is both relatively easy to
write and amenable to compiler analysis. The compact factoring of concerns
provided by these methods makes it easier for user-defined types to behave
consistently with types in the standard library.Comment: 6 pages, 2 figures, workshop paper for the ARRAY '14 workshop, June
11, 2014, Edinburgh, United Kingdo
Enhancing clinical concept extraction with distributional semantics
AbstractExtracting concepts (such as drugs, symptoms, and diagnoses) from clinical narratives constitutes a basic enabling technology to unlock the knowledge within and support more advanced reasoning applications such as diagnosis explanation, disease progression modeling, and intelligent analysis of the effectiveness of treatment. The recent release of annotated training sets of de-identified clinical narratives has contributed to the development and refinement of concept extraction methods. However, as the annotation process is labor-intensive, training data are necessarily limited in the concepts and concept patterns covered, which impacts the performance of supervised machine learning applications trained with these data. This paper proposes an approach to minimize this limitation by combining supervised machine learning with empirical learning of semantic relatedness from the distribution of the relevant words in additional unannotated text.The approach uses a sequential discriminative classifier (Conditional Random Fields) to extract the mentions of medical problems, treatments and tests from clinical narratives. It takes advantage of all Medline abstracts indexed as being of the publication type “clinical trials” to estimate the relatedness between words in the i2b2/VA training and testing corpora. In addition to the traditional features such as dictionary matching, pattern matching and part-of-speech tags, we also used as a feature words that appear in similar contexts to the word in question (that is, words that have a similar vector representation measured with the commonly used cosine metric, where vector representations are derived using methods of distributional semantics). To the best of our knowledge, this is the first effort exploring the use of distributional semantics, the semantics derived empirically from unannotated text often using vector space models, for a sequence classification task such as concept extraction. Therefore, we first experimented with different sliding window models and found the model with parameters that led to best performance in a preliminary sequence labeling task.The evaluation of this approach, performed against the i2b2/VA concept extraction corpus, showed that incorporating features based on the distribution of words across a large unannotated corpus significantly aids concept extraction. Compared to a supervised-only approach as a baseline, the micro-averaged F-score for exact match increased from 80.3% to 82.3% and the micro-averaged F-score based on inexact match increased from 89.7% to 91.3%. These improvements are highly significant according to the bootstrap resampling method and also considering the performance of other systems. Thus, distributional semantic features significantly improve the performance of concept extraction from clinical narratives by taking advantage of word distribution information obtained from unannotated data
αRby—An Embedding of Alloy in Ruby
We present αRby—an embedding of the Alloy language in Ruby—and demonstrate the benefits of having a declarative modeling language (backed by an automated solver) embedded in a traditional object-oriented imperative programming language. This approach aims to bring these two distinct paradigms (imperative and declarative) together in a novel way. We argue that having the other paradigm available within the same language is beneficial to both the modeling community of Alloy users and the object-oriented community of Ruby programmers. In this paper, we primarily focus on the benefits for the Alloy community, namely, how αRby provides elegant solutions to several well-known, outstanding problems: (1) mixed execution, (2) specifying partial instances, (3) staged model finding
Support for adaptivity in ARMCI using migratable objects
Many new paradigms of parallel programming have emerged that compete with and complement the standard and well-established MPI model. Most notable, and suc-cessful, among these are models that support some form of global address space. At the same time, approaches based on migratable objects (also called virtualized processes) have shown that resource management concerns can be sep-arated effectively from the overall parallel programming ef-fort. For example, Charm++ supports dynamic load bal-ancing via an intelligent adaptive run-time system. It is also becoming clear that a multi-paradigm approach that allows modules written in one or more paradigms to coexist and co-operate will be necessary to tame the parallel pro-gramming challenge. ARMCI is a remote memory copy library that serves as a foundation of many global address space languages and libraries. This paper presents our preliminary work on inte-grating and supporting ARMCI with the adaptive run-time system of Charm++ as a part of our overall effort in the multi-paradigm approach.
Formal techniques for the procedural control of industrial processes
Imperial Users onl
PlinyCompute: A Platform for High-Performance, Distributed, Data-Intensive Tool Development
This paper describes PlinyCompute, a system for development of
high-performance, data-intensive, distributed computing tools and libraries. In
the large, PlinyCompute presents the programmer with a very high-level,
declarative interface, relying on automatic, relational-database style
optimization to figure out how to stage distributed computations. However, in
the small, PlinyCompute presents the capable systems programmer with a
persistent object data model and API (the "PC object model") and associated
memory management system that has been designed from the ground-up for high
performance, distributed, data-intensive computing. This contrasts with most
other Big Data systems, which are constructed on top of the Java Virtual
Machine (JVM), and hence must at least partially cede performance-critical
concerns such as memory management (including layout and de/allocation) and
virtual method/function dispatch to the JVM. This hybrid approach---declarative
in the large, trusting the programmer's ability to utilize PC object model
efficiently in the small---results in a system that is ideal for the
development of reusable, data-intensive tools and libraries. Through extensive
benchmarking, we show that implementing complex objects manipulation and
non-trivial, library-style computations on top of PlinyCompute can result in a
speedup of 2x to more than 50x or more compared to equivalent implementations
on Spark.Comment: 48 pages, including references and Appendi
Comment analysis for program comprehension
Dissertação de mestrado em Engenharia de InformáticaThe constant demanding, mostly from Software Maintenance professionals, so that it could be
created new and more efficient methods of understanding programs, have put several challenges
to Program Comprehension researchers. Nowadays, the programmers are not satisfied with the
simple extraction of the program's structure, which mainly resides on the control and dataflow
identification. They want to know the meaning of the program, so they can identify the real world
concepts that are materialized on the source code of the program, which would provide a better and
more efficient understanding of the program.
To solve this problem, Program Comprehension researchers have, mainly, developed approaches
that use techniques which are based on Information Retrieval Systems, like search engines. The
strategy involves the retrieval of non structured information, properly ranked, answering a question
the system can interpret. Considering Program Comprehension, this strategy enables the programmers
to search for real world concepts, also known as Problem Domain, implemented and mapped
into programming concepts, also known as Program Domain.
Although its use is not consensual, source code comments have the main objective of helping
understand the source code, and it has already been proven its value on the process of comprehending,
through several studies. Even though the reasons for this fact have not yet been proved, some
authors have defended that source code comments are an important vehicle for the inclusion of
Problem Domain information and that their exploration improves and increases the comprehension
process.
Therefore, the presenting Master Dissertation exposes the development of a solution based on
Information Retrieval algorithms, based on Information Retrieval Systems, in order to check if, in fact,
the information included in comments can contribute in a decisive way to increase the efficiency of
the comprehension process of a given program or software system.
The results and conclusions, extracted from this work, showed that comments, properly analyzed
and classified by the developed system, helped to better understand the Problem Domain
concepts and their materialization on the source code. The developed solution showed itself able of
meeting the challenges posed, and proved to be a usefull and efficient tool for the comprehension
tasks that may emerge on the process of software maintence of a system.As constantes exigências, principalmente vindas dos profissionais envolvidos na manutenção de software, para que sejam encontradas novos e mais eficientes métodos para entender programas, têm posto variadíssimos desafios aos investigadores na área de Compreensão de Programas. Atualmente, os programadores não ficam satisfeitos com a simples extração da estrutura do programa, que passa pela identificação principalmente do controlo e do fluxo de dados do programa. Eles querem saber qual a semântica do programa, de maneira a poderem identificar os conceitos do mundo real que estão materializados no código fonte do programa, o que proporcionaria uma melhor e mais eficiente compreensão do programa.
Para resolver este problema, os investigadores da área de Compreensão de Programas têm, maioritariamente, criado abordagens que utilizam técnicas baseadas nos sistemas de Information Retrieval, como os motores de busca. A estratégia passa por retornar informação, não estruturada, devidamente classificada, respondida a uma pergunta que o sistema consegue interpretar. Em relação à Compreensão de Programas, esta estratégia permite aos programadores procurarem os conceitos do mundo real, também conhecido como Domínio do Problema, implementados e mapeados em conceitos da programação, também conhecido como Domínio da Programação.
Apesar da sua utilização não ser consensual, os comentários no código fonte têm o principal objetivo de ajudar a compreender o código, e já foi provado a sua utilidade no processo de compreensão através de vários estudos. Apesar de ainda não ter sido provada o porquê deste facto, alguns autores têm defendido que os comentários são um importante veículo na inclusão de informação do Domínio do Problema do programa e que a sua exploração melhora e aumenta a eficiência do processo de compreensão.
Sendo assim, a presente Dissertação de Mestrado expõe o desenvolvimento de uma solução baseado nos algoritmos de Information Retrieval, baseados nos sistemas de Information Retrieval, de maneira a tentar perceber se, de facto, a informação contida nos comentários conseguem contribuir de forma decisiva para aumentar a eficiência do processo de compreensão de um dado programa ou sistema de software.
Os resultados e conclusões retirados deste trabalho mostraram que os comentários, devidamente analisados e classificados pelo sistema desenvolvido, ajudaram a perceber devidamente os conceitos do Domínio do Problema e as suas materializações no código fonte. A solução criada mostrou-se capaz de responder aos desafios lançados, e provou ser uma ferramenta útil e eficiente para as tarefas de compreensão que possam surgir no processo de manutenção de software
Metarel, an ontology facilitating advanced querying of biomedical knowledge
Knowledge management has become indispensible in the Life Sciences for integrating and querying the enormous amounts of detailed knowledge about genes, organisms, diseases, drugs, cells, etc. Such detailed knowledge is continuously generated in bioinformatics via both hardware (e.g. raw data dumps from micro‐arrays) and software (e.g. computational analysis of data). Well‐known frameworks for managing knowledge are relational databases and spreadsheets. The doctoral dissertation describes knowledge management in two more recently‐investigated frameworks: ontologies and the Semantic Web. Knowledge statements like ‘lions live in Africa’ and ‘genes are located in a cell nucleus’ are managed with the use of URIs, logics and the ontological distinction between instances and classes. Both theory and practice are described. Metarel, the core subject of the dissertation, is an ontology describing relations that can bridge the mismatch between network‐based relations that appeal to internet browsing and logic‐based relations that are formally expressed in Description Logic. Another important subject of the dissertation is BioGateway, which is a knowledge base that has integrated biomedical knowledge in the form of hundreds of millions of network‐based relations in the RDF format. Metarel was used to upgrade the logical meaning of these relations towards Description Logic. This has enabled to build a computer reasoner that could run over the knowledge base and derive new knowledge statements
Recommended from our members
Improving Information Retrieval Bug Localisation Using Contextual Heuristics
Software developers working on unfamiliar systems are challenged to identify where and how high-level concepts are implemented in the source code prior to performing maintenance tasks. Bug localisation is a core program comprehension activity in software maintenance: given the observation of a bug, e.g. via a bug report, where is it located in the source code?
Information retrieval (IR) approaches see the bug report as the query, and the source files as the documents to be retrieved, ranked by relevance. Current approaches rely on project history, in particular previously fixed bugs and versions of the source code. Existing IR techniques fall short of providing adequate solutions in finding all the source code files relevant for a bug. Without additional help, bug localisation can become a tedious, time- consuming and error-prone task.
My research contributes a novel algorithm that, given a bug report and the application’s source files, uses a combination of lexical and structural information to suggest, in a ranked order, files that may have to be changed to resolve the reported bug without requiring past code and similar reports.
I study eight applications for which I had access to the user guide, the source code, and some bug reports. I compare the relative importance and the occurrence of the domain concepts in the project artefacts and measure the effectiveness of using only concept key words to locate files relevant for a bug compared to using all the words of a bug report.
Measuring my approach against six others, using their five metrics and eight projects, I position an effected file in the top-1, top-5 and top-10 ranks on average for 44%, 69% and 76% of the bug reports respectively. This is an improvement of 23%, 16% and 11% respectively over the best performing current state-of-the-art tool.
Finally, I evaluate my algorithm with a range of industrial applications in user studies, and found that it is superior to simple string search, as often performed by developers. These results show the applicability of my approach to software projects without history and offers a simpler light-weight solution