Search CORE

18 research outputs found

Predicting the Critical Number of Layers for Hierarchical Support Vector Regression

Author: Drmač Zlatko
Fonoberova Maria
Manojlović Iva
Mezić Igor
Mohr Ryan
Publication venue: 'MDPI AG'
Publication date: 21/12/2020
Field of study

Hierarchical support vector regression (HSVR) models a function from data as a linear combination of SVR models at a range of scales, starting at a coarse scale and moving to finer scales as the hierarchy continues. In the original formulation of HSVR, there were no rules for choosing the depth of the model. In this paper, we observe in a number of models a phase transition in the training error -- the error remains relatively constant as layers are added, until a critical scale is passed, at which point the training error drops close to zero and remains nearly constant for added layers. We introduce a method to predict this critical scale a priori with the prediction based on the support of either a Fourier transform of the data or the Dynamic Mode Decomposition (DMD) spectrum. This allows us to determine the required number of layers prior to training any models.Comment: 18 pages, 9 figure

arXiv.org e-Print Archive

Multidisciplinary Digital Publishing Institute

Cooperative Navigation for Mixed Human–Robot Teams Using Haptic Feedback

Author: D. PRATTICHIZZO
M. AGGRAVI
S. SCHEGGI
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

In this paper, we present a novel cooperative navigation control for human–robot teams. Assuming that a human wants to reach a final location in a large environment with the help of a mobile robot, the robot must steer the human from the initial to the target position. The challenges posed by cooperative human–robot navigation are typically addressed by using haptic feedback via physical interaction. In contrast with that, in this paper, we describe a different approach, in which the human–robot interaction is achieved via wearable vibrotactile armbands. In the proposed work, the subject is free to decide her/his own pace. A warning vibrational signal is generated by the haptic armbands when a large deviation with respect to the desired pose is detected by the robot. The proposed method has been evaluated in a large indoor environment, where 15 blindfolded human subjects were asked to follow the haptic cues provided by the robot. The participants had to reach a target area, while avoiding static and dynamic obstacles. Experimental results revealed that the blindfolded subjects were able to avoid the obstacles and safely reach the target in all of the performed trials. A comparison is provided between the results obtained with blindfolded users and experiments performed with sighted people

Archivio della Ricerca - Università degli Studi di Siena

Data mining applied to neurorehabilitation data

Author: Carmelo Maria Salomé Coimbra
Publication venue
Publication date: 01/01/2017
Field of study

Tese de mestrado integrado, Engenharia Biomédica e Biofísica (Engenharia Clínica e Instrumentação Médica) Universidade de Lisboa, Faculdade de Ciências, 2017Apesar de não serem a principal causa de morte no Mundo, as lesões cerebrais são talvez a principal razão de existirem tantos casos de pessoas que veem a sua vida quotidiana afetada. Tal acontece devido a grandes dificuldades cognitivas que podem ser derivadas de um acidente de automóvel, de uma queda, da presença de um tumor, de um acidente vascular cerebral, da exposição a substâncias tóxicas ou de uma outra qualquer situação que tenha envolvido uma lesão do cérebro. De entre este tipo de lesões podem considerar-se aquelas que são provenientes de traumas por forças externas, ou seja, as chamadas lesões cerebrais traumáticas ou traumatismos crânio-encefálicos. É precisamente em pessoas que sofreram uma lesão desse tipo que se foca este estudo. Em pessoas que, depois dessas lesões, foram sujeitas a um tratamento de neuro reabilitação. Este tratamento, baseado na realização de tarefas especialmente desenhadas para estimular a reorganização das ligações neuronais, permite que os doentes tenham a possibilidade de voltar a conseguir realizar tarefas do dia-a-dia com a menor dificuldade possível. O objetivo da realização destas tarefas é a estimulação da capacidade de plasticidade cerebral, responsável pelo desenvolvimento das conexões sinápticas desde o nascimento e que permite ao cérebro voltar a estabelecer o seu funcionamento normal depois de uma lesão. Naturalmente, o grau de afetação de uma pessoa depende do tipo de lesão e tem uma grande influência não só no tempo de recuperação física e mental, como também no seu estado final. O estudo documentado neste relatório de estágio constitui um meio para atingir um objetivo comum a outros trabalhos de investigação nesta área; pretende-se que os tratamentos de neuro reabilitação possam vir a ser personalizados para cada paciente, para que a sua recuperação seja otimizada. A ideia é que, conhecendo alguns dos dados pessoais de um doente, considerando informação sobre o seu estado inicial e através dos resultados de testes realizados, seja possível associá-lo a um determinado perfil disfuncional, de características bastante específicas, para o terapeuta poder adaptar o seu tratamento. O Institut Guttmann, em Barcelona, foi o primeiro hospital espanhol a prestar cuidados a doentes de lesões medulares. Hoje em dia, um dos seus muitos projetos chama-se GNPT Guttmann NeuroPersonalTrainer e leva a casa dos seus doentes uma plataforma que lhes permite realizar as tarefas definidas pelos terapeutas, no âmbito dos seus tratamentos de neuro reabilitação. Dados desses doentes, incluindo informação démica e resultados de testes realizados antes e depois dos tratamentos, foram cedidos pelo Institut Guttmann ao Grupo de Biomédica e Telemedicina (GBT) sob a forma de bases de dados. Através da sua análise e utilizando ferramentas de Data Mining foi possível obter perfis gerais de disfunção cognitiva e descrever a evolução desses perfis, o principal objetivo desta dissertação. Encontrar padrões em grandes volumes de dados é a principal função de um processo de Data Mining, tratando o assunto de forma muito genérica. Na verdade, é este o conceito utilizado quando são abordados temas de extração de conhecimento a partir de grandes quantidades de dados. Há diversas técnicas que o permitem fazer, que utilizam algoritmos baseados em funções estatísticas e redes neuronais e que têm vindo a ser melhoradas ao longo dos últimos anos, desde que surgiu a primeira necessidade de lidar com grandes conjuntos de elementos. O propósito é sempre o mesmo: que a análise feita a partir destas técnicas permita converter a informação oculta dos dados em informação que pode ser depois utilizada para caracterizar populações, tomar decisões ou para validar resultados. Neste caso, foram utilizados algoritmos de Clustering, um método de Data Mining que permite obter grupos de elementos semelhantes entre si, os clusters, considerando as características de cada um destes elementos. Dados de 698 doentes que sofreram um traumatismo craniano e cuja informação disponível nas bases de dados fornecidas pelo Institut Guttmann satisfazia todas as condições necessárias para serem considerados no estudo, foram integrados num Data Warehouse - um depósito de armazenamento de dados - e depois estruturados. A partir de funções criadas em SQL - a principal linguagem de consultas e organização de bases de dados relacionais - foram obtidas as pontuações correspondentes aos testes realizados pelos doentes, antes do início do tratamento e depois de este ser terminado. Estes testes visaram avaliar, utilizando cinco diferentes níveis de pontuação correspondentes a cada grau de afetação (0 para sem afetação, 1 para afetação suave, 2 para afetação moderada, 3 para afetação severa e 4 para afetação aguda), três funções estritamente relacionadas com o nível cognitivo, a atenção, a memória e algumas funções executivas. As pontuações obtidas para cada uma das funções constituem uma média ponderada da pontuação cada uma das subfunções (atenção dividida, atenção seletiva, memória de trabalho, entre outras), calculadas por pelo menos um dos 24 itens de avaliação a que cada pessoa foi sujeita. De seguida, foram determinados os grupos iniciais e finais, recorrendo a uma ferramenta muito útil para encontrar correlações em grandes conjuntos de dados, o software SPSS. Para determinar a constituição dos clusters iniciais foi aplicado um algoritmo de Clustering designado K-means e, para os finais, um outro denominado TwoStep. A principal característica desta técnica descritiva de Data Mining é a utilização da distância como medida de verificação da proximidade entre dois elementos de um cluster. Os seus algoritmos diferem no tipo de dados a que se aplicam e também na forma como calculam os agrupamentos de elementos. Para cada um dos clusters, e de acordo com cada uma das funções, foi observada a distribuição das pontuações, através de gráficos de barras. Foram também confrontados ambos os conjuntos de clusters para se poder interpretar a relação entre eles. Os clusters, que neste contexto correspondem a perfis de afetação cognitiva, foram validados, e concluiu-se que permitem descrever bem a população em estudo. Por um lado, os seis clusters iniciais determinados representam de uma forma fiel, e com muito sentido do ponto de vista clínico, os conjuntos de pessoas com características suficientemente definidas que os distinguem entre si. Já os três clusters finais, usados para retratar a população no final do tratamento e analisar as evoluções dos pacientes, retratam perfis bastante opostos, o que permitiu, de certa forma interpretar com maior facilidade para que pacientes o efeito da neuro-reabilitação foi mais ou menos positivo. Alguns estudos citados no estado de arte revelaram que algumas variáveis são suscetíveis de influenciar o estado final de um doente. Aproveitando a existência de dados suficientes para tal, foi observado se, tendo em conta os clusters finais, se poderia fazer alguma inferência sobre o efeito de algumas das variáveis – incluindo a idade, o nível de estudos, o intervalo de tempo entre a lesão e o início do tratamento e a sua duração – em cada um destes. No final, considerando apenas as pontuações dos testes em cada função, antes e depois dos tratamentos, foram analisados e interpretados, recorrendo a gráficos, os desenvolvimentos e a evolução global de cada doente. Como desenvolvimentos possíveis, foram tidos em conta os casos em que houve melhorias, agravamentos e também os casos em que os doentes mantiveram o seu estado. Fazendo uso da informação sobre a forma como evoluíram os pacientes, foi possível verificar se, de facto, utilizando apenas os valores das pontuações obtidas nos testes, se poderia ou não confirmar que outras variáveis poderiam ter efeitos na determinação do estado final de um paciente. Os gráficos obtidos demonstraram que há diferenças muito subtis considerando algumas das variáveis, principalmente entre os dos doentes que melhoraram e os dos doentes que viram a sua condição agravada. Concluiu-se que o facto de os clusters agruparem pessoas com tipos de evolução diferentes levou a que o efeito de outras variáveis se mostrasse muito disperso. O tipo de investigação sugerido para futuros desenvolvimentos inclui: (i) o estudo das outras hipóteses de perfis apresentados pelo software usado (SPSS); (ii) considerar os diferentes aspetos das funções avaliadas a um nível mais detalhado; (iii) ter em conta outras variáveis com possíveis efeitos no estado final de um doente.Although they are not the leading cause of death in the world, brain injuries are perhaps the main reason why there are so many cases of people who see their daily lives affected. This is due to the major cognitive difficulties that appear after brain lesion. Brain injuries include those that are derived from traumas due to external forces – the traumatic brain injuries. This study is focused in people who, after these injuries, were subjected to a neuro rehabilitation treatment. The treatment, based on tasks specially designed to stimulate the reorganization of neural connections, allows patients to regain their abilities to perform their everyday tasks with the least possible difficulty. These tasks aim to stimulate the brain plasticity capacity, responsible for the development of synaptic connections which allows the brain to re-establish its normal functioning after an injury. The study documented in this internship report constitutes another step for a major goal, common to other studies in this area: that neuro rehabilitation treatments can be personalized for each patient, so that their recovery is optimized. Knowing some of the personal data of a patient, considering information about their initial state and through the results of tests performed, it is possible to assign a person to a certain dysfunctional profile, with specific characteristics and for the therapist to adapt treatment. One of his many projects of the Institut Guttmann (IG) is called GNPT Guttmann NeuroPersonalTrainer and brings into its patients’ home a platform that allows them to perform the tasks set by the therapists in the context of their neurorehabilitation treatments. Data from these patients, including clinical information and test results performed before and after the treatment, were provided by the IG to the Biomedical and Telemedicine Group (GBT) as databases. Through its analysis and using Data Mining techniques it was possible to obtain general profiles of cognitive dysfunction and to characterize the evolution of these profiles, the objective of this work. Finding patterns and extracting knowledge from large volumes of data are the main functions of a Data Mining process. An analysis performed using these techniques enables the conversion of information hidden in data into information that can later be used to make decisions or to validate results. In this case, Clustering algorithms, which build groups of elements with the similar characteristics called clusters, were used. Also, data from 698 patients who suffered brain trauma and whose information available in the databases provided by the IG satisfied all the conditions considered necessary were integrated into a Data Warehouse and then structured. The scores corresponding to the tests performed before and after the treatment were calculated, for each patient. These tests aimed to evaluate, using five different punctuation levels corresponding to each degree of affectation, three functions strictly related to cognitive level: attention, memory and some executive functions (cognitive processes necessary for the cognitive control of behavior). The initial and final clusters, representing patients’ profiles, were determined, using the SPSS software. The distribution of the scores over the clusters was observed through bar graphs. Both groups of clusters were also confronted to interpret the relationship between them. The clusters, which in this context correspond to profiles of cognitive affectation, were validated, and it was concluded that, at this moment, they represent well the state of patients under study. As some variables, like age and study level, are likely to influence the final state of a patient, it was observed if, given the final clusters, some inference could be made about the effect of those variables. No valuable conclusions were taken from this part. Also, considering the tests scores, patients’ evolution was identified as improvements, aggravations and cases where the conditions is maintained. Using that information, conclusions were extracted, regarding the population and the variables effect. The plots obtained allowed us to correctly describe the patients’ evolution and also to see if the variables considered were good descriptors of that evolution. A simple interpretation from of the facts allows to conclude that the calculated are good general, but not perfect descriptors of the population. The type of research suggested for future developments includes: (i) the study of the other hypothesis of profiles presented by the Data Mining software; (ii) consider the different aspects of the functions evaluated at a more detailed level; (iii) take into account other variables with possible effects on describing the final state of a patient

Universidade de Lisboa: Repositório.UL

Hardware-conscious query processing for the many-core era

Author: Pohl Constantin
Publication venue
Publication date: 01/01/2020
Field of study

Die optimale Nutzung von moderner Hardware zur Beschleunigung von Datenbank-Anfragen ist keine triviale Aufgabe. Viele DBMS als auch DSMS der letzten Jahrzehnte basieren auf Sachverhalten, die heute kaum noch Gültigkeit besitzen. Ein Beispiel hierfür sind heutige Server-Systeme, deren Hauptspeichergröße im Bereich mehrerer Terabytes liegen kann und somit den Weg für Hauptspeicherdatenbanken geebnet haben. Einer der größeren letzten Hardware Trends geht hin zu Prozessoren mit einer hohen Anzahl von Kernen, den sogenannten Manycore CPUs. Diese erlauben hohe Parallelitätsgrade für Programme durch Multithreading sowie Vektorisierung (SIMD), was die Anforderungen an die Speicher-Bandbreite allerdings deutlich erhöht. Der sogenannte High-Bandwidth Memory (HBM) versucht diese Lücke zu schließen, kann aber ebenso wie Many-core CPUs jeglichen Performance-Vorteil negieren, wenn dieser leichtfertig eingesetzt wird. Diese Arbeit stellt die Many-core CPU-Architektur zusammen mit HBM vor, um Datenbank sowie Datenstrom-Anfragen zu beschleunigen. Es wird gezeigt, dass ein hardwarenahes Kostenmodell zusammen mit einem Kalibrierungsansatz die Performance verschiedener Anfrageoperatoren verlässlich vorhersagen kann. Dies ermöglicht sowohl eine adaptive Partitionierungs und Merge-Strategie für die Parallelisierung von Datenstrom-Anfragen als auch eine ideale Konfiguration von Join-Operationen auf einem DBMS. Nichtsdestotrotz ist nicht jede Operation und Anwendung für die Nutzung einer Many-core CPU und HBM geeignet. Datenstrom-Anfragen sind oft auch an niedrige Latenz und schnelle Antwortzeiten gebunden, welche von höherer Speicher-Bandbreite kaum profitieren können. Hinzu kommen üblicherweise niedrigere Taktraten durch die hohe Kernzahl der CPUs, sowie Nachteile für geteilte Datenstrukturen, wie das Herstellen von Cache-Kohärenz und das Synchronisieren von parallelen Thread-Zugriffen. Basierend auf den Ergebnissen dieser Arbeit lässt sich ableiten, welche parallelen Datenstrukturen sich für die Verwendung von HBM besonders eignen. Des Weiteren werden verschiedene Techniken zur Parallelisierung und Synchronisierung von Datenstrukturen vorgestellt, deren Effizienz anhand eines Mehrwege-Datenstrom-Joins demonstriert wird.Exploiting the opportunities given by modern hardware for accelerating query processing speed is no trivial task. Many DBMS and also DSMS from past decades are based on fundamentals that have changed over time, e.g., servers of today with terabytes of main memory capacity allow complete avoidance of spilling data to disk, which has prepared the ground some time ago for main memory databases. One of the recent trends in hardware are many-core processors with hundreds of logical cores on a single CPU, providing an intense degree of parallelism through multithreading as well as vectorized instructions (SIMD). Their demand for memory bandwidth has led to the further development of high-bandwidth memory (HBM) to overcome the memory wall. However, many-core CPUs as well as HBM have many pitfalls that can nullify any performance gain with ease. In this work, we explore the many-core architecture along with HBM for database and data stream query processing. We demonstrate that a hardware-conscious cost model with a calibration approach allows reliable performance prediction of various query operations. Based on that information, we can, therefore, come to an adaptive partitioning and merging strategy for stream query parallelization as well as finding an ideal configuration of parameters for one of the most common tasks in the history of DBMS, join processing. However, not all operations and applications can exploit a many-core processor or HBM, though. Stream queries optimized for low latency and quick individual responses usually do not benefit well from more bandwidth and suffer from penalties like low clock frequencies of many-core CPUs as well. Shared data structures between cores also lead to problems with cache coherence as well as high contention. Based on our insights, we give a rule of thumb which data structures are suitable to parallelize with focus on HBM usage. In addition, different parallelization schemas and synchronization techniques are evaluated, based on the example of a multiway stream join operation

Digitale Bibliothek Thüringen

A criteria based function for reconstructing low-sampling trajectories as a tool for analytics

Author: Ospina Álvarez Edison Camilo
Publication venue
Publication date: 06/03/2015
Field of study

Abstract: Mobile applications equipped with Global Positioning Systems have generated a huge quantity of location data with sampling uncertainty that must be handled and analyzed. Those location data can be ordered in time to represent trajectories of moving objects. The data warehouse approach based on spatio-temporal data can help on this task. For this reason, we address the problem of personalized reconstruction of low-sampling trajectories based on criteria over a graph for including criteria of movement as a dimension in a trajectory data warehouse solution to carry out analytical tasks over moving objects and the environment where they moveMaestrí

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Universidad Nacional De Colombia - Repositorio Institucional UN

Zielorientierte Erkennung und Behebung von Qualitätsdefiziten in Software-Systemen am Beispiel der Weiterentwicklungsfähigkeit

Author: Brčina Robert
Publication venue
Publication date: 03/11/2011
Field of study

The evolvability of software systems is one of the key issues when considering their long term quality. Continuous changes and extensions of these systems are neccessary to adjust them to new or changing requirements. But the changes often cause quality deficiencies, which lead to an increase in complexity or an architectural decay. Especially quality deficiencies within the specification or the architecture of a software system can heavily impair a software system.To counteract this, a method is developed in this work to support the analysis of a quality goal in order to identify the quality deficiencies which hinder the achievement of the quality goal. Both the detection and the removal of quality deficiencies are accomplished in a systematic way. The method integrates detection of these quality deficiencies and their removal by reengineering activities based on rules. The detection of quality deficiencies is performed by means of measurable quality attributes which are derived from a quality goal, such as evolvability. In order to demonstrate the practicability of the method, the quality goal evolvability is taken as an example. This work shows how a software system can be evaluated with regard to evolvability based on structural dependencies and which reengineering activities will improve the system in the direction of this quality goal.To evaluate the method, it was applied within an industrial case study. By analyzing the given software system a large number of different quality deficiencies were detected. Afterwards the system's evolvability was improved substantially by reengineering activities proposed by the method.Für unternehmenskritische Software-Systeme, die langlebig und erweiterbar sein sollen, ist das Qualitätsziel Weiterentwicklungsfähigkeit essentiell. Kontinuierliche Änderungen und Erweiterungen sind unabdingbar, um solche Software-Systeme an neue oder veränderte Anforderungen anzupassen. Diese Maßnahmen verursachen aber auch oft Qualitätsdefizite, die zu einem Anstieg der Komplexität oder einem Verfall der Architektur führen können. Gerade Qualitätsdefizite in der Spezifikation oder Architektur können Software-Systeme stark beeinträchtigen.Um dem entgegenzuwirken, wird in dieser Arbeit eine Methode entwickelt, welche die Einhaltung von Qualitätszielen bewerten kann. Dadurch wird sowohl das Erkennen als auch das Beheben von Qualitätsdefiziten in der Software-Entwicklung ermöglicht. Qualitätsdefizite werden anhand einer am Qualitätsziel orientierten und regelbasierten Analyse erkannt und durch zugeordnete Reengineering-Aktivitäten behoben. Als Beispiel für ein Qualitätsziel wird die Weiterentwicklungsfähigkeit von Software-Systemen betrachtet. Es wird gezeigt, wie dieses Qualitätsziel anhand von strukturellen Abhängigkeiten in Software-Systemen bewertet und durch gezielte Reengineering-Aktivitäten verbessert werden kann.Um die Methode zu validieren, wurde eine industrielle Fallstudie durchgeführt. Durch den Einsatz der Methode konnten eine Vielzahl von Qualitätsdefiziten erkannt und behoben werden. Die Weiterentwicklungsfähigkeit des untersuchten Software-Systems wurde durch die vorgeschlagenen Reengineering-Aktivitäten entscheidend verbessert

Digitale Bibliothek Thüringen

Propuesta metodológica para el cálculo de las penalidades por giro en modelos de accesibilidad

Author: Cardona Urrea Santiago
Publication venue
Publication date: 01/01/2018
Field of study

En esta tesis de maestría se busca desarrollar una metodología para el cálculo de las penalidades por giro a utilizar en los modelos de accesibilidad y en general en los modelos de transportes dada la utilización de algoritmos de caminos mínimos en el cálculo de los tiempos de viaje en la red vial que incluyen penalizaciones y restricciones por giro, entre estos la accesibilidad media global, utilizada en diversos temas como la planificación urbana y de transportes en Manizales (Colombia) y diferentes ciudades alrededor del mundo. En esta ciudad se han utilizado penalidades y restricciones por giro determinadas de manera subjetiva por lo que no se tiene un valor calculado a partir de un método científico. Por lo tanto, se calcularán las penalidades y restricciones por giro para la ciudad de Manizales realizando una cuantificación de los tiempos de giro de los vehículos en diversas intersecciones viales, escogidas a partir de un análisis de priorización y registrando un video en cada una. Con estos datos se podrá obtener el promedio de giro a izquierda y derecha, es decir, las penalidades por giro para Manizales a utilizar en los modelos de accesibilidad calculados en la ciudad o en general para los modelos de transportes. Las penalidades calculadas mediante está metodología serán comparadas con las penalidades utilizadas en investigaciones previas a través del gradiente de ahorro, el cual nos permite cuantificar las diferencias generadas por este dato y su importancia en los modelos de transportes, entre ellos la accesibilidadAbstract: In this Master’s degree thesis seeks develop a methodology for the calculation of turn penalties to use in accessibility models and in general for transport models given in the recent use of algorithms of shortest paths for the calculation of travel times in the road network that includes turn penalties and restrictions, among then the global mean accessibility, used in some issues such as urban and transport planning in Manizales (Colombia) and different cities around the world. At Manizales, turn penalties and restrictions used in accessibility models are determined by a subjective way, so there are not calculated from a scientific method. Therefore, turn and restrictions penalties for Manizales will be calculated, making a quantification of the turn times of the vehicles in different road intersections, chosen from a priorization analysis and recording a video in each one. With this data we can obtain the average time to turn to left and right, that is, the turn penalties for Manizales to be used in the accessibility models calculated in the city or in general in the transport models. The penalties calculated using this methodology will be compared with the penalties used in previous investigations through the saving gradient, which allows us to quantify the differences generated by this data and its importance in transport models, including accessibilityMaestrí

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Universidad Nacional De Colombia - Repositorio Institucional UN

A dependency-aware, context-independent code search infrastructure

Author: Schumacher Marcus
Publication venue
Publication date: 01/01/2019
Field of study

Over the last decade many code search engines and recommendation systems have been developed, both in academia and industry, to try to improve the component discovery step in the software reuse process. Key examples include Krugle, Koders, Portfolio, Merobase, Sourcerer, Strathcona and SENTRE. However, the recall and precision of this current generation of code search tools are limited by their inability to cope effectively with the structural dependencies between code units. This lack of “dependency awareness” manifests itself in three main ways. First, it limits the kinds of search queries that users can define and thus the precision and local recall of dependency aware searches (giving rise to large numbers of false positives and false negatives). Second, it reduces the global recall of the component harvesting process by limiting the range of dependency-containing software components that can be used to populate the search repository. Third, it significantly reduces the performance of the retrieval process for dependency-aware searches. This thesis lays the foundation for a new generation of dependency-aware code search engines that addresses these problems by designing and prototyping a new kind of software search platform. Inspired by the Merobase code search engine, this platform contains three main innovations - an enhanced, dependency aware query language which allows traditional Merobase interface-based searches to be extended with dependency requirements, a new “context independent” crawling infrastructure which can recognize dependencies between code units even when their context (e.g. project) is unknown, and a new graph-based database integrated with a full-text search engine and optimized to store code modules and their dependencies efficiently. After describing the background to, and state-of-the-art in, the field of code search engines and information retrieval the thesis motivates the aforementioned innovations and explains how they are realized in the DAISI (Dependency-Aware, context-Independent code Search Infrastructure) prototype using Lucene and Neo4J.DAISI is then used to demonstrate the advantages of the developed technology in a range of examples

MAnnheim DOCument Server