11 research outputs found

    Análise de distribuições de distâncias entre palavras genómicas

    Get PDF
    The investigation of DNA has been one of the most developed areas of research in this and in the last century. However, there is a long way to go to fully understand the DNA code. With the increasing of DNA sequenced data, mathematical methods play an important role in addressing the need for e cient quantitative techniques for the detection of regions of interest and overall characteristics in these sequences. A feature of interest in the study of genomic words is their spatial distribution along a DNA sequence, which can be characterized by the distances between words. Counting such distances provides discrete distributions that may be analyzed from a statistical point of view. In this work we explore the distances between genomic words as a mathematical descriptor of DNA sequences. The main goal is to design, develop and apply statistical methods specially designed for their distributions, in order to capture information about the primary and secondary structure of DNA. The characterization of empirical inter-word distance distributions involves the problem of the exponential increasing of the number of distributions as the word length increases, leading to the need of data reduction. Moreover, if the data can be validly clustered, the class labels may provide a meaningful description of similarities and di erences between sets of distributions. Therefore, we explore the inter-word distance distributions potential to obtain a word clustering, able to highlight similar patterns of word distributions as well as summarized characteristics of each set of distributions. With the aim of performing comparative studies between genomic sequences and de ning species signatures, we deduce exact distributions of inter-word distances under random scenarios. Based on these theoretical distributions, we de ne genomic signatures of species able to discriminate between species and to capture their evolutionary relation. We presume that the study of distributions similarities and the clustering procedure allow identifying words whose distance distribution strongly di ers from a reference distribution or from the global behaviour of the majority of the words. One of the key topics of our research focuses on the establishment of procedures that capture distance distributions with atypical behaviours, herein referred to as atypical distributions. In the genomic context, words with an atypical distance distribution may be related with some biological function (motifs). We expect that our results may be used to provide some sort of classi cation of sequences, identifying evolutionary patterns and allowing for the prediction of functional properties, thereby contributing to the advancement of knowledge about DNA sequences.A investigação do ADN é uma das áreas mais desenvolvidas neste e no último século. O crescente aumento do número de genomas sequenciados tem exigido técnicas quantitativas mais e cientes para a identi cação de características gerais e especí cas das sequências genómicas, os métodos matemáticos desempenham um papel importante na resposta a essa necessidade. Uma característica com particular interesse no estudo de palavras genómicas é a sua distribuição espacial ao longo de sequências de ADN, podendo esta ser caracterizada pelas distâncias entre palavras. A contagem dessas distâncias fornece distribuições discretas passíveis de análise estatística. Neste trabalho, exploramos as distâncias entre palavras como um descritor matemático das sequências de ADN, tendo como objetivo delinear e desenvolver procedimentos estatísticos especialmente concebidos para o estudo das suas distribuições. A caracterização das distribuições de distâncias empíricas entre palavras genómicas envolve o problema do crescimento exponencial do número de distribuições com o aumento do comprimento da palavra, gerando a necessidade de redução dos dados. Além disso, se os dados puderem ser validamente agrupados em classes então os representantes de classe fornecem informação relevante sobre semelhanças e diferenças entre cada grupo de distribuições. Assim, exploramos o potencial das distribuições de distâncias na obtenção de um agrupamento de palavras, que agrupe padrões de distâncias semelhantes e que coloque em evidência as características de cada grupo. Com vista ao estudo comparativo de sequências genómicas e à de nição de assinaturas de espécies, focamo-nos no desenvolvimento de modelos teóricos que descrevam distribuições de distâncias entre palavras em cenários aleatórios. Esses modelos são utilizados na de nição de assinaturas genómicas, capazes de discriminar entre espécies e de recuperar relações evolutivas entre estas. Presumimos que o estudo de semelhanças e a análise de agrupamento das distribuições permite identi car palavras cuja distribuição se afasta fortemente de uma distribuição de referência ou do comportamento global das maioria das palavras. Um dos principais tópicos de investigação foca-se na deteção de distribuições com comportamentos anormais, aqui referidas como distribuições atípicas. No contexto genómico, palavras com distribuições de distâncias atípicas poderão estar relacionadas com alguma função biológica (motivos). Esperamos que os resultados obtidos possam ser utilizados para fornecer algum tipo de classi cação de sequências, identi cando padrões evolutivos e permitindo a previsão das propriedades funcionais, representando assim um passo adicional na criação de conhecimento sobre sequências de ADN.Programa Doutoral em Matemátic

    Políticas de Copyright de Publicações Científicas em Repositórios Institucionais: O Caso do INESC TEC

    Get PDF
    A progressiva transformação das práticas científicas, impulsionada pelo desenvolvimento das novas Tecnologias de Informação e Comunicação (TIC), têm possibilitado aumentar o acesso à informação, caminhando gradualmente para uma abertura do ciclo de pesquisa. Isto permitirá resolver a longo prazo uma adversidade que se tem colocado aos investigadores, que passa pela existência de barreiras que limitam as condições de acesso, sejam estas geográficas ou financeiras. Apesar da produção científica ser dominada, maioritariamente, por grandes editoras comerciais, estando sujeita às regras por estas impostas, o Movimento do Acesso Aberto cuja primeira declaração pública, a Declaração de Budapeste (BOAI), é de 2002, vem propor alterações significativas que beneficiam os autores e os leitores. Este Movimento vem a ganhar importância em Portugal desde 2003, com a constituição do primeiro repositório institucional a nível nacional. Os repositórios institucionais surgiram como uma ferramenta de divulgação da produção científica de uma instituição, com o intuito de permitir abrir aos resultados da investigação, quer antes da publicação e do próprio processo de arbitragem (preprint), quer depois (postprint), e, consequentemente, aumentar a visibilidade do trabalho desenvolvido por um investigador e a respetiva instituição. O estudo apresentado, que passou por uma análise das políticas de copyright das publicações científicas mais relevantes do INESC TEC, permitiu não só perceber que as editoras adotam cada vez mais políticas que possibilitam o auto-arquivo das publicações em repositórios institucionais, como também que existe todo um trabalho de sensibilização a percorrer, não só para os investigadores, como para a instituição e toda a sociedade. A produção de um conjunto de recomendações, que passam pela implementação de uma política institucional que incentive o auto-arquivo das publicações desenvolvidas no âmbito institucional no repositório, serve como mote para uma maior valorização da produção científica do INESC TEC.The progressive transformation of scientific practices, driven by the development of new Information and Communication Technologies (ICT), which made it possible to increase access to information, gradually moving towards an opening of the research cycle. This opening makes it possible to resolve, in the long term, the adversity that has been placed on researchers, which involves the existence of barriers that limit access conditions, whether geographical or financial. Although large commercial publishers predominantly dominate scientific production and subject it to the rules imposed by them, the Open Access movement whose first public declaration, the Budapest Declaration (BOAI), was in 2002, proposes significant changes that benefit the authors and the readers. This Movement has gained importance in Portugal since 2003, with the constitution of the first institutional repository at the national level. Institutional repositories have emerged as a tool for disseminating the scientific production of an institution to open the results of the research, both before publication and the preprint process and postprint, increase the visibility of work done by an investigator and his or her institution. The present study, which underwent an analysis of the copyright policies of INESC TEC most relevant scientific publications, allowed not only to realize that publishers are increasingly adopting policies that make it possible to self-archive publications in institutional repositories, all the work of raising awareness, not only for researchers but also for the institution and the whole society. The production of a set of recommendations, which go through the implementation of an institutional policy that encourages the self-archiving of the publications developed in the institutional scope in the repository, serves as a motto for a greater appreciation of the scientific production of INESC TEC

    Smart Sensors for Healthcare and Medical Applications

    Get PDF
    This book focuses on new sensing technologies, measurement techniques, and their applications in medicine and healthcare. Specifically, the book briefly describes the potential of smart sensors in the aforementioned applications, collecting 24 articles selected and published in the Special Issue “Smart Sensors for Healthcare and Medical Applications”. We proposed this topic, being aware of the pivotal role that smart sensors can play in the improvement of healthcare services in both acute and chronic conditions as well as in prevention for a healthy life and active aging. The articles selected in this book cover a variety of topics related to the design, validation, and application of smart sensors to healthcare

    An Automatic Representation Optimization and Model Selection Framework for Machine Learning

    Get PDF
    The classification problem is an important part of machine learning and occurs in many application fields like image-based object recognition or industrial quality inspection. In the ideal case, only a training dataset consisting of feature data and true class labels has to be obtained to learn the connection between features and class labels. This connection is represented by a so-called classifier model. However, even today the development of a well-performing classifier for a given task is difficult and requires a lot of expertise. Numerous challenges occur in real-world classification problems that can degrade the generalization performance. Typical challenges are not enough training samples, noisy feature data as well as suboptimal choices of algorithms or hyperparameters. Many solutions exist to tackle these challenges, such as automatic feature and model selection algorithms, hyperparameter tuning or data preprocessing methods. Furthermore, representation learning, which is connected to the recently evolving field of deep learning, is also a promising approach that aims at automatically learning more useful features out of low-level data. Due to the lack of a holistic framework that considers all of these aspects, this work proposes the Automatic Representation Optimization and Model Selection Framework, abbreviated as AROMS-Framework. The central classification pipeline contains feature selection and portfolios of preprocessing, representation learning and classification methods. An optimization algorithm based on Evolutionary Algorithms is developed to automatically adapt the pipeline configuration to a given learning task. Additionally, two kinds of extended analyses are proposed that exploit the optimization trajectory. The first one aims at a better understanding of the complex interplay of the pipeline components using a suitable visualization technique. The second one is a multi-pipeline classifier with the purpose to improve the generalization performance by fusing the decisions of several classification pipelines. Finally, suitable experiments are conducted to evaluate all aspects of the proposed framework regarding its generalization performance, optimization runtime and classification speed. The goal is to show benefits and limitations of the framework when a large variety of datasets from different real-world applications is considered.Ein Framework zur automatischen Optimierung von Merkmalsrepräsentationen und Modellen für maschinelles Lernen Das Klassifikationsproblem ist ein wichtiger Teil der Forschungsrichtung des maschinellen Lernens. Dieses Problem tritt in vielen Anwendungsbereichen wie der bildbasierten Objekterkennung oder industriellen Qualitätsinspektion auf. Im Idealfall muss nur ein Trainingsdatensatz gesammelt werden, der aus einer Menge an Merkmalsdaten und den entsprechenden, geforderten Klassenzuordnungen besteht. Das Ziel ist das Lernen des Zusammenhangs zwischen den Merkmalsdaten und den Klassenzuordnungen mittels eines sogenannten Klassifikatormodells. Auch heute noch ist die Entwicklung eines gut funktionierenden Klassifikators für eine gegebene Anwendung eine anspruchsvolle Aufgabe, die eine Menge Expertenwissen voraussetzt. In praxisnahen Anwendungen müssen viele Probleme gelöst werden, die die Leistungsfähigkeit des Klassifikators einschränken können: Es sind oft nicht ausreichend viele Trainingsdaten vorhanden, die Merkmalsdaten enthalten zu viel Rauschen oder die gewählten Algorithmen oder deren Hyperparameter sind suboptimal eingestellt. Es existiert eine Vielzahl an Lösungsansätzen für diese Herausforderungen, wie z.B. eine automatische Auswahl von Merkmalen, Klassifikatormodellen und Hyperparametern sowie geeigneten Datenvorverarbeitungsmethoden. Zudem gibt es vielversprechende Methoden des sogenannten Repräsentationslernens, das mit dem aktuellen Forschungszweig Deep Learning verbunden ist: Hier ist ein automatisches Erlernen von besseren Merkmalsrepräsentationen aus Rohdaten das Ziel. Es existiert bisher kein ganzheitliches Framework, welches all die vorhergehend genannten Aspekte miteinbezieht. Daher wird in dieser Arbeit ein automatisches Framework zur Optimierung von Merkmalsrepräsentationen und Modellen für maschinelles Lernen eingeführt, das als AROMS-Framework abgekürzt wird. Die zentrale Klassifikations-Pipeline enthält Merkmalsselektion und Algorithmen-Portfolios mit verschiedenen Vorverarbeitungsmethoden, Methoden des Repräsentationslernens sowie Klassifikatoren. Es wird ein Optimierungsverfahren basierend auf evolutionären Algorithmen präsentiert, das zur automatischen Anpassung der Pipeline-Konfiguration an ein Lernproblem genutzt wird. Weiterhin werden zwei erweiterte Analysen der Daten aus dem Verlauf des Optimierungsverfahrens vorgeschlagen: Die erste Erweiterung zielt auf eine verständliche Visualisierung des komplexen Zusammenspiels der Komponenten der Klassifikations-Pipeline ab. Die zweite Erweiterung ist ein Multi-Pipeline-Klassifikator, der die Generalisierung verbessern soll, in dem die Entscheidungen mehrerer Klassifikations-Pipelines fusioniert werden. Abschließend werden geeignete Experimente durchgeführt, um alle Aspekte des vorgeschlagenen Frameworks im Hinblick auf die Generalisierungsleistung, der Optimierungslaufzeit und der Klassifikationsgeschwindigkeit zu untersuchen. Das Ziel ist das Aufzeigen von Vorteilen und Einschränkungen des Frameworks, wenn eine große Vielfalt an Datensätzen aus verschiedenen Anwendungsbereichen betrachtet wird

    Machine learning in portfolio management

    Get PDF
    Financial markets are difficult learning environments. The data generation process is time-varying, returns exhibit heavy tails and signal-to-noise ratio tends to be low. These contribute to the challenge of applying sophisticated, high capacity learning models in financial markets. Driven by recent advances of deep learning in other fields, we focus on applying deep learning in a portfolio management context. This thesis contains three distinct but related contributions to literature. First, we consider the problem of neural network training in a time-varying context. This results in a neural network that can adapt to a data generation process that changes over time. Second, we consider the problem of learning in noisy environments. We propose to regularise the neural network using a supervised autoencoder and show that this improves the generalisation performance of the neural network. Third, we consider the problem of quantifying forecast uncertainty in time-series with volatility clustering. We propose a unified framework for the quantification of forecast uncertainty that results in uncertainty estimates that closely match actual realised forecast errors in cryptocurrencies and U.S. stocks

    Harnessing Human Potential for Security Analytics

    Get PDF
    Humans are often considered the weakest link in cybersecurity. As a result, their potential has been continuously neglected. However, in recent years there is a contrasting development recognizing that humans can benefit the area of security analytics, especially in the case of security incidents that leave no technical traces. Therefore, the demand becomes apparent to see humans not only as a problem but also as part of the solution. In line with this shift in the perception of humans, the present dissertation pursues the research vision to evolve from a human-as-a-problem to a human-as-a-solution view in cybersecurity. A step in this direction is taken by exploring the research question of how humans can be integrated into security analytics to contribute to the improvement of the overall security posture. In addition to laying foundations in the field of security analytics, this question is approached from two directions. On the one hand, an approach in the context of the human-as-a-security-sensor paradigm is developed which harnesses the potential of security novices to detect security incidents while maintaining high data quality of human-provided information. On the other hand, contributions are made to better leverage the potential of security experts within a SOC. Besides elaborating the current state in research, a tool for determining the target state of a SOC in the form of a maturity model is developed. Based on this, the integration of security experts was improved by the innovative application of digital twins within SOCs. Accordingly, a framework is created that improves manual security analyses by simulating attacks within a digital twin. Furthermore, a cyber range was created, which offers a realistic training environment for security experts based on this digital twin

    Data and Text Mining Techniques for In-Domain and Cross-Domain Applications

    Get PDF
    In the big data era, a wide amount of data has been generated in different domains, from social media to news feeds, from health care to genomic functionalities. When addressing a problem, we usually need to harness multiple disparate datasets. Data from different domains may follow different modalities, each of which has a different representation, distribution, scale and density. For example, text is usually represented as discrete sparse word count vectors, whereas an image is represented by pixel intensities, and so on. Nowadays plenty of Data Mining and Machine Learning techniques are proposed in literature, which have already achieved significant success in many knowledge engineering areas, including classification, regression and clustering. Anyway some challenging issues remain when tackling a new problem: how to represent the problem? What approach is better to use among the huge quantity of possibilities? What is the information to be used in the Machine Learning task and how to represent it? There exist any different domains from which borrow knowledge? This dissertation proposes some possible representation approaches for problems in different domains, from text mining to genomic analysis. In particular, one of the major contributions is a different way to represent a classical classification problem: instead of using an instance related to each object (a document, or a gene, or a social post, etc.) to be classified, it is proposed to use a pair of objects or a pair object-class, using the relationship between them as label. The application of this approach is tested on both flat and hierarchical text categorization datasets, where it potentially allows the efficient addition of new categories during classification. Furthermore, the same idea is used to extract conversational threads from an unregulated pool of messages and also to classify the biomedical literature based on the genomic features treated

    Signalverarbeitung für Kommunikationssysteme von Atemschutzvollmasken

    Get PDF
    Im Atemschutzeinsatz ist die Kommunikation unter Feuerwehrleuten aufgrund der starken akustischen Dämpfung der Atemschutzvollmaske und der lauten Umgebungsgeräusche sehr erschwert. Zur Verbesserung der Kommunikation gibt es Kommunikationssysteme, welche in die Atemschutzvollmaske integriert werden können. Diese zeichnen das Sprachsignal mit einem Mikrofon auf und geben dieses Signal an die Umgebung mit Lautsprechern, über den Team-Funk an Einsatzkollegen und über den taktischen Funk an die Einsatzleitung aus. Mit diesem Mikrofon und den Lautsprechern ist eine Verbesserung nur bedingt möglich, da laute Atem- und Umgebungsgeräusche und die entstehenden Rückkopplungen mit den lokalen Lautsprechern die Sprachverständlichkeit stark einschränken. Um eine mögliche Steigerung der Sprachverständlichkeit zu erreichen, werden in dieser Arbeit verschiedene Verfahren der Signalverarbeitung untersucht. Die bei der Kommunikationseinheit zur Verfügung stehenden Rahmenbedingungen sind ein Mikrofon und ein Lautsprecher vor der Maske, ein Lautsprecher an den Ohren, ein elektrischer Signalausgang zum taktischen Funkgerät und ein digitaler Signalprozessor. Die störenden Atemgeräusche werden in einer Sprachaktivitätsdetektion mittels Mustererkennung erkannt und gefiltert. Die Umgebungsgeräusche werden mittels einer Geräuschschätzung ermittelt und daraufhin unterdrückt. Eine hinreichende Verstärkung zur Steigerung der Sprachverständlichkeit wird durch eine Rückkopplungskompensation ermöglicht. In der Nachverarbeitung werden die Signale durch einen Equalizer entzerrt und mittels eines Regelverstärkers wird eine Dynamikanpassung vorgenommen. Alle in dieser Arbeit beschriebenen Algorithmen sind auf einem 16-Bit Festkomma-Signalprozessor umgesetzt und hinsichtlich der Laufzeit optimiert
    corecore