    Concept Learning By Example Decomposition

    For efficient understanding and prediction in natural systems, even in artificially closed ones, we usually need to consider a number of factors that may combine in simple or complex ways. Additionally, many modern scientific disciplines face increasingly large datasets from which to extract knowledge (for example, genomics). Thus to learn all but the most trivial regularities in the natural world, we rely on different ways of simplifying the learning problem. One simplifying technique that is highly pervasive in nature is to break down a large learning problem into smaller ones; to learn the smaller, more manageable problems; and then to recombine them to obtain the larger picture. It is widely accepted in machine learning that it is easier to learn several smaller decomposed concepts than a single large one. Though many machine learning methods exploit it, the process of decomposition of a learning problem has not been studied adequately from a theoretical perspective. Typically such decomposition of concepts is achieved in highly constrained environments, or aided by human experts. In this work, we investigate concept learning by example decomposition in a general probably approximately correct (PAC) setting for Boolean learning. We develop sample complexity bounds for the different steps involved in the process. We formally show that if the cost of example partitioning is kept low then it is highly advantageous to learn by example decomposition. To demonstrate the efficacy of this framework, we interpret the theory in the context of feature extraction. We discover that many vague concepts in feature extraction, starting with what exactly a feature is, can be formalized unambiguously by this new theory of feature extraction. We analyze some existing feature learning algorithms in light of this theory, and finally demonstrate its constructive nature by generating a new learning algorithm from theoretical results

    Building Information Filtering Networks with Topological Constraints: Algorithms and Applications

    We propose a new methodology for learning the structure of sparse networks from data; in doing so we adopt a dual perspective where we consider networks both as weighted graphs and as simplicial complexes. The proposed learning methodology belongs to the family of preferential attachment algorithms, where a network is extended by iteratively adding new vertices. In the conventional preferential attachment algorithm a new vertex is added to the network by adding a single edge to another existing vertex; in our approach a new vertex is added to a set of vertices by adding one or more new simplices to the simplicial complex. We propose the use of a score function to quantify the strength of the association between the new vertex and the attachment points. The methodology performs a greedy optimisation of the total score by selecting, at each step, the new vertex and the attachment points that maximise the gain in the score. Sparsity is enforced by restricting the space of the feasible configurations through the imposition of topological constraints on the candidate networks; the constraint is fulfilled by allowing only topological operations that are invariant with respect to the required property. For instance, if the topological constraint requires the constructed network to be be planar, then only planarity-invariant operations are allowed; if the constraint is that the network must be a clique forest, then only simplicial vertices can be added. At each step of the algorithm, the vertex to be added and the attachment points are those that provide the maximum increase in score while maintaining the topological constraints. As a concrete but general realisation we propose the clique forest as a possible topological structure for the representation of sparse networks, and we allow to specify further constraints such as the allowed range of clique sizes and the saturation of the attachment points. In this thesis we originally introduce the Maximally Filtered Clique Forest (MFCF) algorithm: the MFCF builds a clique forest by repeated application of a suitably invariant operation that we call Clique Expansion operator and adds vertices according to a strategy that greedily maximises the gain in a local score function. The gains produced by the Clique Expansion operator can be validated in a number of ways, including statistical testing, cross-validation or value thresholding. The algorithm does not prescribe a specific form for the gain function, but allows the use of any number of gain functions as long as they are consistent with the Clique Expansion operator. We describe several examples of gain functions suited to different problems. As a specific practical realisation we study the extraction of planar networks with the Triangulated Maximally Filtered Graph (TMFG). The TMFG, in its simplest form, is a specialised version of the MFCF, but it can be made more powerful by allowing the use of specialised planarity invariant operators that are not based on the Clique Expansion operator. We provide applications to two well known applied problems: the Maximum Weight Planar Subgraph Problem (MWPSP) and the Covariance Selection problem. With regards to the Covariance Selection problem we compare our results to the state of the art solution (the Graphical Lasso) and we highlight the benefits of our methodology. Finally, we study the geometry of clique trees as simplicial complexes and note how the statistics based on cliques and separators provides information equivalent to the one that can be achieved by means of homological methods, such as the analysis of Betti numbers, however with our approach being computationally more efficient and intuitively simpler. Finally, we use the geometric tools developed to provide a possible methodology for inferring the size of a dataset generated by a factor model. As an example we show that our tools provide a solution for inferring the size of a dataset generated by a factor model

    EDA Solutions for Double Patterning Lithography

    Expanding the optical lithography to 32-nm node and beyond is impossible using existing single exposure systems. As such, double patterning lithography (DPL) is the most promising option to generate the required lithography resolution, where the target layout is printed with two separate imaging processes. Among different DPL techniques litho-etch-litho-etch (LELE) and self-aligned double patterning (SADP) methods are the most popular ones, which apply two complete exposure lithography steps and an exposure lithography followed by a chemical imaging process, respectively. To realize double patterning lithography, patterns located within a sub-resolution distance should be assigned to either of the imaging sub-processes, so-called layout decomposition. To achieve the optimal design yield, layout decomposition problem should be solved with respect to characteristics and limitations of the applied DPL method. For example, although patterns can be split between the two sub-masks in the LELE method to generate conflict free masks, this pattern split is not favorable due to its sensitivity to lithography imperfections such as the overlay error. On the other hand, pattern split is forbidden in SADP method because it results in non-resolvable gap failures in the final image. In addition to the functional yield, layout decomposition affects parametric yield of the designs printed by double patterning. To deal with both functional and parametric challenges of DPL in dense and large layouts, EDA solutions for DPL are addressed in this thesis. To this end, we proposed a statistical method to determine the interconnect width and space for the LELE method under the effect of random overlay error. In addition to yield maximization and achieving near-optimal trade-off between different parametric requirements, the proposed method provides valuable insight about the trend of parametric and functional yields in future technology nodes. Next, we focused on self-aligned double patterning and proposed layout design and decomposition methods to provide SADP-compatible layouts and litho-friendly decomposed layouts. Precisely, a grid-based ILP formulation of SADP decomposition was proposed to avoid decomposition conflicts and improve overall printability of layout patterns. To overcome the limited applicability of this ILP-based method to fully-decomposable layouts, a partitioning-based method is also proposed which is faster than the grid-based ILP decomposition method too. Moreover, an A∗-based SADP-aware detailed routing method was proposed which performs detailed routing and layout decomposition simultaneously to avoid litho-limited layout configurations. The proposed router preserves the uniformity of pattern density between the two sub-masks of the SADP process. We finally extended our decomposition method for double patterning to triple patterning and formulated SATP decomposition by integer linear programming. In addition to conventional minimum width and spacing constraints, the proposed decomposition method minimizes the mandrel-trim co-defined edges and maximizes the layout features printed by structural spacers to achieve the minimum pattern distortion. This thesis is one of the very early researches that investigates the concept of litho-friendliness in SADP-aware layout design and decomposition. Provided by experimental results, the proposed methods advance prior state-of-the-art algorithms in various aspects. Precisely, the suggested SADP decomposition methods improve total length of sensitive trim edges, total EPE and overall printability of attempted designs. Additionally, our SADP-detailed routing method provides SADP-decomposable layouts in which trim patterns are highly robust to lithography imperfections. The experimental results for SATP decomposition show that total length of overlay-sensitive layout patterns, total EPE and overall printability of the attempted designs are also improved considerably by the proposed decomposition method. Additionally, the methods in this PhD thesis reveal several insights for the upcoming technology nodes which can be considered for improving the manufacturability of these nodes

    Fractals in the Nervous System: conceptual Implications for Theoretical Neuroscience

    This essay is presented with two principal objectives in mind: first, to document the prevalence of fractals at all levels of the nervous system, giving credence to the notion of their functional relevance; and second, to draw attention to the as yet still unresolved issues of the detailed relationships among power law scaling, self-similarity, and self-organized criticality. As regards criticality, I will document that it has become a pivotal reference point in Neurodynamics. Furthermore, I will emphasize the not yet fully appreciated significance of allometric control processes. For dynamic fractals, I will assemble reasons for attributing to them the capacity to adapt task execution to contextual changes across a range of scales. The final Section consists of general reflections on the implications of the reviewed data, and identifies what appear to be issues of fundamental importance for future research in the rapidly evolving topic of this review

    Data Mining of Biomedical Databases

    Data mining can be defined as the nontrivial extraction of implicit, previously unknown and potentially useful information from data. This thesis is focused on Data Mining in Biomedicine, representing one of the most interesting fields of application. Different kinds of biomedical data sets would require different data mining approaches. Two approaches are treated in this thesis, divided in two separate and independent parts. The first part deals with Bayesian Networks, representing one of the most successful tools for medical diagnosis and therapies follow-up. Formally, a Bayesian Network (BN) is a probabilistic graphical model that represents a set of random variables and their conditional dependencies via a directed acyclic graph. An algorithm for Bayesian network structure learning that is a variation of the standard search-and-score approach has been developed. The proposed approach overcomes the creation of redundant network structures that may include non significant connections between variables. In particular, the algorithm finds which relationships between the variables must be prevented, by exploiting the binarization of a square matrix containing the mutual information (MI) among all pairs of variables. Four different binarization methods are implemented. The MI binary matrix is exploited as a pre-conditioning step for the subsequent greedy search procedure that optimizes the network score, reducing the number of possible search paths in the greedy search procedure. This approach has been tested on four different datasets and compared against the standard search-and-score algorithm as implemented in the DEAL package, with successful results. Moreover, a comparison among different network scores has been performed. The second part of this thesis is focused on data mining of microarray databases. An algorithm able to perform the analysis of Illumina microRNA microarray data in a systematic and easy way has been developed. The algorithm includes two parts. The first part is the pre-processing, characterized by two steps: variance stabilization and normalization. Variance stabilization has to be performed to abrogate or at least reduce the heteroskedasticity while normalization has to be performed to minimize systematic effects that are not constant among different samples of an experiment and that are not due to the factors under investigation. Three alternative variance stabilization strategies and three alternative normalization approaches are included. So, considering all the possible combinations between variance stabilization and normalization strategies, 9 different ways to pre-process the data are obtained. The second part of the algorithm deals with the statistical analysis for the differential expression detection. Linear models and empirical Bayes methods are used. The final result is the list of the microRNAs significantly differentially-expressed in two different conditions. The algorithm has been tested on three different real datasets and partially validated with an independent approach (quantitative real time PCR). Moreover, the influence of the use of different preprocessing methods on the discovery of differentially expressed microRNAs has been studied and a comparison among the different normalization methods has been performed. This is the first study comparing normalization techniques for Illumina microRNA microarray data

    Differential Models, Numerical Simulations and Applications

    This Special Issue includes 12 high-quality articles containing original research findings in the fields of differential and integro-differential models, numerical methods and efficient algorithms for parameter estimation in inverse problems, with applications to biology, biomedicine, land degradation, traffic flows problems, and manufacturing systems

    Graphical Models for Multivariate Time-Series

    Gaussian graphical models have received much attention in the last years, due to their flexibility and expression power. In particular, lots of interests have been devoted to graphical models for temporal data, or dynamical graphical models, to understand the relation of variables evolving in time. While powerful in modelling complex systems, such models suffer from computational issues both in terms of convergence rates and memory requirements, and may fail to detect temporal patterns in case the information on the system is partial. This thesis comprises two main contributions in the context of dynamical graphical models, tackling these two aspects: the need of reliable and fast optimisation methods and an increasing modelling power, which are able to retrieve the model in practical applications. The first contribution consists in a forward-backward splitting (FBS) procedure for Gaussian graphical modelling of multivariate time-series which relies on recent theoretical studies ensuring global convergence under mild assumptions. Indeed, such FBS-based implementation achieves, with fast convergence rates, optimal results with respect to ground truth and standard methods for dynamical network inference. The second main contribution focuses on the problem of latent factors, that influence the system while hidden or unobservable. This thesis proposes the novel latent variable time-varying graphical lasso method, which is able to take into account both temporal dynamics in the data and latent factors influencing the system. This is fundamental for the practical use of graphical models, where the information on the data is partial. Indeed, extensive validation of the method on both synthetic and real applications shows the effectiveness of considering latent factors to deal with incomplete information

    Modéliser et analyser les risques de propagations dans les projets complexes : application au développement de nouveaux véhicules

    The management of complex projects requires orchestrating the cooperation of hundreds of individuals from various companies, professions and backgrounds, working on thousands of activities, deliverables, and risks. As well, these numerous project elements are more and more interconnected, and no decision or action is independent. This growing complexity is one of the greatest challenges of project management and one of the causes for project failure in terms of cost overruns and time delays. For instance, in the automotive industry, increasing market orientation and growing complexity of automotive product has changed the management structure of the vehicle development projects from a hierarchical to a networked structure, including the manufacturer but also numerous suppliers. Dependencies between project elements increase risks, since problems in one element may propagate to other directly or indirectly dependent elements. Complexity generates a number of phenomena, positive or negative, isolated or in chains, local or global, that will more or less interfere with the convergence of the project towards its goals. The thesis aim is thus to reduce the risks associated with the complexity of the vehicle development projects by increasing the understanding of this complexity and the coordination of project actors. To do so, a first research question is to prioritize actions to mitigate complexity-related risks. Then, a second research question is to propose a way to organize and coordinate actors in order to cope efficiently with the previously identified complexity-related phenomena.The first question will be addressed by modeling project complexity and by analyzing complexity-related phenomena within the project, at two levels. First, a high-level factor-based descriptive modeling is proposed. It permits to measure and prioritize project areas where complexity may have the most impact. Second, a low-level graph-based modeling is proposed, based on the finer modeling of project elements and interdependencies. Contributions have been made on the complete modeling process, including the automation of some data-gathering steps, in order to increase performance and decrease effort and error risk. These two models can be used consequently; a first high-level measure can permit to focus on some areas of the project, where the low-level modeling will be applied, with a gain of global efficiency and impact. Based on these models, some contributions are made to anticipate potential behavior of the project. Topological and propagation analyses are proposed to detect and prioritize critical elements and critical interdependencies, while enlarging the sense of the polysemous word “critical."The second research question will be addressed by introducing a clustering methodology to propose groups of actors in new product development projects, especially for the actors involved in many deliverable-related interdependencies in different phases of the project life cycle. This permits to increase coordination between interdependent actors who are not always formally connected via the hierarchical structure of the project organization. This allows the project organization to be actually closer to what a networked structure should be. The automotive-based industrial application has shown promising results for the contributions to both research questions. Finally, the proposed methodology is discussed in terms of genericity and seems to be applicable to a wide set of complex projects for decision support.La gestion de projets complexes nécessite d’orchestrer la coopération de centaines de personnes provenant de diverses entreprises, professions et compétences, de travailler sur des milliers d'activités, livrables, objectifs, actions, décisions et risques. En outre, ces nombreux éléments du projet sont de plus en plus interconnectés, et aucune décision ou action n’est indépendante. Cette complexité croissante est l'un des plus grands défis de la gestion de projet et l'une des causes de l'échec du projet en termes de dépassements de coûts et des retards. Par exemple, dans l'industrie automobile, l'augmentation de l'orientation du marché et de la complexité croissante des véhicules a changé la structure de gestion des projets de développement de nouveaux véhicules à partir d'une structure hiérarchique à une structure en réseau, y compris le constructeur, mais aussi de nombreux fournisseurs. Les dépendances entre les éléments du projet augmentent les risques, car les problèmes dans un élément peuvent se propager à d'autres éléments qui en dépendent directement ou indirectement. La complexité génère un certain nombre de phénomènes, positifs ou négatifs, isolés ou en chaînes, locaux ou globaux, qui vont plus ou moins interférer avec la convergence du projet vers ses objectifs.L'objectif de la thèse est donc de réduire les risques associés à la complexité des projets véhicules en augmentant la compréhension de cette complexité et de la coordination des acteurs du projet. Pour ce faire, une première question de recherche est de prioriser les actions pour atténuer les risques liés à la complexité. Puis, une seconde question de recherche est de proposer un moyen d'organiser et de coordonner les acteurs afin de faire face efficacement avec les phénomènes liés à la complexité identifiés précédemment.La première question sera abordée par la modélisation de complexité du projet en analysant les phénomènes liés à la complexité dans le projet, à deux niveaux. Tout d'abord, une modélisation descriptive de haut niveau basée facteur est proposé. Elle permet de mesurer et de prioriser les zones de projet où la complexité peut avoir le plus d'impact. Deuxièmement, une modélisation de bas niveau basée sur les graphes est proposée. Elle permet de modéliser plus finement les éléments du projet et leurs interdépendances. Des contributions ont été faites sur le processus complet de modélisation, y compris l'automatisation de certaines étapes de collecte de données, afin d'augmenter les performances et la diminution de l'effort et le risque d'erreur. Ces deux modèles peuvent être utilisés en conséquence; une première mesure de haut niveau peut permettre de se concentrer sur certains aspects du projet, où la modélisation de bas niveau sera appliquée, avec un gain global d'efficacité et d'impact. Basé sur ces modèles, certaines contributions sont faites pour anticiper le comportement potentiel du projet. Des analyses topologiques et de propagation sont proposées pour détecter et hiérarchiser les éléments essentiels et les interdépendances critiques, tout en élargissant le sens du mot polysémique "critique".La deuxième question de recherche sera traitée en introduisant une méthodologie de « Clustering » pour proposer des groupes d'acteurs dans les projets de développement de nouveaux produits, en particulier pour les acteurs impliqués dans de nombreuses interdépendances liées aux livrables à différentes phases du cycle de vie du projet. Cela permet d'accroître la coordination entre les acteurs interdépendants qui ne sont pas toujours formellement reliés par la structure hiérarchique de l'organisation du projet. Cela permet à l'organisation du projet d’être effectivement plus proche de la structure en « réseau » qu’elle devrait avoir. L'application industrielle aux projets de développement de nouveaux véhicules a montré des résultats prometteurs pour les contributions aux deux questions de recherche