4,792 research outputs found

    Machine Learning in Automated Text Categorization

    Full text link
    The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert manpower, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey

    A review of the state of the art in Machine Learning on the Semantic Web: Technical Report CSTR-05-003

    Get PDF

    Data Management and Mining in Astrophysical Databases

    Full text link
    We analyse the issues involved in the management and mining of astrophysical data. The traditional approach to data management in the astrophysical field is not able to keep up with the increasing size of the data gathered by modern detectors. An essential role in the astrophysical research will be assumed by automatic tools for information extraction from large datasets, i.e. data mining techniques, such as clustering and classification algorithms. This asks for an approach to data management based on data warehousing, emphasizing the efficiency and simplicity of data access; efficiency is obtained using multidimensional access methods and simplicity is achieved by properly handling metadata. Clustering and classification techniques, on large datasets, pose additional requirements: computational and memory scalability with respect to the data size, interpretability and objectivity of clustering or classification results. In this study we address some possible solutions.Comment: 10 pages, Late

    The best of both worlds: highlighting the synergies of combining manual and automatic knowledge organization methods to improve information search and discovery.

    Get PDF
    Research suggests organizations across all sectors waste a significant amount of time looking for information and often fail to leverage the information they have. In response, many organizations have deployed some form of enterprise search to improve the 'findability' of information. Debates persist as to whether thesauri and manual indexing or automated machine learning techniques should be used to enhance discovery of information. In addition, the extent to which a knowledge organization system (KOS) enhances discoveries or indeed blinds us to new ones remains a moot point. The oil and gas industry was used as a case study using a representative organization. Drawing on prior research, a theoretical model is presented which aims to overcome the shortcomings of each approach. This synergistic model could help to re-conceptualize the 'manual' versus 'automatic' debate in many enterprises, accommodating a broader range of information needs. This may enable enterprises to develop more effective information and knowledge management strategies and ease the tension between what arc often perceived as mutually exclusive competing approaches. Certain aspects of the theoretical model may be transferable to other industries, which is an area for further research

    Satellite image analysis using neural networks

    Get PDF
    The tremendous backlog of unanalyzed satellite data necessitates the development of improved methods for data cataloging and analysis. Ford Aerospace has developed an image analysis system, SIANN (Satellite Image Analysis using Neural Networks) that integrates the technologies necessary to satisfy NASA's science data analysis requirements for the next generation of satellites. SIANN will enable scientists to train a neural network to recognize image data containing scenes of interest and then rapidly search data archives for all such images. The approach combines conventional image processing technology with recent advances in neural networks to provide improved classification capabilities. SIANN allows users to proceed through a four step process of image classification: filtering and enhancement, creation of neural network training data via application of feature extraction algorithms, configuring and training a neural network model, and classification of images by application of the trained neural network. A prototype experimentation testbed was completed and applied to climatological data

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

    Performance comparison of deep learning models applied for satellite image classification

    Get PDF
    Satellite images classification is important for applications that involve the distribution of the human activities. Such distribution helps the governments to determine the best places to expand cities avoiding problems related to natural disasters or legal constrains. Currently, existing few agencies in charge of image classification and the area to cover is enormous. Therefor an automation of this process is necessary for this task otherwise, it will take an eternity to perform this task manually. On the other hand, detection and classification algorithms used before Machine Learning (ML) have not shown good result classifying this specific sort of images. However, latest approaches for image classification using Convolutional Neural Networks (CNN) have shown quite accurate results. In this research, we analyses the performance in four different CNN architectures used for satellite image classification. We use a dataset provided in 2017 by IARPA names IARPA fMoW. It contains more than two thousand images belonging to 62 classes already separated in train and validation. The solution was implemented in Python using the Keras and Tensorflow libraries. The research was divided in two parts: Hyperparameters optimization and architectures results evaluation. For the first part we used only seven classes from a sample of the dataset (The sample is three hundred times smaller than the complete dataset). The architectures are trained using these seven classes of this small dataset to determine the best hyperparameters. After having selected the hyperparameters the architectures are trained with the complete sample. The evaluation is based on visual examination with the help of the tool Tensorboard and SKLearn metrics. All the architectures showed accuracies near to 90% over the training dataset sample. The architecture with the best accuracy result was Resnet-152 with one accuracy of 99% over the training dataset Sample. The accuracy over the validation dataset will become important after training the architectures with the complete dataset. The training with the complete dataset will be performed in future works.ITESO, A. C

    Methods for identifying biomedical translation: a systematic review

    Get PDF
    Translational medicine is an important area of biomedicine, and has significantly facilitated the development of biomedical research. Despite its relevance, there is no consensus on how to evaluate its progress and impact. A systematic review was carried out to identify all the methods to evaluate translational research. Seven methods were found according to the established criteria to analyze their characteristics, advantages, and limitations. They allow us to perform this type of evaluation in different ways. No relevant advantages were found between them; each one presented its specific limitations that need to be considered. Nevertheless, the Triangle of Biomedicine could be considered the most relevant method, concerning the time since its publication and usefulness. In conclusion, there is still a lack of a gold-standard method for evaluating biomedical translational research.This work has been supported by the Spanish State Research Agency through the project PID2019-105381GA-I00/AEI/10.13039/ 501100011033 (iScience), grant CTS-115 (Tissue Engineering Research Group, University of Granada) from Junta de Andalucia, Spain, a postdoctoral grant (RH-0145-2020) from the Andalusia Health System and with the EU FEDER ITI Grant for Cadiz Province PI-0032- 2017. The present work is part of the Ph.D. thesis dissertation of Javier Padilla-Cabello
    corecore