1,477 research outputs found
Machine Learning in Automated Text Categorization
The automated categorization (or classification) of texts into predefined
categories has witnessed a booming interest in the last ten years, due to the
increased availability of documents in digital form and the ensuing need to
organize them. In the research community the dominant approach to this problem
is based on machine learning techniques: a general inductive process
automatically builds a classifier by learning, from a set of preclassified
documents, the characteristics of the categories. The advantages of this
approach over the knowledge engineering approach (consisting in the manual
definition of a classifier by domain experts) are a very good effectiveness,
considerable savings in terms of expert manpower, and straightforward
portability to different domains. This survey discusses the main approaches to
text categorization that fall within the machine learning paradigm. We will
discuss in detail issues pertaining to three different problems, namely
document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey
Automatic categorization of diverse experimental information in the bioscience literature
Background:
Curation of information from bioscience literature into biological knowledge databases is a crucial way of capturing experimental information in a computable form. During the biocuration process, a critical first step is to identify from all published literature the papers that contain results for a specific data type the curator is interested in annotating. This step normally requires curators to manually examine many papers to ascertain which few contain information of interest and thus, is usually time consuming. We developed an automatic method for identifying papers containing these curation data types among a large pool of published scientific papers based on the machine learning method Support Vector Machine (SVM). This classification system is completely automatic and can be readily applied to diverse experimental data types. It has been in use in production for automatic categorization of 10 different experimental datatypes in the biocuration process at WormBase for the past two years and it is in the process of being adopted in the biocuration process at FlyBase and the Saccharomyces Genome Database (SGD). We anticipate that this method can be readily adopted by various databases in the biocuration community and thereby greatly reducing time spent on an otherwise laborious and demanding task. We also developed a simple, readily automated procedure to utilize training papers of similar data types from different bodies of literature such as C. elegans and D. melanogaster to identify papers with any of these data types for a single database. This approach has great significance because for some data types, especially those of low occurrence, a single corpus often does not have enough training papers to achieve satisfactory performance.
Results:
We successfully tested the method on ten data types from WormBase, fifteen data types from FlyBase and three data types from Mouse Genomics Informatics (MGI). It is being used in the curation work flow at WormBase for automatic association of newly published papers with ten data types including RNAi, antibody, phenotype, gene regulation, mutant allele sequence, gene expression, gene product interaction, overexpression phenotype, gene interaction, and gene structure correction.
Conclusions:
Our methods are applicable to a variety of data types with training set containing several hundreds to a few thousand documents. It is completely automatic and, thus can be readily incorporated to different workflow at different literature-based databases. We believe that the work presented here can contribute greatly to the tremendous task of automating the important yet labor-intensive biocuration effort
Nomenclature and Benchmarking Models of Text Classification Models: Contemporary Affirmation of the Recent Literature
In this paper we present automated text classification in text mining that is gaining greater relevance in various fields every day Text mining primarily focuses on developing text classification systems able to automatically classify huge volume of documents comprising of unstructured and semi structured data The process of retrieval classification and summarization simplifies extract of information by the user The finding of the ideal text classifier feature generator and distinct dominant technique of feature selection leading all other previous research has received attention from researchers of diverse areas as information retrieval machine learning and the theory of algorithms To automatically classify and discover patterns from the different types of the documents 1 techniques like Machine Learning Natural Language Processing NLP and Data Mining are applied together In this paper we review some effective feature selection researches and show the results in a table for
Learning to detect spam messages
The problem of unwanted e-mails (or spam messages) has been increasing for years. Different methods have been proposed in order to deal with this problem wich includes blacklists of known spammers, handcrafted rules and machine learning techniques.
In this paper we investigate the performance of the k Nearest Neighbours (k-NN) method in spam detection tasks. At this end, a number of different document codifications were tested.
Moreover, we study how the vocabulary size reduction affects this task. In the experimental design, different k values were considered and results were analyzed with respect to a public mailing list and personal e-mail collections. The experiments showed that results with public mailing lists tend to be very optimistic and they should not be considered representative of those expected with personal user accounts.VI Workshop de Agentes y Sistemas Inteligentes (WASI)Red de Universidades con Carreras en Informática (RedUNCI
Recommended from our members
The impact of metadata on the accuracy of automated patent classification
During the last decade, the advance of machine-learning tools and algorithms has resulted in tremendous progress in the automated classification of documents. However, many classifiers base their classification decisions solely on document text and ignore metadata (such as authors, publication date, and author affiliation). In this project, automated classifiers using the k-Nearest Neighbour algorithm were developed for the classification of patents into two different classification systems. Those using metadata (in this case inventor names, applicant names and International Patent Classification codes) were compared with those ignoring it. The use of metadata could significantly improve the classification of patents with one classification system, improving classification accuracy from 70.8% up to 75.4%, which was highly statistically significant. However, the results for the other classification system were inconclusive: while metadata could improve the quality of the classifier for some experiments (recall increased from 66.0% to 68.9%, which was a small but nonetheless significant improvement), experiments with different parameters showed that it could also lead to a deterioration of quality (recall dropping as low as 61.0%). The study shows that metadata can play an extremely useful role in the classification of patents. Nonetheless, it must not be used indiscriminately but only after careful evaluation of its usefulness
Deep Neural Networks for Document Processing of Music Score Images
[EN] There is an increasing interest in the automatic digitization of medieval music documents. Despite efforts in this field, the detection of the different layers of information on these documents still poses difficulties. The use of Deep Neural Networks techniques has reported outstanding results in many areas related to computer vision. Consequently, in this paper, we study the so-called Convolutional Neural Networks (CNN) for performing the automatic document processing of music score images. This process is focused on layering the image into its constituent parts (namely, background, staff lines, music notes, and text) by training a classifier with examples of these parts. A comprehensive experimentation in terms of the configuration of the networks was carried out, which illustrates interesting results as regards to both the efficiency and effectiveness of these models. In addition, a cross-manuscript adaptation experiment was presented in which the networks are evaluated on a different manuscript from the one they were trained. The results suggest that the CNN is capable of adapting its knowledge, and so starting from a pre-trained CNN reduces (or eliminates) the need for new labeled data.This work was supported by the Social Sciences and Humanities Research Council of Canada, and Universidad de Alicante through grant GRE-16-04.Calvo-Zaragoza, J.; Castellanos, F.; Vigliensoni, G.; Fujinaga, I. (2018). Deep Neural Networks for Document Processing of Music Score Images. Applied Sciences. 8(5). https://doi.org/10.3390/app8050654S85Bainbridge, D., & Bell, T. (2001). Computers and the Humanities, 35(2), 95-121. doi:10.1023/a:1002485918032Byrd, D., & Simonsen, J. G. (2015). Towards a Standard Testbed for Optical Music Recognition: Definitions, Metrics, and Page Images. Journal of New Music Research, 44(3), 169-195. doi:10.1080/09298215.2015.1045424LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444. doi:10.1038/nature14539Rebelo, A., Fujinaga, I., Paszkiewicz, F., Marcal, A. R. S., Guedes, C., & Cardoso, J. S. (2012). Optical music recognition: state-of-the-art and open issues. International Journal of Multimedia Information Retrieval, 1(3), 173-190. doi:10.1007/s13735-012-0004-6Louloudis, G., Gatos, B., Pratikakis, I., & Halatsis, C. (2008). Text line detection in handwritten documents. Pattern Recognition, 41(12), 3758-3772. doi:10.1016/j.patcog.2008.05.011Montagner, I. S., Hirata, N. S. T., & Hirata, R. (2017). Staff removal using image operator learning. Pattern Recognition, 63, 310-320. doi:10.1016/j.patcog.2016.10.002Calvo-Zaragoza, J., Micó, L., & Oncina, J. (2016). Music staff removal with supervised pixel classification. International Journal on Document Analysis and Recognition (IJDAR), 19(3), 211-219. doi:10.1007/s10032-016-0266-2Calvo-Zaragoza, J., Pertusa, A., & Oncina, J. (2017). Staff-line detection and removal using a convolutional neural network. Machine Vision and Applications, 28(5-6), 665-674. doi:10.1007/s00138-017-0844-4Shelhamer, E., Long, J., & Darrell, T. (2017). Fully Convolutional Networks for Semantic Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), 640-651. doi:10.1109/tpami.2016.2572683Kato, Z. (2011). Markov Random Fields in Image Segmentation. Foundations and Trends® in Signal Processing, 5(1-2), 1-155. doi:10.1561/2000000035Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324. doi:10.1109/5.72679
- …