8 research outputs found

    Improving Pipelining Tools for Pre-processing Data

    Get PDF
    The last several years have seen the emergence of data mining and its transformation into a powerful tool that adds value to business and research. Data mining makes it possible to explore and find unseen connections between variables and facts observed in different domains, helping us to better understand reality. The programming methods and frameworks used to analyse data have evolved over time. Currently, the use of pipelining schemes is the most reliable way of analysing data and due to this, several important companies are currently offering this kind of services. Moreover, several frameworks compatible with different programming languages are available for the development of computational pipelines and many research studies have addressed the optimization of data processing speed. However, as this study shows, the presence of early error detection techniques and developer support mechanisms is very limited in these frameworks. In this context, this study introduces different improvements, such as the design of different types of constraints for the early detection of errors, the creation of functions to facilitate debugging of concrete tasks included in a pipeline, the invalidation of erroneous instances and/or the introduction of the burst-processing scheme. Adding these functionalities, we developed Big Data Pipelining for Java (BDP4J, https://github.com/sing-group/bdp4j), a fully functional new pipelining framework that shows the potential of these features

    Improving pipelining tools for pre-processing data

    Get PDF
    The last several years have seen the emergence of data mining and its transformation into a powerful tool that adds value to business and research. Data mining makes it possible to explore and find unseen connections between variables and facts observed in different domains, helping us to better understand reality. The programming methods and frameworks used to analyse data have evolved over time. Currently, the use of pipelining schemes is the most reliable way of analysing data and due to this, several important companies are currently offering this kind of services. Moreover, several frameworks compatible with different programming languages are available for the development of computational pipelines and many research studies have addressed the optimization of data processing speed. However, as this study shows, the presence of early error detection techniques and developer support mechanisms is very limited in these frameworks. In this context, this study introduces different improvements, such as the design of different types of constraints for the early detection of errors, the creation of functions to facilitate debugging of concrete tasks included in a pipeline, the invalidation of erroneous instances and/or the introduction of the burst-processing scheme. Adding these functionalities, we developed Big Data Pipelining for Java (BDP4J, https://github.com/sing-group/bdp4j), a fully functional new pipelining framework that shows the potential of these features.Agencia Estatal de Investigación | Ref. TIN2017-84658-C2-1-RXunta de Galicia | Ref. ED481D-2021/024Xunta de Galicia | Ref. ED431C2018/55-GR

    Alnus airborne pollen trends during the last 26 years for improving machine learning-based forecasting methods

    Get PDF
    Black alder (Alnus glutinosa (L.) Gaertn.) is a species of tree widespread along Europe and belongs to mixed hardwood forests. In urban environments, the tree is usually located along watercourses, as is the case in the city of Ourense. This taxon belongs to the betulaceae family, so it has a high allergenic potential in sensitive people. Due to the high allergenic capacity of this pollen type and the increase in global temperature produced by climate change, which induces a greater allergenicity, the present study proposes the implementation of a Machine Learning (ML) model capable of accurately predicting high-risk periods for allergies among sensitive people. The study was carried out in the city of Ourense for 28 years and pollen data were collected by means of the Hirst trap model Lanzoni VPPS-2000. During the same period, meteorological data were obtained from the meteorological station of METEOGALICIA in Ourense. We observed that Alnus airborne pollen was present in the study area during winter months, mainly in January and February. We found statistically significant trends for the end of the main pollen season with a lag trend of 0.68 days per year, and an increase in the annual pollen integral of 112 pollen grains per year and approximately 12 pollen grains/m3 per year during the pollen peak. A Spearman correlation test was carried out in order to select the variables for the ML model. The best ML model was Random Forest, which was able to detect those days with medium and high labels.Xunta de Galicia | Ref. ED431C 2022/03-GRCXunta de Galicia | Ref. CO-0034-2021 00V

    Detección automática de momentos de risco alérxico da poboación ourensá

    Get PDF
    Na actualidade, o número de persoas que presentan reaccións alérxicas ao pole aumentou considerablemente, polo que é interesante contar con mecanismos que permitan determinar, coa maior precisión posible, a cantidade de pole que estará presente na atmosfera e reducir, deste xeito, o seu impacto na poboación. Para predicir a concentración de pole realizáronse estudos que utilizan modelos de regresión lineal e que, posteriormente, evolucionaron cara a modelos automáticos ou de aprendizaxe profunda. A pesar da aplicación idónea destes modelos para predicir a concentración de pole, os resultados obtidos dependen en gran medida da existencia de medicións previas de concentración e están influenciados pola calidade dos datos dispoñibles. A investigación conxunta das disciplinas de botánica e de informática trata de realizar unha estimación do risco de alerxias polo pole, de forma que permita a administración de antihistamínicos con anterioridade á súa exposición, posto que está demostrado que é moito máis efectiva ca unha vez aparecidos os primeiros síntomas. En concreto, esta estimación fíxose sobre Alnus, Betula, Platanus, Poaceae e Urticaceae, os cinco tipos de pole considerados máis agresivos na provincia de Ourense. O grupo de investigación da disciplina de botánica encargouse da captación de datos de concentración de pole, normalización e representación dos valores de recollida, calculou a estación polínica principal para cada tipo de pole e propuxo un calendario polínico para a cidade de Ourense. E o grupo de investigación de Informática centrouse na análise dos datos proporcionados e na comparación de diferentes técnicas de aprendizaxe automática para clasificar as concentracións de pole na atmosfera da provincia de Ourense e para facilitar a toma de decisións. Neste traballo móstrase a experimentación unicamente co tipo de pole Alnus; é de esperar que tamén será adecuada para cada un dos outros tipos de pole, adaptando en cada caso o modelo máis axeitado

    Smart Preprocessing for Improving Spam Filtering Accuracy

    Get PDF
    El término spam hace referencia a mensajes no solicitados, no deseados o con remitente desconocido. Estos mensajes suelen ser enviados en grandes cantidades y con fines exclusivamente publicitarios. Aunque la vía más utilizada para hacer spam es el correo electrónico, puede hacerse a través de diversos servicios de Internet. Hasta el momento, la detección y filtrado de spam se ha centrado principalmente en la detección de anuncios sobre productos ilegales o fraudulentos, pero no en los intereses reales de usuario. Sin embargo, hay mensajes cuyos contenidos son irrelevantes para el usuario de la misma forma que los anuncios descritos anteriormente. Los mecanismos empleados actualmente para la detección y el filtrado de spam se basan en combinaciones de técnicas efectuadas con productos como SpamAssassin o frameworks similares. A partir de estos mecanismos, la obtención de mejoras en el filtrado sería posible mejorando el resultado de alguna de las técnicas independientes combinadas. El presente trabajo de investigación pretende obtener mejoras en el ámbito de las técnicas basadas en contenido. La motivación para esta decisión es que, en la actualidad, a pesar de la gran cantidad de aproximaciones existentes, el uso de aproximaciones basadas en contenido con mecanismos de aprendizaje automático se ha convertido en objeto de estudio por la efectividad que éstas podrían alcanzar gracias a la generalización e integración de conocimiento existente. Hasta el momento, las aproximaciones basadas en contenido se basaban en el empleo de técnicas de clasificación aplicadas sobre información de la presencia (o no) de tokens en el contenido. Sin embargo esta información de entrada para los clasificadores presenta inconvenientes importantes que impiden la obtención de unos resultados realmente precisos, como por ejemplo, la dependencia entre las características. De hecho, este modelo de clasificación, basado en tokens, se ha probado y optimizado en los últimos años hasta llegar a un punto en que, en la actualidad, es imposible obtener mejoras sustanciales y tender hacia la erradicación de los errores de clasificación. La presente investigación se centra en la incorporación de información semántica proveniente de un diccionario ontológico (Wordnet o Babelnet, por ejemplo). Así, en lugar de emplear información sobre tokens sería posible el empleo de synsets (conceptos). De esta forma, se podría obtener una mejora sustancial en la eficacia de los clasificadores, así como la identificación de los intereses del usuario (construcción de su perfil) para que los clasificadores eliminen la publicidad engañosa y los mensajes irrelevantes para el usuario. La hipótesis de partida para este trabajo es la siguiente: “Es posible preprocesar eficientemente contenidos intercambiados mediante los distintos protocolos y servicios de Internet para su representación en forma de synsets y obtener, mediante estos datos, mejoras significativas en la eficacia del filtrado de contenido spam”. Por tanto, este trabajo incluye dos objetivos generales que son (i) la elaboración de un mecanismo de preprocesamiento eficiente y (ii) la mejora del filtrado con los datos obtenidos del preprocesamiento de los contenidos. Dada la hipótesis formulada, se plantea la necesidad de alcanzar una serie de subobjetivos como (i) la construcción de un framework genérico para ejecutar un pipeline de preprocesamiento del texto que concluirá con la obtención del dataset procesado, (ii) la implementación de cada una de las tareas de preprocesamiento y (iii) pruebas de clasificación de textos representados en función de tokens y de synsets.O termo spam fai referencia a mensaxes non solicitadas, non desexados ou con remitente descoñecido. Estas mensaxes adoitan ser enviadas en grandes cantidades e con fins exclusivamente publicitarios. Aínda que a vía máis utilizada para facer spam é o correo electrónico, pode facerse a través de diversos servizos da Internet. Ata o momento, a detección e filtrado de spam centrouse principalmente na detección de anuncios sobre produtos ilegais ou fraudulentos, pero non nos intereses reais de usuario. Con todo, hai mensaxes cuxos contidos son irrelevantes para o usuario da mesma forma que os anuncios descritos anteriormente. Os mecanismos empregados actualmente para a detección e o filtrado de spam baséanse en combinacións de técnicas efectuadas con produtos como SpamAssassin ou frameworks similares. A partir destes mecanismos, a obtención de melloras no filtrado sería posible mellorando o resultado dalgunha das técnicas independentes combinadas. O presente traballo de investigación pretende obter melloras no ámbito das técnicas baseadas en contido. A motivación para esta decisión é que, na actualidade, a pesar da gran cantidade de aproximacións existentes, o uso de aproximacións baseadas en contido con mecanismos de aprendizaxe automática converteuse en obxecto de estudo pola efectividade que estas poderían alcanzar grazas á xeneralización e integración de coñecemento existente. Ata o momento, as aproximacións baseadas en contido baseábanse no emprego de técnicas de clasificación aplicadas sobre información da presenza (ou non) de tokens no contido. Con todo esta información de entrada para os clasificadores presenta inconvenientes importantes que impiden a obtención duns resultados realmente precisos, por exemplo, a dependencia entre as características. De feito, este modelo de clasificación, baseado en tokens, probouse e optimizouse nos últimos anos ata chegar a un punto en que, na actualidade, é imposible obter melloras substanciais e tender cara á erradicación dos erros de clasificación. A presente investigación céntrase na incorporación de información semántica proveniente dun dicionario ontolóxico (Wordnet ou Babelnet, por exemplo). Así, en lugar de empregar información sobre tokens sería posible o emprego de synsets (conceptos). Desta forma, poderíase obter unha mellora substancial na eficacia dos clasificadores, así como a identificación dos intereses do usuario (construción do seu perfil) para que os clasificadores eliminen a publicidade enganosa e as mensaxes irrelevantes para o usuario. A hipótese de partida para este traballo é a seguinte: “É posible preprocesar eficientemente contidos intercambiados mediante os distintos protocolos e servizos da internet para a súa representación en forma de synsets e obter, mediante estes datos, melloras significativas na eficacia do filtrado de contido spam”. Por tanto, este traballo inclúe dous obxectivos xerais que son (i) a elaboración dun mecanismo de preprocesamiento eficiente e (ii) a mellora do filtrado cos datos obtidos do preprocesamiento dos contidos. Dada a hipótese formulada, exponse a necesidade de alcanzar unha serie de subobjetivos como (i) a construción dun framework xenérico para executar un pipeline de preprocesamiento do texto que concluirá coa obtención do dataset procesado, (ii) a implementación de cada unha das tarefas de preprocesamiento e (iii) probas de clasificación de textos representados en función de tokens e de synsets.Spamming is the use of messaging systems to send unsolicited messages (spam), especially advertising, as well as sending messages repeatedly on the same site. While the most widely recognized form of spam is email spam, the term is applied to similar abuses in other media. So far, spam detection and filtering has focused primarily on detecting advertisment, on illegal or fraudulent products, but not on real user interests. However, there are messages whose contents are irrelevant to the user in the same way as the advertisement described above. The currently used mechanisms for spam detection and filtering are based on combinations of techniques made with products such as SpamAssassin or similar frameworks. From these mechanisms, it would be possible to improve filtering by improving the result of some of the independent techniques combined. This research aims to obtain improvements in the scope of content-based techniques. The motivation for this decision is that, at present, despite the large number of existing approaches, the use of content-based approaches with machine learning mechanisms has become an object of study because of the effectiveness that this techniques could achieve through the generalization and integration of existing knowledge. So far, content-based approaches were based on the use of classification techniques applied to information on the presence (or not) of tokens in content. However, this input information for classifiers presents important drawbacks, such as the dependency between the characteristics, that prevent from obtaining really precise results. In fact, this classification approach, based on tokens, has been tested and optimized in recent years until reaching a point where, at present, it is impossible to obtain substantial improvements and tend towards the eradication of classification errors. The present research focuses on the incorporation of semantic information from an ontological dictionary (Wordnet or Babelnet, for example). This way, instead of using token information it would be possible to use synsets (concepts). Thus, a substantial improvement in the efficiency of the classifiers could be obtained, as well as the identification of user interests (construction of his profile), so that the classifiers could eliminate the misleading publicity and the irrelevant messages for that user. The initial hypothesis for this work is the following: "It is possible to efficiently preprocess contents exchanged through the different protocols and Internet services for its representation in the form of synsets and to obtain, through these data, substantial improvement in the effectiveness of spam content filtering". Therefore, this work includes two general objectives that are (i) the elaboration of an efficient preprocessing mechanism and (ii) the improvement of the filtering with the data obtained from the content preprocessing. Given the hypothesis formulated, there is a need to achieve a series of sub-objectives such as (i) the construction of a generic framework to execute a text preprocessing pipeline that will conclude with obtaining the processed dataset, (ii) the implementation of each of the preprocessing tasks and (iii) testing text classification, where texts are represented with tokens and synsets

    Using Natural Language Preprocessing Architecture (NLPA) for Big Data Text Sources

    No full text
    During the last years, big data analysis has become a popular means of taking advantage of multiple (initially valueless) sources to find relevant knowledge about real domains. However, a large number of big data sources provide textual unstructured data. A proper analysis requires tools able to adequately combine big data and text-analysing techniques. Keeping this in mind, we combined a pipelining framework (BDP4J (Big Data Pipelining For Java)) with the implementation of a set of text preprocessing techniques in order to create NLPA (Natural Language Preprocessing Architecture), an extendable open-source plugin implementing preprocessing steps that can be easily combined to create a pipeline. Additionally, NLPA incorporates the possibility of generating datasets using either a classical token-based representation of data or newer synset-based datasets that would be further processed using semantic information (i.e., using ontologies). This work presents a case study of NLPA operation covering the transformation of raw heterogeneous big data into different dataset representations (synsets and tokens) and using the Weka application programming interface (API) to launch two well-known classifiers

    Using natural language preprocessing architecture (NLPA) for Big Data text sources

    Get PDF
    During the last years, big data analysis has become a popular means of taking advantage of multiple (initially valueless) sources to find relevant knowledge about real domains. However, a large number of big data sources provide textual unstructured data. A proper analysis requires tools able to adequately combine big data and text-analysing techniques. Keeping this in mind, we combined a pipelining framework (BDP4J (Big Data Pipelining For Java)) with the implementation of a set of text preprocessing techniques in order to create NLPA (Natural Language Preprocessing Architecture), an extendable open-source plugin implementing preprocessing steps that can be easily combined to create a pipeline. Additionally, NLPA incorporates the possibility of generating datasets using either a classical token-based representation of data or newer synset-based datasets that would be further processed using semantic information (i.e., using ontologies). This work presents a case study of NLPA operation covering the transformation of raw heterogeneous big data into different dataset representations (synsets and tokens) and using the Weka application programming interface (API) to launch two well-known classifiers.Xunta de Galicia | Ref. ED481B 2017/018Agencia Estatal de Investigación | Ref. TIN2017-84658-C2-1-

    Enhancing representation in the context of multiple-channel spam filtering

    Get PDF
    This study addresses the usage of different features to complement synset-based and bag-of-words representations of texts in the context of using classical ML approaches for spam filtering (Ferrara, 2019). Despite the existence of a large number of complementary features, in order to improve the applicability of this study, we have selected only those that can be computed regardless of the communication channel used to distribute content. Feature evaluation has been performed using content distributed through different channels (social networks and email) and classifiers (Adaboost, Flexible Bayes, Naïve Bayes, Random Forests, and SVMs). The results have revealed the usefulness of detecting some non-textual entities (such as URLs, Uniform Resource Locators) in the addressed distribution channels. Moreover, we also found that compression properties and/or information regarding the probability of correctly guessing the language of target texts could be successfully used to improve the classification in a wide range of situations. Finally, we have also detected features that are influenced by specific fashions and habits of users of certain Internet services (e.g. the existence of words written in capital letters) that are not useful for spam filtering.Financiado para publicación en acceso aberto: Universidade de Vigo/CISUGXunta de Galicia | Ref. ED481D-2021/024Agencia Estatal de Investigación | Ref. TIN2017-84658-C2-1-
    corecore