172 research outputs found

    Text Data Mining: Theory and Methods

    Full text link
    This paper provides the reader with a very brief introduction to some of the theory and methods of text data mining. The intent of this article is to introduce the reader to some of the current methodologies that are employed within this discipline area while at the same time making the reader aware of some of the interesting challenges that remain to be solved within the area. Finally, the articles serves as a very rudimentary tutorial on some of techniques while also providing the reader with a list of references for additional study.Comment: Published in at http://dx.doi.org/10.1214/07-SS016 the Statistics Surveys (http://www.i-journals.org/ss/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Measuring academic influence: Not all citations are equal

    Get PDF
    The importance of a research article is routinely measured by counting how many times it has been cited. However, treating all citations with equal weight ignores the wide variety of functions that citations perform. We want to automatically identify the subset of references in a bibliography that have a central academic influence on the citing paper. For this purpose, we examine the effectiveness of a variety of features for determining the academic influence of a citation. By asking authors to identify the key references in their own work, we created a data set in which citations were labeled according to their academic influence. Using automatic feature selection with supervised machine learning, we found a model for predicting academic influence that achieves good performance on this data set using only four features. The best features, among those we evaluated, were those based on the number of times a reference is mentioned in the body of a citing paper. The performance of these features inspired us to design an influence-primed h-index (the hip-index). Unlike the conventional h-index, it weights citations by how many times a reference is mentioned. According to our experiments, the hip-index is a better indicator of researcher performance than the conventional h-index

    Exploring a Modelling Method with Semantic Link Network and Resource Space Model

    Get PDF
    To model the complex reality, it is necessary to develop a powerful semantic model. A rational approach is to integrate a relational view and a multi-dimensional view of reality. The Semantic Link Network (SLN) is a semantic model based on a relational view and the Resource Space Model (RSM) is a multi-dimensional view for managing, sharing and specifying versatile resources with a universal resource observation. The motivation of this research consists of four aspects: (1) verify the roles of Semantic Link Network and the Resource Space Model in effectively managing various types of resources, (2) demonstrate the advantages of the Resource Space Model and Semantic Link Network, (3) uncover the rules through applications, and (4) generalize a methodology for modelling complex reality and managing various resources. The main contribution of this work consists of the following aspects: 1. A new text summarization method is proposed by segmenting a document into clauses based on semantic discourse relations and ranking and extracting the informative clauses according to their relations and roles. The Resource Space Model benefits from using semantic link network, ranking techniques and language characteristics. Compared with other summarization approaches, the proposed approach based on semantic relations achieves a higher recall score. Three implications are obtained from this research. 2. An SLN-based model for recommending research collaboration is proposed by extracting a semantic link network of different types of semantic nodes and different types of semantic links from scientific publications. Experiments on three data sets of scientific publications show that the model achieves a good performance in predicting future collaborators. This research further unveils that different semantic links play different roles in representing texts. 3. A multi-dimensional method for managing software engineering processes is developed. Software engineering processes are mapped into multiple dimensions for supporting analysis, development and maintenance of software systems. It can be used to uniformly classify and manage software methods and models through multiple dimensions so that software systems can be developed with appropriate methods. Interfaces for visualizing Resource Space Model are developed to support the proposed method by keeping the consistency among interface, the structure of model and faceted navigation

    Semantic multimedia analysis using knowledge and context

    Get PDF
    PhDThe difficulty of semantic multimedia analysis can be attributed to the extended diversity in form and appearance exhibited by the majority of semantic concepts and the difficulty to express them using a finite number of patterns. In meeting this challenge there has been a scientific debate on whether the problem should be addressed from the perspective of using overwhelming amounts of training data to capture all possible instantiations of a concept, or from the perspective of using explicit knowledge about the concepts’ relations to infer their presence. In this thesis we address three problems of pattern recognition and propose solutions that combine the knowledge extracted implicitly from training data with the knowledge provided explicitly in structured form. First, we propose a BNs modeling approach that defines a conceptual space where both domain related evi- dence and evidence derived from content analysis can be jointly considered to support or disprove a hypothesis. The use of this space leads to sig- nificant gains in performance compared to analysis methods that can not handle combined knowledge. Then, we present an unsupervised method that exploits the collective nature of social media to automatically obtain large amounts of annotated image regions. By proving that the quality of the obtained samples can be almost as good as manually annotated images when working with large datasets, we significantly contribute towards scal- able object detection. Finally, we introduce a method that treats images, visual features and tags as the three observable variables of an aspect model and extracts a set of latent topics that incorporates the semantics of both visual and tag information space. By showing that the cross-modal depen- dencies of tagged images can be exploited to increase the semantic capacity of the resulting space, we advocate the use of all existing information facets in the semantic analysis of social media

    Information search and similarity based on Web 2.0 and semantic technologies

    Get PDF
    The World Wide Web provides a huge amount of information described in natural language at the current society’s disposal. Web search engines were born from the necessity of finding a particular piece of that information. Their ease of use and their utility have turned these engines into one of the most used web tools at a daily basis. To make a query, users just have to introduce a set of words - keywords - in natural language and the engine answers with a list of ordered resources which contain those words. The order is given by ranking algorithms. These algorithms use basically two types of features: dynamic and static factors. The dynamic factor has into account the query; that is, those documents which contain the keywords used to describe the query are more relevant for that query. The hyperlinks structure among documents is an example of a static factor of most current algorithms. For example, if most documents link to a particular document, this document may have more relevance than others because it is more popular. Even though currently there is a wide consensus on the good results that the majority of web search engines provides, these tools still suffer from some limitations, basically 1) the loneliness of the searching activity itself; and 2) the simple recovery process, based mainly on offering the documents that contains the exact terms used to describe the query. Considering the first problem, there is no doubt in the lonely and time-consuming process of searching relevant information in the World Wide Web. There are thousands of users out there that repeat previously executed queries, spending time in taking decisions of which documents are relevant or not; decisions that may have been taken previously and that may be do the job for similar or identical queries for other users. Considering the second problem, the textual nature of the current Web makes the reasoning capability of web search engines quite restricted; queries and web resources are described in natural language that, in some cases, can lead to ambiguity or other semantic-related difficulties. Computers do not know text; however, if semantics is incorporated to the text, meaning and sense is incorporated too. This way, queries and web resources will not be mere sets of terms, but lists of well-defined concepts. This thesis proposes a semantic layer, known as Itaca, which joins simplicity and effectiveness in order to endow with semantics both the resources stored in the World Wide Web and the queries used by users to find those resources. This is achieved through collaborative annotations and relevance feedback made by the users themselves, which describe both the queries and the web resources by means of Wikipedia concepts. Itaca extends the functional capabilities of current web search engines, providing a new ranking algorithm without dispensing traditional ranking models. Experiments show that this new architecture offers more precision in the final results obtained, keeping the simplicity and usability of the web search engines existing so far. Its particular design as a layer makes feasible its inclusion to current engines in a simple way.Internet pone a disposición de la sociedad una enorme cantidad de información descrita en lenguaje natural. Los buscadores web nacieron de la necesidad de encontrar un fragmento de información entre tanto volumen de datos. Su facilidad de manejo y su utilidad los han convertido en herramientas de uso diario entre la población. Para realizar una consulta, el usuario sólo tiene que introducir varias palabras clave en lenguaje natural y el buscador responde con una lista de recursos que contienen dichas palabras, ordenados en base a algoritmos de ranking. Estos algoritmos usan dos tipos de factores básicos: factores dinámicos y estáticos. El factor dinámico tiene en cuenta la consulta en sí; es decir, aquellos documentos donde estén las palabras utilizadas para describir la consulta serán más relevantes para dicha consulta. La estructura de hiperenlaces en los documentos electrónicos es un ejemplo de factor estático. Por ejemplo, si muchos documentos enlazan a otro documento, éste último documento podrá ser más relevante que otros. Si bien es cierto que actualmente hay consenso entre los buenos resultados de estos buscadores, todavía adolecen de ciertos problemas, destacando 1) la soledad en la que un usuario realiza una consulta; y 2) el modelo simple de recuperación, basado en ver si un documento contiene o no las palabras exactas usadas para describir la consulta. Con respecto al primer problema, no hay duda de que navegar en busca de cierta información relevante es una práctica solitaria y que consume mucho tiempo. Hay miles de usuarios ahí fuera que repiten sin saberlo una misma consulta, y las decisiones que toman muchos de ellos, descartando la información irrelevante y quedándose con la que realmente es útil, podrían servir de guía para otros muchos. Con respecto al segundo, el carácter textual de la Web actual hace que la capacidad de razonamiento en los buscadores se vea limitada, pues las consultas y los recursos están descritos en lenguaje natural que en ocasiones da origen a la ambigüedad. Los equipos informáticos no comprenden el texto que se incluye. Si se incorpora semántica al lenguaje, se incorpora significado, de forma que las consultas y los recursos electrónicos no son meros conjuntos de términos, sino una lista de conceptos claramente diferenciados. La presente tesis desarrolla una capa semántica, Itaca, que dota de significado tanto a los recursos almacenados en la Web como a las consultas que pueden formular los usuarios para encontrar dichos recursos. Todo ello se consigue a través de anotaciones colaborativas y de relevancia realizadas por los propios usuarios, que describen tanto consultas como recursos electrónicos mediante conceptos extraídos de Wikipedia. Itaca extiende las características funcionales de los buscadores web actuales, aportando un nuevo modelo de ranking sin tener que prescindir de los modelos actualmente en uso. Los experimentos demuestran que aporta una mayor precisión en los resultados finales, manteniendo la simplicidad y usabilidad de los buscadores que se conocen hasta ahora. Su particular diseño, a modo de capa, hace que su incorporación a buscadores ya existentes sea posible y sencilla.Programa Oficial de Posgrado en Ingeniería TelemáticaPresidente: Asunción Gómez Pérez.- Secretario: Mario Muñoz Organero.- Vocal: Anselmo Peñas Padill

    Seventh Biennial Report : June 2003 - March 2005

    No full text

    Behavioral Profiling of SCADA Network Traffic using Machine Learning Algorithms

    Get PDF
    Mixed traffic networks containing both traditional ICT network traffic and SCADA network traffic are more commonplace now due to the desire for remote control and monitoring of industrial processes. The ability to identify SCADA devices on a mixed traffic network with zero prior knowledge, such as port, protocol or IP address, is desirable since SCADA devices are communicating over corporate networks but typically use non-standard ports and proprietary protocols. Four supervised ML algorithms are tested on a mixed traffic dataset containing 116,527 dataflows from both SCADA and traditional ICT networks: Naive Bayes, NBTree, BayesNet, and J4.8. Using packet timing, packet size and data throughput as traffic behavior categories, this research calculates 24 attributes from each device dataflow. All four algorithms are tested with three attribute subsets: a full set and two reduced attribute subsets. The attributes and ML algorithms chosen for experimentation successfully demonstrate that a TPR of .9935 for SCADA network traffic is feasible on a given network. It also successfully identifies an optimal attribute subset, while maintaining at least a .99 TPR. The optimal attribute subset provides the SCADA network traffic behaviors that most effectively differentiating them from traditional ICT network traffic

    Applying Wikipedia to Interactive Information Retrieval

    Get PDF
    There are many opportunities to improve the interactivity of information retrieval systems beyond the ubiquitous search box. One idea is to use knowledge bases—e.g. controlled vocabularies, classification schemes, thesauri and ontologies—to organize, describe and navigate the information space. These resources are popular in libraries and specialist collections, but have proven too expensive and narrow to be applied to everyday webscale search. Wikipedia has the potential to bring structured knowledge into more widespread use. This online, collaboratively generated encyclopaedia is one of the largest and most consulted reference works in existence. It is broader, deeper and more agile than the knowledge bases put forward to assist retrieval in the past. Rendering this resource machine-readable is a challenging task that has captured the interest of many researchers. Many see it as a key step required to break the knowledge acquisition bottleneck that crippled previous efforts. This thesis claims that the roadblock can be sidestepped: Wikipedia can be applied effectively to open-domain information retrieval with minimal natural language processing or information extraction. The key is to focus on gathering and applying human-readable rather than machine-readable knowledge. To demonstrate this claim, the thesis tackles three separate problems: extracting knowledge from Wikipedia; connecting it to textual documents; and applying it to the retrieval process. First, we demonstrate that a large thesaurus-like structure can be obtained directly from Wikipedia, and that accurate measures of semantic relatedness can be efficiently mined from it. Second, we show that Wikipedia provides the necessary features and training data for existing data mining techniques to accurately detect and disambiguate topics when they are mentioned in plain text. Third, we provide two systems and user studies that demonstrate the utility of the Wikipedia-derived knowledge base for interactive information retrieval
    corecore