49 research outputs found

    Classifying Attitude by Topic Aspect for English and Chinese Document Collections

    Get PDF
    The goal of this dissertation is to explore the design of tools to help users make sense of subjective information in English and Chinese by comparing attitudes on aspects of a topic in English and Chinese document collections. This involves two coupled challenges: topic aspect focus and attitude characterization. The topic aspect focus is specified by using information retrieval techniques to obtain documents on a topic that are of interest to a user and then allowing the user to designate a few segments of those documents to serve as examples for aspects that she wishes to see characterized. A novel feature of this work is that the examples can be drawn from documents in two languages (English and Chinese). A bilingual aspect classifier which applies monolingual and cross-language classification techniques is used to assemble automatically a large set of document segments on those same aspects. A test collection was designed for aspect classification by annotating consecutive sentences in documents from the Topic Detection and Tracking collections as aspect instances. Experiments show that classification effectiveness can often be increased by using training examples from both languages. Attitude characterization is achieved by classifiers which determine the subjectivity and polarity of document segments. Sentence attitude classification is the focus of the experiments in the dissertation because the best presently available test collection for Chinese attitude classification (the NTCIR-6 Chinese Opinion Analysis Pilot Task) is focused on sentence-level classification. A large Chinese sentiment lexicon was constructed by leveraging existing Chinese and English lexical resources, and an existing character-based approach for estimating the semantic orientation of other Chinese words was extended. A shallow linguistic analysis approach was adopted to classify the subjectivity and polarity of a sentence. Using the large sentiment lexicon with appropriate handling of negation, and leveraging sentence subjectivity density, sentence positivity and negativity, the resulting sentence attitude classifier was more effective than the best previously reported systems

    On the use of language models and topic models in the web : new algorithms for filtering, classification, ranking, and recommendation

    Get PDF
    [no abstract

    Semantic discovery and reuse of business process patterns

    Get PDF
    Patterns currently play an important role in modern information systems (IS) development and their use has mainly been restricted to the design and implementation phases of the development lifecycle. Given the increasing significance of business modelling in IS development, patterns have the potential of providing a viable solution for promoting reusability of recurrent generalized models in the very early stages of development. As a statement of research-in-progress this paper focuses on business process patterns and proposes an initial methodological framework for the discovery and reuse of business process patterns within the IS development lifecycle. The framework borrows ideas from the domain engineering literature and proposes the use of semantics to drive both the discovery of patterns as well as their reuse

    Cybernationalism and cyberactivism in China

    Get PDF
    El nacionalismo en la era de Internet se está convirtiendo cada vez más en un factor esencial que influye en la agenda-setting de la sociedad china, así como en las relaciones de China con los países extranjeros, especialmente con Occidente. Para China, una mejor comprensión de la estructura teórica universal y de los patrones de comportamiento del nacionalismo facilitaría la articulación social general de esta tendencia y potenciaría su papel positivo en la agenda-setting social. Por otra parte, un estudio del cibernacionalismo chino basado en una perspectiva china en el mundo académico occidental es un intento de transculturación. Desde el punto de vista de las relaciones internacionales y la geopolítica actuales, que son bastante urgentes, este intento ayudaría a mejorar la compatibilidad de China con el actual orden mundial dominado por Occidente, a reducir la desinformación entre China y otros países y a sentar las bases culturales e ideológicas para otras colaboraciones internacionales. Teniendo en cuenta el estado actual de la investigación sobre el nacionalismo chino y la naturaleza participativa de las masas del cibernacionalismo, esta disertación se centra en el cibernacionalismo en las tres partes siguientes. El primero es un estudio de los orígenes históricos del cibernacionalismo chino. Esta sección incluye tanto una exploración del consenso social en la antigua China como un estudio de la influencia del nacionalismo en la historia china moderna. El estudio de los orígenes históricos no sólo nos muestra la secuencia cronológica de la experiencia del desarrollo y la evolución tanto del proto-nacionalismo como del nacionalismo en China, sino que también revela un impulso decisivo para las reivindicaciones y comportamientos actuales del cibernacionalismo. La segunda parte trata del proceso de formación y ascenso del cibernacionalismo desde el siglo XXI. El importante antecedente del paso del nacionalismo al cibernacionalismo es el proceso de informatización de la sociedad china. Una vez completado el estudio de la situación básica de la sociedad china de Internet, especialmente el estudio de los medios sociales como espacio público, podemos vincular Internet con el nacionalismo y examinar el nuevo desarrollo del nacionalismo en la era de la participación de masas. El objetivo final es conectar el proto-nacionalismo, el nacionalismo y el cibernacionalismo, y seguir construyendo una comprensión del cibernacionalismo que sea coherente tanto con los principios universales del nacionalismo como con el contexto chino. Por último, validamos los resultados derivados del estudio anterior a través de la realidad social, es decir, estudiando las prácticas de ciberactivismo del cibernacionalismo para juzgar su suficiencia general así como su validez. Llevaremos a cabo varios estudios de caso de natural language processing basados en big data para reproducir la lógica de comportamiento y el impacto real del ciberactivismo de la manera más cercana posible a la realidad de Internet, evitando al mismo tiempo los defectos de argumentación unilateral y de infrarrepresentación de los estudios de caso tradicionales.Nationalism in the Internet age is increasingly becoming an essential factor influencing agendasetting within Chinese society, as well as China’s relations with foreign countries, especially the West. For China, a better understanding of the universal theoretical structure and behavioral patterns of nationalism would facilitate the overall social articulation of this trend and enhance its positive role in social agenda setting. On the other hand, a study of Chinese cybernationalism based on a Chinese perspective in western academia is an attempt at transculturation. From the viewpoint of the current rather urgent international relations and geopolitics, such an attempt would help to enhance China’s compatibility with the current western-dominated world order, reduce misinformation between China and other countries, and lay the cultural and ideological groundwork for various other international collaborations. Considering the current state of Chinese nationalism research and the mass participatory nature of cybernationalism, this dissertation focuses on cybernationalism in the following three parts. The first is a study of the historical origins of Chinese cybernationalism. This section includes both an exploration of the social consensus in ancient China and a survey of the influence of nationalism in modern Chinese history. The historical origins study not only shows us the chronological sequence of experiencing the development and evolution of both proto-nationalism and nationalism in China, but also reveals a decisive impetus for the current claims and behaviors of cybernationalism. The second part deals with the process of formation and rise of cybernationalism since the 21st century. The important background for the move from nationalism to cybernationalism is the informatization process of Chinese society. After we have completed the study of the basic situation of Chinese Internet society, especially the study of social media as a public space, we can link the Internet with nationalism and examine the new development of nationalism in the era of mass participation. The ultimate goal is to connect the proto-nationalism, nationalism, cybernationalism, and furtherly construct an understanding of cybernationalism that is consistent with both the universal principles of nationalism and the Chinese context. Finally, we validate the results derived from the previous study through social reality, i.e., by studying the cyberactivism practices of cybernationalism to judge its general sufficiency as well as validity. We will conduct several natural language processing case studies based on big data to reproduce the behavioral logic and actual impact of cyberactivism in the closest possible way to the Internet reality while avoiding the unilateral argumentation and under-representation flaws of traditional case studies

    Time series motif discovery

    Get PDF
    Programa doutoral MAP-i em Computer ScienceTime series data are daily produced in massive proportions in virtually every field. Most of the data are stored in time series databases. To find patterns in the databases is an important problem. These patterns, also known as motifs, provide useful insight to the domain expert and summarize the database. They have been widely used in areas as diverse as finance and medicine. Despite there are many algorithms for the task, they typically do not scale and need to set several parameters. We propose a novel algorithm that runs in linear time, is also space efficient and only needs to set one parameter. It fully exploits the state of the art time series representation (SAX _ Symbolic Aggregate Approximation) technique to extract motifs at several resolutions. This property allows the algorithm to skip expensive distance calculations that are typically employed by other algorithms. We also propose an approach to calculate time series motifs statistical significance. Despite there are many approaches in the literature to find time series motifs e_ciently, surprisingly there is no approach that calculates a motifs statistical significance. Our proposal leverages work from the bioinformatics community by using a symbolic definition of time series motifs to derive each motif's p-value. We estimate the expected frequency of a motif by using Markov Chain models. The p-value is then assessed by comparing the actual frequency to the estimated one using statistical hypothesis tests. Our contribution gives means to the application of a powerful technique - statistical tests - to a time series setting. This provides researchers and practitioners with an important tool to evaluate automatically the degree of relevance of each extracted motif. Finally, we propose an approach to automatically derive the Symbolic Aggregate Approximation (iSAX) time series representation's parameters. This technique is widely used in time series data mining. Its popularity arises from the fact that it is symbolic, reduces the dimensionality of the series, allows lower bounding and is space efficient. However, the need to set the symbolic length and alphabet size parameters limits the applicability of the representation since the best parameter setting is highly application dependent. Typically, these are either set to a fixed value (e.g. 8) or experimentally probed for the best configuration. The technique, referred as AutoiSAX, not only discovers the best parameter setting for each time series in the database but also finds the alphabet size for each iSAX symbol within the same word. It is based on the simple and intuitive ideas of time series complexity and standard deviation. The technique can be smoothly embedded in existing data mining tasks as an efficient sub-routine. We analyse the impact of using AutoiSAX in visualization interpretability, classification accuracy and motif mining results. Our contribution aims to make iSAX a more general approach as it evolves towards a parameter-free method.As séries temporais são produzidas diariamente em quantidades massivas em diferentes áreas de trabalho. Estes dados são guardados em bases de dados de séries temporais. Descobrir padrões desconhecidos e repetidos em bases de dados de séries temporais é um desafio pertinente. Estes padrões, também conhecidos como motivos, dão uma nova perspectiva da base de dados, ajudando a explorá-la e sumarizá-la. São frequentemente utilizados em áreas tão diversas como as finanças ou a medicina. Apesar de existirem diversos algoritmos destinados à execução desta tarefa, geralmente não apresentam uma boa escalabilidade e exigem a configuração de vários parâmetros. Propomos, neste trabalho, a criação de um novo algoritmo que executa em tempo linear e que é igualmente eficiente em termos de memória usada, necessitando apenas de um parâmetro. Este algoritmo usufrui da melhor técnica de representação de séries temporais para extrair motivos em várias resoluções (SAX). Esta propriedade permite evitar o cálculo de distâncias que têm um custo computacional muito elevado, cálculo este geralmente presente noutros algoritmos. Nesta tese também fazemos uma proposta para calcular a significância estatística de motivos em séries temporais. Apesar de existirem muitas propostas para a detecção eficiente de motivos em séries temporais, surpreendentemente não existe nenhuma aproximação para calcular a sua significância estatística. A nossa proposta é enriquecida pelo trabalho da área bioinformática, sendo usada uma definição simbólica de motivo para derivar o seu respectivo p-value. Estimamos a frequência esperada de um motivo usando modelos de cadeias de Markov. O p-value associado a um teste estatístico é calculado comparando a frequência real com a frequência estimada de cada padrão. A nossa contribuição permite a aplicação de uma técnica poderosa, testes estatísticos, para a área das séries temporais. Proporciona assim, aos investigadores e utilizadores, uma ferramenta importante para avaliarem, de forma automática, a relevância de cada motivo extraído dos seus dados. Por fim, propomos uma metodologia para derivar de forma automática os parâmetros da representação de séries temporais Symbolic Aggregate Approximation (iSAX). Esta técnica é vastamente utilizada na área de Extracção de Conhecimento em séries temporais. A sua popularidade surge associada ao facto de ser simbólica, de reduzir o tamanho das séries, de permitir aproximar a Distância Euclidiana nas séries originais e ser eficiente em termos de espaço. Contudo, a necessidade de definir os parâmetros comprimento da representação e tamanho do alfabeto limita a sua utilização na prática, uma vez que o parâmetro mais adequado está dependente da área em causa. Normalmente, estes são definidos quer para um valor fixo (por exemplo, 8). A técnica, designada por AutoiSAX, não só extrai a melhor configuração do parâmetro para cada série temporal da base de dados como consegue encontrar a dimensão do alfabeto para cada símbolo iSAX dentro da mesma palavra. Baseia-se em ideias simples e intuitivas como a complexidade das séries temporais e no desvio padrão. A técnica pode ser facilmente incorporada como uma sub-rotina eficiente em tarefas existentes de extracção de conhecimento. Analisamos também o impacto da utilização do AutoiSAX na capacidade interpretativa em tarefas de visualização, exactidão da classificação e na qualidade dos motivos extraídos. A nossa proposta pretende que a iSAX se consolide como uma abordagem mais geral à medida que se vai constituindo como uma metodologia livre de parâmetros.Fundação para a Ciência e Tecnologia (FCT) - SFRH / BD / 33303 / 200
    corecore