10 research outputs found

    Text Summarization Techniques: A Brief Survey

    Get PDF
    In recent years, there has been a explosion in the amount of text data from a variety of sources. This volume of text is an invaluable source of information and knowledge which needs to be effectively summarized to be useful. In this review, the main approaches to automatic text summarization are described. We review the different processes for summarization and describe the effectiveness and shortcomings of the different methods.Comment: Some of references format have update

    Personal Text Summarization in Mobile Device

    Get PDF
    This paper presents a hybrid text summarization for mobile device to summarize a selected text. The system can be proceeds by statistic or heuristic methods. With the statistic and heuristic the summary is found based on combined statistic features and heuristic features like word frequency, position, length of sentences, and similarity with the document title. The results shows that the time with proposed system is less than without it during the retrieving the text with selected keywords

    Do peers see more in a paper than its authors?

    Get PDF
    Recent years have shown a gradual shift in the content of biomedical publications that is freely accessible, from titles and abstracts to full text. This has enabled new forms of automatic text analysis and has given rise to some interesting questions: How informative is the abstract compared to the full-text? What important information in the full-text is not present in the abstract? What should a good summary contain that is not already in the abstract? Do authors and peers see an article differently? We answer these questions by comparing the information content of the abstract to that in citances-sentences containing citations to that article. We contrast the important points of an article as judged by its authors versus as seen by peers. Focusing on the area of molecular interactions, we perform manual and automatic analysis, and we find that the set of all citances to a target article not only covers most information (entities, functions, experimental methods, and other biological concepts) found in its abstract, but also contains 20% more concepts. We further present a detailed summary of the differences across information types, and we examine the effects other citations and time have on the content of citances

    Tweet-biased summarization

    Get PDF
    We examined whether the microblog comments given by people after reading a web document could be exploited to improve the accuracy of a web document summarization system. We examined the effect of social information (i.e., tweets) on the accuracy of the generated summaries by comparing the user preference for TBS (tweet-biased summary) with GS (generic summary). The result of crowdsourcing-based evaluation shows that the user preference for TBS was significantly higher than GS. We also took random samples of the documents to see the performance of summaries in a traditional evaluation using ROUGE, which, in general, TBS was also shown to be better than GS. We further analyzed the influence of the number of tweets pointed to a web document on summarization accuracy, finding a positive moderate correlation between the number of tweets pointed to a web document and the performance of generated TBS as measured by user preference. The results show that incorporating social information into the summary generation process can improve the accuracy of summary. The reason for people choosing one summary over another in a crowdsourcing-based evaluation is also presented in this article

    Do Peers See More in a Paper Than Its Authors?

    Get PDF
    Recent years have shown a gradual shift in the content of biomedical publications that is freely accessible, from titles and abstracts to full text. This has enabled new forms of automatic text analysis and has given rise to some interesting questions: How informative is the abstract compared to the full-text? What important information in the full-text is not present in the abstract? What should a good summary contain that is not already in the abstract? Do authors and peers see an article differently? We answer these questions by comparing the information content of the abstract to that in citances—sentences containing citations to that article. We contrast the important points of an article as judged by its authors versus as seen by peers. Focusing on the area of molecular interactions, we perform manual and automatic analysis, and we find that the set of all citances to a target article not only covers most information (entities, functions, experimental methods, and other biological concepts) found in its abstract, but also contains 20% more concepts. We further present a detailed summary of the differences across information types, and we examine the effects other citations and time have on the content of citances

    Unsupervised Graph-Based Similarity Learning Using Heterogeneous Features.

    Full text link
    Relational data refers to data that contains explicit relations among objects. Nowadays, relational data are universal and have a broad appeal in many different application domains. The problem of estimating similarity between objects is a core requirement for many standard Machine Learning (ML), Natural Language Processing (NLP) and Information Retrieval (IR) problems such as clustering, classiffication, word sense disambiguation, etc. Traditional machine learning approaches represent the data using simple, concise representations such as feature vectors. While this works very well for homogeneous data, i.e, data with a single feature type such as text, it does not exploit the availability of dfferent feature types fully. For example, scientic publications have text, citations, authorship information, venue information. Each of the features can be used for estimating similarity. Representing such objects has been a key issue in efficient mining (Getoor and Taskar, 2007). In this thesis, we propose natural representations for relational data using multiple, connected layers of graphs; one for each feature type. Also, we propose novel algorithms for estimating similarity using multiple heterogeneous features. Also, we present novel algorithms for tasks like topic detection and music recommendation using the estimated similarity measure. We demonstrate superior performance of the proposed algorithms (root mean squared error of 24.81 on the Yahoo! KDD Music recommendation data set and classiffication accuracy of 88% on the ACL Anthology Network data set) over many of the state of the art algorithms, such as Latent Semantic Analysis (LSA), Multiple Kernel Learning (MKL) and spectral clustering and baselines on large, standard data sets.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/89824/1/mpradeep_1.pd

    Enhanced Web Document Summarization Using Hyperlinks

    No full text
    This paper addresses the issue of Web document summarization. As textual content of Web documents is often scarce or irrelevant and existing summarization techniques are based on it, many Web pages and websites cannot be suitably summarized. We consider the context of a Web document by the textual content of all the documents linking to it. To summarize a target Web document, a context-based summarizer has to perform a preprocessing task, during which it will be decided which pieces of information in the source documents are relevant to the content of the target. Then a context-based summarizer faces two issues: first, the selected elements may partially deal with the topic of the target, second they may be related to the target and yet not contain any clues about the content of the target. In thi

    Extracção automática de tópicos de documentos

    Get PDF
    Dissertação apresentada na Faculdade de Ciências e Tecnologia da Universidade Nova de Lisboa para a obtenção do grau de Mestre em Engenharia InformáticaÉ amplamente conhecida a necessidade de se terem palavras-chave ou tópicos associados a documentos. Entende-se por palavras-chave ou por tópico (s) de um documento qualquer palavra ou multipalavra (uma sequência de 2 ou mais palavras) que, tendo um significado mais ou menos preciso, resume em si parte do conteúdo desse documento. Neste trabalho pretendo desenvolver uma nova metodologia que aborda a problemática de extracção de palavras-chave. Para tal, trabalharei a extracção das palavras-chave trabalhando com palavras, multipalavras e prefixos de palavras com comprimentos predefinidos (5 caracteres). A utilização de prefixos permite trabalhar com línguas altamente flexionadas, servindo os prefixos tópico como sinalizadores de toda uma família de palavras e de multipalavras que poderão, nesse caso, ser promovidas a tópicos, sendo a extracção destes prefixos inovadora, relativamente ao estado da arte. A extracção a realizar será baseada em estatística, o que possibilita trabalhar com textos de várias línguas, nomeadamente o Português, o Inglês e o Checo, que foram as línguas utilizadas neste trabalho. Pretendi melhorar os tempos de extracção de tópicos, recorrendo para isso à utilização de Suffix Arrays. Os resultados obtidos foram avaliados por pessoas externas. É feita também uma comparação bastante exaustiva entre 24 métodos de extracção, alguns novos, propostos neste trabalho, outros propostos por outros autores. Com esta dissertação pretendo fornecer uma nova ferramenta a trabalhos posteriores de sumarização de documentos, de Agrupamento ou indexação de documentos, de construção de ontologias

    Publicaciones científicas accesibles para personas ciegas y deficientes visuales

    Get PDF
    La obra, tesis doctoral de la autora, defendida en la Universidad de Barcelona en 2009, analiza la situación actual de la edición accesible, atendiendo a las necesidades específicas de los usuarios con discapacidad visual, y valora las características de los documentos digitales en función de tales necesidades. Al estudiar la estructura de los diversos tipos de documentos digitales, la autora señala la edición de artículos científicos como sector más avanzado, por lo que este tipo de documentos constituyen un modelo particularmente idóneo para validar la edición accesible

    Techniques for the Analysis of Modern Web Page Traffic using Anonymized TCP/IP Headers

    Get PDF
    Analysis of traces of network traffic is a methodology that has been widely adopted for studying the Web for several decades. However, due to recent privacy legislation and increasing adoption of traffic encryption, often only anonymized TCP/IP headers are accessible in traffic traces. For traffic traces to remain useful for analysis, techniques must be developed to glean insight using this limited header information. This dissertation evaluates approaches for classifying individual web page downloads — referred to as web page classification — when only anonymized TCP/IP headers are available. The context in which web page classification is defined and evaluated in this dissertation is different from prior traffic classification methods in three ways. First, the impact of diversity in client platforms (browsers, operating systems, device type, and vantage point) on network traffic is explicitly considered. Second, the challenge of overlapping traffic from multiple web pages is explicitly considered and demultiplexing approaches are evaluated (web page segmentation). And lastly, unlike prior work on traffic classification, four orthogonal labeling schemes are considered (genre-based, device-based, navigation-based, and video streaming-based) — these are of value in several web-related applications, including privacy analysis, user behavior modeling, traffic forecasting, and potentially behavioral ad-targeting. We conduct evaluations using large collections of both synthetically generated data, as well as browsing data from real users. Our analysis shows that the client platform choice has a statistically significant impact on web traffic. It also shows that change point detection methods, a new class of segmentation approach, outperform existing idle time-based methods. Overall, this work establishes that web page classification performance can be improved by: (i) incorporating client platform differences in the feature selection and training methodology, and (ii) utilizing better performing web page segmentation approaches. This research increases the overall awareness on the challenges associated with the analysis of modern web traffic. It shows and advocates for considering real-world factors, such as client platform diversity and overlapping traffic from multiple streams, when developing and evaluating traffic analysis techniques.Doctor of Philosoph
    corecore