602 research outputs found
Unsupervised Keyword Extraction from Polish Legal Texts
In this work, we present an application of the recently proposed unsupervised
keyword extraction algorithm RAKE to a corpus of Polish legal texts from the
field of public procurement. RAKE is essentially a language and domain
independent method. Its only language-specific input is a stoplist containing a
set of non-content words. The performance of the method heavily depends on the
choice of such a stoplist, which should be domain adopted. Therefore, we
complement RAKE algorithm with an automatic approach to selecting non-content
words, which is based on the statistical properties of term distribution
Transforming legal documents for visualization and analysis
Regulations, laws, norms, and other documents of legal nature are
a relevant part of any governmental organisation. During
digitisation and transformation stages towards a digital
government model, information and communication technologies
are explored to improve internal processes and working practices
of government infrastructures. This paper introduces preliminary
results on a research line devoted to developing visualisation
techniques for enhancing the readability and comprehension of
legal texts. The content of documents is conveyed to a welldefined
model, which is enriched with semantic information
extracted automatically. Then, a set of digital views are created for
document exploration from both a structural and semantic point
of view. Effective and easier to use digital interfaces can enable
and promote citizens engagement in decision-making processes,
provide information for the public, and also enhance the study and
analysis of legal texts by lawmakers, legal practitioners, and
assorted scholars.“SmartEGOV: Harnessing
EGOV for Smart Governance (Foundations, methods, Tools) /
NORTE-01-0145-FEDER-000037”, supported by Norte Portugal
Regional Operational Programme (NORTE 2020), under the
PORTUGAL 2020 Partnership Agreement, through the European
Regional Development Fund (EFDR
Classification of protein interaction sentences via gaussian processes
The increase in the availability of protein interaction studies in textual format coupled with the demand for easier access to the key results has lead to a need for text mining solutions. In the text processing pipeline, classification is a key step for extraction of small sections of relevant text. Consequently, for the task of locating protein-protein interaction sentences, we examine the use of a classifier which has rarely been applied to text, the Gaussian processes (GPs). GPs are a non-parametric probabilistic analogue to the more popular support vector machines (SVMs). We find that GPs outperform the SVM and na\"ive Bayes classifiers on binary sentence data, whilst showing equivalent performance on abstract and multiclass sentence corpora. In addition, the lack of the margin parameter, which requires costly tuning, along with the principled multiclass extensions enabled by the probabilistic framework make GPs an appealing alternative worth of further adoption
Semi-automated dialogue act classification for situated social agents in games
As a step toward simulating dynamic dialogue between agents and humans in virtual environments, we describe learning a model of social behavior composed of interleaved utterances and physical actions. In our model, utterances are abstracted as {speech act, propositional content, referent} triples. After training a classifier on 100 gameplay logs from The Restaurant Game annotated with dialogue act triples, we have automatically classified utterances in an additional 5,000 logs. A quantitative evaluation of statistical models learned from the gameplay logs demonstrates that semi-automatically classified dialogue acts yield significantly more predictive power than automatically clustered utterances, and serve as a better common currency for modeling interleaved actions and utterances
Multimedia Retrieval by Means of Merge of Results from Textual and Content Based Retrieval Subsystems
The main goal of this paper it is to present our experiments in ImageCLEF 2009 Campaign (photo retrieval task). In 2008 we proved empirically that the Text-based Image Retrieval (TBIR) methods defeats the Content-based Image Retrieval CBIR “quality” of results, so this time we developed several experiments in which the CBIR helps the TBIR. The TBIR System [6] main improvement is the named-entity sub-module. In case of the CBIR system [3] the number of low-level features has been increased from the 68 component used at ImageCLEF 2008 up to 114 components, and only the Mahalanobis distance has been used. We propose an ad-hoc management of the topics delivered, and the generation of XML structures for 0.5 million captions of the photographs (corpus) delivered. Two different merging algorithms were developed and the third one tries to improve our previous cluster level results promoting the diversity. Our best run for precision metrics appeared in position 16th, in the 19th for MAP score, and for diversity value in position 11th, for a total of 84 submitted experiments. Our best and “only textual” experiment was the 6th one over 41
Document Word Clouds: Visualising Web Documents as Tag Clouds to Aid Users in Relevance Decisions
Περιέχει το πλήρες κείμενοInformation Retrieval systems spend a great effort on determining
the significant terms in a document. When, instead, a user
is looking at a document he cannot benefit from such information. He
has to read the text to understand which words are important. In this
paper we take a look at the idea of enhancing the perception of web
documents with visualisation techniques borrowed from the tag clouds
of Web 2.0. Highlighting the important words in a document by using a
larger font size allows to get a quick impression of the relevant concepts
in a text. As this process does not depend on a user query it can also
be used for explorative search. A user study showed, that already simple
TF-IDF values used as notion of word importance helped the users to
decide quicker, whether or not a document is relevant to a topic
Extracting collective trends from Twitter using social-based data mining
The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-642-40495-5_62Proceedings 5th International Conference, ICCCI 2013, Craiova, Romania, September 11-13, 2013,Social Networks have become an important environment for Collective Trends extraction. The interactions amongst users provide information of their preferences and relationships. This information can be used to measure the influence of ideas, or opinions, and how they are spread within the Network. Currently, one of the most relevant and popular Social Network is Twitter. This Social Network was created to share comments and opinions. The information provided by users is specially useful in different fields and research areas such as marketing. This data is presented as short text strings containing different ideas expressed by real people. With this representation, different Data Mining and Text Mining techniques (such as classification and clustering) might be used for knowledge extraction trying to distinguish the meaning of the opinions. This work is focused on the analysis about how these techniques can interpret these opinions within the Social Network using information related to IKEA® company.The preparation of this manuscript has been supported
by the Spanish Ministry of Science and Innovation under the following projects:
TIN2010-19872, ECO2011-30105 (National Plan for Research, Development and
Innovation) and the Multidisciplinary Project of Universidad Aut´onoma de
Madrid (CEMU-2012-034
From text summarisation to style-specific summarisation for broadcast news
In this paper we report on a series of experiments investigating the path from text summarisation to style-specific summarisation of spoken news stories. We show that the portability of traditional text summarisation features to broadcast news is dependent on the diffusiveness of the information in the broadcast news story. An analysis of two categories of news stories (containing only read speech or including some spontaneous speech) demonstrates the importance of the style and the
quality of the transcript, when extracting the summary-worthy information content. Further experiments indicate the advantages of doing
style-specific summarisation of broadcast news
- …