Search CORE

3,001 research outputs found

On the Effect of Semantically Enriched Context Models on Software Modularization

Author: Hage Jurriaan
Jansen Slinger
Khadka Ravi
Saeidi Amir
Publication venue: 'Aspect-Oriented Software Association (AOSA)'
Publication date: 04/08/2017
Field of study

Many of the existing approaches for program comprehension rely on the linguistic information found in source code, such as identifier names and comments. Semantic clustering is one such technique for modularization of the system that relies on the informal semantics of the program, encoded in the vocabulary used in the source code. Treating the source code as a collection of tokens loses the semantic information embedded within the identifiers. We try to overcome this problem by introducing context models for source code identifiers to obtain a semantic kernel, which can be used for both deriving the topics that run through the system as well as their clustering. In the first model, we abstract an identifier to its type representation and build on this notion of context to construct contextual vector representation of the source code. The second notion of context is defined based on the flow of data between identifiers to represent a module as a dependency graph where the nodes correspond to identifiers and the edges represent the data dependencies between pairs of identifiers. We have applied our approach to 10 medium-sized open source Java projects, and show that by introducing contexts for identifiers, the quality of the modularization of the software systems is improved. Both of the context models give results that are superior to the plain vector representation of documents. In some cases, the authoritativeness of decompositions is improved by 67%. Furthermore, a more detailed evaluation of our approach on JEdit, an open source editor, demonstrates that inferred topics through performing topic analysis on the contextual representations are more meaningful compared to the plain representation of the documents. The proposed approach in introducing a context model for source code identifiers paves the way for building tools that support developers in program comprehension tasks such as application and domain concept location, software modularization and topic analysis

arXiv.org e-Print Archive

Heriot Watt Pure

Crossref

ZENODO

Utrecht University Repository

FigShare

Concept-based Text Clustering

Author: Huang Lan
Publication venue: 'University of Waikato'
Publication date: 06/07/2011
Field of study

Thematic organization of text is a natural practice of humans and a crucial task for today's vast repositories. Clustering automates this by assessing the similarity between texts and organizing them accordingly, grouping like ones together and separating those with different topics. Clusters provide a comprehensive logical structure that facilitates exploration, search and interpretation of current texts, as well as organization of future ones. Automatic clustering is usually based on words. Text is represented by the words it mentions, and thematic similarity is based on the proportion of words that texts have in common. The resulting bag-of-words model is semantically ambiguous and undesirably orthogonal|it ignores the connections between words. This thesis claims that using concepts as the basis of clustering can significantly improve effectiveness. Concepts are defined as units of knowledge. When organized according to the relations among them, they form a concept system. Two concept systems are used here: WordNet, which focuses on word knowledge, and Wikipedia, which encompasses world knowledge. We investigate a clustering procedure with three components: using concepts to represent text; taking the semantic relations among them into account during clustering; and learning a text similarity measure from concepts and their relations. First, we demonstrate that concepts provide a succinct and informative representation of the themes in text, exemplifying this with the two concept systems. Second, we define methods for utilizing concept relations to enhance clustering by making the representation models more discriminative and extending thematic similarity beyond surface overlap. Third, we present a similarity measure based on concepts and their relations that is learned from a small number of examples, and show that it both predicts similarity consistently with human judgement and improves clustering. The thesis provides strong support for the use of concept-based representations instead of the classic bag-of-words model

Research Commons@Waikato

Recommended from our members

Extracting Semantics of Individual Places from Movement Data by Analyzing Temporal Patterns of Visits

Author: Andrienko G.
Andrienko N.
Fuchs G.
Olteanu Raimond A. M.
Symanzik J.
Ziemlicki C.
Publication venue
Publication date: 01/01/2013
Field of study

Data reflecting movements of people, such as GPS or GSM tracks, can be a source of information about mobility behaviors and activities of people. Such information is required for various kinds of spatial planning in the public and business sectors. Movement data by themselves are semantically poor. Meaningful information can be derived by means of interactive visual analysis performed by a human expert; however, this is only possible for data about a small number of people. We suggest an approach that allows scaling to large datasets reflecting movements of numerous people. It includes extracting stops, clustering them for identifying personal places of interest (POIs), and creating temporal signatures of the POIs characterizing the temporal distribution of the stops with respect to the daily and weekly time cycles and the time line. The analyst can give meanings to selected POIs based on their temporal signatures (i.e., classify them as home, work, etc.), and then POIs with similar signatures can be classified automatically. We demonstrate the possibilities for interactive visual semantic analysis by example of GSM, GPS, and Twitter data. GPS data allow inferring richer semantic information, but temporal signatures alone may be insufficient for interpreting short stops. Twitter data are similar to GSM data but additionally contain message texts, which can help in place interpretation. We plan to develop an intelligent system that learns how to classify personal places and trips while a human analyst visually analyzes and semantically annotates selected subsets of movement data

City Research Online

Fraunhofer-ePrints

Russian word sense induction by clustering averaged word embeddings

Author: Kutuzov Andrey
Publication venue
Publication date: 01/01/2018
Field of study

The paper reports our participation in the shared task on word sense induction and disambiguation for the Russian language (RUSSE-2018). Our team was ranked 2nd for the wiki-wiki dataset (containing mostly homonyms) and 5th for the bts-rnc and active-dict datasets (containing mostly polysemous words) among all 19 participants. The method we employed was extremely naive. It implied representing contexts of ambiguous words as averaged word embedding vectors, using off-the-shelf pre-trained distributional models. Then, these vector representations were clustered with mainstream clustering techniques, thus producing the groups corresponding to the ambiguous word senses. As a side result, we show that word embedding models trained on small but balanced corpora can be superior to those trained on large but noisy data - not only in intrinsic evaluation, but also in downstream tasks like word sense induction.Comment: Proceedings of the 24rd International Conference on Computational Linguistics and Intellectual Technologies (Dialogue-2018

arXiv.org e-Print Archive

NORA - Norwegian Open Research Archives

Disclosing citation meanings for augmented research retrieval and exploration

Author: Cataldi M.
Di Caro L.
Ferrod R.
Schifanella C.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

Institutional Research Information System University of Turin

Does Enrichment of Clinical Texts by Ontology Concepts Increases Classification Accuracy?

Author: Denecke Kerstin
Publication venue: 'IOS Press'
Publication date: 01/01/2022
Field of study

In the medical domain, multiple ontologies and terminology systems are available. However, existing classification and prediction algorithms in the clinical domain often ignore or insufficiently utilize semantic information as it is provided in those ontologies. To address this issue, we introduce a concept for augmenting embeddings, the input to deep neural networks, with semantic information retrieved from ontologies. To do this, words and phrases of sentences are mapped to concepts of a medical ontology aggregating synonyms in the same concept. A semantically enriched vector is generated and used for sentence classification. We study our approach on a sentence classification task using a real world dataset which comprises 640 sentences belonging to 22 categories. A deep neural network model is defined with an embedding layer followed by two LSTM layers and two dense layers. Our experiments show, classification accuracy without content enriched embeddings is for some categories higher than without enrichment. We conclude that semantic information from ontologies has potential to provide a useful enrichment of text. Future research will assess to what extent semantic relationships from the ontology can be used for enrichment

Berner Fachhochschule: ARBOR