327,232 research outputs found

    Text Mining at Feature Level: A Review

    Full text link
    Text Mining is the technique that helps users to find out useful information from a large amount of text documents on the web or database. Most popular text mining and classification methods have adopted term-based approaches. The term based approaches and the pattern-based method describing user preferences. This review paper analyse how the text mining work on the three level i.e sentence level, document level and feature level. In this paper we review the related work which is previously done. This paper also demonstrated that what are the problems arise while doing text mining done at the feature level. This paper presents the technique to text mining for the compound sentences

    Graph-based Representation for Sentence Similarity Measure : A Comparative Analysis

    Get PDF
    Textual data are a rich source of knowledge hence, sentence comparison has become one of the important tasks in text mining related works.Most previous work in text comparison are performed at document level, research suggest that comparing sentence level text is a non-trivial problem.One of the reason is two sentences can convey the same meaning with totally dissimilar words.This paper presents the results of a comparative analysis on three representation schemes i.e. term frequency inverse document frequency, Latent Semantic Analysis and Graph based representation using three similarity measures i.e. Cosine, Dice coefficient and Jaccard similarity to compare the similarity of sentences.Results reveal that the graph based representation and the Jaccard similarity measure outperforms the others in terms of precision, recall and F-measures

    Relationship Analysis of Keyword and Chapter in Malay-Translated Tafseer of Al-Quran

    Get PDF
    A number of studies have gained popularity to study the unseen knowledge categories and relationship of subject matters discussed in the Al-Quran or the Tafseer. This research investigates the relationships between verses and chapters at the keyword level in a Malay translated Tafseer. A combination technique of text mining and network analysis is developed to discover non-trivial patterns and relationships of verses and chapters in the Tafseer. This is achieved through keyword extraction, keyword-chapter relationship discovery and keyword- chapter network analysis. A total of 130 keywords were extracted from six chapters in the Tafseer. The keywords and their relative importance to a chapter are computed using term weighting. A network analysis map was generated to visualize and analyze the relationship between keyword and chapter in the Tafseer. The relationship between the verses and chapters at the keyword level are successfully portrayed through the combination technique of text mining and network analysis. The novelty of this approach lies in the discovery of the relationships between verses and chapters that is useful for grouping related chapters together

    Textual Data Mining For Knowledge Discovery and Data Classification: A Comparative Study

    Get PDF
    Business Intelligence solutions are key to enable industrial organisations (either manufacturing or construction) to remain competitive in the market. These solutions are achieved through analysis of data which is collected, retrieved and re-used for prediction and classification purposes. However many sources of industrial data are not being fully utilised to improve the business processes of the associated industry. It is generally left to the decision makers or managers within a company to take effective decisions based on the information available throughout product design and manufacture or from the operation of business or production processes. Substantial efforts and energy are required in terms of time and money to identify and exploit the appropriate information that is available from the data. Data Mining techniques have long been applied mainly to numerical forms of data available from various data sources but their applications to analyse semi-structured or unstructured databases are still limited to a few specific domains. The applications of these techniques in combination with Text Mining methods based on statistical, natural language processing and visualisation techniques could give beneficial results. Text Mining methods mainly deal with document clustering, text summarisation and classification and mainly rely on methods and techniques available in the area of Information Retrieval (IR). These help to uncover the hidden information in text documents at an initial level. This paper investigates applications of Text Mining in terms of Textual Data Mining (TDM) methods which share techniques from IR and data mining. These techniques may be implemented to analyse textual databases in general but they are demonstrated here using examples of Post Project Reviews (PPR) from the construction industry as a case study. The research is focused on finding key single or multiple term phrases for classifying the documents into two classes i.e. good information and bad information documents to help decision makers or project managers to identify key issues discussed in PPRs which can be used as a guide for future project management process

    BC4GO: a full-text corpus for the BioCreative IV GO task

    Get PDF
    Gene function curation via Gene Ontology (GO) annotation is a common task among Model Organism Database groups. Owing to its manual nature, this task is considered one of the bottlenecks in literature curation. There have been many previous attempts at automatic identification of GO terms and supporting information from full text. However, few systems have delivered an accuracy that is comparable with humans. One recognized challenge in developing such systems is the lack of marked sentence-level evidence text that provides the basis for making GO annotations. We aim to create a corpus that includes the GO evidence text along with the three core elements of GO annotations: (i) a gene or gene product, (ii) a GO term and (iii) a GO evidence code. To ensure our results are consistent with real-life GO data, we recruited eight professional GO curators and asked them to follow their routine GO annotation protocols. Our annotators marked up more than 5000 text passages in 200 articles for 1356 distinct GO terms. For evidence sentence selection, the inter-annotator agreement (IAA) results are 9.3% (strict) and 42.7% (relaxed) in F1-measures. For GO term selection, the IAAs are 47% (strict) and 62.9% (hierarchical). Our corpus analysis further shows that abstracts contain ∼10% of relevant evidence sentences and 30% distinct GO terms, while the Results/Experiment section has nearly 60% relevant sentences and >70% GO terms. Further, of those evidence sentences found in abstracts, less than one-third contain enough experimental detail to fulfill the three core criteria of a GO annotation. This result demonstrates the need of using full-text articles for text mining GO annotations. Through its use at the BioCreative IV GO (BC4GO) task, we expect our corpus to become a valuable resource for the BioNLP research community

    Ambiguity Diagnosis for Terms in Digital Humanities

    Get PDF
    International audienceAmong all researches dedicating to terminology and word sense disambiguation, little attention has been devoted to the ambiguity of term occurrences. If a lexical unit is indeed a term of the domain, it is not true, even in a specialised corpus, that all its occurrences are terminological. Some occurrences are terminological and other are not. Thus, a global decision at the corpus level about the terminological status of all occurrences of a lexical unit would then be erroneous. In this paper, we propose three original methods to characterise the ambiguity of term occurrences in the domain of social sciences for French. These methods differently model the context of the term occurrences: one is relying on text mining, the second is based on textometry, and the last one focuses on text genre properties. The experimental results show the potential of the proposed approaches and give an opportunity to discuss about their hybridisation

    All metrics are equal, but some metrics are more equal than others: A systematic search and review on the use of the term ‘metric’

    Full text link
    Objective: To examine the use of the term 'metric' in health and social sciences' literature, focusing on the interval scale implication of the term in Modern Test Theory (MTT). Materials and methods: A systematic search and review on MTT studies including 'metric' or 'interval scale' was performed in the health and social sciences literature. The search was restricted to 2001-2005 and 2011-2015. A Text Mining algorithm was employed to operationalize the eligibility criteria and to explore the uses of 'metric'. The paradigm of each included article (Rasch Measurement Theory (RMT), Item Response Theory (IRT) or both), as well as its type (Theoretical, Methodological, Teaching, Application, Miscellaneous) were determined. An inductive thematic analysis on the first three types was performed. Results: 70.6% of the 1337 included articles were allocated to RMT, and 68.4% were application papers. Among the number of uses of 'metric', it was predominantly a synonym of 'scale'; as adjective, it referred to measurement or quantification. Three incompatible themes 'only RMT/all MTT/no MTT models can provide interval measures' were identified, but 'interval scale' was considerably more mentioned in RMT than in IRT. Conclusion: 'Metric' is used in many different ways, and there is no consensus on which MTT metric has interval scale properties. Nevertheless, when using the term 'metric', the authors should specify the level of the metric being used (ordinal, ordered, interval, ratio), and justify why according to them the metric is at that level
    • …
    corecore