231 research outputs found
DERIVING TECHNOLOGY ROADMAPS WITH TECH MINING TECHNIQUES
Technology monitoring has been a knowledge intensive and time-consuming task for IT managers or domain experts. Tech mining techniques can be used to mitigate these efforts. This paper proposes a technology monitoring framework based on tech mining techniques to facilitate the derivative of information and communication technology (ICT) roadmaps. With this framework, a tech mining engine is able to allocate the most relevant documents which describe a category of technologies. Domain experts were participated in a scan meeting to verify the generated roadmaps based on the selected cluster of documents. The draft roadmaps can be further articulated with domain experts\u27 judgment for technology forecasting and assessment
Incorporating semantic and syntactic information into document representation for document clustering
Document clustering is a widely used strategy for information retrieval and text data mining. In traditional document clustering systems, documents are represented as a bag of independent words. In this project, we propose to enrich the representation of a document by incorporating semantic information and syntactic information. Semantic analysis and syntactic analysis are performed on the raw text to identify this information. A detailed survey of current research in natural language processing, syntactic analysis, and semantic analysis is provided. Our experimental results demonstrate that incorporating semantic information and syntactic information can improve the performance of our document clustering system for most of our data sets. A statistically significant improvement can be achieved when we combine both syntactic and semantic information. Our experimental results using compound words show that using only compound words does not improve the clustering performance for our data sets. When the compound words are combined with original single words, the combined feature set gets slightly better performance for most data sets. But this improvement is not statistically significant. In order to select the best clustering algorithm for our document clustering system, a comparison of several widely used clustering algorithms is performed. Although the bisecting K-means method has advantages when working with large datasets, a traditional hierarchical clustering algorithm still achieves the best performance for our small datasets
Text mining techniques for patent analysis.
Abstract Patent documents contain important research results. However, they are lengthy and rich in technical terminology such that it takes a lot of human efforts for analyses. Automatic tools for assisting patent engineers or decision makers in patent analysis are in great demand. This paper describes a series of text mining techniques that conforms to the analytical process used by patent analysts. These techniques include text segmentation, summary extraction, feature selection, term association, cluster generation, topic identification, and information mapping. The issues of efficiency and effectiveness are considered in the design of these techniques. Some important features of the proposed methodology include a rigorous approach to verify the usefulness of segment extracts as the document surrogates, a corpus-and dictionary-free algorithm for keyphrase extraction, an efficient co-word analysis method that can be applied to large volume of patents, and an automatic procedure to create generic cluster titles for ease of result interpretation. Evaluation of these techniques was conducted. The results confirm that the machine-generated summaries do preserve more important content words than some other sections for classification. To demonstrate the feasibility, the proposed methodology was applied to a realworld patent set for domain analysis and mapping, which shows that our approach is more effective than existing classification systems. The attempt in this paper to automate the whole process not only helps create final patent maps for topic analyses, but also facilitates or improves other patent analysis tasks such as patent classification, organization, knowledge sharing, and prior art searches
No Pattern, No Recognition: a Survey about Reproducibility and Distortion Issues of Text Clustering and Topic Modeling
Extracting knowledge from unlabeled texts using machine learning algorithms
can be complex. Document categorization and information retrieval are two
applications that may benefit from unsupervised learning (e.g., text clustering
and topic modeling), including exploratory data analysis. However, the
unsupervised learning paradigm poses reproducibility issues. The initialization
can lead to variability depending on the machine learning algorithm.
Furthermore, the distortions can be misleading when regarding cluster geometry.
Amongst the causes, the presence of outliers and anomalies can be a determining
factor. Despite the relevance of initialization and outlier issues for text
clustering and topic modeling, the authors did not find an in-depth analysis of
them. This survey provides a systematic literature review (2011-2022) of these
subareas and proposes a common terminology since similar procedures have
different terms. The authors describe research opportunities, trends, and open
issues. The appendices summarize the theoretical background of the text
vectorization, the factorization, and the clustering algorithms that are
directly or indirectly related to the reviewed works
On semantic differences: a multivariate corpus-based study of the semantic field of inchoativity in translated and non-translated Dutch
This dissertation places the study of semantic differences in translation compared to non-translation at the centre of its concerns. To date, much research in Corpus-based Translation Studies has focused on lexical and grammatical phenomena in an attempt to reveal presumed general tendencies of translation. On the semantic level, these general tendencies have rarely been investigated. Therefore, the goal of this study is to explore whether universal tendencies of translation also exist on the semantic level, thereby connecting the framework of translation universals to semantics
- …