610 research outputs found
Pronunciation Ambiguities in Japanese Kanji
Japanese writing is a complex system, and a large part of the complexity resides in the use of kanji. A single kanji character in modern Japanese may have multiple pronunciations, either as native vocabulary or as words borrowed from Chinese. This causes a problem for text-to-speech synthesis (TTS) because the system has to predict which pronunciation of each kanji character is appropriate in the context. The problem is called homograph disambiguation. In Japanese TTS technology, the trick in any case is to know which is the right reading, which makes reading Japanese text a challenge. To solve the problem, this research provides a new annotated Japanese single kanji character pronunciation data set and describes an experiment using logistic regression (LR) classifier. A baseline is computed to compare with the LR classifier accuracy. The LR classifier improves the modeling performance by 16%. This experiment provides the first experimental research in Japanese single kanji homograph disambiguation. The annotated Japanese data is freely released to the public to support further work
Towards a Peaceful Development of Cyberspace - Challenges and Technical Measures for the De-escalation of State-led Cyberconflicts and Arms Control of Cyberweapons
Cyberspace, already a few decades old, has become a matter of course for most of us, part of our everyday life. At the same time, this space and the global infrastructure behind it are essential for our civilizations, the economy and administration, and thus an essential expression and lifeline of a globalized world. However, these developments also create vulnerabilities and thus, cyberspace is increasingly developing into an intelligence and military operational area – for the defense and security of states but also as a component of offensive military planning, visible in the creation of military cyber-departments and the integration of cyberspace into states' security and defense strategies. In order to contain and regulate the conflict and escalation potential of technology used by military forces, over the last decades, a complex tool set of transparency, de-escalation and arms control measures has been developed and proof-tested. Unfortunately, many of these established measures do not work for cyberspace due to its specific technical characteristics. Even more, the concept of what constitutes a weapon – an essential requirement for regulation – starts to blur for this domain. Against this background, this thesis aims to answer how measures for the de-escalation of state-led conflicts in cyberspace and arms control of cyberweapons can be developed. In order to answer this question, the dissertation takes a specifically technical perspective on these problems and the underlying political challenges of state behavior and international humanitarian law in cyberspace to identify starting points for technical measures of transparency, arms control and verification. Based on this approach of adopting already existing technical measures from other fields of computer science, the thesis will provide proof of concepts approaches for some mentioned challenges like a classification system for cyberweapons that is based on technical measurable features, an approach for the mutual reduction of vulnerability stockpiles and an approach to plausibly assure the non-involvement in a cyberconflict as a measure for de-escalation. All these initial approaches and the questions of how and by which measures arms control and conflict reduction can work for cyberspace are still quite new and subject to not too many debates. Indeed, the approach of deliberately self-restricting the capabilities of technology in order to serve a bigger goal, like the reduction of its destructive usage, is yet not very common for the engineering thinking of computer science. Therefore, this dissertation also aims to provide some impulses regarding the responsibility and creative options of computer science with a view to the peaceful development and use of cyberspace
実応用を志向した機械翻訳システムの設計と評価
Tohoku University博士(情報科学)thesi
Hidden Citations Obscure True Impact in Science
References, the mechanism scientists rely on to signal previous knowledge,
lately have turned into widely used and misused measures of scientific impact.
Yet, when a discovery becomes common knowledge, citations suffer from
obliteration by incorporation. This leads to the concept of hidden citation,
representing a clear textual credit to a discovery without a reference to the
publication embodying it. Here, we rely on unsupervised interpretable machine
learning applied to the full text of each paper to systematically identify
hidden citations. We find that for influential discoveries hidden citations
outnumber citation counts, emerging regardless of publishing venue and
discipline. We show that the prevalence of hidden citations is not driven by
citation counts, but rather by the degree of the discourse on the topic within
the text of the manuscripts, indicating that the more discussed is a discovery,
the less visible it is to standard bibliometric analysis. Hidden citations
indicate that bibliometric measures offer a limited perspective on quantifying
the true impact of a discovery, raising the need to extract knowledge from the
full text of the scientific corpus
Integrating Big Data Analytics with U.S. SEC Financial Statement Datasets and the Critical Examination of the Altman Z’-Score Model
The main aim of this thesis is to document the process of developing Big Data analytical applications and their integration with financial statement datasets. These datasets are publicly available on the U.S. SEC (Security and Exchange Commission) website which contains the annual and quarterly reports of approximately 8000 companies. Through its Electronic Data Gathering, Analysis and Retrieval (EDGAR) system, the SEC receives several terabytes of data in the mandatory filings from its registrants. This vast amount of data can potentially provide a valuable resource for those parties (such as investors, analysts, regulators and researchers) who are interested in assessing the financial performance and position of companies. Traditionally, the quarterly and annual reports were submitted in standard PDF, HTML and Text files. The data from these files could be manually extracted and analysed, but this process (still used by some analysts and researchers) is costly and time-consuming.
In 2009, the SEC mandated all listed companies to use a digital reporting format known as XBRL (eXtensible Business Reporting Language). The intention of this was to improve financial reporting in terms of transparency and efficiency. In order to take advantage of structured data contained in the XBRL format, a variety of methods such as novel extraction algorithms and data mining techniques have been developed. However, several limitations and issues have emerged. These include a lack of automated connectivity between the EDGAR web interface and the terms used in structured taxonomies, and the inability to provide access to multiple files in a single query.
Given the challenging and complex nature of these issues, this research project used the financial statement datasets available on the SEC website to extract relevant financial information from the company’s annual reports. The novel aspect of this research is providing big data analytical applications using cloud technologies that can efficiently perform datasets integration and transformation into a format suitable for further analysis. The result of this is that the extracted financial data can be analysed to assess the performance of companies, and this facilitates the critical examination of widely used credit assessment models such as the Altman Z’-Score
Examining Political Discourse on Online 8Kun and Reddit Forums
A recent example of political violence in the United States was that of the January 6, 2021, Capitol attack in connection with the certification of Joseph R. Biden’s victory over Donald J. Trump in the 2020 US presidential election. This thesis analyzes the events of January 6, 2021, through the lens of social media discourse. This thesis presents a workflow that acquired over 5 million 8kun and Reddit posts from various apolitical and political forums in the three months preceding and following the Capitol attack on January 6, 2021. Techniques from text analysis are then used to group forums according to the similarities of their posting patterns. Five main groups of forums are identified. Finally, this thesis analyzes these forums for feelings of isolation and displacement from society in connection with the events of January 6, 2021. Such feelings were not clearly identified. This thesis demonstrates the challenges and opportunities of scraping and analyzing social media data
Integration of heterogeneous data sources and automated reasoning in healthcare and domotic IoT systems
In recent years, IoT technology has radically transformed many crucial industrial and service sectors such as healthcare. The multi-facets heterogeneity of the devices and the collected information provides important opportunities to develop innovative systems and services. However, the ubiquitous presence of data silos and the poor semantic interoperability in the IoT landscape constitute a significant obstacle in the pursuit of this goal. Moreover, achieving actionable knowledge from the collected data requires IoT information sources to be analysed using appropriate artificial intelligence techniques such as automated reasoning. In this thesis work, Semantic Web technologies have been investigated as an approach to address both the data integration and reasoning aspect in modern IoT systems. In particular, the contributions presented in this thesis are the following: (1) the IoT Fitness Ontology, an OWL ontology that has been developed in order to overcome the issue of data silos and enable semantic interoperability in the IoT fitness domain; (2) a Linked Open Data web portal for collecting and sharing IoT health datasets with the research community; (3) a novel methodology for embedding knowledge in rule-defined IoT smart home scenarios; and (4) a knowledge-based IoT home automation system that supports a seamless integration of heterogeneous devices and data sources
Fuzzy spectral clustering methods for textual data
Nowadays, the development of advanced information technologies has determined an increase in the production of textual data. This inevitable growth accentuates the need to advance in the identification of new methods and tools able to efficiently analyse such kind of data. Against this background, unsupervised classification techniques can play a key role in this process since most of this data is not classified. Document clustering, which is used for identifying a partition of clusters in a corpus of documents, has proven to perform efficiently in the analyses of textual documents and it has been extensively applied in different fields, from topic modelling to information retrieval tasks. Recently, spectral clustering methods have gained success in the field of text classification. These methods have gained popularity due to their solid theoretical foundations which do not require any specific assumption on the global structure of the data. However, even though they prove to perform well in text classification problems, little has been done in the field of clustering. Moreover, depending on the type of documents analysed, it might be often the case that textual documents do not contain
only information related to a single topic: indeed, there might be an overlap of contents characterizing different knowledge domains. Consequently, documents may contain information that is relevant to different areas of interest to some degree.
The first part of this work critically analyses the main clustering algorithms used for text data, involving also the mathematical representation of documents and the pre-processing phase. Then, three novel fuzzy versions of spectral clustering algorithms for text data are introduced. The first one exploits the use of fuzzy K-medoids instead of K-means. The second one derives directly from the first one but is used in combination with Kernel and Set Similarity (KS2M), which takes into account the Jaccard index. Finally, in the third one, in order to enhance the clustering performance, a new similarity measure S∗ is proposed. This last one exploits the inherent sequential nature of text data by means of a weighted combination between the Spectrum string kernel function and a measure of set similarity.
The second part of the thesis focuses on spectral bi-clustering algorithms for text mining tasks, which represent an interesting and partially unexplored field of research. In particular, two novel versions of fuzzy spectral bi-clustering algorithms are introduced. The two algorithms differ from each other for the approach followed in the identification of the document and the word partitions. Indeed, the first one follows a simultaneous approach while the second one a sequential approach. This difference leads also to a diversification in the choice of the number of clusters. The adequacy of all the proposed fuzzy (bi-)clustering methods is evaluated by experiments performed on both real and benchmark data sets
- …