14 research outputs found
Cross Language Text Classification via Subspace Co-Regularized Multi-View Learning
In many multilingual text classification problems, the documents in different
languages often share the same set of categories. To reduce the labeling cost
of training a classification model for each individual language, it is
important to transfer the label knowledge gained from one language to another
language by conducting cross language classification. In this paper we develop
a novel subspace co-regularized multi-view learning method for cross language
text classification. This method is built on parallel corpora produced by
machine translation. It jointly minimizes the training error of each classifier
in each language while penalizing the distance between the subspace
representations of parallel documents. Our empirical study on a large set of
cross language text classification tasks shows the proposed method consistently
outperforms a number of inductive methods, domain adaptation methods, and
multi-view learning methods.Comment: Appears in Proceedings of the 29th International Conference on
Machine Learning (ICML 2012
Cross-Lingual Adaptation using Structural Correspondence Learning
Cross-lingual adaptation, a special case of domain adaptation, refers to the
transfer of classification knowledge between two languages. In this article we
describe an extension of Structural Correspondence Learning (SCL), a recently
proposed algorithm for domain adaptation, for cross-lingual adaptation. The
proposed method uses unlabeled documents from both languages, along with a word
translation oracle, to induce cross-lingual feature correspondences. From these
correspondences a cross-lingual representation is created that enables the
transfer of classification knowledge from the source to the target language.
The main advantages of this approach over other approaches are its resource
efficiency and task specificity.
We conduct experiments in the area of cross-language topic and sentiment
classification involving English as source language and German, French, and
Japanese as target languages. The results show a significant improvement of the
proposed method over a machine translation baseline, reducing the relative
error due to cross-lingual adaptation by an average of 30% (topic
classification) and 59% (sentiment classification). We further report on
empirical analyses that reveal insights into the use of unlabeled data, the
sensitivity with respect to important hyperparameters, and the nature of the
induced cross-lingual correspondences
CROSS-LINGUAL TEXT CLASSIFICATION WITH MODEL TRANSLATION AND DOCUMENT TRANSLATION
Most enterprise search engines employ data mining classifiers to classify documents. Along with the economic globalization, many companies are starting to have overseas branches or divisions. Those branches are using local languages in documents and emails. When a classifier tries to categorize those documents in another language, the trained model in mono-lingual will not work. The most direct solution would be to translate those documents in other languages into one language by the machine translator. But this solution suffers from inaccuracy of the machine translation, and the over-head work is economically inefficient. Another approach is to translate the feature extracted from one language to another language and use them to classify another language. This approach is efficient but faces a translation inaccuracy and language culture gap. In this project, the author proposes a new method which adapts both the model translation and document translation. This method can take advantage of the very best functionality between both the document translation and model translation methods
Application of transfer learning for the prediction of blast impulse
Transfer learning offers the potential to increase the utility of obtained data and improve predictive model performance in a new domain, particularly useful in an environment where data is expensive to obtain such as in a blast engineering context. A successful application in this respect will improve existing surrogate modelling approaches to allow for holistic and efficient strategies to protect people and structures subjected to the effects of an explosion. This paper presents a novel application of transfer learning for the prediction of peak specific impulse where we demonstrate that previous knowledge learned when modelling spherical charges can be transferred to provide a performance benefit when modelling cylindrical charges. To evaluate the influence of transfer learning, two artificial neural network architectures were stress tested for three levels of random data removal: the first model (NN) did not implement transfer learning whilst the second model (TNN) did by including a bolt-on network to a previously published NN model trained on the spherical dataset. It is shown the TNN consistently outperforms the NN, with this out-performance increasing as the proportion of data removed increases and showing statistically significant results for the low and high threshold with less variability in all cases. This paper indicates transfer learning applications can be used successfully with considerable benefit with respect to surrogate modelling in a blast engineering context
Can chinese web pages be classified with english data source
As the World Wide Web in China grows rapidly, mining knowledge in Chinese Web pages becomes more and more important. Mining Web information usually relies on the machine learning techniques which require a large amount of labeled data to train credible models. Although the number of Chinese Web pages increases quite fast, it still lacks Chinese labeled data. However, there are relatively sufficient English labeled Web pages. These labeled data, though in different linguistic representations, share a substantial amount of semantic information with Chinese ones, and can be utilized to help classify Chinese Web pages. In this paper, we propose an information bottleneck based approach to address this cross-language classification problem. Our algorithm first translates all the Chinese Web pages to English. Then, all the Web pages, including Chinese and English ones, are encoded through an information bottleneck which can allow only limited information to pass. Therefore, in order to retain as much useful information as possible, the common part between Chinese and English Web pages is inclined to be encoded to the same code (i.e. class label), which makes the cross-language classification accurate. We evaluated our approach using the Web pages collected from Open Directory Project (ODP). The experimental results show that our method significantly improves several existing supervised and semi-supervised classifiers