1,655 research outputs found
Deep Learning for Period Classification of Historical Texts
In this study, we address the interesting task of classifying historical texts by their assumed period of writing. This task is useful in digital humanity studies where many texts have unidentified publication dates. For years, the typical approach for temporal text classification was supervised using machine-learning algorithms. These algorithms require careful feature engineering and considerable domain expertise to design a feature extractor to transform the raw text into a feature vector from which the classifier could learn to classify any unseen valid input. Recently, deep learning has produced extremely promising results for various tasks in natural language processing (NLP). The primary advantage of deep learning is that human engineers did not design the feature layers, but the features were extrapolated from data with a general-purpose learning procedure. We investigated deep learning models for period classification of historical texts. We compared three common models: paragraph vectors, convolutional neural networks (CNN), and recurrent neural networks (RNN). We demonstrate that the CNN and RNN models outperformed the paragraph vector model and supervised machine-learning algorithms. In addition, we constructed word embeddings for each time period and analyzed semantic changes of word meanings over time
Recommended from our members
Social media and Russian territorial irredentism: some facts and a conjecture
Religion, division of labor and conflict : anti-semitism in Germany over 600 years
We study the role of economic incentives in shaping the co-existence of Jews, Catholics and Protestants, using novel data from Germany for 1,000+ cities. The Catholic usury ban and higher literacy rates gave Jews a specific advantage in the moneylending sector. Following the Protestant Reformation (1517), the Jews lost these advantages in regions that became Protestant. We show 1) a change in the geography of anti-Semitism with persecutions of Jews and anti-Jewish publications becoming more common in Protestant areas relative to Catholic areas; 2) a more pronounced change in cities where Jews had already established themselves as moneylenders. These findings are consistent with the interpretation that, following the Protestant Reformation, Jews living in Protestant regions were exposed to competition with the Christian majority, especially in moneylending, leading to an increase in anti-Semitism
Fuerzas tradicionales de exclusión: Una revisión de la literatura cuantitativa sobre la situación económica de los pueblos indígenas, afrodescendientes y personas con discapacidad
(Disponible en inglés) La distribución desigual de riqueza en América Latina y el Caribe esta ligada a la distribución desigual de activos (humanos y físicos) y al acceso diferenciado a los mercados y servicios. Estas circunstancias, y las correspondientes tensiones sociales, deben ser entendidas en términos de fuerzas tradicionales de exlcusión; los sectores de la población que experimentan resultados desfavorables también pueden ser reconocidos por características como etnicidad, raza, género y discapacidaes físicas. Además de revisar la literatura en exclusión social, este trabajo revisa diferentes tópicos: (i) deprivación relativa (en tierra y vivienda, infraestructura física, salud e ingresos); (ii) temas de los mercados de trabajo, incluyendo acceso a los mercados en general, así como informalidad, segregación y discriminación; (iii) los puntos de transacción de representación política, protección social y violencia; y (iv) áreas en las que el análisis aun es débil y avenidas para mayor investigación en la región.
Recommended from our members
The Near-Synonymous Classifiers in Mandarin Chinese: Etymology, Modern Usage, And Possible Problems in L2 Classroom
Many Chinese classifiers are nearly synonymic – they can be used with the same head nouns without changing the meaning of the sentence, in other words, such classifiers can be used interchangeably or almost interchangeably. This poses a challenge for Chinese language learners, especially those who lack such a grammatical category in their own native language. Another complication arises from the ambiguous English translations of many classifiers.
In this paper we investigate the collocation behavior of near-synonymous Chinese classifiers, focusing on their semantic nuances and interchangeability. Analyzing 6 pairs of classifiers — 栋 and 幢, 匹 and 头, 批 and 派, 颗 and 粒, 辆 and 台, and 根 and 支— drawn from the HSK exam glossary, the dataset for this study encompasses 1200 samples (100 per each variable) and 416 distinct head nouns.
Through a corpus-based approach we analyze collocation behavior of each classifier on its own and as a part of the pair. The results showcase that not all pairs exhibit complete interchangeability. The collocation behavior of 批 and 派 differ significantly, where 批 primarily quantifies batches with a \u27first\u27 connotation, while 派 is used more in artistic expressions. The interchangeability of 栋 and 幢 varies with context. 幢 emerges as the least fre¬¬quent morpheme in the corpus, emphasizing its specific contextual usage. While both are used in address lines, 栋 predominantly quantifies standalone buildings, whereas 幢 is more aligned with larger architectural complexes. The analysis of 匹 and 头 highlights their distinctiveness, with 匹 counting horses and wolves and 头 being more versatile with various animals. 颗 and 粒 appear partially interchangeable, particularly with 珠-related head nouns and items associated with plants, fruits, and trees. The research also underscores that 辆 is primarily linked to car-related nouns, while 台 is used more versatile as a classifier for machines and electronic devices, including computers, printers, phones, cameras. 根 and 支 only overlap in the head noun 笔, and their roles diverge, with 根 being a versatile classifier and 支 also appearing as part of medical terms
Sentiment analysis for hate speech detection on social media: TF-IDF weighted N-Grams based approach
Thesis submitted in partial fulfillment of the requirements for the Degree of Master of Science in Information Technology (MSIT) at Strathmore UniversityHate speech on social media has unfortunately become a common occurrence in the Kenyan online community largely due to advances in mobile computing and the internet. Incidents of hate speech on social media have the potential of quickly disseminating amidst online users and escalating into acts of violence and hate crimes due to incitement, as was the case during the 2007-2008 Post Election Violence. With the upcoming, highly contested 2017 general elections, the monitoring of hate speech on social media platforms is of critical importance to detect hate speech occurrences as soon as possible to prevent any further escalations which may result in violence. Current efforts by the National Cohesion and Integration Commission to monitor hate speech on social media involve the use of web crawlers to collect possible instances of hate speech based on specific keywords. Human monitors then have to analyze the collected data to determine instances that are actually hate speech. This human analysis is not only time consuming and overwhelming but also introduces subjective notions of what constitutes hate speech. This research proposed the application of machine learning techniques to build a text binary classifier to detect hate speech on twitter. Hate speech data was collected and labelled to build the corpora. A Support Vector Machine model was trained and validated based on the labelled text data using unigram features and term frequency-inverse document frequency weighting. The research employed an experimental approach to determine which combination of features, weighting schemes and classifiers gives the best performance on the collected hate speech data. Bigram features weighted using term frequency-inverse document frequency fed into a Support Vector Machine classifier gave the best classification performance at an accuracy of 76.22 percent, with an area under the curve of 0.76 for a Receiver Operating Characteristic curve
- …