5,539 research outputs found
LFTK: Handcrafted Features in Computational Linguistics
Past research has identified a rich set of handcrafted linguistic features
that can potentially assist various tasks. However, their extensive number
makes it difficult to effectively select and utilize existing handcrafted
features. Coupled with the problem of inconsistent implementation across
research works, there has been no categorization scheme or generally-accepted
feature names. This creates unwanted confusion. Also, most existing handcrafted
feature extraction libraries are not open-source or not actively maintained. As
a result, a researcher often has to build such an extraction system from the
ground up.
We collect and categorize more than 220 popular handcrafted features grounded
on past literature. Then, we conduct a correlation analysis study on several
task-specific datasets and report the potential use cases of each feature.
Lastly, we devise a multilingual handcrafted linguistic feature extraction
system in a systematically expandable manner. We open-source our system for
public access to a rich set of pre-implemented handcrafted features. Our system
is coined LFTK and is the largest of its kind. Find it at
github.com/brucewlee/lftk.Comment: BEA @ ACL 202
Recommended from our members
Guttate leukoderma and acrokeratosis verruciformis of Hopf: a rare combination in Darier disease
A distinct Darier phenotype presenting with confetti-like hypopigmented macules was first described in 1965. Designated as "guttate leukoderma," this skin finding is a rarely-reported presentation of Darier disease. It has been theorized that the mutation in ATP2A2 causes defective E-cadherin, which in turn disrupts the adhesion of melanocytes to keratinocytes, thus leading to impaired dendrite formation, hindered melanin transfer, and ultimately to melanocyte apoptosis. Herein, we contribute a case of a 56-year old woman who presented with the rarely-described guttate leukoderma of Darier disease and acrokeratosis verruciformis of Hopf
A Side-by-side Comparison of Transformers for English Implicit Discourse Relation Classification
Though discourse parsing can help multiple NLP fields, there has been no wide
language model search done on implicit discourse relation classification. This
hinders researchers from fully utilizing public-available models in discourse
analysis. This work is a straightforward, fine-tuned discourse performance
comparison of seven pre-trained language models. We use PDTB-3, a popular
discourse relation annotated dataset. Through our model search, we raise SOTA
to 0.671 ACC and obtain novel observations. Some are contrary to what has been
reported before (Shi and Demberg, 2019b), that sentence-level pre-training
objectives (NSP, SBO, SOP) generally fail to produce the best performing model
for implicit discourse relation classification. Counterintuitively,
similar-sized PLMs with MLM and full attention led to better performance.Comment: TrustNLP @ ACL 202
Data augmentation and semi-supervised learning for deep neural networks-based text classifier
User feedback is essential for understanding user needs. In this paper, we use free-text obtained from a survey on sleep-related issues to build a deep neural networks-based text classifier. However, to train the deep neural networks model, a lot of labelled data is needed. To reduce manual data labelling, we propose a method which is a combination of data augmentation and pseudo-labelling: data augmentation is applied to labelled data to increase the size of the initial train set and then the trained model is used to annotate unlabelled data with pseudo-labels. The result shows that the model with the data augmentation achieves macro-averaged f1 score of 65.2% while using 4,300 training data, whereas the model without data augmentation achieves macro-averaged f1 score of 68.2% with around 14,000 training data. Furthermore, with the combination of pseudo-labelling, the model achieves macro-averaged f1 score of 62.7% with only using 1,400 training data with labels. In other words, with the proposed method we can reduce the amount of labelled data for training while achieving relatively good performance
Sporting Faith: Exploring Displays of Faith as Part of Christian Higher Education Athletic Program Identity
Contemporary higher education is made of a marketplace where institutions aggressively market themselves to student consumers who “shop” for school options (Tolbert, 2014). This study examines the marketing of faith-based higher education institutions’ athletic programs to determine how faith-related missions are revealed on institutional websites. Higher education institutions analyzed in this study consisted of 112 of the 141 member institutions that are members of the Council for Christian Colleges & Universities (CCCU), which compete in sanctioned intercollegiate athletics programs (e.g., NCAA, NAIA, NCCAA). The study attempted to quantify how strongly a university’s athletic program portrays the faith dimension of the school’s identity through the visual marketing tool of the athletic departments’ website to determine whether that measure is indicative of external perception. For this study, institutional websites were examined to measure the strength of faith identity presented on the sites using a content analysis of the university tagline, university mission statement, and athletic department mission statement. Faith expression was lacking in 53% of taglines and 33% of athletic department mission statements. Study results reflect that CCCU member institutions should streamline the faith expression of the university mission statement into the message conveyed in the tagline and the athletic department mission statement
Associations among Human-Associated Fecal Contamination, Microcystis aeruginosa, and Microcystin at Lake Erie Beaches
Lake Erie beaches exhibit impaired water quality due to fecal contamination and cyanobacterial blooms, though few studies address potential relationships between these two public health hazards. Using quantitative polymerase chain reaction (qPCR), Microcystis aeruginosa was monitored in conjunction with a human-associated fecal marker (Bacteroides fragilis group; g-Bfra), microcystin, and water quality parameters at two beaches to evaluate their potential associations. During the summer of 2010, water samples were collected 32 times from both Euclid and Villa Angela beaches. The phycocyanin intergenic spacer (PC-IGS) and the microcystin-producing (mcyA) gene in M. aeruginosa were quantified with qPCR. PC-IGS and mcyA were detected in 50.0% and 39.1% of samples, respectively, and showed increased occurrences after mid-August. Correlation and regression analyses showed that water temperature was negatively correlated with M. aeruginosa markers and microcystin. The densities of mcyA and the g-Bfra were predicted by nitrate, implicating fecal contamination as contributing to the growth of M. aeruginosa by nitrate loading. Microcystin was correlated with mcyA (r = 0.413, p \u3c 0.01), suggesting toxin-producing M. aeruginosa populations may significantly contribute to microcystin production. Additionally, microcystin was correlated with total phosphorus (r = 0.628, p \u3c 0.001), which was higher at Euclid (p \u3c 0.05), possibly contributing to higher microcystin concentrations at Euclid
K2-231 b: A sub-Neptune exoplanet transiting a solar twin in Ruprecht 147
We identify a sub-Neptune exoplanet ( R)
transiting a solar twin in the Ruprecht 147 star cluster (3 Gyr, 300 pc, [Fe/H]
= +0.1 dex). The ~81 day light curve for EPIC 219800881 (V = 12.71) from K2
Campaign 7 shows six transits with a period of 13.84 days, a depth of ~0.06%,
and a duration of ~4 hours. Based on our analysis of high-resolution MIKE
spectra, broadband optical and NIR photometry, the cluster parallax and
interstellar reddening, and isochrone models from PARSEC, Dartmouth, and MIST,
we estimate the following properties for the host star: M, R, and K. This star appears to be single, based on our modeling of the
photometry, the low radial velocity variability measured over nearly ten years,
and Keck/NIRC2 adaptive optics imaging and aperture-masking interferometry.
Applying a probabilistic mass-radius relation, we estimate that the mass of
this planet is M, which would cause a RV
semi-amplitude of m s that may be measurable with existing
precise RV facilities. After statistically validating this planet with BLENDER,
we now designate it K2-231 b, making it the second sub-stellar object to be
discovered in Ruprecht 147 and the first planet; it joins the small but growing
ranks of 23 other planets found in open clusters.Comment: 24 pages, 7 figures, light curve included as separate fil
- …