15 research outputs found

    Generate Your Counterfactuals: Towards Controlled Counterfactual Generation for Text

    Full text link
    Machine Learning has seen tremendous growth recently, which has led to larger adoption of ML systems for educational assessments, credit risk, healthcare, employment, criminal justice, to name a few. The trustworthiness of ML and NLP systems is a crucial aspect and requires a guarantee that the decisions they make are fair and robust. Aligned with this, we propose a framework GYC, to generate a set of counterfactual text samples, which are crucial for testing these ML systems. Our main contributions include a) We introduce GYC, a framework to generate counterfactual samples such that the generation is plausible, diverse, goal-oriented, and effective, b) We generate counterfactual samples, that can direct the generation towards a corresponding condition such as named-entity tag, semantic role label, or sentiment. Our experimental results on various domains show that GYC generates counterfactual text samples exhibiting the above four properties. GYC generates counterfactuals that can act as test cases to evaluate a model and any text debiasing algorithm.Comment: Accepted at AAAI Conference on Artificial Intelligence (AAAI 2021

    Do Large Scale Molecular Language Representations Capture Important Structural Information?

    Full text link
    Predicting the chemical properties of a molecule is of great importance in many applications, including drug discovery and material design. Machine learning based molecular property prediction holds the promise of enabling accurate predictions at much less computationally complex cost when compared to, for example, Density Functional Theory (DFT) calculations. Various representation learning methods in a supervised setting, including the features extracted using graph neural nets, have emerged for such tasks. However, the vast chemical space and the limited availability of labels make supervised learning challenging, calling for learning a general-purpose molecular representation. Recently, pre-trained transformer-based language models on large unlabeled corpus have produced state-of-the-art results in many downstream natural language processing tasks. Inspired by this development, we present molecular embeddings obtained by training an efficient transformer encoder model, MoLFormer. This model employs a linear attention mechanism coupled with highly parallelized training on SMILES sequences of 1.1 billion unlabeled molecules from the PubChem and ZINC datasets. Experiments show that the learned molecular representation outperforms supervised and unsupervised graph neural net baselines on several regression and classification tasks from 10 benchmark datasets, while performing competitively on others. Further analyses, specifically through the lens of attention, demonstrate that MoLFormer indeed learns a molecule's local and global structural aspects. These results provide encouraging evidence that large-scale molecular language models can capture sufficient structural information to be able to predict diverse molecular properties, including quantum-chemical propertie
    corecore