In this paper we present a clean, yet effective, model for word sense
disambiguation. Our approach leverage a bidirectional long short-term memory
network which is shared between all words. This enables the model to share
statistical strength and to scale well with vocabulary size. The model is
trained end-to-end, directly from the raw text to sense labels, and makes
effective use of word order. We evaluate our approach on two standard datasets,
using identical hyperparameter settings, which are in turn tuned on a third set
of held out data. We employ no external resources (e.g. knowledge graphs,
part-of-speech tagging, etc), language specific features, or hand crafted
rules, but still achieve statistically equivalent results to the best
state-of-the-art systems, that employ no such limitations