This week
a new paper in Nature (
see here for ungated coverage) shows that natural language processing (NLP) techniques can be used to read an entire corpus of journals and make discoveries years in advance of published research. Looking at materials science, the authors trained a model to “discover” compounds that have interesting properties.
People have applied machine learning to materials science before, but this required a huge amount of labour to create (small) labelled datasets, which include a tiny fraction of our knowledge of materials. The NLP method avoids this: it’s completely
unsupervised and so can read all published articles in the field without human input. The technique used is based on
Word2Vec - a kind of model that encodes word meanings as vectors, so that we can see things like the “closeness” of related concepts. This allows us to perform what you might crudely call word arithmetic - e.g. we can ask what is “King - Man + Woman” and get the answer “Queen”! (See
here for some fun examples).
It’s exciting to think that so much opportunity for discovery is already embedded in existing knowledge - not just in materials, but in medicine, engineering and more. The future is already here; we’ve just not (machine) learned it yet.