1. Introduction
Unstructured (mainly text) data remains an untapped resource with tremendous potential insights. Recent advances in technology have transformed the Natural Language Technology (NLT) landscape, specifically, the use of deep learning techniques to build transformer language models such as SciB-ERT [1] and GPT3 [2]. These new NLTs are being adopted in the private industry to improve operations and services [3]. Most of the uses focus on improving customer services. Industry applications focus on the operational use of these technologies include translations from speech to text and vice versa, automated text classification, sentiment analysis for feedback, comments and financial reports, topic modeling, text summarization, personalization and cognitive assistants (contextual chatbots). These technologies also serve as the basis for insight engines. Insight engines combine search capabilities with artificial intelligence to deliver actionable insights from the full spectrum of content and data sourced within and external to an enterprise [3]. Earth science has no shortage of unstructured data. Almost all knowledge within Earth science gets published as journal or conference papers. However, limited efforts have focused on harnessing this potential resource for knowledge extraction and supporting the scientific process. This paper examines four aspects related to language models in Earth science. First, it surveys the use of Language models in different science areas to provide context. Second, it describes BERT-E, an Earth science-specific language model. Third, the challenges of developing robust benchmarks in evaluating the language model such as BERT-E are presented. Finally, the paper explores the use of BERT-E in some ongoing prototyping efforts and future applications to support the scientific process.