-
Longformer: The Long-Document Transformer, Iz Beltagy,Matthew E. Peters,Arman Cohan, 10-04-2020
Computation and Language
Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA. We finally introduce the Longformer-Encoder-Decoder (LED), a Longformer variant for supporting long document generative sequence-to-sequence tasks, and demonstrate its effectiveness on the arXiv summarization dataset.
-
Transformer-based models are unable to process long sequences due to their self-attention, which scales quadratically with the sequence length
-
Longformer introduces an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer
-
The attention mechanism combines local windowed attention with a task motivated global attention
-
After prior work on long-sequence transformers, we evaluated and achieved state-of-the-art results on text8 and enwik8, pretrain and finetune it on downstream tasks
-
Our pretrained Longformerin consistently outperforms RoBERTa on long document tasks and sets new results on WikiHop and TriviaQA
-
Finally, we introduced the LongFormer-Encoder-Decoder (LED) for supporting long document generative sequence-to-selence tasks and demonstrated its effectiveness on the ar
-
-
ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT, Omar Khattab,Matei Zaharia, 27-04-2020
Information Retrieval, Computation and Language
Recent progress in Natural Language Understanding (NLU) is driving fast-paced advances in Information Retrieval (IR), largely owed to fine-tuning deep language models (LMs) for document ranking. While remarkably effective, the ranking models based on these LMs increase computational cost by orders of magnitude over prior approaches, particularly as they must feed each query-document pair through a massive neural network to compute a single relevance score. To tackle this, we present ColBERT, a novel ranking model that adapts deep LMs (in particular, BERT) for efficient retrieval. ColBERT introduces a late interaction architecture that independently encodes the query and the document using BERT and then employs a cheap yet powerful interaction step that models their fine-grained similarity. By delaying and yet retaining this fine-granular interaction, ColBERT can leverage the expressiveness of deep LMs while simultaneously gaining the ability to pre-compute document representations offline, considerably speeding up query processing. Beyond reducing the cost of re-ranking the documents retrieved by a traditional model, ColBERT's pruning-friendly interaction mechanism enables leveraging vector-similarity indexes for end-to-end retrieval directly from a large document collection. We extensively evaluate ColBERT using two recent passage search datasets. Results show that ColBERT's effectiveness is competitive with existing BERT-based models (and outperforms every non-BERT baseline), while executing two orders-of-magnitude faster and requiring four orders-of-magnitude fewer FLOPs per query.
-
ColBERT is a novel ranking model that adapts deep language models (LMs) for efficient retrieval, leveraging the expressiveness of deep LMs while reducing computational cost by orders of magnitude
-
It introduces a late interaction architecture that independently encodes the query and the document using BERT and then uses a cheap yet powerful interaction step that models their fine-grained similarity
-
ColBRT's effectiveness is competitive with existing BERT-based models and outperforms every non-BERT baseline, while executing two orders-of-magnitude faster and requiring four orders-offer fewer FLOPs per query.
-
-
Language Models are Few-Shot Learners, Tom B. Brown,Benjamin Mann,Nick Ryder,Melanie Subbiah,Jared Kaplan,Prafulla Dhariwal,Arvind Neelakantan,Pranav Shyam,Girish Sastry,Amanda Askell,Sandhini Agarwal,Ariel Herbert-Voss,Gretchen Krueger,Tom Henighan,Rewon Child,Aditya Ramesh,Daniel M. Ziegler,Jeffrey Wu,Clemens Winter,Christopher Hesse,Mark Chen,Eric Sigler,Mateusz Litwin,Scott Gray,Benjamin Chess,Jack Clark,Christopher Berner,Sam McCandlish,Alec Radford,Ilya Sutskever,Dario Amodei, 28-05-2020
Computation and Language
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.
-
Recent work has shown significant gains on many NLP tasks and benchmarks by pre-training on a large corpus of text and fine-tuning on specific tasks
-
Scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tinging approaches
-
GPT-3 is an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and it achieves strong performance on various NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-th-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic
-
However, some datasets still struggle to train, and GPT3 faces methodological
-
-
It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners, Timo Schick,Hinrich Schütze, 15-09-2020
Computation and Language, Artificial Intelligence, Machine Learning
When scaled to hundreds of billions of parameters, pretrained language models such as GPT-3 (Brown et al., 2020) achieve remarkable few-shot performance. However, enormous amounts of compute are required for training and applying such big models, resulting in a large carbon footprint and making it difficult for researchers and practitioners to use them. We show that performance similar to GPT-3 can be obtained with language models that are much "greener" in that their parameter count is several orders of magnitude smaller. This is achieved by converting textual inputs into cloze questions that contain a task description, combined with gradient-based optimization; exploiting unlabeled data gives further improvements. We identify key factors required for successful natural language understanding with small language models.
-
Pretrained language models such as GPT-3 achieve few-shot performance when scaled to hundreds of billions of parameters
-
However, large amounts of compute are required for training and applying such models, resulting in a large carbon footprint and making it difficult for researchers and practitioners to use them
-
Smaller language models can be obtained by converting textual inputs into cloze questions that contain a task description, combined with gradient-based optimization, and exploiting unlabeled data for further improvements
-
Key factors required for successful natural language understanding are small language models.
-
-
FLIN: A Flexible Natural Language Interface for Web Navigation, Sahisnu Mazumder,Oriana Riva, 24-10-2020
Computation and Language, Artificial Intelligence
AI assistants can now carry out tasks for users by directly interacting with website UIs. Current semantic parsing and slot-filling techniques cannot flexibly adapt to many different websites without being constantly re-trained. We propose FLIN, a natural language interface for web navigation that maps user commands to concept-level actions (rather than low-level UI actions), thus being able to flexibly adapt to different websites and handle their transient nature. We frame this as a ranking problem: given a user command and a webpage, FLIN learns to score the most relevant navigation instruction (involving action and parameter values). To train and evaluate FLIN, we collect a dataset using nine popular websites from three domains. Our results show that FLIN was able to adapt to new websites in a given domain.
-
FLIN is a natural language interface for web navigation that maps user commands to concept-level actions rather than low-level UI actions
-
It learns to score the most relevant navigation instruction involving action and parameter values based on a user command and a webpage
-
To train and evaluate FLIN, we collected a dataset using nine popular websites from three domains and found that it was able to adapt to new websites in a given domain.
-
-
Automatically Identifying Words That Can Serve as Labels for Few-Shot Text Classification, Timo Schick,Helmut Schmid,Hinrich Schütze, 26-10-2020
Computation and Language, Artificial Intelligence, Machine Learning
A recent approach for few-shot text classification is to convert textual inputs to cloze questions that contain some form of task description, process them with a pretrained language model and map the predicted words to labels. Manually defining this mapping between words and labels requires both domain expertise and an understanding of the language model's abilities. To mitigate this issue, we devise an approach that automatically finds such a mapping given small amounts of training data. For a number of tasks, the mapping found by our approach performs almost as well as hand-crafted label-to-word mappings.
-
A recent approach for few-shot text classification involves closing textual inputs, processing them with a prerained language model, and mapping the predicted words to labels
-
Manually defining this mapping requires domain expertise and an understanding of the language model's abilities
-
To mitigate this issue, we developed an approach that automatically finds a mapping given small amounts of training data, which performs almost as well as hand-crafted label-to-word mappings.
-
-
AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts, Taylor Shin,Yasaman Razeghi,Robert L. Logan IV,Eric Wallace,Sameer Singh, 29-10-2020
Computation and Language, Machine Learning
The remarkable success of pretrained language models has motivated the study of what kinds of knowledge these models learn during pretraining. Reformulating tasks as fill-in-the-blanks problems (e.g., cloze tests) is a natural approach for gauging such knowledge, however, its usage is limited by the manual effort and guesswork required to write suitable prompts. To address this, we develop AutoPrompt, an automated method to create prompts for a diverse set of tasks, based on a gradient-guided search. Using AutoPrompt, we show that masked language models (MLMs) have an inherent capability to perform sentiment analysis and natural language inference without additional parameters or finetuning, sometimes achieving performance on par with recent state-of-the-art supervised models. We also show that our prompts elicit more accurate factual knowledge from MLMs than the manually created prompts on the LAMA benchmark, and that MLMs can be used as relation extractors more effectively than supervised relation extraction models. These results demonstrate that automatically generated prompts are a viable parameter-free alternative to existing probing methods, and as pretrained LMs become more sophisticated and capable, potentially a replacement for finetuning.
-
AutoPrompt is an automated method to create prompts for a diverse set of tasks based on a gradient-guided search
-
It demonstrates that masked language models (MLMs) have an inherent capability to perform sentiment analysis and natural language inference without additional parameters or finetuning, sometimes achieving performance on par with recent state-of-the-art supervised models
-
Our prompts elicit more accurate factual knowledge from MLMs than manually created prompts on the LAMA benchmark and can be used as relation extractors more effectively than supervised relation extraction models
-
These results demonstrate that automatically generated prompts are a viable parameter-free alternative to existing probing methods, and as pretrained LMs become more sophisticated and capable, potentially a replacement for fineting.
-
-
Making Pre-trained Language Models Better Few-shot Learners, Tianyu Gao,Adam Fisch,Danqi Chen, 31-12-2020
Computation and Language, Machine Learning
The recent GPT-3 model (Brown et al., 2020) achieves remarkable few-shot performance solely by leveraging a natural-language prompt and a few task demonstrations as input context. Inspired by their findings, we study few-shot learning in a more practical scenario, where we use smaller language models for which fine-tuning is computationally efficient. We present LM-BFF--better few-shot fine-tuning of language models--a suite of simple and complementary techniques for fine-tuning language models on a small number of annotated examples. Our approach includes (1) prompt-based fine-tuning together with a novel pipeline for automating prompt generation; and (2) a refined strategy for dynamically and selectively incorporating demonstrations into each context. Finally, we present a systematic evaluation for analyzing few-shot performance on a range of NLP tasks, including classification and regression. Our experiments demonstrate that our methods combine to dramatically outperform standard fine-tuning procedures in this low resource setting, achieving up to 30% absolute improvement, and 11% on average across all tasks. Our approach makes minimal assumptions on task resources and domain expertise, and hence constitutes a strong task-agnostic method for few-shot learning.
-
The GPT-3 model achieves remarkable few-shot performance by leveraging natural-language prompts and task demonstrations as input context
-
In a practical scenario, we present LM-BFF, a suite of simple and complementary techniques for fine-tuning language models on a small number of annotated examples
-
Our approach includes prompt-based fine-ting and a refined strategy for dynamically and selectively incorporating demonstrations into each context
-
Our systematic evaluation shows that our methods outperform standard Fine-Tuning procedures in a low resource setting, achieving up to 30% absolute improvement and 11% on average across all tasks
-
This approach makes minimal assumptions on task resources and domain expertise, and constitutes a strong task-agnostic method for few-shoot learning.
-
-
The Pile: An 800GB Dataset of Diverse Text for Language Modeling, Leo Gao,Stella Biderman,Sid Black,Laurence Golding,Travis Hoppe,Charles Foster,Jason Phang,Horace He,Anish Thite,Noa Nabeshima,Shawn Presser,Connor Leahy, 31-12-2020
Computation and Language
Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. With this in mind, we present \textit{the Pile}: an 825 GiB English text corpus targeted at training large-scale language models. The Pile is constructed from 22 diverse high-quality subsets -- both existing and newly constructed -- many of which derive from academic or professional sources. Our evaluation of the untuned performance of GPT-2 and GPT-3 on the Pile shows that these models struggle on many of its components, such as academic writing. Conversely, models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile, while improving performance on downstream evaluations. Through an in-depth exploratory analysis, we document potentially concerning aspects of the data for prospective users. We make publicly available the code used in its construction.
-
Increased training dataset diversity improves cross-domain knowledge and downstream generalization capability for large-scale language models
-
textitthe Pile is an 825 GiB English text corpus that is constructed from 22 diverse high-quality subsets
-
GPT-2 and GPT-3 models struggle on many components, while models trained on the Pile improve significantly over Raw CC and CC-100 on all components and improve performance on downstream evaluations
-
The code used in its construction is publicly available for potential users.
-