-
Generating Long Sequences with Sparse Transformers, Rewon Child,Scott Gray,Alec Radford,Ilya Sutskever, 23-04-2019
Categories
Machine Learning, Machine Learning
Abstract
Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to
$O(n \sqrt{n})$ . We also introduce a) a variation on architecture and initialization to train deeper networks, b) the recomputation of attention matrices to save memory, and c) fast attention kernels for training. We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers. We use the same architecture to model images, audio, and text from raw bytes, setting a new state of the art for density modeling of Enwik8, CIFAR-10, and ImageNet-64. We generate unconditional samples that demonstrate global coherence and great diversity, and show it is possible in principle to use self-attention to model sequences of length one million or more.Bullet Points
-
The paper introduces sparse factorizations of the attention matrix, a variation on architecture and initialization to train deeper networks, recomputation of attention matrices to save memory, and fast attention kernels for training
-
We call networks with these changes Sparse Transformers and show they can model sequences tens of thousands of timesteps long using hundreds of layers
-
We use the same architecture to model images, audio, and text from raw bytes, setting a new state of the art for density modeling of Enwik8, CIFAR-10, and ImageNet-64
-
We generate unconditional samples that demonstrate global coherence and great diversity, and it is possible in principle to use self-attention to model sequence of length one million or more.
-
-
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism, Mohammad Shoeybi,Mostofa Patwary,Raul Puri,Patrick LeGresley,Jared Casper,Bryan Catanzaro, 17-09-2019
Categories
Computation and Language
Abstract
Recent work in language modeling demonstrates that training large transformer models advances the state of the art in Natural Language Processing applications. However, very large models can be quite difficult to train due to memory constraints. In this work, we present our techniques for training very large transformer models and implement a simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters. Our approach does not require a new compiler or library changes, is orthogonal and complimentary to pipeline model parallelism, and can be fully implemented with the insertion of a few communication operations in native PyTorch. We illustrate this approach by converging transformer based models up to 8.3 billion parameters using 512 GPUs. We sustain 15.1 PetaFLOPs across the entire application with 76% scaling efficiency when compared to a strong single GPU baseline that sustains 39 TeraFLOPs, which is 30% of peak FLOPs. To demonstrate that large language models can further advance the state of the art (SOTA), we train an 8.3 billion parameter transformer language model similar to GPT-2 and a 3.9 billion parameter model similar to BERT. We show that careful attention to the placement of layer normalization in BERT-like models is critical to achieving increased performance as the model size grows. Using the GPT-2 model we achieve SOTA results on the WikiText103 (10.8 compared to SOTA perplexity of 15.8) and LAMBADA (66.5% compared to SOTA accuracy of 63.2%) datasets. Our BERT model achieves SOTA results on the RACE dataset (90.9% compared to SOTA accuracy of 89.4%).
Bullet Points
-
The work presents techniques for training large transformer models and implements an intra-layer model parallel approach that enables training transformer models with billions of parameters
-
The approach is orthogonal and complimentary to pipeline model parallelism, and can be fully implemented with the insertion of a few communication operations in native PyTorch
-
We converge transformer based models up to 8.3 billion parameters using 512 GPUs, sustain 15.1 PetaFLOPs across the entire application with 76% scaling efficiency
-
To demonstrate that large language models can advance the state of the art (SOTA), we train a 3.9 billion parameter transformer language model similar to GPT-2 and BERT
-
Careful attention to layer normalization in BERT-like models is critical to achieving increased performance as the model size grows
-
We achieve SOTA results on WikiText103 (10.8 compared to SOTA perplexity of 15.8) and LAMBADA (66
-
-
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension, Mike Lewis,Yinhan Liu,Naman Goyal,Marjan Ghazvininejad,Abdelrahman Mohamed,Omer Levy,Ves Stoyanov,Luke Zettlemoyer, 29-10-2019
Categories
Computation and Language, Machine Learning, Machine Learning
Abstract
We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes. We evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of the original sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token. BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE. BART also provides a 1.1 BLEU increase over a back-translation system for machine translation, with only target language pretraining. We also report ablation experiments that replicate other pretraining schemes within the BART framework, to better measure which factors most influence end-task performance.
Bullet Points
-
BART is a denoising autoencoder for pretraining sequence-to-sequence models
-
It is trained by corrupting text with an arbitrary noising function and learning a model to reconstruct the original text
-
It uses a standard Tranformer-based neural machine translation architecture that generalizes BERT, GPT, and other recent pretraining schemes
-
The best performance is achieved by randomly shuffling the order of the original sentences and using a novel in-filling scheme where spans of text are replaced with a single mask token
-
BART matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new state-of-the-art results on abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE
-
It also provides a 1.1 BLEU increase over a back-translation system
-
-
How Can We Know What Language Models Know?, Zhengbao Jiang,Frank F. Xu,Jun Araki,Graham Neubig, 28-11-2019
Categories
Computation and Language, Machine Learning
Abstract
.
Bullet Points
-
I'm sorry, but you haven't provided me with any context or information to summarize
-
Please provide me with more details so that I can assist you better in summarizing the information you're looking for
-
Thank you for your help.
-
-
Zero-shot Text Classification With Generative Language Models, Raul Puri,Bryan Catanzaro, 10-12-2019
Categories
Computation and Language
Abstract
This work investigates the use of natural language to enable zero-shot model adaptation to new tasks. We use text and metadata from social commenting platforms as a source for a simple pretraining task. We then provide the language model with natural language descriptions of classification tasks as input and train it to generate the correct answer in natural language via a language modeling objective. This allows the model to generalize to new classification tasks without the need for multiple multitask classification heads. We show the zero-shot performance of these generative language models, trained with weak supervision, on six benchmark text classification datasets from the torchtext library. Despite no access to training data, we achieve up to a 45% absolute improvement in classification accuracy over random or majority class baselines. These results show that natural language can serve as simple and powerful descriptors for task adaptation. We believe this points the way to new metalearning strategies for text problems.
Bullet Points
-
The work investigates the use of natural language to enable zero-shot model adaptation to new tasks by using text and metadata from social commenting platforms as a pretraining task
-
The language model is trained to generate the correct answer in natural language via a language modeling objective, allowing it to generalize to new classification tasks without the need for multiple multitask classification heads
-
The results demonstrate that natural language can serve as simple and powerful descriptors for task adaptation, and this can lead to new metalearning strategies for text problems.
-