Transformers documentation
Efficient Inference on CPU
Get started
Tutorials
Pipelines for inferenceLoad pretrained instances with an AutoClassPreprocessFine-tune a pretrained modelDistributed training with 🤗 AccelerateShare a model
How-to guides
Use tokenizers from 🤗 TokenizersCreate a custom architectureSharing custom models
Fine-tune for downstream tasks
Text classificationToken classificationQuestion answeringLanguage modelingTranslationSummarizationMultiple choiceAudio classificationAutomatic speech recognitionImage classification
Train with a scriptRun training on Amazon SageMakerInference for multilingual modelsConverting TensorFlow CheckpointsExport 🤗 Transformers modelsPerformance and scalability
OverviewTraining on one GPUTraining on many GPUsTraining on CPUTraining on TPUsTraining on Specialized HardwareInference on CPUInference on one GPUInference on many GPUsInference on Specialized HardwareCustom hardware for training
Instantiating a big modelBenchmarksMigrating from previous packagesTroubleshootDebugging🤗 Transformers NotebooksCommunityHow to contribute to transformers?How to add a model to 🤗 Transformers?How to create a custom pipeline?TestingChecks on a Pull RequestConceptual guides
PhilosophyGlossarySummary of the tasksSummary of the modelsSummary of the tokenizersPadding and truncationBERTologyPerplexity of fixed-length models
API
Main Classes
CallbacksConfigurationData CollatorKeras callbacksLoggingModelsText GenerationONNXOptimizationModel outputsPipelinesProcessorsTokenizerTrainerDeepSpeed IntegrationFeature Extractor
Models
ALBERTAuto ClassesBARTBARThezBARTphoBEiTBERTBertGenerationBertJapaneseBertweetBigBirdBigBirdPegasusBlenderbotBlenderbot SmallBLOOMBORTByT5CamemBERTCANINECLIPCodeGenConvBERTConvNeXTCPMCTRLCvTData2VecDeBERTaDeBERTa-v2Decision TransformerDeiTDETRDialoGPTDistilBERTDiTDPRDPTELECTRAEncoder Decoder ModelsFlauBERTFLAVAFNetFSMTFunnel TransformerGLPNGPTGPT NeoGPT NeoXGPT-JGPT2GroupViTHerBERTHubertI-BERTImageGPTLayoutLMLayoutLMV2LayoutLMV3LayoutXLMLEDLeViTLongformerLongT5LUKELXMERTM2M100MarianMTMaskFormerMBart and MBart-50MCTCTMegatronBERTMegatronGPT2mLUKEMobileBERTMobileViTMPNetMT5MVPNEZHANLLBNyströmformerOPTOWL-ViTPegasusPerceiverPhoBERTPLBartPoolFormerProphetNetQDQBertRAGREALMReformerRegNetRemBERTResNetRetriBERTRoBERTaRoFormerSegFormerSEWSEW-DSpeech Encoder Decoder ModelsSpeech2TextSpeech2Text2SplinterSqueezeBERTSwin TransformerT5T5v1.1TAPASTAPEXTrajectory TransformerTransformer XLTrOCRUL2UniSpeechUniSpeech-SATVANViLTVision Encoder Decoder ModelsVision Text Dual EncoderVision Transformer (ViT)VisualBERTViTMAEWav2Vec2Wav2Vec2-ConformerWav2Vec2PhonemeWavLMXGLMXLMXLM-ProphetNetXLM-RoBERTaXLM-RoBERTa-XLXLNetXLS-RXLSR-Wav2Vec2YOLOSYOSO
Internal Helpers
You are viewing v4.21.3 version. A newer version v5.8.1 is available.
Efficient Inference on CPU
This guide focuses on inferencing large models efficiently on CPU.
PyTorch JIT-mode (TorchScript)
TorchScript is a way to create serializable and optimizable models from PyTorch code. Any TorchScript program can be saved from a Python process and loaded in a process where there is no Python dependency. Comparing to default eager mode, jit mode in PyTorch normally yields better performance for model inference from optimization methodologies like operator fusion.For a gentle introduction to TorchScript, see the Introduction to PyTorch TorchScript tutorial.
IPEX Graph Optimization with JIT-mode
Intel® Extension for PyTorch provides further optimizations in jit mode for Transformers series models. It is highly recommended for users to take advantage of Intel® Extension for PyTorch with jit mode. Some frequently used operator patterns from Transformers models are already supported in Intel® Extension for PyTorch with jit mode fusions. Those fusion patterns like Multi-head-attention fusion, Concat Linear, Linear+Add, Linear+Gelu, Add+LayerNorm fusion and etc. are enabled and perform well. The benefit of the fusion is delivered to users in a transparent fashion. According to the analysis, ~70% of most popular NLP tasks in question-answering, text-classification, and token-classification can get performance benefits with these fusion patterns for both Float32 precision and BFloat16 Mixed precision.Check more detailed information for IPEX Graph Optimization.
IPEX installation:
IPEX release is following PyTorch, check the approaches for IPEX installation.
Usage of JIT-mode
To enable jit mode in Trainer, users should add `jit_mode_eval` in Trainer command arguments.Take an example of the use cases on Transformers question-answering
Inference using jit mode on CPU:
python run_qa.py \ --model_name_or_path csarron/bert-base-uncased-squad-v1 \ --dataset_name squad \ --do_eval \ --max_seq_length 384 \ --doc_stride 128 \ --output_dir /tmp/ \ --no_cuda \ --jit_mode_eval
Inference with IPEX using jit mode on CPU:
python run_qa.py \ --model_name_or_path csarron/bert-base-uncased-squad-v1 \ --dataset_name squad \ --do_eval \ --max_seq_length 384 \ --doc_stride 128 \ --output_dir /tmp/ \ --no_cuda \ --use_ipex \ --jit_mode_eval