# Hub ## Docs - [Using Sentence Transformers at Hugging Face](https://huggingface.co/docs/hub/sentence-transformers.md) - [Giskard on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-giskard.md) - [Embed the Dataset Viewer in a webpage](https://huggingface.co/docs/hub/datasets-viewer-embed.md) - [Using OpenCLIP at Hugging Face](https://huggingface.co/docs/hub/open_clip.md) - [Tabby on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-tabby.md) - [ChatUI on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-chatui.md) - [Pricing and Billing](https://huggingface.co/docs/hub/jobs-pricing.md) - [Displaying carbon emissions for your model](https://huggingface.co/docs/hub/model-cards-co2.md) - [Spaces ZeroGPU: Dynamic GPU Allocation for Spaces](https://huggingface.co/docs/hub/spaces-zerogpu.md) - [Webhooks Automation](https://huggingface.co/docs/hub/jobs-webhooks.md) - [Using mlx-image at Hugging Face](https://huggingface.co/docs/hub/mlx-image.md) - [Panel on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-panel.md) - [Access Patterns](https://huggingface.co/docs/hub/storage-buckets-access.md) - [Dash on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-dash.md) - [Audit Logs](https://huggingface.co/docs/hub/audit-logs.md) - [Downloading models](https://huggingface.co/docs/hub/models-downloading.md) - [The Model Hub](https://huggingface.co/docs/hub/models-the-hub.md) - [Spaces Dev Mode: Seamless development in Spaces](https://huggingface.co/docs/hub/spaces-dev-mode.md) - [SQL Console: Query Hugging Face datasets in your browser](https://huggingface.co/docs/hub/datasets-viewer-sql-console.md) - [Single Sign-On (SSO)](https://huggingface.co/docs/hub/enterprise-sso.md) - [Managing Spaces with Github Actions](https://huggingface.co/docs/hub/spaces-github-actions.md) - [Billing](https://huggingface.co/docs/hub/billing.md) - [More ways to create Spaces](https://huggingface.co/docs/hub/spaces-more-ways-to-create.md) - [Hugging Face Hub documentation](https://huggingface.co/docs/hub/index.md) - [Advanced Security](https://huggingface.co/docs/hub/enterprise-advanced-security.md) - [Using Unity Sentis Models from Hugging Face](https://huggingface.co/docs/hub/unity-sentis.md) - [FiftyOne](https://huggingface.co/docs/hub/datasets-fiftyone.md) - [Using timm at Hugging Face](https://huggingface.co/docs/hub/timm.md) - [Git over SSH](https://huggingface.co/docs/hub/security-git-ssh.md) - [Using `Transformers.js` at Hugging Face](https://huggingface.co/docs/hub/transformers-js.md) - [Spaces](https://huggingface.co/docs/hub/spaces.md) - [Using OpenCV in Spaces](https://huggingface.co/docs/hub/spaces-using-opencv.md) - [Storage Buckets: Security & Compliance](https://huggingface.co/docs/hub/storage-buckets-security.md) - [Ingesting Datasets](https://huggingface.co/docs/hub/datasets-ingesting.md) - [How to Add a Space to ArXiv](https://huggingface.co/docs/hub/spaces-add-to-arxiv.md) - [Dask](https://huggingface.co/docs/hub/datasets-dask.md) - [Hub API Endpoints](https://huggingface.co/docs/hub/api.md) - [Argilla on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-argilla.md) - [Using ๐Ÿค— Datasets](https://huggingface.co/docs/hub/datasets-usage.md) - [Model Cards](https://huggingface.co/docs/hub/model-cards.md) - [Evidence on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-evidence.md) - [Datasets Download Stats](https://huggingface.co/docs/hub/datasets-download-stats.md) - [User Provisioning (SCIM)](https://huggingface.co/docs/hub/enterprise-scim.md) - [Using PaddleNLP at Hugging Face](https://huggingface.co/docs/hub/paddlenlp.md) - [How to configure SCIM with Okta](https://huggingface.co/docs/hub/security-sso-okta-scim.md) - [Distilabel](https://huggingface.co/docs/hub/datasets-distilabel.md) - [Gating Group Collections](https://huggingface.co/docs/hub/enterprise-gating-group-collections.md) - [THE LANDSCAPE OF ML DOCUMENTATION TOOLS](https://huggingface.co/docs/hub/model-card-landscape-analysis.md) - [Livebook on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-livebook.md) - [Spaces Custom Domain](https://huggingface.co/docs/hub/spaces-custom-domain.md) - [Storage Regions on the Hub](https://huggingface.co/docs/hub/storage-regions.md) - [Basic SSO](https://huggingface.co/docs/hub/security-sso-basic.md) - [SSO Configuration Guides](https://huggingface.co/docs/hub/security-sso-configuration-guides.md) - [Storage limits](https://huggingface.co/docs/hub/storage-limits.md) - [Model Card Guidebook](https://huggingface.co/docs/hub/model-card-guidebook.md) - [Pandas](https://huggingface.co/docs/hub/datasets-pandas.md) - [GGUF usage with LM Studio](https://huggingface.co/docs/hub/lmstudio.md) - [Hugging Face Dataset Upload Decision Guide](https://huggingface.co/docs/hub/datasets-upload-guide-llm.md) - [How to configure SAML SSO with Google Workspace](https://huggingface.co/docs/hub/security-sso-google-saml.md) - [Third-party scanner: JFrog](https://huggingface.co/docs/hub/security-jfrog.md) - [Managing Spaces with CircleCI Workflows](https://huggingface.co/docs/hub/spaces-circleci.md) - [Perform SQL operations](https://huggingface.co/docs/hub/datasets-duckdb-sql.md) - [Static HTML Spaces](https://huggingface.co/docs/hub/spaces-sdks-static.md) - [Using spaCy at Hugging Face](https://huggingface.co/docs/hub/spacy.md) - [Webhook guide: build a Discussion bot based on BLOOM](https://huggingface.co/docs/hub/webhooks-guide-discussion-bot.md) - [Academia Hub](https://huggingface.co/docs/hub/academia-hub.md) - [GGUF usage with llama.cpp](https://huggingface.co/docs/hub/gguf-llamacpp.md) - [Evaluation Results](https://huggingface.co/docs/hub/eval-results.md) - [Network Security](https://huggingface.co/docs/hub/enterprise-network-security.md) - [Advanced Topics](https://huggingface.co/docs/hub/spaces-advanced.md) - [GitHub Actions](https://huggingface.co/docs/hub/repositories-github-actions.md) - [How to configure OIDC SSO with Okta](https://huggingface.co/docs/hub/security-sso-okta-oidc.md) - [Digital Object Identifier (DOI)](https://huggingface.co/docs/hub/doi.md) - [How to get a user's plan and status in Spaces](https://huggingface.co/docs/hub/spaces-get-user-plan.md) - [Programmatic User Access Control Management](https://huggingface.co/docs/hub/programmatic-user-access-control.md) - [Datasets Overview](https://huggingface.co/docs/hub/datasets-overview.md) - [Widgets](https://huggingface.co/docs/hub/models-widgets.md) - [Search](https://huggingface.co/docs/hub/search.md) - [Data Studio](https://huggingface.co/docs/hub/data-studio.md) - [Data files Configuration](https://huggingface.co/docs/hub/datasets-data-files-configuration.md) - [Transforming your dataset](https://huggingface.co/docs/hub/datasets-polars-operations.md) - [Shiny on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-shiny.md) - [Using SetFit with Hugging Face](https://huggingface.co/docs/hub/setfit.md) - [Using ๐Ÿค— `transformers` at Hugging Face](https://huggingface.co/docs/hub/transformers.md) - [Team & Enterprise plans](https://huggingface.co/docs/hub/enterprise.md) - [Langfuse on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-langfuse.md) - [Using SpeechBrain at Hugging Face](https://huggingface.co/docs/hub/speechbrain.md) - [ZenML on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-zenml.md) - [Using SpanMarker at Hugging Face](https://huggingface.co/docs/hub/span_marker.md) - [Downloading datasets](https://huggingface.co/docs/hub/datasets-downloading.md) - [Using MLX at Hugging Face](https://huggingface.co/docs/hub/mlx.md) - [Third-party scanner: Protect AI](https://huggingface.co/docs/hub/security-protectai.md) - [Getting Started with Repositories](https://huggingface.co/docs/hub/repositories-getting-started.md) - [Embedding Atlas](https://huggingface.co/docs/hub/datasets-embedding-atlas.md) - [Using PEFT at Hugging Face](https://huggingface.co/docs/hub/peft.md) - [Using Keras at Hugging Face](https://huggingface.co/docs/hub/keras.md) - [Models](https://huggingface.co/docs/hub/models.md) - [Gated models](https://huggingface.co/docs/hub/models-gated.md) - [Run with Docker](https://huggingface.co/docs/hub/spaces-run-with-docker.md) - [Integrate your library with the Hub](https://huggingface.co/docs/hub/models-adding-libraries.md) - [The HF PRO subscription ๐Ÿ”ฅ](https://huggingface.co/docs/hub/pro.md) - [Using BERTopic at Hugging Face](https://huggingface.co/docs/hub/bertopic.md) - [Models Download Stats](https://huggingface.co/docs/hub/models-download-stats.md) - [Docker Spaces Examples](https://huggingface.co/docs/hub/spaces-sdks-docker-examples.md) - [Advanced Compute Options](https://huggingface.co/docs/hub/advanced-compute-options.md) - [Managed SSO](https://huggingface.co/docs/hub/enterprise-advanced-sso.md) - [Webhooks](https://huggingface.co/docs/hub/webhooks.md) - [marimo on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-marimo.md) - [How to configure SAML SSO with Okta](https://huggingface.co/docs/hub/security-sso-okta-saml.md) - [DuckDB](https://huggingface.co/docs/hub/datasets-duckdb.md) - [Using Stanza at Hugging Face](https://huggingface.co/docs/hub/stanza.md) - [Notifications](https://huggingface.co/docs/hub/notifications.md) - [Gradio Spaces](https://huggingface.co/docs/hub/spaces-sdks-gradio.md) - [Spaces as API endpoints](https://huggingface.co/docs/hub/spaces-api-endpoints.md) - [Licenses](https://huggingface.co/docs/hub/repositories-licenses.md) - [Hugging Face CLI for AI Agents](https://huggingface.co/docs/hub/agents-cli.md) - [Webhook guide: Setup an automatic metadata quality review for models and datasets](https://huggingface.co/docs/hub/webhooks-guide-metadata-review.md) - [Use AI Models Locally](https://huggingface.co/docs/hub/local-apps.md) - [Adding a Sign-In with HF button to your Space](https://huggingface.co/docs/hub/spaces-oauth.md) - [Local Agents with llama.cpp](https://huggingface.co/docs/hub/agents-local.md) - [Using ๐Ÿงจ `diffusers` at Hugging Face](https://huggingface.co/docs/hub/diffusers.md) - [Resource groups](https://huggingface.co/docs/hub/enterprise-resource-groups.md) - [User access tokens](https://huggingface.co/docs/hub/security-tokens.md) - [Bucket Integrations](https://huggingface.co/docs/hub/storage-buckets-integrations.md) - [Tasks](https://huggingface.co/docs/hub/models-tasks.md) - [Access control in organizations](https://huggingface.co/docs/hub/organizations-security.md) - [Optimizations](https://huggingface.co/docs/hub/datasets-polars-optimizations.md) - [Featured Spaces](https://huggingface.co/docs/hub/spaces-featured.md) - [PyArrow](https://huggingface.co/docs/hub/datasets-pyarrow.md) - [Building with the SDK](https://huggingface.co/docs/hub/agents-sdk.md) - [Video Dataset](https://huggingface.co/docs/hub/datasets-video.md) - [Libraries](https://huggingface.co/docs/hub/models-libraries.md) - [๐ŸŸง Label Studio on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-label-studio.md) - [Repository Settings](https://huggingface.co/docs/hub/repositories-settings.md) - [Agents](https://huggingface.co/docs/hub/agents.md) - [Using GPU Spaces](https://huggingface.co/docs/hub/spaces-gpus.md) - [Managing organizations](https://huggingface.co/docs/hub/organizations-managing.md) - [Using Stable-Baselines3 at Hugging Face](https://huggingface.co/docs/hub/stable-baselines3.md) - [Audio Dataset](https://huggingface.co/docs/hub/datasets-audio.md) - [How to configure OIDC SSO with Microsoft Entra ID (Azure AD)](https://huggingface.co/docs/hub/security-sso-azure-oidc.md) - [File formats](https://huggingface.co/docs/hub/datasets-polars-file-formats.md) - [Gated datasets](https://huggingface.co/docs/hub/datasets-gated.md) - [Accessing Benchmark Leaderboard Data](https://huggingface.co/docs/hub/leaderboard-data-guide.md) - [Data Designer](https://huggingface.co/docs/hub/datasets-data-designer.md) - [TF-Keras (legacy)](https://huggingface.co/docs/hub/tf-keras.md) - [Pickle Scanning](https://huggingface.co/docs/hub/security-pickle.md) - [Hugging Face MCP Server](https://huggingface.co/docs/hub/agents-mcp.md) - [Configure the Dataset Viewer](https://huggingface.co/docs/hub/datasets-viewer-configure.md) - [Hub Local Cache](https://huggingface.co/docs/hub/local-cache.md) - [Custom Python Spaces](https://huggingface.co/docs/hub/spaces-sdks-python.md) - [Model(s) Release Checklist](https://huggingface.co/docs/hub/model-release-checklist.md) - [Using _Adapters_ at Hugging Face](https://huggingface.co/docs/hub/adapters.md) - [Examples & Tutorials](https://huggingface.co/docs/hub/jobs-examples.md) - [File names and splits](https://huggingface.co/docs/hub/datasets-file-names-and-splits.md) - [Your First Docker Space: Text Generation with T5](https://huggingface.co/docs/hub/spaces-sdks-docker-first-demo.md) - [Malware Scanning](https://huggingface.co/docs/hub/security-malware.md) - [Hub Rate limits](https://huggingface.co/docs/hub/rate-limits.md) - [Using RL-Baselines3-Zoo at Hugging Face](https://huggingface.co/docs/hub/rl-baselines3-zoo.md) - [Handling Spaces Dependencies in Gradio Spaces](https://huggingface.co/docs/hub/spaces-dependencies.md) - [Using ML-Agents at Hugging Face](https://huggingface.co/docs/hub/ml-agents.md) - [Streaming datasets](https://huggingface.co/docs/hub/datasets-streaming.md) - [Organization cards](https://huggingface.co/docs/hub/organizations-cards.md) - [fenic](https://huggingface.co/docs/hub/datasets-fenic.md) - [Argilla](https://huggingface.co/docs/hub/datasets-argilla.md) - [Widget Examples](https://huggingface.co/docs/hub/models-widgets-examples.md) - [Spaces Settings](https://huggingface.co/docs/hub/spaces-settings.md) - [Spaces Configuration Reference](https://huggingface.co/docs/hub/spaces-config-reference.md) - [WebDataset](https://huggingface.co/docs/hub/datasets-webdataset.md) - [Authentication for private and gated datasets](https://huggingface.co/docs/hub/datasets-duckdb-auth.md) - [Datasets](https://huggingface.co/docs/hub/datasets.md) - [Storage Buckets](https://huggingface.co/docs/hub/storage-buckets.md) - [Polars](https://huggingface.co/docs/hub/datasets-polars.md) - [Schedule Jobs](https://huggingface.co/docs/hub/jobs-schedule.md) - [Authentication](https://huggingface.co/docs/hub/datasets-polars-auth.md) - [Datasets](https://huggingface.co/docs/hub/enterprise-datasets.md) - [Combine datasets and export](https://huggingface.co/docs/hub/datasets-duckdb-combine-and-export.md) - [Publisher Analytics](https://huggingface.co/docs/hub/publisher-analytics.md) - [Using fastai at Hugging Face](https://huggingface.co/docs/hub/fastai.md) - [Models Frequently Asked Questions](https://huggingface.co/docs/hub/models-faq.md) - [Organizations, Security, and the Hub API](https://huggingface.co/docs/hub/other.md) - [Using Flair at Hugging Face](https://huggingface.co/docs/hub/flair.md) - [How to configure OIDC SSO with Google Workspace](https://huggingface.co/docs/hub/security-sso-google-oidc.md) - [Editing Datasets in Data Studio](https://huggingface.co/docs/hub/datasets-cell-editing.md) - [Disk usage on Spaces](https://huggingface.co/docs/hub/spaces-storage.md) - [Reference](https://huggingface.co/docs/hub/jobs-reference.md) - [Jobs](https://huggingface.co/docs/hub/jobs.md) - [Using AllenNLP at Hugging Face](https://huggingface.co/docs/hub/allennlp.md) - [Libraries](https://huggingface.co/docs/hub/datasets-libraries.md) - [Model Card components](https://huggingface.co/docs/hub/model-cards-components.md) - [Next Steps](https://huggingface.co/docs/hub/repositories-next-steps.md) - [Agents](https://huggingface.co/docs/hub/agents-overview.md) - [Moderation](https://huggingface.co/docs/hub/moderation.md) - [Inference Providers](https://huggingface.co/docs/hub/models-inference.md) - [Spaces as MCP servers](https://huggingface.co/docs/hub/spaces-mcp-servers.md) - [Tokens Management](https://huggingface.co/docs/hub/enterprise-tokens-management.md) - [Agent Libraries](https://huggingface.co/docs/hub/agents-libraries.md) - [Paper Pages](https://huggingface.co/docs/hub/paper-pages.md) - [Use Ollama with any GGUF Model on Hugging Face Hub](https://huggingface.co/docs/hub/ollama.md) - [Two-Factor Authentication (2FA)](https://huggingface.co/docs/hub/security-2fa.md) - [Spaces as Agent Tools](https://huggingface.co/docs/hub/spaces-agents.md) - [How to configure SCIM with Microsoft Entra ID (Azure AD)](https://huggingface.co/docs/hub/security-sso-entra-id-scim.md) - [Single Sign-On (SSO)](https://huggingface.co/docs/hub/security-sso.md) - [How to configure SAML SSO with Microsoft Entra ID (Azure AD)](https://huggingface.co/docs/hub/security-sso-azure-saml.md) - [Advanced Access Control in Organizations with Resource Groups](https://huggingface.co/docs/hub/security-resource-groups.md) - [Organizations](https://huggingface.co/docs/hub/organizations.md) - [Aim on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-aim.md) - [Lance](https://huggingface.co/docs/hub/datasets-lance.md) - [Using TensorBoard](https://huggingface.co/docs/hub/tensorboard.md) - [User Management](https://huggingface.co/docs/hub/security-sso-user-management.md) - [Collections](https://huggingface.co/docs/hub/collections.md) - [Spaces Overview](https://huggingface.co/docs/hub/spaces-overview.md) - [Configuration](https://huggingface.co/docs/hub/jobs-configuration.md) - [Repositories](https://huggingface.co/docs/hub/repositories.md) - [Secrets Scanning](https://huggingface.co/docs/hub/security-secrets.md) - [Advanced Topics](https://huggingface.co/docs/hub/models-advanced.md) - [Embed your Space in another website](https://huggingface.co/docs/hub/spaces-embed.md) - [Spark](https://huggingface.co/docs/hub/datasets-spark.md) - [Using ESPnet at Hugging Face](https://huggingface.co/docs/hub/espnet.md) - [Daft](https://huggingface.co/docs/hub/datasets-daft.md) - [Dataset Cards](https://huggingface.co/docs/hub/datasets-cards.md) - [Docker Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker.md) - [Streamlit Spaces](https://huggingface.co/docs/hub/spaces-sdks-streamlit.md) - [Annotated Model Card Template](https://huggingface.co/docs/hub/model-card-annotated.md) - [How to handle URL parameters in Spaces](https://huggingface.co/docs/hub/spaces-handle-url-parameters.md) - [Signing commits with GPG](https://huggingface.co/docs/hub/security-gpg.md) - [User Studies](https://huggingface.co/docs/hub/model-cards-user-studies.md) - [Skills](https://huggingface.co/docs/hub/agents-skills.md) - [GGUF usage with GPT4All](https://huggingface.co/docs/hub/gguf-gpt4all.md) - [GGUF](https://huggingface.co/docs/hub/gguf.md) - [Jupyter Notebooks on the Hugging Face Hub](https://huggingface.co/docs/hub/notebooks.md) - [Pull requests and Discussions](https://huggingface.co/docs/hub/repositories-pull-requests-discussions.md) - [JupyterLab on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-jupyter.md) - [Perform vector similarity search](https://huggingface.co/docs/hub/datasets-duckdb-vector-similarity-search.md) - [Blog Articles for Organizations](https://huggingface.co/docs/hub/enterprise-blog-articles.md) - [Uploading models](https://huggingface.co/docs/hub/models-uploading.md) - [Appendix](https://huggingface.co/docs/hub/model-card-appendix.md) - [Quickstart](https://huggingface.co/docs/hub/jobs-quickstart.md) - [Image Dataset](https://huggingface.co/docs/hub/datasets-image.md) - [Uploading datasets](https://huggingface.co/docs/hub/datasets-adding.md) - [DDUF](https://huggingface.co/docs/hub/dduf.md) - [Using sample-factory at Hugging Face](https://huggingface.co/docs/hub/sample-factory.md) - [Using Asteroid at Hugging Face](https://huggingface.co/docs/hub/asteroid.md) - [Security](https://huggingface.co/docs/hub/security.md) - [Popular Images](https://huggingface.co/docs/hub/jobs-popular-images.md) - [Webhook guide: Setup an automatic system to re-train a model when a dataset changes](https://huggingface.co/docs/hub/webhooks-guide-auto-retrain.md) - [Sign in with Hugging Face](https://huggingface.co/docs/hub/oauth.md) - [Using Spaces for Organization Cards](https://huggingface.co/docs/hub/spaces-organization-cards.md) - [Query datasets](https://huggingface.co/docs/hub/datasets-duckdb-select.md) - [Manual Configuration](https://huggingface.co/docs/hub/datasets-manual-configuration.md) - [Spaces Changelog](https://huggingface.co/docs/hub/spaces-changelog.md) - [Jobs Overview](https://huggingface.co/docs/hub/jobs-overview.md) - [Cookie limitations in Spaces](https://huggingface.co/docs/hub/spaces-cookie-limitations.md) - [Editing datasets](https://huggingface.co/docs/hub/datasets-editing.md) - [Manage Jobs](https://huggingface.co/docs/hub/jobs-manage.md) - [Using Xet Storage](https://huggingface.co/docs/hub/xet/using-xet-storage.md) - [Xet: our Storage Backend](https://huggingface.co/docs/hub/xet/index.md) - [Xet History & Overview](https://huggingface.co/docs/hub/xet/overview.md) - [Deduplication](https://huggingface.co/docs/hub/xet/deduplication.md) - [Backward Compatibility with LFS](https://huggingface.co/docs/hub/xet/legacy-git-lfs.md) - [Security Model](https://huggingface.co/docs/hub/xet/security.md) ### Using Sentence Transformers at Hugging Face https://huggingface.co/docs/hub/sentence-transformers.md # Using Sentence Transformers at Hugging Face `sentence-transformers` is a library that provides easy methods to compute embeddings (dense vector representations) for sentences, paragraphs and images. Texts are embedded in a vector space such that similar text is close, which enables applications such as semantic search, clustering, and retrieval. ## Exploring sentence-transformers in the Hub You can find over 500 hundred `sentence-transformer` models by filtering at the left of the [models page](https://huggingface.co/models?library=sentence-transformers&sort=downloads). Most of these models support different tasks, such as doing [`feature-extraction`](https://huggingface.co/models?library=sentence-transformers&pipeline_tag=feature-extraction&sort=downloads) to generate the embedding, and [`sentence-similarity`](https://huggingface.co/models?library=sentence-transformers&pipeline_tag=sentence-similarity&sort=downloads) as a way to determine how similar is a given sentence to other. You can also find an overview of the official pre-trained models in [the official docs](https://www.sbert.net/docs/pretrained_models.html). All models on the Hub come up with features: 1. An automatically generated model card with a description, example code snippets, architecture overview, and more. 2. Metadata tags that help for discoverability and contain information such as license. 3. An interactive widget you can use to play out with the model directly in the browser. 4. An Inference Providers widget that allows to make inference requests. ## Using existing models The pre-trained models on the Hub can be loaded with a single line of code ```py from sentence_transformers import SentenceTransformer model = SentenceTransformer('model_name') ``` Here is an example that encodes sentences and then computes the distance between them for doing semantic search. ```py from sentence_transformers import SentenceTransformer, util model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1') query_embedding = model.encode('How big is London') passage_embedding = model.encode(['London has 9,787,426 inhabitants at the 2011 census', 'London is known for its financial district']) print("Similarity:", util.dot_score(query_embedding, passage_embedding)) ``` If you want to see how to load a specific model, you can click `Use in sentence-transformers` and you will be given a working snippet that you can load it! ## Sharing your models You can share your Sentence Transformers by using the `save_to_hub` method from a trained model. ```py from sentence_transformers import SentenceTransformer # Load or train a model model.save_to_hub("my_new_model") ``` This command creates a repository with an automatically generated model card, an inference widget, example code snippets, and more! [Here](https://huggingface.co/osanseviero/my_new_model) is an example. ## Additional resources * Sentence Transformers [library](https://github.com/UKPLab/sentence-transformers). * Sentence Transformers [docs](https://www.sbert.net/). * Integration with Hub [announcement](https://huggingface.co/blog/sentence-transformers-in-the-hub). ### Giskard on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-giskard.md # Giskard on Spaces **Giskard** is an AI model quality testing toolkit for LLMs, tabular, and NLP models. It consists of an open-source Python library for scanning and testing AI models and an AI Model Quality Testing app, which can now be deployed using Hugging Face's Docker Spaces. Extending the features of the open-source library, the AI Model Quality Testing app enables you to: - Debug tests to diagnose your issues - Create domain-specific tests thanks to automatic model insights - Compare models to decide which model to promote - Collect business feedback of your model results - Share your results with your colleagues for alignment - Store all your QA objects (tests, data slices, evaluation criteria, etc.) in one place to work more efficiently Visit [Giskard's documentation](https://docs.giskard.ai/) and [Quickstart Guides](https://docs.giskard.ai/en/latest/getting_started/quickstart/index.html) to learn how to use the full range of tools provided by Giskard. In the next sections, you'll learn to deploy your own Giskard AI Model Quality Testing app and use it right from Hugging Face Spaces. This Giskard app is a **self-contained application completely hosted on Spaces using Docker**. ## Deploy Giskard on Spaces You can deploy Giskard on Spaces with just a few clicks: > [!WARNING] > IMPORTANT NOTE ABOUT DATA PERSISTENCE: > You can use the Giskard Space as is for initial exploration and experimentation. For **longer use in > small-scale projects, attach a [Storage Bucket](https://huggingface.co/docs/hub/storage-buckets)**. This prevents data loss during Space restarts which > occur every 24 hours. You need to define the **Owner** (your personal account or an organization), a **Space name**, and the **Visibility**. If you donโ€™t want to publicly share your models and quality tests, set your Space to **Private**. Once you have created the Space, you'll see the `Building` status. Once it becomes `Running`, your Space is ready to go. If you don't see a change in the screen, refresh the page. ## Request a free license Once your Giskard Space is up and running, you'll need to request a free license to start using the app. You will then automatically receive an email with the license file. ## Create a new Giskard project Once inside the app, start by creating a new project from the welcome screen. ## Generate a Hugging Face Giskard Space Token and Giskard API key The Giskard API key is used to establish communication between the environment where your AI models are running and the Giskard app on Hugging Face Spaces. If you've set the **Visibility** of your Space to **Private**, you will need to provide a Hugging Face user access token to generate the Hugging Face Giskard Space Token and establish a communication for access to your private Space. To do so, follow the instructions displayed in the settings page of the Giskard app. ## Start the ML worker Giskard executes your model using a worker that runs the model directly in your Python environment, with all the dependencies required by your model. You can either execute the ML worker: - From your local notebook within the kernel that contains all the dependencies of your model - From Google Colab within the kernel that contains all the dependencies of your model - Or from your terminal within the Python environment that contains all the dependencies of your model Simply run the following command within the Python environment that contains all the dependencies of your model: ```bash giskard worker start -d -k GISKARD-API-KEY -u https://XXX.hf.space --hf-token GISKARD-SPACE-TOKEN ``` ## Upload your test suite, models and datasets In order to start building quality tests for a project, you will need to upload model and dataset objects, and either create or upload a test suite from the Giskard Python library. > [!TIP] > For more information on how to create test suites from Giskard's Python library's automated model scanning tool, head > over to Giskard's [Quickstart Guides](https://docs.giskard.ai/en/latest/getting_started/quickstart/index.html). These actions will all require a connection between your Python environment and the Giskard Space. Achieve this by initializing a Giskard Client: simply copy the โ€œCreate a Giskard Clientโ€ snippet from the settings page of the Giskard app and run it within your Python environment. This will look something like this: ```python from giskard import GiskardClient url = "https://user_name-space_name.hf.space" api_key = "gsk-xxx" hf_token = "xxx" # Create a giskard client to communicate with Giskard client = GiskardClient(url, api_key, hf_token) ``` If you run into issues, head over to Giskard's [upload object documentation page](https://docs.giskard.ai/en/latest/giskard_hub/upload/index.html). ## Feedback and support If you have suggestions or need specific support, please join [Giskard's Discord community](https://discord.com/invite/ABvfpbu69R) or reach out on [Giskard's GitHub repository](https://github.com/Giskard-AI/giskard). ### Embed the Dataset Viewer in a webpage https://huggingface.co/docs/hub/datasets-viewer-embed.md # Embed the Dataset Viewer in a webpage You can embed the Dataset Viewer in your own webpage using an iframe. The URL to use is `https://huggingface.co/datasets///embed/viewer`, where `` is the owner of the dataset (user or organization) and `` is the name of the dataset. You can also pass other parameters like the subset, split, filter, search or selected row. For example, the following iframe embeds the Dataset Viewer for the `glue` dataset from the `nyu-mll` organization: ```html ``` You can also get the embed code directly from the Dataset Viewer interface. Click on the `Embed` button in the top right corner of the Dataset Viewer: It will open a modal with the iframe code that you can copy and paste into your webpage: ## Parameters All the parameters of the dataset viewer page can also be passed to the embedded viewer (filter, search, specific split, etc.) by adding them to the iframe URL. For example, to show the results of the search on `mangrove` in the `test` split of the `rte` subset of the `nyu-mll/glue` dataset, you can use the following URL: ```html ``` You can get this code directly from the Dataset Viewer interface by performing the search, clicking on the `โ‹ฎ` button then `Embed`: It will open a modal with the iframe code that you can copy and paste into your webpage: ## Examples The embedded dataset viewer is used in multiple Machine Learning tools and platforms to display datasets. Here are a few examples. Open a [pull request](https://github.com/huggingface/hub-docs/blob/main/docs/hub/datasets-viewer-embed.md) if you want to appear in this section! ### Tool: ZenML [`htahir1`](https://huggingface.co/htahir1) shares a [blog post](https://www.zenml.io/blog/embedding-huggingface-datasets-visualizations-with-zenml) showing how you can use the [ZenML](https://huggingface.co/zenml) integration with the Datasets Viewer to visualize a Hugging Face dataset within a ZenML pipeline. ### Tool: Metaflow + Outerbounds [`eddie-OB`](https://huggingface.co/eddie-OB) shows in a [demo video](https://www.linkedin.com/posts/eddie-mattia_the-team-at-hugging-facerecently-released-activity-7219416449084272641-swIu) how to include the dataset viewer in Metaflow cards on [Outerbounds](https://huggingface.co/outerbounds). ### Tool: AutoTrain [`abhishek`](https://huggingface.co/abhishek) showcases how the dataset viewer is integrated into [AutoTrain](https://huggingface.co/autotrain) in a [demo video](https://x.com/abhi1thakur/status/1813892464144798171). ### Datasets: Alpaca-style datasets gallery [`davanstrien`](https://huggingface.co/davanstrien) showcases the [collection of Alpaca-style datasets](https://huggingface.co/collections/librarian-bots/alpaca-style-datasets-66964d3e490f463859002588) in a [space](https://huggingface.co/spaces/davanstrien/collection_dataset_viewer). ### Datasets: Docmatix [`andito`](https://huggingface.co/andito) uses the embedded viewer in the [blog post](https://huggingface.co/blog/docmatix) announcing the release of [Docmatix](https://huggingface.co/datasets/HuggingFaceM4/Docmatix), a huge dataset for Document Visual Question Answering (DocVQA). ### App: Masader - Arabic NLP data catalogue [`Zaid`](https://huggingface.co/Zaid) [showcases](https://x.com/zaidalyafeai/status/1815365207775932576) the dataset viewer in [Masader - the Arabic NLP data catalogue0](https://arbml.github.io/masader//). ### Using OpenCLIP at Hugging Face https://huggingface.co/docs/hub/open_clip.md # Using OpenCLIP at Hugging Face [OpenCLIP](https://github.com/mlfoundations/open_clip) is an open-source implementation of OpenAI's CLIP. ## Exploring OpenCLIP on the Hub You can find OpenCLIP models by filtering at the left of the [models page](https://huggingface.co/models?library=open_clip&sort=trending). OpenCLIP models hosted on the Hub have a model card with useful information about the models. Thanks to OpenCLIP Hugging Face Hub integration, you can load OpenCLIP models with a few lines of code. You can also deploy these models using [Inference Endpoints](https://huggingface.co/inference-endpoints). ## Installation To get started, you can follow the [OpenCLIP installation guide](https://github.com/mlfoundations/open_clip#usage). You can also use the following one-line install through pip: ``` $ pip install open_clip_torch ``` ## Using existing models All OpenCLIP models can easily be loaded from the Hub: ```py import open_clip model, preprocess = open_clip.create_model_from_pretrained('hf-hub:laion/CLIP-ViT-g-14-laion2B-s12B-b42K') tokenizer = open_clip.get_tokenizer('hf-hub:laion/CLIP-ViT-g-14-laion2B-s12B-b42K') ``` Once loaded, you can encode the image and text to do [zero-shot image classification](https://huggingface.co/tasks/zero-shot-image-classification): ```py import torch from PIL import Image import requests url = 'http://images.cocodataset.org/val2017/000000039769.jpg' image = Image.open(requests.get(url, stream=True).raw) image = preprocess(image).unsqueeze(0) text = tokenizer(["a diagram", "a dog", "a cat"]) with torch.no_grad(), torch.cuda.amp.autocast(): image_features = model.encode_image(image) text_features = model.encode_text(text) image_features /= image_features.norm(dim=-1, keepdim=True) text_features /= text_features.norm(dim=-1, keepdim=True) text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1) print("Label probs:", text_probs) ``` It outputs the probability of each possible class: ```text Label probs: tensor([[0.0020, 0.0034, 0.9946]]) ``` If you want to load a specific OpenCLIP model, you can click `Use in OpenCLIP` in the model card and you will be given a working snippet! ## Additional resources * OpenCLIP [repository](https://github.com/mlfoundations/open_clip) * OpenCLIP [docs](https://github.com/mlfoundations/open_clip/tree/main/docs) * OpenCLIP [models in the Hub](https://huggingface.co/models?library=open_clip&sort=trending) ### Tabby on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-tabby.md # Tabby on Spaces [Tabby](https://tabby.tabbyml.com) is an open-source, self-hosted AI coding assistant. With Tabby, every team can set up its own LLM-powered code completion server with ease. In this guide, you will learn how to deploy your own Tabby instance and use it for development directly from the Hugging Face website. ## Your first Tabby Space In this section, you will learn how to deploy a Tabby Space and use it for yourself or your organization. ### Deploy Tabby on Spaces You can deploy Tabby on Spaces with just a few clicks: [![Deploy on HF Spaces](https://huggingface.co/datasets/huggingface/badges/raw/main/deploy-to-spaces-lg.svg)](https://huggingface.co/spaces/TabbyML/tabby-template-space?duplicate=true) You need to define the Owner (your personal account or an organization), a Space name, and the Visibility. To secure the api endpoint, we're configuring the visibility as Private. ![Duplicate Space](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/tabby/duplicate-space.png) Youโ€™ll see the *Building status*. Once it becomes *Running*, your Space is ready to go. If you donโ€™t see the Tabby Swagger UI, try refreshing the page. ![Swagger UI](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/tabby/swagger-ui.png) > [!TIP] > If you want to customize the title, emojis, and colors of your space, go to "Files and Versions" and edit the metadata of your README.md file. ### Your Tabby Space URL Once Tabby is up and running, for a space link such as https://huggingface.com/spaces/TabbyML/tabby, the direct URL will be https://tabbyml-tabby.hf.space. This URL provides access to a stable Tabby instance in full-screen mode and serves as the API endpoint for IDE/Editor Extensions to talk with. ### Connect VSCode Extension to Space backend 1. Install the [VSCode Extension](https://marketplace.visualstudio.com/items?itemName=TabbyML.vscode-tabby). 2. Open the file located at `~/.tabby-client/agent/config.toml`. Uncomment both the `[server]` section and the `[server.requestHeaders]` section. * Set the endpoint to the Direct URL you found in the previous step, which should look something like `https://UserName-SpaceName.hf.space`. * As the Space is set to **Private**, it is essential to configure the authorization header for accessing the endpoint. You can obtain a token from the [Access Tokens](https://huggingface.co/settings/tokens) page. ![Agent Config](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/tabby/agent-config.png) 3. You'll notice a โœ“ icon indicating a successful connection. ![Tabby Connected](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/tabby/tabby-connected.png) 4. You've complete the setup, now enjoy tabing! ![Code Completion](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/tabby/code-completion.png) You can also utilize Tabby extensions in other IDEs, such as [JetBrains](https://plugins.jetbrains.com/plugin/22379-tabby). ## Feedback and support If you have improvement suggestions or need specific support, please join [Tabby Slack community](https://join.slack.com/t/tabbycommunity/shared_invite/zt-1xeiddizp-bciR2RtFTaJ37RBxr8VxpA) or reach out on [Tabbyโ€™s GitHub repository](https://github.com/TabbyML/tabby). ### ChatUI on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-chatui.md # ChatUI on Spaces **HuggingChat** is an open-source interface enabling everyone to try open-source large language models such as Falcon, StarCoder, and BLOOM. Thanks to an official Docker template called ChatUI, you can deploy your own HuggingChat based on a model of your choice with a few clicks using Hugging Face's infrastructure. ## Deploy your own Chat UI To get started, simply head [here](https://huggingface.co/new-space?template=huggingchat/chat-ui-template). In the backend of this application, [text-generation-inference](https://github.com/huggingface/text-generation-inference) is used for better optimized model inference. Since these models can't run on CPUs, you can select the GPU depending on your choice of model. You should provide a MongoDB endpoint where your chats will be written. If you leave this section blank, your logs will be persisted to a database inside the Space. Note that Hugging Face does not have access to your chats. You can configure the name and the theme of the Space by providing the application name and application color parameters. Below this, you can select the Hugging Face Hub ID of the model you wish to serve. You can also change the generation hyperparameters in the dictionary below in JSON format. _Note_: If you'd like to deploy a model with gated access or a model in a private repository, you can simply provide `HF_TOKEN` in repository secrets. You need to set its value to an access token you can get from [here](https://huggingface.co/settings/tokens). Once the creation is complete, you will see `Building` on your Space. Once built, you can try your own HuggingChat! Start chatting! ## Read more - [HF Docker Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker) - [chat-ui GitHub Repository](https://github.com/huggingface/chat-ui) - [text-generation-inference GitHub repository](https://github.com/huggingface/text-generation-inference) ### Pricing and Billing https://huggingface.co/docs/hub/jobs-pricing.md # Pricing and Billing Hugging Face Jobs let you run compute tasks on Hugging Face infrastructure without managing it yourself. Simply define a command, a Docker image, and a hardware flavor among various CPU and GPU options. > [!TIP] > Jobs are available to any user or organization with [pre-paid credits](https://huggingface.co/pricing). Billing on Jobs is based on hardware usage and is computed by the minute: you get charged for every minute the Jobs runs on the requested hardware. During a Jobโ€™s lifecycle, it is only billed when the Job is Starting or Running. This means that there is no cost during build. If a running Job starts to fail, it will be automatically suspended and the billing will stop. ## Pricing Jobs are billed per minute based on the hardware used. Below are the available hardware options and their pricing. ### CPU | **Hardware** | **CPU** | **Memory** | **Hourly Price** | |----------------------- |-------------- |------------- | ----------------- | | CPU Basic | 2 vCPU | 16 GB | $0.01 | | CPU Upgrade | 8 vCPU | 32 GB | $0.03 | | CPU XL | 16 vCPU | 124 GB | $1.00 | | CPU Performance | 32 vCPU | 256 GB | $1.90 | ### GPU | **Hardware** | **CPU** | **Memory** | **GPU Memory** | **Hourly Price** | |----------------------- |-------------- |------------- |---------------- | ----------------- | | Nvidia T4 - small | 4 vCPU | 15 GB | 16 GB | $0.40 | | Nvidia T4 - medium | 8 vCPU | 30 GB | 16 GB | $0.60 | | 1x Nvidia L4 | 8 vCPU | 30 GB | 24 GB | $0.80 | | 4x Nvidia L4 | 48 vCPU | 186 GB | 96 GB | $3.80 | | 1x Nvidia L40S | 8 vCPU | 62 GB | 48 GB | $1.80 | | 4x Nvidia L40S | 48 vCPU | 382 GB | 192 GB | $8.30 | | 8x Nvidia L40S | 192 vCPU | 1534 GB | 384 GB | $23.50 | | Nvidia A10G - small | 4 vCPU | 15 GB | 24 GB | $1.00 | | Nvidia A10G - large | 12 vCPU | 46 GB | 24 GB | $1.50 | | 2x Nvidia A10G - large | 24 vCPU | 92 GB | 48 GB | $3.00 | | 4x Nvidia A10G - large | 48 vCPU | 184 GB | 96 GB | $5.00 | | Nvidia A100 - large | 12 vCPU | 142 GB | 80 GB | $2.50 | | 4x Nvidia A100 - large | 48 vCPU | 568 GB | 320 GB | $10.00 | | 8x Nvidia A100 - large | 96 vCPU | 1136 GB | 640 GB | $20.00 | | Nvidia H200 | 23 vCPU | 256 GB | 141 GB | $5.00 | | 2x Nvidia H200 | 46 vCPU | 512 GB | 282 GB | $10.00 | | 4x Nvidia H200 | 92 vCPU | 1024 GB | 564 GB | $20.00 | | 8x Nvidia H200 | 184 vCPU | 2048 GB | 1128 GB | $40.00 | You can also retrieve available hardware and pricing programmatically via the API at `GET /api/jobs/hardware` or via the CLI: ```bash >>> hf jobs hardware ``` ## Manage billing ### Bill to your organization Billing is done to the user's namespace by default, but you can bill to your organization instead by specifying the right `namespace`: ```bash hf jobs run --namespace my-org-name ... ``` In this case the Job runs under the organization account, and you can see it in your organization Jobs page (organization page > settings > Jobs). ### View current compute usage You can look at your current billing information for Jobs in in your [Billing](https://huggingface.co/settings/billing) page, under the "Compute Usage" section: Additional information about billing can be found in the [dedicated Hub documentation](https://huggingface.co/docs/hub/en/billing). ### Recommendations #### Set timeout limits Set a `timeout` when creating the Job to ensure it can't run beyond a certain duration. A Job run that reaches the `timeout` duration is automatically stopped, and so is its billing. Here is how to set a timeout with the CLI: ```bash hf jobs run --timeout 3h ... ``` Note that the default timeout is set to **30 minutes**. You must therefore specify a longer timeout if your Job requires more time to run. #### Cancel irrelevant Jobs If a running Job is no longer relevant, you can cancel it prematurely to stop its billing, either via the Job page or the CLI: ```bash hf jobs cancel ``` ### Displaying carbon emissions for your model https://huggingface.co/docs/hub/model-cards-co2.md # Displaying carbon emissions for your model ## Why is it beneficial to calculate the carbon emissions of my model? Training ML models is often energy-intensive and can produce a substantial carbon footprint, as described by [Strubell et al.](https://arxiv.org/abs/1906.02243). It's therefore important to *track* and *report* the emissions of models to get a better idea of the environmental impacts of our field. ## What information should I include about the carbon footprint of my model? If you can, you should include information about: - where the model was trained (in terms of location) - the hardware used -- e.g. GPU, TPU, or CPU, and how many - training type: pre-training or fine-tuning - the estimated carbon footprint of the model, calculated in real-time with the [Code Carbon](https://github.com/mlco2/codecarbon) package or after training using the [ML CO2 Calculator](https://mlco2.github.io/impact/). ## Carbon footprint metadata You can add the carbon footprint data to the model card metadata (in the README.md file). The structure of the metadata should be: ```yaml --- co2_eq_emissions: emissions: number (in grams of CO2) source: "source of the information, either directly from AutoTrain, code carbon or from a scientific article documenting the model" training_type: "pre-training or fine-tuning" geographical_location: "as granular as possible, for instance Quebec, Canada or Brooklyn, NY, USA. To check your compute's electricity grid, you can check out https://app.electricitymap.org." hardware_used: "how much compute and what kind, e.g. 8 v100 GPUs" --- ``` ## How is the carbon footprint of my model calculated? ๐ŸŒŽ Considering the computing hardware, location, usage, and training time, you can estimate how much CO2 the model produced. The math is pretty simple! โž• First, you take the *carbon intensity* of the electric grid used for the training -- this is how much CO2 is produced by KwH of electricity used. The carbon intensity depends on the location of the hardware and the [energy mix](https://electricitymap.org/) used at that location -- whether it's renewable energy like solar ๐ŸŒž, wind ๐ŸŒฌ๏ธ and hydro ๐Ÿ’ง, or non-renewable energy like coal โšซ and natural gas ๐Ÿ’จ. The more renewable energy gets used for training, the less carbon-intensive it is! Then, you take the power consumption of the GPUs during training using the `pynvml` library. Finally, you multiply the power consumption and carbon intensity by the training time of the model, and you have an estimate of the CO2 emission. Keep in mind that this isn't an exact number because other factors come into play -- like the energy used for data center heating and cooling -- which will increase carbon emissions. But this will give you a good idea of the scale of CO2 emissions that your model is producing! To add **Carbon Emissions** metadata to your models: 1. If you are using **AutoTrain**, this is tracked for you ๐Ÿ”ฅ 2. Otherwise, use a tracker like Code Carbon in your training code, then specify ```yaml co2_eq_emissions: emissions: 1.2345 ``` in your model card metadata, where `1.2345` is the emissions value in **grams**. To learn more about the carbon footprint of Transformers, check out the [video](https://www.youtube.com/watch?v=ftWlj4FBHTg), part of the Hugging Face Course! ### Spaces ZeroGPU: Dynamic GPU Allocation for Spaces https://huggingface.co/docs/hub/spaces-zerogpu.md # Spaces ZeroGPU: Dynamic GPU Allocation for Spaces ZeroGPU is a shared infrastructure that optimizes GPU usage for AI models and demos on Hugging Face Spaces. It dynamically allocates and releases NVIDIA H200 GPUs as needed, offering: 1. **Free GPU Access**: Enables cost-effective GPU usage for Spaces. 2. **Multi-GPU Support**: Allows Spaces to leverage multiple GPUs concurrently on a single application. Unlike traditional single-GPU allocations, ZeroGPU's efficient system lowers barriers for developers, researchers, and organizations to deploy AI models by maximizing resource utilization and power efficiency. ## Using and hosting ZeroGPU Spaces - **Using existing ZeroGPU Spaces** - ZeroGPU Spaces are available to use for free to all users. (Visit [the curated list](https://huggingface.co/spaces/enzostvs/zero-gpu-spaces)). - [PRO users](https://huggingface.co/subscribe/pro) get x8 more daily usage quota, highest priority in GPU queues, and can go beyond their daily quota using pre-paid credits when using any ZeroGPU Spaces. - **Hosting your own ZeroGPU Spaces** - Personal accounts: [Subscribe to PRO](https://huggingface.co/settings/billing/subscription) to access ZeroGPU in the hardware options when creating a new Gradio SDK Space. - Organizations: [Subscribe to a Team or Enterprise plan](https://huggingface.co/enterprise) to enable ZeroGPU Spaces for all organization members. ## Technical Specifications ZeroGPU supports two GPU sizes | GPU size | Backing hardware | Vram | Quota cost | |---------------------|------------------|--------------------------|------------| | `large` *(default)* | Half NVIDIA H200 | 70GB | 1ร— | | `xlarge` | Full NVIDIA H200 | 141GB | 2ร— | > [!NOTE] > See [GPU size selection](#gpu-size-selection) to learn how to use sizes ## Compatibility ZeroGPU Spaces are designed to be compatible with most PyTorch-based GPU Spaces. While compatibility is enhanced for high-level Hugging Face libraries like `transformers` and `diffusers`, users should be aware that: - Currently, ZeroGPU Spaces are exclusively compatible with the **Gradio SDK**. - ZeroGPU Spaces may have limited compatibility compared to standard GPU Spaces. - Unexpected issues may arise in some scenarios. ### Supported Versions - **Gradio**: 4+ - **PyTorch**: Almost all versions from **2.1.0** to **latest** are supported See full list - 2.1.0 - 2.1.1 - 2.1.2 - 2.2.0 - 2.2.2 - 2.4.0 - 2.5.1 - 2.6.0 - 2.7.1 - 2.8.0 - 2.9.1 - **Python**: - 3.12.12 - 3.10.13 ## Getting started with ZeroGPU To utilize ZeroGPU in your Space, follow these steps: 1. Make sure the ZeroGPU hardware is selected in your Space settings. 2. Import the `spaces` module. 3. Decorate GPU-dependent functions with `@spaces.GPU`. This decoration process allows the Space to request a GPU when the function is called and release it upon completion. > [!NOTE] > The `@spaces.GPU` decorator is designed to be effect-free in non-ZeroGPU environments, ensuring compatibility across different setups. ### Example Usage ```python import spaces from diffusers import DiffusionPipeline pipe = DiffusionPipeline.from_pretrained(...) pipe.to('cuda') @spaces.GPU def generate(prompt): return pipe(prompt).images gr.Interface( fn=generate, inputs=gr.Text(), outputs=gr.Gallery(), ).launch() ``` ### Model loading Even though a real GPU is only available inside `@spaces.GPU` functions, models must be placed on `cuda` at the root module level (as shown in the example above). Lazy-loading or moving models to CUDA inside `@spaces.GPU` is discouraged, as it is significantly less efficient (CUDA transfers are optimized for placements done during startup). > [!NOTE] > Loading models on `cuda` at module level works because a PyTorch CUDA emulation mode is enabled outside `@spaces.GPU` functions, allowing CUDA operations without a real GPU. Inside `@spaces.GPU`, real CUDA is used. ## GPU size selection The default size used by `@spaces.GPU` is `large` (half H200). You can explicitly request a full H200 by specifying `size="xlarge"`: ``` python @spaces.GPU(size="xlarge") def generate(prompt): return pipe(prompt).images ``` > [!NOTE] > - `xlarge` consumes **2ร—** more daily quota than `large` (e.g. a 45s **effective** task duration consumes 90s of quota) > - `xlarge` usually means higher queuing probability and longer wait times > - Only use `xlarge` when your workload truly benefits from the additional compute or memory ## Duration Management For functions expected to exceed the default 60-second of GPU runtime, you can specify a custom duration: ```python @spaces.GPU(duration=120) def generate(prompt): return pipe(prompt).images ``` This sets the maximum function runtime to 120 seconds. Specifying shorter durations for quicker functions will improve queue priority for Space visitors. ### Dynamic duration `@spaces.GPU` also supports dynamic durations. Instead of directly passing a duration, simply pass a callable that takes the same inputs as your decorated function and returns a duration value: ```python def get_duration(prompt, steps): step_duration = 3.75 return steps * step_duration @spaces.GPU(duration=get_duration) def generate(prompt, steps): return pipe(prompt, num_inference_steps=steps).images ``` ## Compilation ZeroGPU does not support `torch.compile`, but you can use PyTorch **ahead-of-time** compilation (requires torch `2.8+`) Check out this [blogpost](https://huggingface.co/blog/zerogpu-aoti) for a complete guide on ahead-of-time compilation on ZeroGPU. ## Usage Tiers GPU usage is subject to **daily** quotas, per account tier: | Account type | Included daily GPU quota | Queue priority | | ------------------------------ | ------------------------ | --------------- | | Unauthenticated | 2 minutes | Low | | Free account | 3.5 minutes | Medium | | PRO account | 25 minutes (extensible) | Highest | | Team organization member | 25 minutes (extensible) | Highest | | Enterprise organization member | 45 minutes (extensible) | Highest | Included daily quota resets exactly 24 hours after your first GPU usage. > [!NOTE] > Remaining quota directly impacts priority in ZeroGPU queues. ### Extending quota with credits PRO, Team, and Enterprise users can continue using ZeroGPU Spaces beyond their included daily quota by consuming pre-paid credits at the rate of **$1 per 10 minutes** of GPU time. Once your daily quota is exhausted, any additional GPU usage is automatically billed against your credit balance. You can add credits from your [billing settings](https://huggingface.co/settings/billing). ## Hosting Limitations - **Personal accounts ([PRO subscribers](https://huggingface.co/subscribe/pro))**: Maximum of 10 ZeroGPU Spaces. - **Organization accounts ([Team & Enterprise](https://huggingface.co/enterprise))**: Maximum of 50 ZeroGPU Spaces. By leveraging ZeroGPU, developers can create more efficient and scalable Spaces, maximizing GPU utilization while minimizing costs. ## Recommendations If your demo uses a large model, we recommend using optimizations like ahead-of-time compilation and flash-attention 3. You can learn how to leverage these with ZeroGPU in [this post](https://huggingface.co/blog/zerogpu-aoti). These optimizations will help you to maximize the advantages of ZeroGPU hours and provide a better user experience. ## Feedback You can share your feedback on Spaces ZeroGPU directly on the HF Hub: https://huggingface.co/spaces/zero-gpu-explorers/README/discussions ### Webhooks Automation https://huggingface.co/docs/hub/jobs-webhooks.md # Webhooks Automation Webhooks allow you to listen for new changes on specific repositories or to all repositories belonging to particular set of users/organizations (not just your repos, but any repo) on Hugging Face. Use `create_webhook` in the `huggingface_hub` Python client to create a webhook that triggers a Job when a change happens in a Hugging Face repository: ```python from huggingface_hub import create_webhook # Example: Creating a webhook that triggers a Job webhook = create_webhook( job_id=job_id, watched=[{"type": "user", "name": "your-username"}, {"type": "org", "name": "your-org-name"}], domains=["repo", "discussion"], secret="your-secret" ) ``` The webhook triggers the Job with the following environment variables: - `WEBHOOK_PAYLOAD`: the full webhook payload as a JSON string - `WEBHOOK_REPO_ID`: the repository name (e.g., `user/repo-name`) - `WEBHOOK_REPO_TYPE`: the repository type (`model`, `dataset`, or `space`) - `WEBHOOK_SECRET`: the webhook secret, if one was configured The webhook payload contains multiple fields, here are a few useful ones: ``` - event: - action: one of "create", "delete", "move", "update" - scope: string - repo: - owner: string - headSha: string - name: string - type: one of "dataset", "model", "space" ``` You can find more information on webhooks in the [`huggingface_hub` Webhooks documentation](https://huggingface.co/docs/huggingface_hub/en/guides/webhooks). ### Using mlx-image at Hugging Face https://huggingface.co/docs/hub/mlx-image.md # Using mlx-image at Hugging Face [`mlx-image`](https://github.com/riccardomusmeci/mlx-image) is an image models library developed by [Riccardo Musmeci](https://github.com/riccardomusmeci) built on Apple [MLX](https://github.com/ml-explore/mlx). It tries to replicate the great [timm](https://github.com/huggingface/pytorch-image-models), but for MLX models. ## Exploring mlx-image on the Hub You can find `mlx-image` models by filtering using the `mlx-image` library name, like in [this query](https://huggingface.co/models?library=mlx-image&sort=trending). There's also an open [mlx-vision](https://huggingface.co/mlx-vision) community for contributors converting and publishing weights for MLX format. ## Installation ```bash pip install mlx-image ``` ## Models Model weights are available on the [`mlx-vision`](https://huggingface.co/mlx-vision) community on HuggingFace. To load a model with pre-trained weights: ```python from mlxim.model import create_model # loading weights from HuggingFace (https://huggingface.co/mlx-vision/resnet18-mlxim) model = create_model("resnet18") # pretrained weights loaded from HF # loading weights from local file model = create_model("resnet18", weights="path/to/resnet18/model.safetensors") ``` To list all available models: ```python from mlxim.model import list_models list_models() ``` ## ImageNet-1K Results Go to [results-imagenet-1k.csv](https://github.com/riccardomusmeci/mlx-image/blob/main/results/results-imagenet-1k.csv) to check every model converted to `mlx-image` and its performance on ImageNet-1K with different settings. > **TL;DR** performance is comparable to the original models from PyTorch implementations. ## Similarity to PyTorch and other familiar tools `mlx-image` tries to be as close as possible to PyTorch: - `DataLoader` -> you can define your own `collate_fn` and also use `num_workers` to speed up data loading - `Dataset` -> `mlx-image` already supports `LabelFolderDataset` (the good and old PyTorch `ImageFolder`) and `FolderDataset` (a generic folder with images in it) - `ModelCheckpoint` -> keeps track of the best model and saves it to disk (similar to PyTorchLightning). It also suggests early stopping ## Training Training is similar to PyTorch. Here's an example of how to train a model: ```python import mlx.nn as nn import mlx.optimizers as optim from mlxim.model import create_model from mlxim.data import LabelFolderDataset, DataLoader train_dataset = LabelFolderDataset( root_dir="path/to/train", class_map={0: "class_0", 1: "class_1", 2: ["class_2", "class_3"]} ) train_loader = DataLoader( dataset=train_dataset, batch_size=32, shuffle=True, num_workers=4 ) model = create_model("resnet18") # pretrained weights loaded from HF optimizer = optim.Adam(learning_rate=1e-3) def train_step(model, inputs, targets): logits = model(inputs) loss = mx.mean(nn.losses.cross_entropy(logits, target)) return loss model.train() for epoch in range(10): for batch in train_loader: x, target = batch train_step_fn = nn.value_and_grad(model, train_step) loss, grads = train_step_fn(x, target) optimizer.update(model, grads) mx.eval(model.state, optimizer.state) ``` ## Additional Resources * [mlx-image repository](https://github.com/riccardomusmeci/mlx-image) * [mlx-vision community](https://huggingface.co/mlx-vision) ## Contact If you have any questions, please email `riccardomusmeci92@gmail.com`. ### Panel on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-panel.md # Panel on Spaces [Panel](https://panel.holoviz.org/) is an open-source Python library that lets you easily build powerful tools, dashboards and complex applications entirely in Python. It has a batteries-included philosophy, putting the PyData ecosystem, powerful data tables and much more at your fingertips. High-level reactive APIs and lower-level callback based APIs ensure you can quickly build exploratory applications, but you arenโ€™t limited if you build complex, multi-page apps with rich interactivity. Panel is a member of the [HoloViz](https://holoviz.org/) ecosystem, your gateway into a connected ecosystem of data exploration tools. Visit [Panel documentation](https://panel.holoviz.org/) to learn more about making powerful applications. ## ๐Ÿš€ Deploy Panel on Spaces You can deploy Panel on Spaces with just a few clicks: There are a few key parameters you need to define: the Owner (either your personal account or an organization), a Space name, and Visibility. In case you intend to execute computationally intensive deep learning models, consider upgrading to a GPU to boost performance. Once you have created the Space, it will start out in โ€œBuildingโ€ status, which will change to โ€œRunningโ€ once your Space is ready to go. ## โšก๏ธ What will you see? When your Space is built and ready, you will see this image classification Panel app which will let you fetch a random image and run the OpenAI CLIP classifier model on it. Check out our [blog post](https://blog.holoviz.org/building_an_interactive_ml_dashboard_in_panel.html) for a walkthrough of this app. ## ๐Ÿ› ๏ธ How to customize and make your own app? The Space template will populate a few files to get your app started: Three files are important: ### 1. app.py This file defines your Panel application code. You can start by modifying the existing application or replace it entirely to build your own application. To learn more about writing your own Panel app, refer to the [Panel documentation](https://panel.holoviz.org/). ### 2. Dockerfile The Dockerfile contains a sequence of commands that Docker will execute to construct and launch an image as a container that your Panel app will run in. Typically, to serve a Panel app, we use the command `panel serve app.py`. In this specific file, we divide the command into a list of strings. Furthermore, we must define the address and port because Hugging Face will expect to serve your application on port 7860. Additionally, we need to specify the `allow-websocket-origin` flag to enable the connection to the server's websocket. ### 3. requirements.txt This file defines the required packages for our Panel app. When using Space, dependencies listed in the requirements.txt file will be automatically installed. You have the freedom to modify this file by removing unnecessary packages or adding additional ones that are required for your application. Feel free to make the necessary changes to ensure your app has the appropriate packages installed. ## ๐ŸŒ Join Our Community The Panel community is vibrant and supportive, with experienced developers and data scientists eager to help and share their knowledge. Join us and connect with us: - [Discord](https://discord.gg/aRFhC3Dz9w) - [Discourse](https://discourse.holoviz.org/) - [Twitter](https://twitter.com/Panel_Org) - [LinkedIn](https://www.linkedin.com/company/panel-org) - [Github](https://github.com/holoviz/panel) ### Access Patterns https://huggingface.co/docs/hub/storage-buckets-access.md # Access Patterns Beyond the [CLI and Python SDK](./storage-buckets#managing-files), there are several ways to access bucket data from your existing tools and workflows. ## Choosing an Access Method | Method | Best for | Details | |--------|----------|---------| | **hf-mount** | Mount as local filesystem โ€” any tool works | [See below](#mount-as-a-local-filesystem) | | **Volume mounts** | HF Jobs & Spaces (same idea, managed for you) | [See below](#volume-mounts-in-jobs-and-spaces) | | **hf:// paths** (fsspec) | Python data tools (pandas, DuckDB) | [See below](#python-data-tools) | | **CLI sync** | Batch transfers, backups | [Sync docs](./storage-buckets#syncing-directories) | Access through the S3 API is not currently supported, but is on the roadmap. ## Mount as a Local Filesystem [hf-mount](https://github.com/huggingface/hf-mount) lets you mount buckets (and repos) as local filesystems via NFS (recommended) or FUSE. Files are fetched lazily โ€” only the bytes your code reads hit the network. Install with [Homebrew](https://brew.sh/): ```bash brew install hf-mount ``` Mount a bucket: ```bash hf-mount start bucket username/my-bucket /mnt/data ``` Once mounted, any tool that reads or writes files works with your bucket โ€” pandas, DuckDB, vLLM, training scripts, shell commands, etc. > [!TIP] > Buckets are mounted read-write; repos are read-only. See the [hf-mount repository](https://github.com/huggingface/hf-mount) for full documentation including backend options, caching, and write modes. ## Volume Mounts in Jobs and Spaces Volume mounts in [Jobs](./jobs) and [Spaces](./spaces) are the same idea as `hf-mount`, managed for you by the platform โ€” no extra setup needed. Buckets are mounted read-write by default. ```bash hf jobs run -v hf://buckets/username/my-bucket:/data python:3.12 python script.py ``` For the full volume mount syntax and Python API, see the [Jobs configuration docs](./jobs-configuration#volumes) and the [Spaces volume mount guide](/docs/huggingface_hub/guides/manage-spaces#mount-volumes-in-your-space). ## Python Data Tools The [`HfFileSystem`](/docs/huggingface_hub/guides/hf_file_system) provides [fsspec](https://filesystem-spec.readthedocs.io)-compatible access to buckets using `hf://buckets/` paths. Any Python library that supports fsspec can read and write bucket data directly. **pandas:** ```python import pandas as pd df = pd.read_parquet("hf://buckets/username/my-bucket/data.parquet") df.to_parquet("hf://buckets/username/my-bucket/output.parquet") ``` **DuckDB** (Python client): ```python import duckdb from huggingface_hub import HfFileSystem duckdb.register_filesystem(HfFileSystem()) duckdb.sql("SELECT * FROM 'hf://buckets/username/my-bucket/data.parquet' LIMIT 10") ``` For more on `hf://` paths and supported operations, see the [`HfFileSystem` guide](/docs/huggingface_hub/guides/hf_file_system) and the [Buckets Python guide](/docs/huggingface_hub/guides/buckets). ### Dash on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-dash.md # Dash on Spaces With Dash Open Source, you can create data apps on your laptop in pure Python, no JavaScript required. Get familiar with Dash by building a [sample app](https://dash.plotly.com/tutorial) with open source. Scale up with [Dash Enterprise](https://plotly.com/dash/) when your Dash app is ready for department or company-wide consumption. Or, launch your initiative with Dash Enterprise from the start to unlock developer productivity gains and hands-on acceleration from Plotly's team. ## Deploy Dash on Spaces To get started with Dash on Spaces, click the button below: This will start building your Space using Plotly's Dash Docker template. If successful, you should see a similar application to the [Dash template app](https://huggingface.co/spaces/dash/dash-app-template). ## Customizing your Dash app If you have never built with Dash before, we recommend getting started with our [Dash in 20 minutes tutorial](https://dash.plotly.com/tutorial). When you create a Dash Space, you'll get a few key files to help you get started: ### 1. app.py This is the main app file that defines the core logic of your project. Dash apps are often structured as modules, and you can optionally separate your layout, callbacks, and data into other files, like `layout.py`, etc. Inside of `app.py` you will see: 1. `from dash import Dash, html` We import the `Dash` object to define our app, and the `html` library, which gives us building blocks to assemble our project. 2. `app = Dash()` Here, we define our app. Layout, server, and callbacks are _bound_ to the `app` object. 3. `server = app.server` Here, we define our server variable, which is used to run the app in production. 4. `app.layout = ` The starter app layout is defined as a list of Dash components, an individual Dash component, or a function that returns either. The `app.layout` is your initial layout that will be updated as a single-page application by callbacks and other logic in your project. 5. `if __name__ == '__main__': app.run(debug=True)` If you are running your project locally with `python app.py`, `app.run(...)` will execute and start up a development server to work on your project, with features including hot reloading, the callback graph, and more. In production, we recommend `gunicorn`, which is a production-grade server. Debug features will not be enabled when running your project with `gunicorn`, so this line will never be reached. ### 2. Dockerfile The Dockerfile for a Dash app is minimal since Dash has few system dependencies. The key requirements are: - It installs the dependencies listed in `requirements.txt` (using `uv`) - It creates a non-root user for security - It runs the app with `gunicorn` using `gunicorn app:server --workers 4` You may need to modify this file if your application requires additional system dependencies, permissions, or other CLI flags. ### 3. requirements.txt The Space will automatically install dependencies listed in the `requirements.txt` file. At minimum, you must include `dash` and `gunicorn` in this file. You will want to add any other required packages your app needs. The Dash Space template provides a basic setup that you can extend based on your needs. ## Additional Resources and Support - [Dash documentation](https://dash.plotly.com) - [Dash GitHub repository](https://github.com/plotly/dash) - [Dash Community Forums](https://community.plotly.com) - [Dash Enterprise](https://plotly.com/dash) - [Dash template Space](https://huggingface.co/spaces/plotly/dash-app-template) ## Troubleshooting If you encounter issues: 1. Make sure your notebook runs locally in app mode using `python app.py` 2. Check that all required packages are listed in `requirements.txt` 3. Verify the port configuration matches (7860 is the default for Spaces) 4. Check Space logs for any Python errors For more help, visit the [Plotly Community Forums](https://community.plotly.com) or [open an issue](https://github.com/plotly/dash/issues). ### Audit Logs https://huggingface.co/docs/hub/audit-logs.md # Audit Logs > [!WARNING] > This feature is part of the Team & Enterprise plans. Audit Logs enable organization admins to easily review actions taken by members, including organization membership, repository settings and billing changes. screenshot of Hugging Face Audit Logs feature ## Accessing Audit Logs Audit Logs are accessible through your organization settings. Each log entry includes: - Who performed the action - What type of action was taken - A description of the change - Location and anonymized IP address - Date and time of the action You can also download the complete audit log as a JSON file for further analysis. ## What Events Are Tracked? Each action has an **event name** in `scope.action` format (e.g. `repo.create`, `collection.delete`). This is the `type` field in each log entry and in the exported JSONโ€”use it when searching or filtering logs. ### Organization Management & Security - **Core organization changes** โ€” Creation, deletion, restoration, renaming, and profile/settings updates. - **Events:** `org.create`, `org.delete`, `org.restore`, `org.rename`, `org.update_settings` - **Security management** - Organization API token rotation. - **Event:** `org.rotate_token` - Token approval system โ€” Enabling or disabling the policy, authorization requests, approvals, denials, and revocations. - **Events:** `org.token_approval.enabled`, `org.token_approval.disabled`, `org.token_approval.authorization_request`, `org.token_approval.authorization_request.authorized`, `org.token_approval.authorization_request.revoked`, `org.token_approval.authorization_request.denied` - SSO โ€” Logins and joins via SSO. - **Events:** `org.sso_login`, `org.sso_join` - **Join settings** โ€” Domain-based access and automatic join configuration. - **Event:** `org.update_join_settings` ### Membership and Access Control - **Member lifecycle** โ€” Adding and removing members, role changes, and members leaving the organization. - **Events:** `org.add_user`, `org.remove_user`, `org.change_role`, `org.leave` - **Invitations** โ€” Sending invites, invitation links by email, and users accepting invites. - **Events:** `org.invite_user`, `org.invite.accept`, `org.invite.email` - **Automatic joins** โ€” Joins via verified email domain or โ€œrequest accessโ€. - **Events:** `org.join.from_domain`, `org.join.automatic` ### Content and Resource Management - **Repository administration** โ€” Creation, deletion, moving, disabling/re-enabling, duplication settings, DOI removal, resource group assignment, and general repo settings (visibility, gating, discussions, etc.). Also LFS file deletion. - **Events:** `repo.create`, `repo.delete`, `repo.move`, `repo.disable`, `repo.removeDisable`, `repo.duplication`, `repo.delete_doi`, `repo.update_resource_group`, `repo.update_settings`, `repo.delete_lfs_file` - **Collections** โ€” Creation and deletion of collections. - **Events:** `collection.create`, `collection.delete` - **Repository security** โ€” Secrets and variables (individual and bulk add/update/remove). - **Events (secrets):** `repo.add_secret`, `repo.update_secret`, `repo.remove_secret`, `repo.add_secrets`, `repo.remove_secrets` - **Events (variables):** `repo.add_variable`, `repo.update_variable`, `repo.remove_variable`, `repo.add_variables`, `repo.remove_variables` - **Spaces configuration** โ€” Storage tier changes, hardware (flavor) updates, and sleep time adjustments. - **Events:** `spaces.add_storage`, `spaces.remove_storage`, `spaces.update_hardware`, `spaces.update_sleep_time` ### Resource Groups - **Resource group administration** โ€” Creation, deletion, and settings changes. - **Events:** `resource_group.create`, `resource_group.delete`, `resource_group.settings` - **Resource group members** โ€” Adding and removing users, and role changes. - **Events:** `resource_group.add_users`, `resource_group.remove_users`, `resource_group.change_role` ### Jobs and Scheduled Jobs - **Jobs** โ€” Job creation (e.g. on a Space) and cancellation. - **Events:** `jobs.create`, `jobs.cancel` - **Scheduled jobs** โ€” Creating, deleting, resuming, suspending, and triggering runs. - **Events:** `scheduled_job.create`, `scheduled_job.delete`, `scheduled_job.resume`, `scheduled_job.suspend`, `scheduled_job.run` ### Billing and Cloud Integration - **Payment and customers** โ€” Payment method updates, attachment, and removal; customer account creation. - **Events:** `billing.update_payment_method`, `billing.create_customer`, `billing.remove_payment_method` - **Cloud marketplaces** โ€” AWS and GCP marketplace linking/unlinking and marketplace approval. - **Events:** `billing.aws_add`, `billing.aws_remove`, `billing.gcp_add`, `billing.gcp_remove`, `billing.marketplace_approve` - **Subscriptions** โ€” Starting, renewing, cancelling, reactivating, and updating subscriptions (including plan and contract details). - **Events:** `billing.start_subscription`, `billing.renew_subscription`, `billing.cancel_subscription`, `billing.un_cancel_subscription`, `billing.update_subscription`, `billing.update_subscription_plan`, `billing.update_subscription_contract_details` ## Event reference The list above covers every event type shown in the audit log UI and export. Event names follow the `scope.action` pattern; scopes include `org`, `repo`, `collection`, `spaces`, `resource_group`, `jobs`, `scheduled_job`, and `billing`. The export action itself is recorded as `org.audit_log.export` but that event is not included in the default audit log view. ### Downloading models https://huggingface.co/docs/hub/models-downloading.md # Downloading models ## Integrated libraries If a model on the Hub is tied to a [supported library](./models-libraries), loading the model can be done in just a few lines. For information on accessing the model, you can click on the "Use in _Library_" button on the model page to see how to do so. For example, `distilbert/distilgpt2` shows how to do so with ๐Ÿค— Transformers below. ## Using the Hugging Face Client Library You can use the [`huggingface_hub`](https://github.com/huggingface/huggingface_hub) library to create, delete, update and retrieve information from repos. For example, to download the `HuggingFaceH4/zephyr-7b-beta` model from the command line, run ```bash hf download HuggingFaceH4/zephyr-7b-beta ``` See the [CLI download documentation](https://huggingface.co/docs/huggingface_hub/en/guides/cli#download-an-entire-repository) for more information. You can also integrate this into your own library. For example, you can quickly load a Scikit-learn model with a few lines. ```py from huggingface_hub import hf_hub_download import joblib REPO_ID = "YOUR_REPO_ID" FILENAME = "sklearn_model.joblib" model = joblib.load( hf_hub_download(repo_id=REPO_ID, filename=FILENAME) ) ``` ## Using Git Since all models on the Model Hub are Xet-backed Git repositories, you can clone the models locally by [installing git-xet](./xet/using-xet-storage#git-xet) and running: ```bash git xet install git lfs install git clone git@hf.co: # example: git clone git@hf.co:bigscience/bloom ``` If you have write-access to the particular model repo, you'll also have the ability to commit and push revisions to the model. Add your SSH public key to [your user settings](https://huggingface.co/settings/keys) to push changes and/or access private repos. ## Faster downloads `hf_xet` is a Rust-based package leveraging the [Xet storage backend](https://huggingface.co/docs/hub/en/xet/index) to optimize file transfers with chunk-based deduplication. By default, `hf_xet` uses **adaptive concurrency** โ€” it automatically tunes the number of parallel transfer streams based on real-time network conditions, starting conservatively (1 stream) and scaling up to 64 concurrent streams as bandwidth permits. For most machines โ€” including data center environments โ€” the default settings will already saturate the available network bandwidth. For advanced users on machines with high bandwidth **and at least 64 GB of RAM**, `HF_XET_HIGH_PERFORMANCE=1` raises concurrency bounds and significantly increases memory buffer sizes, which can help when downloading many large files in parallel. ```bash HF_XET_HIGH_PERFORMANCE=1 hf download ... ``` ## Using hf-mount For large models, you can mount a repo as a local filesystem with [hf-mount](https://github.com/huggingface/hf-mount) instead of downloading the full repo. Files are fetched lazily โ€” only the bytes your code reads hit the network. ```bash brew install hf-mount hf-mount start repo openai-community/gpt2 /tmp/gpt2 ``` Repos are mounted read-only. See [Mount as a Local Filesystem](./storage-buckets-access#mount-as-a-local-filesystem) for full setup details, backend options, and caching. ### The Model Hub https://huggingface.co/docs/hub/models-the-hub.md # The Model Hub ## What is the Model Hub? The Model Hub is where the members of the Hugging Face community can host all of their model checkpoints for simple storage, discovery, and sharing. Download pre-trained models with the [`huggingface_hub` client library](https://huggingface.co/docs/huggingface_hub/index), with ๐Ÿค— [`Transformers`](https://huggingface.co/docs/transformers/index) for fine-tuning and other usages or with any of the over [15 integrated libraries](./models-libraries). You can even leverage [Inference Providers](/docs/inference-providers/) or [Inference Endpoints](https://huggingface.co/docs/inference-endpoints) to use models in production settings. You can refer to the following video for a guide on navigating the Model Hub: To learn how to upload models to the Hub, you can refer to the [Repositories Getting Started Guide](./repositories-getting-started). ### Spaces Dev Mode: Seamless development in Spaces https://huggingface.co/docs/hub/spaces-dev-mode.md # Spaces Dev Mode: Seamless development in Spaces > [!WARNING] > This feature is still in Beta stage. > [!WARNING] > The Spaces Dev Mode is part of PRO or Team & Enterprise plans. ## Spaces Dev Mode Spaces Dev Mode is a feature that eases the debugging of your application and makes iterating on Spaces faster. Whenever your commit some changes to your Space repo, the underlying Docker image gets rebuilt, and then a new virtual machine is provisioned to host the new container. The Dev Mode allows you to update your Space much quicker by overriding the Docker image. The Dev Mode Docker image starts your application as a sub-process, allowing you to restart it without stopping the Space container itself. It also starts a VS Code server and a SSH server in the background for you to connect to the Space. The ability to connect to the running Space unlocks several use cases: - You can make changes to the app code without the Space rebuilding everytime - You can debug a running application and monitor resources live Overall it makes developing and experimenting with Spaces much faster by skipping the Docker image rebuild phase. ## Interface Once the Dev Mode is enabled on your Space, you should see a modal like the following. The application does not restart automatically when you change the code. For your changes to appear in the Space, you need to use the `Refresh` button that will restart the app. If you're using the Gradio SDK, or if your application is Python-based, note that requirements are not installed automatically. You will need to manually run `pip install` from VS Code or SSH. ### SSH connection and VS Code The Dev Mode allows you to connect to your Space's docker container using SSH. Instructions to connect are listed in the Dev Mode controls modal. You will need to add your machine's SSH public key to [your user account](https://huggingface.co/settings/keys) to be able to connect to the Space using SSH. Check out the [Git over SSH](./security-git-ssh#add-a-ssh-key-to-your-account) documentation for more detailed instructions. You can also use a local install of VS Code to connect to the Space container. To do so, you will need to install the [SSH Remote](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-ssh) extension. ### Persisting changes The changes you make when Dev Mode is enabled are not persisted to the Space repo automatically. By default, they will be discarded when Dev Mode is disabled or when the Space goes to sleep. If you wish to persist changes made while Dev Mode is enabled, you need to use `git` from inside the Space container (using VS Code or SSH). For example: ```shell # Add changes and commit them git add . git commit -m "Persist changes from Dev Mode" # Push the commit to persist them in the repo git push ``` The modal will display a warning if you have uncommitted or unpushed changes in the Space: ## Enabling Dev Mode You can enable the Dev Mode on your Space from the web interface or via the API. ### Via the API You can toggle Dev Mode programmatically: ``` POST https://huggingface.co/api/spaces/{namespace}/{repo}/dev-mode Content-Type: application/json Authorization: Bearer {token} { "enabled": true } ``` ### Via the web interface You can also create a Space with the dev mode enabled: ## Limitations Dev Mode is currently not available for static Spaces. Docker Spaces also have some additional requirements. ### Docker Spaces Dev Mode is supported for Docker Spaces. However, your Space needs to comply with the following rules for Dev Mode to work properly. 1. The following packages must be installed: - `bash` (required to establish SSH connections) - `curl`, `wget` and `procps` (required by the VS Code server process) - `git` and `git-lfs` to be able to commit and push changes from your Dev Mode environment 2. Your application code must be located in the `/app` folder for the Dev Mode daemon to be able to detect changes. 3. The `/app` folder must be owned by the user with uid `1000` to allow you to make changes to the code. 4. The Dockerfile must contain a `CMD` instruction for startup. Checkout [Docker's documentation](https://docs.docker.com/reference/dockerfile/#cmd) about the `CMD` instruction for more details. Dev Mode works well when the base image is debian-based (eg, ubuntu). More exotic linux distros (eg, alpine) are not tested and Dev Mode is not guaranteed to work on them. ### Example of compatible Dockerfiles This is an example of a Dockerfile compatible with Spaces Dev Mode. It installs the required packages with `apt-get`, along with a couple more for developer convenience (namely: `top`, `vim` and `nano`). It then starts a NodeJS application from `/app`. ```Dockerfile FROM node:19-slim RUN apt-get update && \ apt-get install -y \ bash \ git git-lfs \ wget curl procps \ htop vim nano && \ rm -rf /var/lib/apt/lists/* WORKDIR /app COPY --link ./ /app RUN npm i RUN chown 1000 /app USER 1000 CMD ["node", "index.js"] ``` There are several examples of Dev Mode compatible Docker Spaces in this organization. Feel free to duplicate them in your namespace! Example Python app (FastAPI HTTP server): https://huggingface.co/spaces/dev-mode-explorers/dev-mode-python Example Javascript app (Express.js HTTP server): https://huggingface.co/spaces/dev-mode-explorers/dev-mode-javascript ## Feedback You can share your feedback on Spaces Dev Mode directly on the HF Hub: https://huggingface.co/spaces/dev-mode-explorers/README/discussions ### SQL Console: Query Hugging Face datasets in your browser https://huggingface.co/docs/hub/datasets-viewer-sql-console.md # SQL Console: Query Hugging Face datasets in your browser You can run SQL queries on the dataset in the browser using the SQL Console. The SQL Console is powered by [DuckDB](https://duckdb.org/) WASM and runs entirely in the browser. You can access the SQL Console from the Data Studio. To learn more about the SQL Console, see the SQL Console blog post. Through the SQL Console, you can: - Run [DuckDB SQL queries](https://duckdb.org/docs/sql/query_syntax/select) on the dataset (_checkout [SQL Snippets](https://huggingface.co/spaces/cfahlgren1/sql-snippets) for useful queries_) - Share results of the query with others via a link (_check out [this example](https://huggingface.co/datasets/gretelai/synthetic-gsm8k-reflection-405b?sql_console=true&sql=FROM+histogram%28%0A++train%2C%0A++topic%2C%0A++bin_count+%3A%3D+10%0A%29)_) - Download the results of the query to a Parquet or CSV file - Embed the results of the query in your own webpage using an iframe - Query datasets with natural language > [!TIP] > You can also use the DuckDB locally through the CLI to query the dataset via the `hf://` protocol. See the DuckDB Datasets documentation for more information. The SQL Console provides a convenient `Copy to DuckDB CLI` button that generates the SQL query for creating views and executing your query in the DuckDB CLI. ## Examples ### Filtering The SQL Console makes filtering datasets really easy. For example, if you want to filter the `SkunkworksAI/reasoning-0.01` dataset for instructions and responses with a reasoning length of at least 10, you can use the following query: Here's the SQL to sort by length of the reasoning ```sql SELECT * FROM train WHERE LENGTH(reasoning_chains) > 10; ``` ### Histogram Many dataset authors choose to include statistics about the distribution of the data in the dataset. Using the DuckDB `histogram` function, we can plot a histogram of a column's values. For example, to plot a histogram of the `Rating` column in the [Lichess/chess-puzzles](https://huggingface.co/datasets/Lichess/chess-puzzles) dataset, you can use the following query: Learn more about the `histogram` function and parameters here. ```sql from histogram(train, Rating) ``` ### Regex Matching One of the most powerful features of DuckDB is the deep support for regular expressions. You can use the `regexp` function to match patterns in your data. Using the [regexp_matches](https://duckdb.org/docs/sql/functions/char.html#regexp_matchesstring-pattern) function, we can filter the [GeneralReasoning/GeneralThought-195k](https://huggingface.co/datasets/GeneralReasoning/GeneralThought-195K) dataset for instructions that contain markdown code blocks. Learn more about the DuckDB regex functions here. ```sql SELECT * FROM train WHERE regexp_matches(model_answer, '```') LIMIT 10; ``` ### Saved Queries and Embeds API You can create, update, and delete SQL Console embeds programmatically. Embeds are saved queries that can be shared via link or embedded in other pages. **Create an embed:** ``` POST /api/datasets/{namespace}/{repo}/sql-console/embed Content-Type: application/json Authorization: Bearer {token} { "sql": "SELECT * FROM train LIMIT 10", "title": "Sample rows", "private": false, "views": [{"key": "default/train", "displayName": "Train", "viewName": "train"}] } ``` **Update an embed:** ``` PATCH /api/datasets/{namespace}/{repo}/sql-console/embed/{embed_id} Content-Type: application/json Authorization: Bearer {token} { "sql": "SELECT * FROM train LIMIT 20", "title": "Updated title", "private": true } ``` **Delete an embed:** ``` DELETE /api/datasets/{namespace}/{repo}/sql-console/embed/{embed_id} Authorization: Bearer {token} ``` ### Leakage Detection Leakage detection is the process of identifying whether data in a dataset is present in multiple splits, for example, whether the test set is present in the training set.

Learn more about leakage detection here.

```sql WITH overlapping_rows AS ( SELECT COALESCE( (SELECT COUNT(*) AS overlap_count FROM train INTERSECT SELECT COUNT(*) AS overlap_count FROM test), 0 ) AS overlap_count ), total_unique_rows AS ( SELECT COUNT(*) AS total_count FROM ( SELECT * FROM train UNION SELECT * FROM test ) combined ) SELECT overlap_count, total_count, CASE WHEN total_count > 0 THEN (overlap_count * 100.0 / total_count) ELSE 0 END AS overlap_percentage FROM overlapping_rows, total_unique_rows; ``` ### Single Sign-On (SSO) https://huggingface.co/docs/hub/enterprise-sso.md # Single Sign-On (SSO) > [!WARNING] > This feature is part of the Team & Enterprise plans. Hugging Face offers two distinct SSO models, each designed for different organizational needs. Understanding the differences between these two approaches is key to choosing the right setup for your team. ## At a glance | | **Basic SSO** | **Managed SSO** | | --- | --- | --- | | **Plan** | Team & Enterprise | Enterprise Plus | | **Scope** | Organization resources only | Entire Hugging Face platform | | **Replaces the Hugging Face login** | No โ€” users keep their existing Hugging Face credentials | Yes โ€” your IdP becomes the only login method | | **User accounts** | Users keep their personal Hugging Face account | Accounts are owned and managed by the organization | | **Personal content** | Users can create content in their personal namespace | Users can only create content within the organization | | **Multi-org membership** | Users can belong to multiple organizations | Users are restricted to their managing organization | | **User provisioning** | Manual (SSO join link) โ€” or invitation-based [SCIM](./enterprise-scim) on Enterprise | Full lifecycle ([SCIM](./enterprise-scim)) | | **Setup** | Self-service from organization settings | Requires setup with the Hugging Face team | | **External collaborators** | Yes | Yes | | **Protocols** | SAML 2.0 and OIDC | SAML 2.0 and OIDC | | **Role mapping** | Yes | Yes | | **Resource group mapping** | Yes | Yes | ## Basic SSO Basic SSO adds an access-control layer on top of the standard Hugging Face login. It does **not** replace the Hugging Face login โ€” members keep their existing credentials and are prompted to complete SSO only when accessing your organization's resources. This is well suited for teams that want to **secure access to their organizational resources while preserving the flexibility of individual Hugging Face accounts**. Setup is self-service from your organization's settings. [Getting started with Basic SSO โ†’](./security-sso-basic) ## Managed SSO Managed SSO **replaces the Hugging Face login entirely**. Your Identity Provider becomes the sole authentication method across the entire Hugging Face platform. The organization controls the full user lifecycle, from account creation to deactivation. This is designed for companies that require **complete control over identity, access, and data governance**. Managed accounts have [specific restrictions](./enterprise-advanced-sso#restrictions-on-managed-accounts) (no personal content, organization-bound collaboration). Setup requires coordination with the Hugging Face team. [Getting started with Managed SSO โ†’](./enterprise-advanced-sso) ## User Provisioning (SCIM) Both SSO models support [SCIM](./enterprise-scim) (System for Cross-domain Identity Management) to automate user provisioning from your Identity Provider. The two models use SCIM differently, consistent with their respective philosophies: - **Basic SSO** (Enterprise plan): SCIM automates the **invitation** of existing Hugging Face users to your organization. Users must accept the invitation to join. - **Managed SSO** (Enterprise Plus plan): SCIM manages the **entire user lifecycle** โ€” account creation, profile updates, and deactivation. Learn more in the [User Provisioning (SCIM) guide](./enterprise-scim). ## Which model should you choose? **Choose Basic SSO** if your team needs to secure access to organizational resources while allowing members to maintain their own Hugging Face accounts and participate in the broader community. **Choose Managed SSO** if your enterprise requires centralized control over all user accounts, automated provisioning and deprovisioning, and strict data governance policies that prevent any content from being created outside the organization. Both models support SAML 2.0 and OIDC protocols and can be integrated with popular identity providers such as Okta, Microsoft Entra ID (Azure AD), and Google Workspace. ## Further reading - [User Management](./security-sso-user-management) โ€” Role mapping, resource group mapping, session timeout, and more - [Configuration Guides](./security-sso-configuration-guides) โ€” Step-by-step setup instructions for Okta, Microsoft Entra ID, and Google Workspace ### Managing Spaces with Github Actions https://huggingface.co/docs/hub/spaces-github-actions.md # Managing Spaces with Github Actions You can keep your Space in sync with your GitHub repository using the official [`huggingface/hub-sync`](https://github.com/marketplace/actions/sync-github-to-hugging-face-hub) GitHub Action. `hub-sync` also works for Models and Datasets. See [GitHub Actions](./repositories-github-actions) for general usage. ## Setup 1. Create a [GitHub secret](https://docs.github.com/en/actions/security-guides/encrypted-secrets#creating-encrypted-secrets-for-an-environment) called `HF_TOKEN` with a Hugging Face [access token](https://huggingface.co/settings/tokens). 2. Add a workflow file (e.g. `.github/workflows/sync-to-hub.yml`) to your repository: ```yaml name: Sync to Hugging Face Hub on: push: branches: [main] jobs: sync: runs-on: ubuntu-latest steps: - uses: actions/checkout@v6 - uses: huggingface/hub-sync@v0.1.0 with: github_repo_id: ${{ github.repository }} huggingface_repo_id: username/my-space hf_token: ${{ secrets.HF_TOKEN }} ``` You can configure the Space SDK with `space_sdk` (defaults to `gradio`). See [all parameters](./repositories-github-actions#parameters). ## How it works The action mirrors your files to the Hub using the `hf` CLI (`hf repo create` + `hf upload`). It is not a git-to-git sync โ€” it uploads the file contents and automatically excludes `.github/` and `.git/` directories. Files removed from your GitHub repository will also be removed from the Hub. For more complex workflows (e.g. build steps, custom logic), you can install and use the [`hf` CLI](https://huggingface.co/docs/huggingface_hub/en/guides/cli) directly in your workflow instead. ## File size considerations For files larger than 10MB, Spaces requires [Git-LFS](./repositories-getting-started#terminal). Make sure large files in your GitHub repository are tracked with LFS before syncing. ## Alternative: manual git push If you prefer a direct git-to-git sync instead of file mirroring, you can push to your Space's git remote directly: ```yaml name: Sync to Hugging Face hub on: push: branches: [main] workflow_dispatch: jobs: sync-to-hub: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 with: fetch-depth: 0 lfs: true - name: Push to hub env: HF_TOKEN: ${{ secrets.HF_TOKEN }} run: git push https://HF_USERNAME:$HF_TOKEN@huggingface.co/spaces/HF_USERNAME/SPACE_NAME main ``` Replace `HF_USERNAME` with your username and `SPACE_NAME` with your Space name. ### Billing https://huggingface.co/docs/hub/billing.md # Billing At Hugging Face, we build a collaboration platform for the ML community (i.e., the Hub) and monetize by providing advanced features and simple access to compute for AI. Any feedback or support request related to billing is welcome at billing@huggingface.co ## Team and Enterprise subscriptions We offer advanced security and compliance features for organizations through our Team or Enterprise plans, which include [Single Sign-On](./enterprise-sso), [Advanced Access Control](./enterprise-resource-groups) for repositories, control over your data location, higher [storage capacity](./storage-limits) for public and private repositories, and more. Team and Enterprise plans are billed like a typical subscription. They renew automatically, but you can choose to cancel at any time in the organization's billing settings. You can pay for a Team subscription with a credit card or your AWS account, or upgrade to Enterprise via an annual contract. Upon renewal, the number of seats in your subscription will be updated to match the number of members of your organization. Private repository storage above the [included storage](./storage-limits) will be billed along with your subscription renewal. ## PRO subscription The PRO subscription unlocks essential features for serious users, including: - Higher [storage capacity](./storage-limits) for public and private repositories - Higher bandwidth and API [rate limits](./rate-limits) - Included credits for [Inference Providers](/docs/inference-providers/) - Higher tier for ZeroGPU Spaces usage, and pay-as-you-go quota extension - Ability to create ZeroGPU Spaces and use Dev Mode - Ability to publish Social Posts and Community Blogs - Leverage the [Data Studio](./data-studio) on private datasets - Run and schedule serverless [CPU/ GPU Jobs](https://huggingface.co/docs/huggingface_hub/en/guides/jobs) View the full list of benefits at https://huggingface.co/pro then subscribe over at https://huggingface.co/subscribe/pro Similarly to the Team & Enterprise subscriptions, PRO subscriptions are billed like a typical subscription. The subscription renews automatically for you. You can choose to cancel the subscription at anytime in your billing settings: https://huggingface.co/settings/billing You can only pay for the PRO subscription with a credit card. The subscription is billed separately from any pay-as-you-go compute usage. Private repository storage above the [included storage](./storage-limits) will be billed along with your subscription renewal. Note: PRO benefits are also included in the [Enterprise subscription](https://huggingface.co/enterprise). ## Pay-as-you-go private storage Above the included 1TB (or 1TB per seat) of private storage in PRO, Team, and Enterprise, additional private storage is billed in 1TB increments, at a base price of **$18/TB/month**. Overage is charged to your payment method in Pay-as-you-go mode. Additional discounts are available for large-scale volumes through our account executives. See the full pricing tiers at [huggingface.co/pricing](https://huggingface.co/pricing#storage). ## Compute Services on the Hub We also directly provide compute services with [Spaces](./spaces), [Inference Endpoints](https://huggingface.co/docs/inference-endpoints/index) and [Inference Providers](https://huggingface.co/docs/inference-providers/index). While most of our compute services have a comprehensive free tier, users and organizations can pay to access more powerful hardware accelerators. The billing for our compute services is usage-based, meaning you only pay for what you use. You can monitor your usage at any time from your billing dashboard, located in your user's or organization's settings menu. Compute services usage is billed separately from PRO and Team / Enterprise subscriptions (and potential private storage). Invoices for compute services are edited at the beginning of each month. ## Available payment methods Hugging Face uses [Stripe](https://stripe.com) to securely process your payment information. The only payment method supported for Hugging Face compute services is credit cards. You can add a credit card to your account from your billing settings. ### Billing thresholds & Invoicing When using credit cards as a payment method, you'll be billed for the Hugging Face compute usage each time the accrued usage goes above a billing threshold for your user or organization. On the 1st of every month, Hugging Face edits an invoice for usage accrued during the prior month. Any usage that has yet to be charged will be charged at that time. For example, if your billing threshold is set at $100.00, and you incur $254.00 of usage during a given month, your credit card will be charged a total of three times during the month: - Once for usage between $0 and $100: $100 - Once for usage between $100 and $200: $100 - Once at the end of the month for the remaining $54: $54 Note: this will be detailed in your monthly invoice. You can view invoices and receipts for the last 3 months in your billing dashboard. ## Cloud providers partnerships We partner with cloud providers like [AWS](https://huggingface.co/blog/aws-partnership), [Azure](https://huggingface.co/blog/hugging-face-endpoints-on-azure), and [Google Cloud](https://huggingface.co/blog/llama31-on-vertex-ai) to make it easy for customers to use Hugging Face directly in their cloud of choice. These solutions and usage are billed directly by the cloud provider. Ultimately, we want people to have great options for using Hugging Face wherever they build ML-powered products. You also have the option to link your Hugging Face organization to your AWS account via [AWS Marketplace](https://aws.amazon.com/marketplace/pp/prodview-n6vsyhdjkfng2). Hugging Face compute service usage will then be included in your AWS bill. Read more in our [blog post](https://huggingface.co/blog/aws-marketplace). ## Support FAQ **Q. Why do I need to add credits? What can I use them for?** A. Credits let you use HF pay-as-you-go services: - Jobs: run any workload on GPUs - Inference Providers: call 250k+ models via API - Inference Endpoints: dedicated deployments - GPU Spaces: host on custom hardware - ZeroGPU: extra quota beyond daily allowance - Private Storage: extra storage for private repos **Q. What happens if I run out of credits?** A. We recommend enabling automatic recharge to avoid service disruptions after credits are exhausted. **Q. I'm having issues adding my card. Whatโ€™s up?** A. Please ensure the card supportsย 3D-secureย authentication and is properly configured for recurring online payments. We do not yet support credit cards issued in India as weโ€™re working on adding system compliance with the latest RBI directives. Until we add support for Indian credit cards, you can: * Link an organization account to an AWS account in order to access pay-as-you-go features (Endpoints, Spaces, AutoTrain): [Hugging Face Platform on the AWS Marketplace: Pay with your AWS Account](https://huggingface.co/blog/aws-marketplace) * Use a credit card issued in another country **Q. How can I add my tax ID or update the billing details?** A. Email billing@huggingface.co and we can help! **Q. I was just billed for the PRO/Team subscription a few days ago. Why did you charge me again?** A. All subscriptions renew on the 1st of each month. We prorate the subscription charge if you sign up mid-month for your first month of Team or PRO. **Q. I need copies of my past invoices, where can I find these?** A. View and download all invoices here: https://huggingface.co/settings/billing/invoices. Invoices are also emailed. **Q. I need to update my credit card in my account. What to do?** A. Head to https://huggingface.co/settings/billing/payment and update your payment method at anytime. **Subscriptions** **Q. I need to pause my PRO subscription for a bit, where can I do this?** A. You can cancel your subscription at anytime here:ย https://huggingface.co/settings/billing/subscription. Drop us a line at billing@huggingface.co with your feedback. **Q. My org has a Team or Enterprise subscription and I need to update the number of seats. How can I do this?** A. The number of seats will automatically be adjusted at the time of the subscription renewal to reflect any increases in the number of members in the organization during the previous period. Thereโ€™s no need to update the subscribed number of seats during the month or year as itโ€™s a flat fee subscription. ### More ways to create Spaces https://huggingface.co/docs/hub/spaces-more-ways-to-create.md # More ways to create Spaces ## Duplicating a Space You can duplicate a Space by clicking the three dots at the top right and selecting **Duplicate this Space**. Learn more about it [here](./spaces-overview#duplicating-a-space). ## Creating a Space from a model You can create a Gradio demo directly from most model pages, using the "Deploy -> Spaces" button. As another example of how to create a Space from a set of models, the [Model Comparator Space Builder](https://huggingface.co/spaces/farukozderim/Model-Comparator-Space-Builder) from [@farukozderim](https://huggingface.co/farukozderim) can be used to create a Space directly from any model hosted on the Hub. ### Hugging Face Hub documentation https://huggingface.co/docs/hub/index.md # Hugging Face Hub documentation The Hugging Face Hub is the reference AI platform for open ML. It hosts over 2M models, 1.5M datasets, and 1.5M AI apps (Spaces), all open and publicly available. Beyond open AI, the Hub is also a great collaboration platform for internal and private teams. Explore, experiment, collaborate, and build, all in one place! ๐Ÿค— Subscriptions & Plans PRO subscription Team & Enterprise Plans Single Sign-On (SSO) Audit Logs Storage Regions Data Studio for Private datasets Resource Groups Advanced Security Tokens Management Network Security Rate Limits Repositories Getting Started Repository Settings Storage Limits Storage Backend (Xet) Local Cache Pull requests and Discussions Notifications Collections Webhooks Next Steps Licenses Models The Model Hub Model Cards Eval Results Gated Models Uploading Models Downloading Models Libraries Tasks Widgets Inference Providers Download Stats Datasets Introduction Datasets Overview Dataset Cards Gated Datasets Uploading Datasets Ingesting Datasets Downloading Datasets Streaming Datasets Editing Datasets Libraries Data Studio Download Stats Data files Configuration Spaces Introduction Spaces Overview Gradio Spaces Static HTML Spaces Docker Spaces ZeroGPU Spaces Embed your Space Run with Docker Reference Advanced Topics Sign in with HF Storage Buckets new Introduction Buckets vs Git Repositories Creating a Bucket Managing Files Use Cases Security & Compliance Jobs Introduction Jobs Overview Quickstart Pricing Manage Jobs Jobs Configuration Popular images Schedule Jobs Webhooks Automation Reference Agents Introduction Agents Overview Hugging Face CLI for AI Agents Hugging Face MCP Server Hugging Face Agent Skills Building agents with the HF SDK Local Agents Agent Libraries Other Organizations Billing Security Moderation Paper Pages Search Digital Object Identifier (DOI) Hub API Endpoints Sign in with HF Contributor Code of Conduct Content Guidelines ## What's the Hugging Face Hub? We are helping the community work together towards the goal of advancing Machine Learning ๐Ÿ”ฅ. No single company, including the Tech Titans, will be able to โ€œsolve AIโ€ by themselves โ€“ the only way we'll achieve this is by sharing knowledge and resources in a community-centric approach. We are building the largest open-source collection of models, datasets, and demos on the Hugging Face Hub to democratize and advance ML for everyone ๐Ÿš€. We encourage you to read the [Code of Conduct](https://huggingface.co/code-of-conduct) and the [Content Guidelines](https://huggingface.co/content-guidelines) to familiarize yourself with the values that we expect our community members to uphold ๐Ÿค—. ## What can you find on the Hub? The Hugging Face Hub hosts Git-based repositories, which are version-controlled folders that can contain all your files. For non-versioned, mutable object storage, the Hub also offers [Storage Buckets](./storage-buckets). On it, you'll be able to upload and discover... - Models: _hosting the latest state-of-the-art models for LLM, text, vision, and audio tasks_ - Datasets: _featuring a wide variety of data for different domains and modalities_ - Spaces: _interactive apps for demonstrating ML models directly in your browser_ The Hub offers **versioning, commit history, diffs, branches, and over a dozen library integrations**! All repositories build on [Xet](./xet/index), a new technology to efficiently store Large Files inside Git, intelligently splitting files into unique chunks and accelerating uploads and downloads. You can learn more about the features that all repositories share in the [**Repositories documentation**](./repositories). ## Models You can discover and use dozens of thousands of open-source ML models shared by the community. To promote responsible model usage and development, model repos are equipped with [Model Cards](./model-cards) to inform users of each model's limitations and biases. Additional [metadata](./model-cards#model-card-metadata) about info such as their tasks, languages, and evaluation results can be included, with training metrics charts even added if the repository contains [TensorBoard traces](./tensorboard). It's also easy to add an [**inference widget**](./models-widgets) to your model, allowing anyone to play with the model directly in the browser! For programmatic access, a serverless API is provided by [**Inference Providers**](./models-inference). To upload models to the Hub, or download models and integrate them into your work, explore the [**Models documentation**](./models). You can also choose from [**over a dozen libraries**](./models-libraries) such as ๐Ÿค— Transformers, Asteroid, and ESPnet that support the Hub. ## Datasets The Hub is home to over 500k public datasets in more than 8k languages that can be used for a broad range of tasks across NLP, Computer Vision, and Audio. The Hub makes it simple to find, download, and upload datasets. Datasets are accompanied by extensive documentation in the form of [**Dataset Cards**](./datasets-cards) and [**Data Studio**](./datasets-viewer) to let you explore the data directly in your browser. While many datasets are public, [**organizations**](./organizations) and individuals can create private datasets to comply with licensing or privacy issues. You can learn more about [**Datasets here on the Hugging Face Hub documentation**](./datasets-overview). The [๐Ÿค— `datasets`](https://huggingface.co/docs/datasets/index) library allows you to programmatically interact with the datasets, so you can easily use datasets from the Hub in your projects. With a single line of code, you can access the datasets; even if they are so large they don't fit in your computer, you can use streaming to efficiently access the data. ## Spaces [Spaces](https://huggingface.co/spaces) is a simple way to host ML demo apps on the Hub. They allow you to build your ML portfolio, showcase your projects at conferences or to stakeholders, and work collaboratively with other people in the ML ecosystem. We currently support two awesome Python SDKs (**[Gradio](https://gradio.app/)** and **[Streamlit](./spaces-sdks-streamlit)**) that let you build cool apps in a matter of minutes. Users can also create static Spaces, which are simple HTML/CSS/JavaScript pages, or deploy any Docker-based application. If you need GPU power for your demos, try [**ZeroGPU**](./spaces-zerogpu): it dynamically provides NVIDIA H200 GPUs, in real-time, only when needed. After you've explored a few Spaces (take a look at our [Space of the Week!](https://huggingface.co/spaces)), dive into the [**Spaces documentation**](./spaces-overview) to learn all about how you can create your own Space. You'll also be able to upgrade your Space to run on a GPU or other accelerated hardware. โšก๏ธ ## Storage Buckets [Storage Buckets](./storage-buckets) provide S3-like object storage on Hugging Face, powered by the Xet storage backend. Unlike repositories (which are git-based and track file history), buckets are remote object storage containers designed for large-scale files with content-addressable deduplication. They are designed for use cases where you need simple, fast, mutable storage such as storing training checkpoints, logs, intermediate artifacts, or any large collection of files that doesnโ€™t need version control. ## Organizations Companies, universities and non-profits are an essential part of the Hugging Face community! The Hub offers [**Organizations**](./organizations), which can be used to group accounts and manage datasets, models, and Spaces. Educators can also create collaborative organizations for students using [Hugging Face for Classrooms](https://huggingface.co/classrooms). An organization's repositories will be featured on the organizationโ€™s page and every member of the organization will have the ability to contribute to the repository. In addition to conveniently grouping all of an organization's work, the Hub allows admins to set roles to [**control access to repositories**](./organizations-security), and manage their organization's [payment method and billing info](https://huggingface.co/pricing). Machine Learning is more fun when collaborating! ๐Ÿ”ฅ [Explore existing organizations](https://huggingface.co/organizations), create a new organization [here](https://huggingface.co/organizations/new), and then visit the [**Organizations documentation**](./organizations) to learn more. ## Security The Hugging Face Hub supports security and access control features to give you the peace of mind that your code, models, and data are safe. Visit the [**Security**](./security) section in these docs to learn about: - User Access Tokens - Access Control for Organizations - Signing commits with GPG - Malware scanning ### Advanced Security https://huggingface.co/docs/hub/enterprise-advanced-security.md # Advanced Security > [!WARNING] > This feature is part of the Team & Enterprise plans. Team & Enterprise organizations can improve their security with advanced security controls for both members and repositories. ## Members Security Configure additional security settings to protect your organization: - **Two-Factor Authentication (2FA)**: Require all organization members to enable 2FA for enhanced account security. - **User Approval**: For organizations with a verified domain name, require admin approval for new users with matching email addresses. This adds a verified badge to your organization page. - **Hide members list**: When enabled, the list of members will not be visible on the organization page. Note that users can potentially find organization membership information through other means, so do not use for critical use cases. ## Repository Visibility Controls Manage the default visibility of repositories in your organization: - **Public by default**: New repositories are created with public visibility - **Private by default**: New repositories are created with private visibility. Note that changing this setting will not affect existing repositories. - **Private only**: Enforce private visibility for all new repositories, with only organization admins able to change visibility settings These settings help organizations maintain control of their ownership while enabling collaboration when needed. ### Using Unity Sentis Models from Hugging Face https://huggingface.co/docs/hub/unity-sentis.md # Using Unity Sentis Models from Hugging Face [Unity 3D](https://unity.com/) is one of the most popular game engines in the world. [Unity Sentis](https://unity.com/products/sentis) is the inference engine that runs on Unity 2023 or above. It is an API that allows you to easily integrate and run neural network models in your game or application making use of hardware acceleration. Because Unity can export to many different form factors including PC, mobile and consoles, it means that this is an easy way to run neural network models on many different types of hardware. ## Exploring Sentis Models in the Hub You will find `unity-sentis` models by filtering at the left of the [models page](https://huggingface.co/models?library=unity-sentis). All the Sentis models in the Hub come with code and instructions to easily get you started using the model in Unity. All Sentis models under the `unity` namespace (for example, [unity/sentis-yolotinyv7](https://huggingface.co/unity/sentis-yolotinyv7) have been validated to work, so you can be sure they will run in Unity. To get more details about using Sentis, you can read its [documentation](https://docs.unity3d.com/Packages/com.unity.sentis@latest). To get help from others using Sentis, you can ask in its [discussion forum](https://discussions.unity.com/c/ai-beta/sentis) ## Types of files Each repository will contain several types of files: * ``sentis`` files: These are the main model files that contain the neural networks that run on Unity. * ``ONNX`` files: This is an alternative format you can include in addition to, or instead of, the Sentis files. It can be useful for visualization with third party tools such as [Netron](https://github.com/lutzroeder/netron). * ``cs`` file: These are C# files that contain the code to run the model on Unity. * ``info.json``: This file contains information about the files in the repository. * Data files. These are other files that are needed to run the model. They could include vocabulary files, lists of class names etc. Some typical files will have extensions ``json`` or ``txt``. * ``README.md``. This is the model card. It contains instructions on how to use the model and other relevant information. ## Running the model Always refer to the instructions on the model card. It is expected that you have some knowledge of Unity and some basic knowledge of C#. 1. Open Unity 2023 or above and create a new scene. 2. Install the ``com.unity.sentis`` package from the [package manager](https://docs.unity3d.com/Manual/upm-ui-quick.html). 3. Download your model files (``*.sentis``) and data files and put them in the StreamingAssets folder which is a subfolder inside the Assets folder. (If this folder does not exist you can create it). 4. Place your C# file on an object in the scene such as the Main Camera. 5. Refer to the model card to see if there are any other objects you need to create in the scene. In most cases, we only provide the basic implementation to get you up and running. It is up to you to find creative uses. For example, you may want to combine two or more models to do interesting things. ## Sharing your own Sentis models We encourage you to share your own Sentis models on Hugging Face. These may be models you trained yourself or models you have converted to the [Sentis format](https://docs.unity3d.com/Packages/com.unity.sentis@1.3/manual/serialize-a-model.html) and have tested to run in Unity. Please provide the models in the Sentis format for each repository you upload. This provides an extra check that they will run in Unity and is also the preferred format for large models. You can also include the original ONNX versions of the model files. Provide a C# file with a minimal implementation. For example, an image processing model should have code that shows how to prepare the image for the input and construct the image from the output. Alternatively, you can link to some external sample code. This will make it easy for others to download and use the model in Unity. Provide any data files needed to run the model. For example, vocabulary files. Finally, please provide an ``info.json`` file, which lists your project's files. This helps in counting the downloads. Some examples of the contents of ``info.json`` are: ``` { "code": [ "mycode.cs"], "models": [ "model1.sentis", "model2.sentis"], "data": [ "vocab.txt" ] } ``` Or if your code sample is external: ``` { "sampleURL": [ "http://sampleunityproject"], "models": [ "model1.sentis", "model2.sentis"] } ``` ## Additional Information We also have some full [sample projects](https://github.com/Unity-Technologies/sentis-samples) to help you get started using Sentis. ### FiftyOne https://huggingface.co/docs/hub/datasets-fiftyone.md # FiftyOne FiftyOne is an open-source toolkit for curating, visualizing, and managing unstructured visual data. The library streamlines data-centric workflows, from finding low-confidence predictions to identifying poor-quality samples and uncovering hidden patterns in your data. The library supports all sorts of visual data, from images and videos to PDFs, point clouds, and meshes. FiftyOne accommodates object detections, keypoints, polylines, and custom schemas. FiftyOne is integrated with the Hugging Face Hub so that you can load and share FiftyOne datasets directly from the Hub. ๐Ÿš€ Try the FiftyOne ๐Ÿค Hugging Face Integration in [Colab](https://colab.research.google.com/drive/1l0kzfbJ2wtUw1EGS1tq1PJYoWenMlihp?usp=sharing)! ## Prerequisites First [login with your Hugging Face account](/docs/huggingface_hub/quick-start#login): ```bash hf auth login ``` Make sure you have `fiftyone>=0.24.0` installed: ```bash pip install -U fiftyone ``` ## Loading Visual Datasets from the Hub With `load_from_hub()` from FiftyOne's Hugging Face utils, you can load: - Any FiftyOne dataset uploaded to the hub - Most image-based datasets stored in Parquet files (which is the standard for datasets uploaded to the hub via the `datasets` library) ### Loading FiftyOne datasets from the Hub Any dataset pushed to the hub in one of FiftyOneโ€™s [supported common formats](https://docs.voxel51.com/user_guide/dataset_creation/datasets.html#supported-import-formats) should have all of the necessary configuration info in its dataset repo on the hub, so you can load the dataset by specifying its `repo_id`. As an example, to load the [VisDrone detection dataset](https://huggingface.co/datasets/Voxel51/VisDrone2019-DET): ```python import fiftyone as fo from fiftyone.utils import load_from_hub ## load from the hub dataset = load_from_hub("Voxel51/VisDrone2019-DET") ## visualize in app session = fo.launch_app(dataset) ``` ![FiftyOne VisDrone dataset](https://cdn-uploads.huggingface.co/production/uploads/63127e2495407887cb79c5ea/0eKxe_GSsBjt8wMjT9qaI.jpeg) You can [customize the download process](https://docs.voxel51.com/integrations/huggingface.html#configuring-the-download-process), including the number of samples to download, the name of the created dataset object, or whether or not it is persisted to disk. You can list all the available FiftyOne datasets on the Hub using: ```python from huggingface_hub import HfApi api = HfApi() api.list_datasets(tags="fiftyone") ``` ### Loading Parquet Datasets from the Hub with FiftyOne You can also use the `load_from_hub()` function to load datasets from Parquet files. Type conversions are handled for you, and images are downloaded from URLs if necessary. With this functionality, [you can load](https://docs.voxel51.com/integrations/huggingface.html#basic-examples) any of the following: - [FiftyOne-Compatible Image Classification Datasets](https://huggingface.co/collections/Voxel51/fiftyone-compatible-image-classification-datasets-665dfd51020d8b66a56c9b6f), like [Food101](https://huggingface.co/datasets/food101) and [ImageNet-Sketch](https://huggingface.co/datasets/imagenet_sketch) - [FiftyOne-Compatible Object Detection Datasets](https://huggingface.co/collections/Voxel51/fiftyone-compatible-object-detection-datasets-665e0279c94ae552c7159a2b) like [CPPE-5](https://huggingface.co/datasets/cppe-5) and [WIDER FACE](https://huggingface.co/datasets/wider_face) - [FiftyOne-Compatible Segmentation Datasets](https://huggingface.co/collections/Voxel51/fiftyone-compatible-image-segmentation-datasets-665e15b6ddb96a4d7226a380) like [SceneParse150](https://huggingface.co/datasets/scene_parse_150) and [Sidewalk Semantic](https://huggingface.co/datasets/segments/sidewalk-semantic) - [FiftyOne-Compatible Image Captioning Datasets](https://huggingface.co/collections/Voxel51/fiftyone-compatible-image-captioning-datasets-665e16e29350244c06084505) like [COYO-700M](https://huggingface.co/datasets/kakaobrain/coyo-700m) and [New Yorker Caption Contest](https://huggingface.co/datasets/jmhessel/newyorker_caption_contest) - [FiftyOne-Compatible Visual Question-Answering Datasets](https://huggingface.co/collections/Voxel51/fiftyone-compatible-vqa-datasets-665e16424ecc8a718156248a) like [TextVQA](https://huggingface.co/datasets/textvqa) and [ScienceQA](https://huggingface.co/datasets/derek-thomas/ScienceQA) As an example, we can load the first 1,000 samples from the [WikiArt dataset](https://huggingface.co/datasets/huggan/wikiart) into FiftyOne with: ```python import fiftyone as fo from fiftyone.utils.huggingface import load_from_hub dataset = load_from_hub( "huggan/wikiart", ## repo_id format="parquet", ## for Parquet format classification_fields=["artist", "style", "genre"], ## columns to treat as classification labels max_samples=1000, # number of samples to load name="wikiart", # name of the dataset in FiftyOne ) ``` ![WikiArt Dataset](https://cdn-uploads.huggingface.co/production/uploads/63127e2495407887cb79c5ea/PCqCvTlNTG5SLtcK5fwuQ.jpeg) ## Pushing FiftyOne Datasets to the Hub You can push a dataset to the hub with: ```python import fiftyone as fo import fiftyone.zoo as foz from fiftyone.utils.huggingface import push_to_hub ## load example dataset dataset = foz.load_zoo_dataset("quickstart") ## push to hub push_to_hub(dataset, "my-hf-dataset") ``` When you call `push_to_hub()`, the dataset will be uploaded to the repo with the specified repo name under your username, and the repo will be created if necessary. A [Dataset Card](./datasets-cards) will automatically be generated and populated with instructions for loading the dataset from the hub. You can upload a thumbnail image/gif to appear on the Dataset Card with the `preview_path` argument. Hereโ€™s an example using many of these arguments, which would upload the first three samples of FiftyOne's [Quickstart Video](https://docs.voxel51.com/user_guide/dataset_zoo/datasets.html#quickstart-video) dataset to the private repo `username/my-quickstart-video-dataset` with tags, an MIT license, a description, and a preview image: ```python dataset = foz.load_from_zoo("quickstart-video", max_samples=3) push_to_hub( dataset, "my-quickstart-video-dataset", tags=["video", "tracking"], license="mit", description="A dataset of video samples for tracking tasks", private=True, preview_path="" ) ``` ## ๐Ÿ“š Resources - [๐Ÿš€ Code-Along Colab Notebook](https://colab.research.google.com/drive/1l0kzfbJ2wtUw1EGS1tq1PJYoWenMlihp?usp=sharing) - [๐Ÿ—บ๏ธ User Guide for FiftyOne Datasets](https://docs.voxel51.com/user_guide/using_datasets.html#) - [๐Ÿค— FiftyOne ๐Ÿค Hub Integration Docs](https://docs.voxel51.com/integrations/huggingface.html#huggingface-hub) - [๐Ÿค— FiftyOne ๐Ÿค Transformers Integration Docs](https://docs.voxel51.com/integrations/huggingface.html#transformers-library) - [๐Ÿงฉ FiftyOne Hugging Face Hub Plugin](https://github.com/voxel51/fiftyone-huggingface-plugins) ### Using timm at Hugging Face https://huggingface.co/docs/hub/timm.md # Using timm at Hugging Face `timm`, also known as [pytorch-image-models](https://github.com/rwightman/pytorch-image-models), is an open-source collection of state-of-the-art PyTorch image models, pretrained weights, and utility scripts for training, inference, and validation. This documentation focuses on `timm` functionality in the Hugging Face Hub instead of the `timm` library itself. For detailed information about the `timm` library, visit [its documentation](https://huggingface.co/docs/timm). You can find a number of `timm` models on the Hub using the filters on the left of the [models page](https://huggingface.co/models?library=timm&sort=downloads). All models on the Hub come with several useful features: 1. An automatically generated model card, which model authors can complete with [information about their model](./model-cards). 2. Metadata tags help users discover the relevant `timm` models. 3. An [interactive widget](./models-widgets) you can use to play with the model directly in the browser. 4. An [Inference Providers](./models-inference) that allows users to make inference requests. ## Using existing models from the Hub Any `timm` model from the Hugging Face Hub can be loaded with a single line of code as long as you have `timm` installed! Once you've selected a model from the Hub, pass the model's ID prefixed with `hf-hub:` to `timm`'s `create_model` method to download and instantiate the model. ```py import timm # Loading https://huggingface.co/timm/eca_nfnet_l0 model = timm.create_model("hf-hub:timm/eca_nfnet_l0", pretrained=True) ``` If you want to see how to load a specific model, you can click **Use in timm** and you will be given a working snippet to load it! ### Inference The snippet below shows how you can perform inference on a `timm` model loaded from the Hub: ```py import timm import torch from PIL import Image from timm.data import resolve_data_config from timm.data.transforms_factory import create_transform # Load from Hub ๐Ÿ”ฅ model = timm.create_model( 'hf-hub:nateraw/resnet50-oxford-iiit-pet', pretrained=True ) # Set model to eval mode for inference model.eval() # Create Transform transform = create_transform(**resolve_data_config(model.pretrained_cfg, model=model)) # Get the labels from the model config labels = model.pretrained_cfg['label_names'] top_k = min(len(labels), 5) # Use your own image file here... image = Image.open('boxer.jpg').convert('RGB') # Process PIL image with transforms and add a batch dimension x = transform(image).unsqueeze(0) # Pass inputs to model forward function to get outputs out = model(x) # Apply softmax to get predicted probabilities for each class probabilities = torch.nn.functional.softmax(out[0], dim=0) # Grab the values and indices of top 5 predicted classes values, indices = torch.topk(probabilities, top_k) # Prepare a nice dict of top k predictions predictions = [ {"label": labels[i], "score": v.item()} for i, v in zip(indices, values) ] print(predictions) ``` This should leave you with a list of predictions, like this: ```py [ {'label': 'american_pit_bull_terrier', 'score': 0.9999998807907104}, {'label': 'staffordshire_bull_terrier', 'score': 1.0000000149011612e-07}, {'label': 'miniature_pinscher', 'score': 1.0000000149011612e-07}, {'label': 'chihuahua', 'score': 1.0000000149011612e-07}, {'label': 'beagle', 'score': 1.0000000149011612e-07} ] ``` ## Sharing your models You can share your `timm` models directly to the Hugging Face Hub. This will publish a new version of your model to the Hugging Face Hub, creating a model repo for you if it doesn't already exist. Before pushing a model, make sure that you've logged in to Hugging Face: ```sh python -m pip install huggingface_hub hf auth login ``` Alternatively, if you prefer working from a Jupyter or Colaboratory notebook, once you've installed `huggingface_hub` you can log in with: ```py from huggingface_hub import notebook_login notebook_login() ``` Then, push your model using the `push_to_hf_hub` method: ```py import timm # Build or load a model, e.g. timm's pretrained resnet18 model = timm.create_model('resnet18', pretrained=True, num_classes=4) ########################### # [Fine tune your model...] ########################### # Push it to the ๐Ÿค— Hub timm.models.hub.push_to_hf_hub( model, 'resnet18-random-classifier', model_config={'labels': ['a', 'b', 'c', 'd']} ) # Load your model from the Hub model_reloaded = timm.create_model( 'hf-hub:/resnet18-random-classifier', pretrained=True ) ``` ## Inference Widget and API All `timm` models on the Hub are automatically equipped with an [inference widget](./models-widgets), pictured below for [nateraw/timm-resnet50-beans](https://huggingface.co/nateraw/timm-resnet50-beans). Additionally, `timm` models are available through the [Inference Providers](./models-inference), which you can access through HTTP with cURL, Python's `requests` library, or your preferred method for making network requests. ```sh curl https://api-inference.huggingface.co/models/nateraw/timm-resnet50-beans \ -X POST \ --data-binary '@beans.jpeg' \ -H "Authorization: Bearer {$HF_API_TOKEN}" # [{"label":"angular_leaf_spot","score":0.9845947027206421},{"label":"bean_rust","score":0.01368315052241087},{"label":"healthy","score":0.001722085871733725}] ``` ## Additional resources * timm (pytorch-image-models) [GitHub Repo](https://github.com/rwightman/pytorch-image-models). * timm [documentation](https://huggingface.co/docs/timm). * Additional documentation at [timmdocs](https://timm.fast.ai) by [Aman Arora](https://github.com/amaarora). * [Getting Started with PyTorch Image Models (timm): A Practitionerโ€™s Guide](https://towardsdatascience.com/getting-started-with-pytorch-image-models-timm-a-practitioners-guide-4e77b4bf9055) by [Chris Hughes](https://github.com/Chris-hughes10). ### Git over SSH https://huggingface.co/docs/hub/security-git-ssh.md # Git over SSH You can access and write data in repositories on huggingface.co using SSH (Secure Shell Protocol). When you connect via SSH, you authenticate using a private key file on your local machine. Some actions, such as pushing changes, or cloning private repositories, will require you to upload your SSH public key to your account on huggingface.co. You can use a pre-existing SSH key, or generate a new one specifically for huggingface.co. ## Checking for existing SSH keys If you have an existing SSH key, you can use that key to authenticate Git operations over SSH. SSH keys are usually located under `~/.ssh` on Mac & Linux, and under `C:\\Users\\\\.ssh` on Windows. List files under that directory and look for files of the form: - id_rsa.pub - id_ecdsa.pub - id_ed25519.pub Those files contain your SSH public key. If you don't have such file under `~/.ssh`, you will have to [generate a new key](#generating-a-new-ssh-keypair). Otherwise, you can [add your existing SSH public key(s) to your huggingface.co account](#add-a-ssh-key-to-your-account). ## Generating a new SSH keypair If you don't have any SSH keys on your machine, you can use `ssh-keygen` to generate a new SSH key pair (public + private keys): ``` $ ssh-keygen -t ed25519 -C "your.email@example.co" ``` We recommend entering a passphrase when you are prompted to. A passphrase is an extra layer of security: it is a password that will be prompted whenever you use your SSH key. Once your new key is generated, add it to your SSH agent with `ssh-add`: ``` $ ssh-add ~/.ssh/id_ed25519 ``` If you chose a different location than the default to store your SSH key, you would have to replace `~/.ssh/id_ed25519` with the file location you used. ## Add a SSH key to your account To access private repositories with SSH, or to push changes via SSH, you will need to add your SSH public key to your huggingface.co account. You can manage your SSH keys [in your user settings](https://huggingface.co/settings/keys). To add a SSH key to your account, click on the "Add SSH key" button. Then, enter a name for this key (for example, "Personal computer"), and copy and paste the content of your **public** SSH key in the area below. The public key is located in the `~/.ssh/id_XXXX.pub` file you found or generated in the previous steps. Click on "Add key", and voilร ! You have added a SSH key to your huggingface.co account. ## Testing your SSH authentication Once you have added your SSH key to your huggingface.co account, you can test that the connection works as expected. In a terminal, run: ``` $ ssh -T git@hf.co ``` If you see a message with your username, congrats! Everything went well, you are ready to use git over SSH. Otherwise, if the message states something like the following, make sure your SSH key is actually used by your SSH agent. ``` Hi anonymous, welcome to Hugging Face. ``` ## HuggingFace's SSH key fingerprints Public key fingerprints can be used to validate a connection to a remote server. These are HuggingFace's public key fingerprints: > SHA256:aBG5R7IomF4BSsx/h6tNAUVLhEkkaNGB8Sluyh/Q/qY (ECDSA) > SHA256:skgQjK2+RuzvdmHr24IIAJ6uLWQs0TGtEUt3FtzqirQ (DSA - deprecated) > SHA256:dVjzGIdV7d6cwKIeZiCoRMa2gMvSKfGZAvHf4gMiMao (ED25519) > SHA256:uqjYymysBGCXXiMVebB8L8RIuWbPSKGBxQQNhcT5a3Q (RSA) You can add the following ssh key entries to your ~/.ssh/known_hosts file to avoid manually verifying HuggingFace hosts: ``` hf.co ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDtPB+snz63eZvTrbMY2Qt39a6HYile89JOum55z3lhIqAqUHxLtXFd+q+ED8izQvyORFPSmFIaPw05rtXo37bm+ixL6wDmvWrHN74oUUWmtrv2MNCLHE5VDb3+Q6MJjjDVIoK5QZIuTStlq0cUbGGxQk7vFZZ2VXdTPqgPjw4hMV7MGp3RFY/+Wy8rIMRv+kRCIwSAOeuaLPT7FzL0zUMDwj/VRjlzC08+srTQHqfoh0RguZiXZQneZKmM75AFhoMbP5x4AW2bVoZam864DSGiEwL8R2jMiyXxL3OuicZteZqll0qfRlNopKnzoxS29eBbXTr++ILqYz1QFqaruUgqSi3MIC9sDYEqh2Q8UxP5+Hh97AnlgWDZC0IhojVmEPNAc7Y2d+ctQl4Bt91Ik4hVf9bU+tqMXgaTrTMXeTURSXRxJEm2zfKQVkqn3vS/zGVnkDS+2b2qlVtrgbGdU/we8Fux5uOAn/dq5GygW/DUlHFw412GtKYDFdWjt3nJCY8= hf.co ssh-dss AAAAB3NzaC1kc3MAAACBAORXmoE8fn/UTweWy7tCYXZxigmODg71CIvs/haZQN6GYqg0scv8OFgeIQvBmIYMnKNJ7eoo5ZK+fk1yPv8aa9+8jfKXNJmMnObQVyObxFVzB51x8yvtHSSrL4J3z9EAGX9l9b+Fr2+VmVFZ7a90j2kYC+8WzQ9HaCYOlrALzz2VAAAAFQC0RGD5dE5Du2vKoyGsTaG/mO2E5QAAAIAHXRCMYdZij+BYGC9cYn5Oa6ZGW9rmGk98p1Xc4oW+O9E/kvu4pCimS9zZordLAwHHWwOUH6BBtPfdxZamYsBgO8KsXOWugqyXeFcFkEm3c1HK/ysllZ5kM36wI9CUWLedc2vj5JC+xb5CUzhVlGp+Xjn59rGSFiYzIGQC6pVkHgAAAIBve2DugKh3x8qq56sdOH4pVlEDe997ovEg3TUxPPIDMSCROSxSR85fa0aMpxqTndFMNPM81U/+ye4qQC/mr0dpFLBzGuum4u2dEpjQ7B2UyJL9qhs1Ubby5hJ8Z3bmHfOK9/hV8nhyN8gf5uGdrJw6yL0IXCOPr/VDWSUbFrsdeQ== hf.co ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBL0wtM52yIjm8gRecBy2wRyEMqr8ulG0uewT/IQOGz5K0ZPTIy6GIGHsTi8UXBiEzEIznV3asIz2sS7SiQ311tU= hf.co ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAINJjhgtT9FOQrsVSarIoPVI1jFMh3VSHdKfdqp/O776s ``` ### Using `Transformers.js` at Hugging Face https://huggingface.co/docs/hub/transformers-js.md # Using `Transformers.js` at Hugging Face Transformers.js is a JavaScript library for running ๐Ÿค— Transformers directly in your browser, with no need for a server! It is designed to be functionally equivalent to the original [Python library](https://github.com/huggingface/transformers), meaning you can run the same pretrained models using a very similar API. ## Exploring `transformers.js` in the Hub You can find `transformers.js` models by filtering by library in the [models page](https://huggingface.co/models?library=transformers.js). ## Quick tour It's super simple to translate from existing code! Just like the Python library, we support the `pipeline` API. Pipelines group together a pretrained model with preprocessing of inputs and postprocessing of outputs, making it the easiest way to run models with the library. Python (original) Javascript (ours) ```python from transformers import pipeline # Allocate a pipeline for sentiment-analysis pipe = pipeline('sentiment-analysis') out = pipe('I love transformers!') # [{'label': 'POSITIVE', 'score': 0.999806941}] ``` ```javascript import { pipeline } from '@huggingface/transformers'; // Allocate a pipeline for sentiment-analysis let pipe = await pipeline('sentiment-analysis'); let out = await pipe('I love transformers!'); // [{'label': 'POSITIVE', 'score': 0.999817686}] ``` You can also use a different model by specifying the model id or path as the second argument to the `pipeline` function. For example: ```javascript // Use a different model for sentiment-analysis let pipe = await pipeline('sentiment-analysis', 'nlptown/bert-base-multilingual-uncased-sentiment'); ``` Refer to the [documentation](https://huggingface.co/docs/transformers.js) for the full list of supported tasks and models. ## Installation To install via [NPM](https://www.npmjs.com/package/@huggingface/transformers), run: ```bash npm i @huggingface/transformers ``` For more information, including how to use it in vanilla JS (without any bundler) via a CDN or static hosting, refer to the [README](https://github.com/huggingface/transformers.js/blob/main/README.md#installation). ## Additional resources * Transformers.js [repository](https://github.com/huggingface/transformers.js) * Transformers.js [docs](https://huggingface.co/docs/transformers.js) * Transformers.js [demo](https://huggingface.github.io/transformers.js/) ### Spaces https://huggingface.co/docs/hub/spaces.md # Spaces [Hugging Face Spaces](https://huggingface.co/spaces) offer a simple way to host ML demo apps directly on your profile or your organization's profile. This allows you to create your ML portfolio, showcase your projects at conferences or to stakeholders, and work collaboratively with other people in the ML ecosystem. We have built-in support for an awesome SDK that let you build cool apps in Python in a matter of minutes: **[Gradio](https://gradio.app/)**, but you can also unlock the whole power of Docker and host an arbitrary Dockerfile. Finally, you can create static Spaces using JavaScript and HTML. You'll also be able to upgrade your Space to run [on a GPU or other accelerated hardware](./spaces-gpus). โšก๏ธ ## Contents - [Spaces Overview](./spaces-overview) - [Handling Spaces Dependencies](./spaces-dependencies) - [Spaces Settings](./spaces-settings) - [Using OpenCV in Spaces](./spaces-using-opencv) - [Using Spaces for Organization Cards](./spaces-organization-cards) - [More ways to create Spaces](./spaces-more-ways-to-create) - [Managing Spaces with Github Actions](./spaces-github-actions) - [How to Add a Space to ArXiv](./spaces-add-to-arxiv) - [Spaces Dev Mode](./spaces-dev-mode) - [Spaces GPU Upgrades](./spaces-gpus) - [Spaces Disk Usage & Storage](./spaces-storage) - [Gradio Spaces](./spaces-sdks-gradio) - [Docker Spaces](./spaces-sdks-docker) - [Static HTML Spaces](./spaces-sdks-static) - [Custom Python Spaces](./spaces-sdks-python) - [Embed your Space](./spaces-embed) - [Run your Space with Docker](./spaces-run-with-docker) - [Reference](./spaces-config-reference) - [Changelog](./spaces-changelog) ## Contact Feel free to ask questions on the [forum](https://discuss.huggingface.co/c/spaces/24) if you need help with making a Space, or if you run into any other issues on the Hub. If you're interested in infra challenges, custom demos, advanced GPUs, or something else, please reach out to us by sending an email to **website at huggingface.co**. You can also tag us [on Twitter](https://twitter.com/huggingface)! ๐Ÿค— ### Using OpenCV in Spaces https://huggingface.co/docs/hub/spaces-using-opencv.md # Using OpenCV in Spaces In order to use OpenCV in your Gradio or Python Spaces, you'll need to make the Space install both the Python and Debian dependencies This means adding `python3-opencv` to the `packages.txt` file, and adding `opencv-python` to the `requirements.txt` file. If those files don't exist, you'll need to create them. To see an example, [see this Gradio project](https://huggingface.co/spaces/templates/gradio_opencv/tree/main). ### Storage Buckets: Security & Compliance https://huggingface.co/docs/hub/storage-buckets-security.md # Storage Buckets: Security & Compliance Storage Buckets are built on the same infrastructure that powers the Hugging Face Hub, with enterprise-grade security and compliance built in. ## Encryption All data stored in buckets is encrypted at rest using **AES-256** encryption. Data in transit is protected via **TLS**. ## Access Control Buckets use the Hub's standard access control mechanisms: - **SSO**: Authenticate through your organization's identity provider via [Single Sign-On](./security-sso) - **RBAC**: Fine-grained permissions through [Resource Groups](./security-resource-groups) let you control who can read, write, or admin each bucket - **Tokens**: Programmatic access is managed through [User Access Tokens](./security-tokens) with scoped permissions ## Audit Logs All bucket operations โ€” uploads, downloads, deletions, and permission changes โ€” are recorded in your organization's [Audit Logs](./audit-logs), giving you a full trail of who accessed what and when. ## Data Residency Bucket data is stored in **US and EU regions**. You can choose where your data lives when creating a bucket, and [pre-warming](./storage-buckets#pre-warming-and-cdn) lets you cache data closer to your compute in specific cloud regions. ## Compliance Hugging Face maintains the following certifications and compliance standards: - **SOC 2 Type 2** certified โ€” active monitoring and patching of security vulnerabilities - **GDPR** compliant โ€” data processing agreements available through [Enterprise Plans](https://huggingface.co/pricing) For more details on Hugging Face's overall security posture, see the [Security](./security) page. For questions, contact [security@huggingface.co](mailto:security@huggingface.co). ### Ingesting Datasets https://huggingface.co/docs/hub/datasets-ingesting.md # Ingesting Datasets Data generally lives in databases or cloud storage in forms that are not suited for AI workflows. Ingesting data to the [Hub](https://huggingface.co/datasets) is a good way to publish them as AI-ready datasets, enabling easy and efficient data loading, processing and model training and evaluation. ## Using `huggingface_hub` The simplest way to ingest data is to simply upload the data files with `huggingface_hub`. The `huggingface_hub` Python library provides a rich feature set that allows you to manage repositories, including creating repos and uploading datasets to the Hub. Visit [the client library's documentation](/docs/huggingface_hub/index) to learn more. This is relevant if your data is static/frozen and if you can easily obtain a local dump of the data in a format supported by the Hub (e.g., Parquet or JSON Lines) with a usable structure (e.g., well-defined fields for training and evaluation). ## Using `dlt` [dlt](http://github.com/dlt-hub/dlt) is an open-source Python library for data movement (ETL), and is useful for developers (and their agents) building data pipelines. It can ingest data from diverse source types: * Cloud storage or files * REST APIs * SQL databases * Python generators Examples of source types: * `filesystem` (includes s3, gs, az, abff, etc.) * `sql_database`, `mongodb`, `google_sheets` * `notion`, `hubspot`, `rest_api` Find your source type from the [list of sources](https://dlthub.com/docs/dlt-ecosystem/verified-sources) and create your `dlt` project: ``` dlt init filesystem ``` You can then create a configuration file `.dlt/secrets.toml` in the root of your dlt project to define the Hub as a filesystem destination for your datasets, based on the `hf://` protocol: ```toml [destination.filesystem] bucket_url = "hf://datasets/" [destination.filesystem.credentials] hf_token = "hf_..." # Your Hugging Face Access Token ``` The namespace should be your user name or the name of your organization/team where you want to ingest your dataset. Then each dlt dataset creates or updates a Hugging Face dataset repository. The repository name is /, where is the same one you used in the bucket_url (your organization or team), and is the pipeline's dataset_name. Here is an example pipeline: ```python import dlt @dlt.resource def my_data(): # One of the functions auto-generated by `dlt init` that you can customize, # or you can define your own python generator function. # Here is an example from the `chess` source type: for player in ['magnuscarlsen', 'rpragchess']: response = requests.get(f'https://api.chess.com/pub/player/{player}') response.raise_for_status() yield response.json() # Requires bucket_url = "hf://datasets/" in .dlt/secrets.toml pipeline = dlt.pipeline( pipeline_name="my_pipeline", destination="filesystem", dataset_name="dataset_name", ) pipeline.run(my_data()) ``` Customize the `dlt` resource to load the data you want and parse the fields you want to publish in your dataset, e.g. the text you need for training and evaluation. ## Using other libraries Some libraries like [๐Ÿค— Datasets](/docs/datasets/index), [Pandas](./datasets-pandas), [Polars](./datasets-polars), [Dask](./datasets-dask), [DuckDB](./datasets-duckdb), [Spark](./datasets-spark), or [Daft](./datasets-daft) can ingest data from various places to the Hub. See the list of [Libraries supported by the Datasets Hub](./datasets-libraries) for more information. ## Ingest raw data If you are ingesting raw data that need further curation before being published as AI-ready datasets or if you need an S3-like experience, consider ingesting them to [Hugging Face Storage Buckets](./storage-buckets). ## Scheduled ingestion There are some limitations when updating the same file on the Hub thousands of times. For instance, you might want to ingest generations of a running LLM inference server, live agents traces, or logs of a running model training. In such cases, uploading the data as a dataset on the Hub makes sense, but it can be hard to do properly. The main reason is that you donโ€™t want to version every update of your data because itโ€™ll make the git repository unusable. Three options are available: * **Use a Storage Bucket instead of a Dataset repository:** [Storage Buckets](/docs/hub/storage-buckets) offer an S3-like experience that allows updating files very frequently, since they are not based on git. Storage Buckets are especially useful for data that are not ready to be published as a dataset, e.g. data that are still evolving or that need more curation. * **Use a CommitScheduler**: The `CommitScheduler` in `huggingface_hub` offers near real-time ingestion to keep the git history of a Dataset repository manageable. It can be configured to do git commits at intervals defined in minutes. * **Use Hugging Face Jobs to schedule ingestion scripts**: Hugging Face Jobs provides a way to run and schedule python scripts on Hugging Face infrastructure. Schedule ingestion scripts to run at intervals defined using the Cron syntax. ### High frequency using Storage Buckets Contrary to Dataset repositories that are based on git, you can update files on Storage Buckets at very high rate, offering quasi real-time ingestion. Use `batch_bucket_files()` in `huggingface_hub` to update files in a bucket: ```python from huggingface_hub import batch_bucket_files def update_bucket(local_files): destinations = [os.path.basename(local_file) for local_file in local_file] batch_bucket_files(bucket_id="username/bucket_name", add=[(local_file, dst) for local_file, dst in zip(local_files, destinations)]) ``` Alternatively, you can append to files in a Bucket and `flush()` on every new item: ```python from huggingface_hub import hffs with hffs.open("buckets/username/bucket_name/texts.jsonl", "a") as f: for text in live_texts_stream: f.write(json.dumps({"text": text}) + "\n") f.flush() ``` The `HfFileSystem` is based on `fsspec` which has a default blocksize of 5MiB, which means flushing actually uploads the data once a full chunk of 5MiB of new data was appended. If you want to upload more often, lower `blocksize` in `hffs.open()` (e.g. `hffs.open(..., blocksize=100 * 2 ** 10)` for 100 kiB) or use `f.flush(force=True)`. Hugging Face storage is based on Xet which enables efficient I/O when appending to files: uploads are deduplicated and only new data are uploaded. Find more information on doing dynamic data ingestion in buckets in the [buckets documentation on uploads](/docs/hub/storage-buckets#uploading-files) and in the [dataset editing documentation](./datasets-editing#only-upload-the-new-data). ### Near real-time using a `CommitScheduler` The idea is to run a background job that regularly pushes a local folder to the Hub. You want to save data to the Hub (potentially millions of entries), but you donโ€™t need to save in real-time each userโ€™s input. Instead, you can save the data locally in a JSON file and upload it every 10 minutes. For example: ```python import json from huggingface_hub import CommitScheduler folder_path = "path/to/files/to/ingest" every = 10 # ingest every 10min with CommitScheduler(repo_id="username/dataset_name", repo_type="dataset", folder_path=folder_path, every=every) as scheduler: # Write to the folder to ingest every 10min # For example: with open(folder_path + "/texts.jsonl", "a") as f: f.write(json.dumps({"text": text}) + "\n") ... ``` Check out how to ingest dynamic data without having to reupload everything every time in the documentation on [dataset editing](./datasets-editing#only-upload-the-new-data). Find more information on scheduled uploads in the [huggingface_hub documentation](/docs/huggingface_hub/guides/upload#scheduled-uploads). ### Cron-based using Hugging Face Jobs Schedule python scripts to ingest data according to a schedule For example to run a script `ingest.py` every 5 minutes: ```bash hf jobs scheduled uv run "*/5 * * * *" ingest.py ``` Declare the script dependencies [in the header of the script](https://docs.astral.sh/uv/guides/scripts/#declaring-script-dependencies) or use `--with`. For example to run a `dlt` pipeline every day at midnight: ```bash hf jobs scheduled uv run --with "dlt[hf]" "0 0 * * *" pipeline.py ``` You can check the logs of every run using `hf jobs logs` or directly in the Jobs page on your account on Hugging Face. Find more information about Hugging Face Jobs in the [Jobs documentation](/docs/hub/jobs-overview). ### How to Add a Space to ArXiv https://huggingface.co/docs/hub/spaces-add-to-arxiv.md # How to Add a Space to ArXiv Demos on Hugging Face Spaces allow a wide audience to try out state-of-the-art machine learning research without writing any code. [Hugging Face and ArXiv have collaborated](https://huggingface.co/blog/arxiv) to embed these demos directly along side papers on ArXiv! Thanks to this integration, users can now find the most popular demos for a paper on its arXiv abstract page. For example, if you want to try out demos of the LayoutLM document classification model, you can go to [the LayoutLM paper's arXiv page](https://arxiv.org/abs/1912.13318), and navigate to the demo tab. You will see open-source demos built by the machine learning community for this model, which you can try out immediately in your browser: ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/layout-lm-space-arxiv.gif) We'll cover two different ways to add your Space to ArXiv and have it show up in the Demos tab. **Prerequisites** * There's an existing paper on ArXiv that you'd like to create a demo for * You have built or (can build) a demo for the model on Spaces **Method 1 (Recommended): Linking from the Space README** The simplest way to add a Space to an ArXiv paper is to include the link to the paper in the Space README file (`README.md`). It's good practice to include a full citation as well. You can see an example of a link and a citation on this [Echocardiogram Segmentation Space README](https://huggingface.co/spaces/abidlabs/echocardiogram-arxiv/blob/main/README.md). And that's it! Your Space should appear in the Demo tab next to the paper on ArXiv in a few minutes ๐Ÿค— **Method 2: Linking a Related Model** An alternative approach can be used to link Spaces to papers by linking an intermediate model to the Space. This requires that the paper is **associated with a model** that is on the Hugging Face Hub (or can be uploaded there) 1. First, upload the model associated with the ArXiv paper onto the Hugging Face Hub if it is not already there. ([Detailed instructions are here](./models-uploading)) 2. When writing the model card (README.md) for the model, include a link to the ArXiv paper. It's good practice to include a full citation as well. You can see an example of a link and a citation on the [LayoutLM model card](https://huggingface.co/microsoft/layoutlm-base-uncased) *Note*: you can verify this step has been carried out successfully by seeing if an ArXiv button appears above the model card. In the case of LayoutLM, the button says: "arxiv:1912.13318" and links to the LayoutLM paper on ArXiv. ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/arxiv-button.png) 3. Then, create a demo on Spaces that loads this model. Somewhere within the code, the model name must be included in order for Hugging Face to detect that a Space is associated with it. For example, the [docformer_for_document_classification](https://huggingface.co/spaces/iakarshu/docformer_for_document_classification) Space loads the LayoutLM [like this](https://huggingface.co/spaces/iakarshu/docformer_for_document_classification/blob/main/modeling.py#L484) and include the string `"microsoft/layoutlm-base-uncased"`: ```py from transformers import LayoutLMForTokenClassification layoutlm_dummy = LayoutLMForTokenClassification.from_pretrained("microsoft/layoutlm-base-uncased", num_labels=1) ``` *Note*: Here's an [overview on building demos on Hugging Face Spaces](./spaces-overview) and here are more specific instructions for [Gradio](./spaces-sdks-gradio) and [Streamlit](./spaces-sdks-streamlit). 4. As soon as your Space is built, Hugging Face will detect that it is associated with the model. A "Linked Models" button should appear in the top right corner of the Space, as shown here: ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/linked-models.png) *Note*: You can also add linked models manually by explicitly updating them in the [README metadata for the Space, as described here](https://huggingface.co/docs/hub/spaces-config-reference). Your Space should appear in the Demo tab next to the paper on ArXiv in a few minutes ๐Ÿค— ### Dask https://huggingface.co/docs/hub/datasets-dask.md # Dask [Dask](https://www.dask.org/?utm_source=hf-docs) is a parallel and distributed computing library that scales the existing Python and PyData ecosystem. In particular, we can use [Dask DataFrame](https://docs.dask.org/en/stable/dataframe.html?utm_source=hf-docs) to scale up pandas workflows. Dask DataFrame parallelizes pandas to handle large tabular data. It closely mirrors the pandas API, making it simple to transition from testing on a single dataset to processing the full dataset. Dask is particularly effective with Parquet, the default format on Hugging Face Datasets, as it supports rich data types, efficient columnar filtering, and compression. A good practical use case for Dask is running data processing or model inference on a dataset in a distributed manner. See, for example, [Coiled's](https://www.coiled.io/?utm_source=hf-docs) excellent blog post on [Scaling AI-Based Data Processing with Hugging Face + Dask](https://huggingface.co/blog/dask-scaling). ## Read and Write Since Dask uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths ([`hf://`](/docs/huggingface_hub/guides/hf_file_system#integrations)) to read and write data on the Hub. First you need to [Login with your Hugging Face account](/docs/huggingface_hub/quick-start#login), for example using: ``` hf auth login ``` Then you can [Create a dataset repository](/docs/huggingface_hub/quick-start#create-a-repository), for example using: ```python from huggingface_hub import HfApi HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset") ``` Finally, you can use [Hugging Face paths](/docs/huggingface_hub/guides/hf_file_system#integrations) in Dask. Dask DataFrame supports distributed writing to Parquet on Hugging Face, which uses commits to track dataset changes: ```python import dask.dataframe as dd df.to_parquet("hf://datasets/username/my_dataset") # or write in separate directories if the dataset has train/validation/test splits df_train.to_parquet("hf://datasets/username/my_dataset/train") df_valid.to_parquet("hf://datasets/username/my_dataset/validation") df_test .to_parquet("hf://datasets/username/my_dataset/test") ``` Since this creates one commit per file, it is recommended to squash the history after the upload: ```python from huggingface_hub import HfApi HfApi().super_squash_history(repo_id=repo_id, repo_type="dataset") ``` This creates a dataset repository `username/my_dataset` containing your Dask dataset in Parquet format. You can reload it later: ```python import dask.dataframe as dd df = dd.read_parquet("hf://datasets/username/my_dataset") # or read from separate directories if the dataset has train/validation/test splits df_train = dd.read_parquet("hf://datasets/username/my_dataset/train") df_valid = dd.read_parquet("hf://datasets/username/my_dataset/validation") df_test = dd.read_parquet("hf://datasets/username/my_dataset/test") ``` For more information on the Hugging Face paths and how they are implemented, please refer to the [the client library's documentation on the HfFileSystem](/docs/huggingface_hub/guides/hf_file_system). ## Process data To process a dataset in parallel using Dask, you can first define your data processing function for a pandas DataFrame or Series, and then use the Dask `map_partitions` function to apply this function to all the partitions of a dataset in parallel: ```python def dummy_count_words(texts): return pd.Series([len(text.split(" ")) for text in texts]) ``` or a similar function using pandas string methods (faster): ```python def dummy_count_words(texts): return texts.str.count(" ") ``` In pandas you can use this function on a text column: ```python # pandas API df["num_words"] = dummy_count_words(df.text) ``` And in Dask you can run this function on every partition: ```python # Dask API: run the function on every partition df["num_words"] = df.text.map_partitions(dummy_count_words, meta=int) ``` Note that you also need to provide `meta` which is the type of the pandas Series or DataFrame in the output of your function. This is needed because Dask DataFrame uses a lazy API. Since Dask will only run the data processing once `.compute()` is called, it needs the `meta` argument to know the type of the new column in the meantime. ## Predicate and Projection Pushdown When reading Parquet data from Hugging Face, Dask automatically leverages the metadata in Parquet files to skip entire files or row groups if they are not needed. For example if you apply a filter (predicate) on a Hugging Face Dataset in Parquet format or if you select a subset of the columns (projection), Dask will read the metadata of the Parquet files to discard the parts that are not needed without downloading them. This is possible thanks to a [reimplementation of the Dask DataFrame API](https://docs.coiled.io/blog/dask-dataframe-is-fast.html?utm_source=hf-docs) to support query optimization, which makes Dask faster and more robust. For example this subset of FineWeb-Edu contains many Parquet files. If you can filter the dataset to keep the text from recent CC dumps, Dask will skip most of the files and only download the data that match the filter: ```python import dask.dataframe as dd df = dd.read_parquet("hf://datasets/HuggingFaceFW/fineweb-edu/sample/10BT/*.parquet") # Dask will skip the files or row groups that don't # match the query without downloading them. df = df[df.dump >= "CC-MAIN-2023"] ``` Dask will also read only the required columns for your computation and skip the rest. For example if you drop a column late in your code, it will not bother to load it early on in the pipeline if it's not needed. This is useful when you want to manipulate a subset of the columns or for analytics: ```python # Dask will download the 'dump' and 'token_count' needed # for the filtering and computation and skip the other columns. df.token_count.mean().compute() ``` ## Client Most features in `dask` are optimized for a cluster or a local `Client` to launch the parallel computations: ```python import dask.dataframe as dd from distributed import Client if __name__ == "__main__": # needed for creating new processes client = Client() df = dd.read_parquet(...) ... ``` For local usage, the `Client` uses a Dask `LocalCluster` with multiprocessing by default. You can manually configure the multiprocessing of `LocalCluster` with ```python from dask.distributed import Client, LocalCluster cluster = LocalCluster(n_workers=8, threads_per_worker=8) client = Client(cluster) ``` Note that if you use the default threaded scheduler locally without `Client`, a DataFrame can become slower after certain operations (more details [here](https://github.com/dask/dask-expr/issues/1181)). Find more information on setting up a local or cloud cluster in the [Deploying Dask documentation](https://docs.dask.org/en/latest/deploying.html). ### Hub API Endpoints https://huggingface.co/docs/hub/api.md # Hub API Endpoints We have open endpoints that you can use to retrieve information from the Hub as well as perform certain actions such as creating model, dataset or Space repos. We offer a wrapper Python client, [`huggingface_hub`](https://github.com/huggingface/huggingface_hub), and a JS client, [`huggingface.js`](https://github.com/huggingface/huggingface.js), that allow easy access to these endpoints. We also provide [webhooks](./webhooks) to receive real-time incremental info about repos. Enjoy! > [!NOTE] > We've moved the Hub API Endpoints documentation to our [OpenAPI Playground](https://huggingface.co/spaces/huggingface/openapi), which provides a comprehensive reference that's always up-to-date. You can also access the OpenAPI specification directly at [https://huggingface.co/.well-known/openapi.json](https://huggingface.co/.well-known/openapi.json), or in Markdown version if you want to send it to your Agent: [https://huggingface.co/.well-known/openapi.md](https://huggingface.co/.well-known/openapi.md). > [!NOTE] > All API calls are subject to the HF-wide [Rate limits](./rate-limits). Upgrade your account if you need elevated, large-scale access. ### Argilla on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-argilla.md # Argilla on Spaces Argilla is a free and open source tool to build and iterate on data for AI. It can be deployed on the Hub with a few clicks and Hugging Face OAuth enabled. This enables other HF users to join your Argilla server to annotate datasets, perfect for running community annotation initiatives! With Argilla you can: - Configure datasets for collecting human feedback with a growing number questions (Label, NER, Ranking, Rating, free text, etc.) - Use model outputs/predictions to evaluate them or speeding up the annotation process. - UI users can explore, find, and label the most interesting/critical subsets using Argilla's search and semantic similarity features. - Pull and push datasets from the Hugging Face Hub for versioning and model training. The best place to get started with Argilla on Spaces is [this guide](http://docs.argilla.io/latest/getting_started/quickstart/). ### Using ๐Ÿค— Datasets https://huggingface.co/docs/hub/datasets-usage.md # Using ๐Ÿค— Datasets Once you've found an interesting dataset on the Hugging Face Hub, you can load the dataset using ๐Ÿค— Datasets. You can click on the [**Use this dataset** button](https://huggingface.co/datasets/nyu-mll/glue?library=datasets) to copy the code to load a dataset. First you need to [Login with your Hugging Face account](/docs/huggingface_hub/quick-start#login), for example using: ``` hf auth login ``` And then you can load a dataset from the Hugging Face Hub using ```python from datasets import load_dataset dataset = load_dataset("username/my_dataset") # or load the separate splits if the dataset has train/validation/test splits train_dataset = load_dataset("username/my_dataset", split="train") valid_dataset = load_dataset("username/my_dataset", split="validation") test_dataset = load_dataset("username/my_dataset", split="test") ``` You can also upload datasets to the Hugging Face Hub: ```python my_new_dataset.push_to_hub("username/my_new_dataset") ``` This creates a dataset repository `username/my_new_dataset` containing your Dataset in Parquet format, that you can reload later. For more information about using ๐Ÿค— Datasets, check out the [tutorials](/docs/datasets/tutorial) and [how-to guides](/docs/datasets/how_to) available in the ๐Ÿค— Datasets documentation. ### Model Cards https://huggingface.co/docs/hub/model-cards.md # Model Cards ## What are Model Cards? Model cards are files that accompany the models and provide handy information. Under the hood, model cards are simple Markdown files with additional metadata. Model cards are essential for discoverability, reproducibility, and sharing! You can find a model card as the `README.md` file in any model repo. The model card should describe: - the model - its intended uses & potential limitations, including biases and ethical considerations as detailed in [Mitchell, 2018](https://arxiv.org/abs/1810.03993) - the training params and experimental info (you can embed or link to an experiment tracking platform for reference) - which datasets were used to train your model - the model's evaluation results The model card template is available [here](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md). How to fill out each section of the model card is described in [the Annotated Model Card](https://huggingface.co/docs/hub/model-card-annotated). Model Cards on the Hub have two key parts, with overlapping information: - [Metadata](#model-card-metadata) - [Text descriptions](#model-card-text) ## Model card metadata A model repo will render its `README.md` as a model card. The model card is a [Markdown](https://en.wikipedia.org/wiki/Markdown) file, with a [YAML](https://en.wikipedia.org/wiki/YAML) section at the top that contains metadata about the model. The metadata you add to the model card supports discovery and easier use of your model. For example: * Allowing users to filter models at https://huggingface.co/models. * Displaying the model's license. * Adding datasets to the metadata will add a message reading `Datasets used to train:` to your model page and link the relevant datasets, if they're available on the Hub. Dataset and language identifiers are those listed on the [Datasets](https://huggingface.co/datasets) and [Languages](https://huggingface.co/languages) pages. ### Adding metadata to your model card There are a few different ways to add metadata to your model card including: - Using the metadata UI - Directly editing the YAML section of the `README.md` file - Via the [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub) Python library, see the [docs](https://huggingface.co/docs/huggingface_hub/guides/model-cards#update-metadata) for more details. Many libraries with [Hub integration](./models-libraries) will automatically add metadata to the model card when you upload a model. #### Using the metadata UI You can add metadata to your model card using the metadata UI. To access the metadata UI, go to the model page and click on the `Edit model card` button in the top right corner of the model card. This will open an editor showing the model card `README.md` file, as well as a UI for editing the metadata. This UI will allow you to add key metadata to your model card and many of the fields will autocomplete based on the information you provide. Using the UI is the easiest way to add metadata to your model card, but it doesn't support all of the metadata fields. If you want to add metadata that isn't supported by the UI, you can edit the YAML section of the `README.md` file directly. #### Editing the YAML section of the `README.md` file You can also directly edit the YAML section of the `README.md` file. If the model card doesn't already have a YAML section, you can add one by adding three `---` at the top of the file, then include all of the relevant metadata, and close the section with another group of `---` like the example below: ```yaml --- language: - "List of ISO 639-1 code for your language" - lang1 - lang2 thumbnail: "url to a thumbnail used in social sharing" tags: - tag1 - tag2 license: "any valid license identifier" datasets: - dataset1 - dataset2 base_model: "base model Hub identifier" --- ``` You can find the detailed model card metadata specification here. ### Specifying a library You can specify the supported libraries in the model card metadata section. Find more about our supported libraries [here](./models-libraries). The library will be specified in the following order of priority: 1. Specifying `library_name` in the model card (recommended if your model is not a `transformers` model). This information can be added via the metadata UI or directly in the model card YAML section: ```yaml library_name: flair ``` 2. Having a tag with the name of a library that is supported ```yaml tags: - flair ``` If it's not specified, the Hub will try to automatically detect the library type. However, this approach is discouraged, and repo creators should use the explicit `library_name` as much as possible. 1. By looking into the presence of files such as `*.nemo` or `*.mlmodel`, the Hub can determine if a model is from NeMo or CoreML. 2. In the past, if nothing was detected and there was a `config.json` file, it was assumed the library was `transformers`. For model repos created after August 2024, this is not the case anymore, so you need to set `library_name: transformers` explicitly. ### Specifying a base model If your model is a fine-tune, an adapter, or a quantized version of a base model, you can specify the base model in the model card metadata section. This information can also be used to indicate if your model is a merge of multiple existing models. Hence, the `base_model` field can either be a single model ID, or a list of one or more base_models (specified by their Hub identifiers). ```yaml base_model: HuggingFaceH4/zephyr-7b-beta ``` This metadata will be used to display the base model on the model page. Users can also use this information to filter models by base model or find models that are derived from a specific base model: For a fine-tuned model: For an adapter (LoRA, PEFT, etc): For a quantized version of another model: For a merge of two or more models: In the merge case, you specify a list of two or more base_models: ```yaml base_model: - Endevor/InfinityRP-v1-7B - l3utterfly/mistral-7b-v0.1-layla-v4 ``` The Hub will infer the type of relationship from the current model to the base model (`"adapter", "merge", "quantized", "finetune"`) but you can also set it explicitly if needed: `base_model_relation: quantized` for instance. ### Specifying a new version If a new version of your model is available in the Hub, you can specify it in a `new_version` field. For example, on `l3utterfly/mistral-7b-v0.1-layla-v3`: ```yaml new_version: l3utterfly/mistral-7b-v0.1-layla-v4 ``` This metadata will be used to display a link to the latest version of a model on the model page. If the model linked in `new_version` also has a `new_version` field, the very latest version will always be displayed. ### Specifying a dataset You can specify the datasets used to train your model in the model card metadata section. The datasets will be displayed on the model page and users will be able to filter models by dataset. You should use the Hub dataset identifier, which is the same as the dataset's repo name as the identifier: ```yaml datasets: - stanfordnlp/imdb - HuggingFaceFW/fineweb ``` ### Specifying a bucket You can specify the [storage buckets](./storage-buckets) linked to your model in the model card metadata section. The buckets will be shown as tags on the model page and the linked bucket pages will show the model in return. You should use the Hub bucket identifier, which is the same as the bucket's repo name: ```yaml buckets: - my-org/my-bucket - my-org/another-bucket ``` ### Specifying a task (`pipeline_tag`) You can specify the `pipeline_tag` in the model card metadata. The `pipeline_tag` indicates the type of task the model is intended for. This tag will be displayed on the model page and users can filter models on the Hub by task. This tag is also used to determine which [widget](./models-widgets#enabling-a-widget) to use for the model and which APIs to use under the hood. For `transformers` models, the pipeline tag is automatically inferred from the model's `config.json` file but you can override it in the model card metadata if required. Editing this field in the metadata UI will ensure that the pipeline tag is valid. Some other libraries with Hub integration will also automatically add the pipeline tag to the model card metadata. ### Specifying a license You can specify the license in the model card metadata section. The license will be displayed on the model page and users will be able to filter models by license. Using the metadata UI, you will see a dropdown of the most common licenses. If required, you can also specify a custom license by adding `other` as the license value and specifying the name and a link to the license in the metadata. ```yaml # Example from https://huggingface.co/coqui/XTTS-v1 --- license: other license_name: coqui-public-model-license license_link: https://coqui.ai/cpml --- ``` If the license is not available via a URL you can link to a LICENSE stored in the model repo. ### Evaluation Results You can specify your **model's evaluation results** in a structured way in the model card metadata. Results are parsed by the Hub and displayed in a widget on the model page. Here is an example on how it looks like for the [bigcode/starcoder](https://huggingface.co/bigcode/starcoder) model: The initial metadata spec was based on Papers with code's [model-index specification](https://github.com/paperswithcode/model-index). This allowed us to directly index the results into Papers with code's leaderboards when appropriate. You could also link the source from where the eval results has been computed. > [!TIP] > NEW: We have a new, simpler metadata format for eval results. Check it out in [the dedicated doc page](./eval-results). Here is a partial example of a model-index that was describing [01-ai/Yi-34B](https://huggingface.co/01-ai/Yi-34B)'s score on the ARC benchmark. The result came from the [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) which is defined as the `source`: ```yaml --- model-index: - name: Yi-34B results: - task: type: text-generation dataset: name: ai2_arc type: ai2_arc metrics: - name: AI2 Reasoning Challenge (25-Shot) type: AI2 Reasoning Challenge (25-Shot) value: 64.59 source: name: Open LLM Leaderboard url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard --- ``` For more details on how to format this data, check out the [Model Card specifications](https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1). ### CO2 Emissions The model card is also a great place to show information about the CO2 impact of your model. Visit our [guide on tracking and reporting CO2 emissions](./model-cards-co2) to learn more. ### Linking a Paper If the model card includes a link to a Paper page (either on HF or an Arxiv abstract/PDF), the Hugging Face Hub will extract the arXiv ID and include it in the model tags with the format `arxiv:`. Clicking on the tag will let you: * Visit the Paper page * Filter for other models on the Hub that cite the same paper. Read more about Paper pages [here](./paper-pages). ## Model Card text Details on how to fill out the human-readable portion of the model card (so that it may be printed out, cut+pasted, etc.) is available in the [Annotated Model Card](./model-card-annotated). ## FAQ ### How are model tags determined? Each model page lists all the model's tags in the page header, below the model name. These are primarily computed from the model card metadata, although some are added automatically, as described in [Enabling a Widget](./models-widgets#enabling-a-widget). ### Can I add custom tags to my model? Yes, you can add custom tags to your model by adding them to the `tags` field in the model card metadata. The metadata UI will suggest some popular tags, but you can add any tag you want. For example, you could indicate that your model is focused on finance by adding a `finance` tag. ### How can I indicate that my model is not suitable for all audiences You can add a `not-for-all-audiences` tag to your model card metadata. When this tag is present, a message will be displayed on the model page indicating that the model is not for all audiences. Users can click through this message to view the model card. ### How can I display different images for dark and light mode? You can display different versions of an image optimized for each theme. This is particularly useful for logos, diagrams, or screenshots that need different color schemes to maintain visibility and aesthetics across light and dark modes. To use this feature, you'll need to provide both versions of your image. **For images uploaded via the markdown editor** When you upload an image directly from the markdown editor (using drag-and-drop), append the URI fragment `#hf-light-mode-only` or `#hf-dark-mode-only` to the end of the image URL to specify which theme it should display in: ```markdown Image only displays when viewing in light mode ![Logo](https://cdn-uploads.huggingface.co/production/uploads/logo-light.png#hf-light-mode-only) Image only displays when viewing in dark mode ![Logo](https://cdn-uploads.huggingface.co/production/uploads/logo-dark.png#hf-dark-mode-only) ``` **For already hosted images** If you want to reference images that are already hosted without re-uploading them, use HTML `` tags with the following Tailwind CSS classes to specify which theme it should display in: ```html // Image only displays when viewing in dark mode // Image only displays when viewing in light mode Logo ``` ### Can I write LaTeX in my model card? Yes! The Hub uses the [KaTeX](https://katex.org/) math typesetting library to render math formulas server-side before parsing the Markdown. You have to use the following delimiters: - `$$ ... $$` for display mode - `\\(...\\)` for inline mode (no space between the slashes and the parenthesis). Then you'll be able to write: $$ \LaTeX $$ $$ \mathrm{MSE} = \left(\frac{1}{n}\right)\sum_{i=1}^{n}(y_{i} - x_{i})^{2} $$ $$ E=mc^2 $$ ### Evidence on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-evidence.md # Evidence on Spaces **Evidence** is an open-source framework designed for building data-driven applications, reports, and dashboards using SQL and Markdown. With Evidence, you can quickly create decision-support tools, reports, and interactive dashboards without relying on traditional drag-and-drop business intelligence (BI) platforms. Evidence enables you to: - Write reports and dashboards directly in Markdown with SQL-backed components. - Integrate data from multiple sources, including SQL databases and APIs. - Use templated pages to automatically generate multiple pages based on a single template. - Deploy reports seamlessly to various hosting solutions. Visit [Evidenceโ€™s documentation](https://docs.evidence.dev/) for guides, examples, and best practices for using Evidence to create data products. ## Deploy Evidence on Spaces You can deploy Evidence on Hugging Face Spaces with just a few clicks: Once created, the Space will display `Building` status. Refresh the page if the status doesn't automatically update to `Running`. Your Evidence app will automatically be deployed on Hugging Face Spaces. ## Editing your Evidence app from the CLI To edit your app, clone the Space and edit the files locally. ```bash git clone https://huggingface.co/spaces/your-username/your-space-name cd your-space-name npm install npm run sources npm run dev ``` You can then modify pages/index.md to change the content of your app. ## Editing your Evidence app from VS Code The easiest way to develop with Evidence is using the [VS Code Extension](https://marketplace.visualstudio.com/items?itemName=Evidence.evidence-vscode): 1. Install the extension from the VS Code Marketplace 2. Open the Command Palette (Ctrl/Cmd + Shift + P) and enter `Evidence: Copy Existing Project` 3. Paste the URL of the Hugging Face Spaces Evidence app you'd like to copy (e.g. `https://huggingface.co/spaces/your-username/your-space-name`) and press Enter 4. Select the folder you'd like to clone the project to and press Enter 5. Press `Start Evidence` in the bottom status bar Check out the docs for [alternative install methods](https://docs.evidence.dev/getting-started/install-evidence), Github Codespaces, and alongside dbt. ## Learning More - [Docs](https://docs.evidence.dev/) - [Github](https://github.com/evidence-dev/evidence) - [Slack Community](https://slack.evidence.dev/) - [Evidence Home Page](https://www.evidence.dev) ### Datasets Download Stats https://huggingface.co/docs/hub/datasets-download-stats.md # Datasets Download Stats ## How are downloads counted for datasets? Counting the number of downloads for datasets is not a trivial task, as a single dataset repository might contain multiple files, from multiple subsets and splits (e.g. train/validation/test) and sometimes with many files in a single split. To solve this issue and avoid counting one person's download multiple times, we treat all files downloaded by a user (based on their IP address) within a 5-minute window in a given repository as a single dataset download. This counting happens automatically on our servers when files are downloaded (through GET or HEAD requests), with no need to collect any user information or make additional calls. ## Before September 2024 The Hub used to provide download stats only for the datasets loadable via the `datasets` library. To determine the number of downloads, the Hub previously counted every time `load_dataset` was called in Python, excluding Hugging Face's CI tooling on GitHub. No information was sent from the user, and no additional calls were made for this. The count was done server-side as we served files for downloads. This meant that: * The download count was the same regardless of whether the data is directly stored on the Hub repo or if the repository has a [script](/docs/datasets/dataset_script) to load the data from an external source. * If a user manually downloaded the data using tools like `wget` or the Hub's user interface (UI), those downloads were not included in the download count. ### User Provisioning (SCIM) https://huggingface.co/docs/hub/enterprise-scim.md # User Provisioning (SCIM) > [!WARNING] > This feature is part of the Enterprise and Enterprise Plus plans. SCIM (System for Cross-domain Identity Management) is a standard for automating user provisioning. It allows you to connect your Identity Provider (IdP) to Hugging Face to manage your organization's members. SCIM works differently depending on your SSO model. For a detailed comparison, see the [SSO overview](./enterprise-sso#user-provisioning-scim). ## Basic SSO: invitation-based provisioning With [Basic SSO](./security-sso-basic) (Enterprise plan), SCIM automates the **invitation** of existing Hugging Face users to your organization. - Users **must already have a Hugging Face account** before they can be provisioned via SCIM - When your IdP provisions a user, Hugging Face sends them an **invitation email** to join the organization - The user must **accept the invitation** to become a member โ€” provisioning does not grant immediate access - SCIM **cannot modify** user profile information (name, email, username) โ€” the user retains full control of their Hugging Face account - When a user is deprovisioned in your IdP, their invitation is deactivated and their access to the organization is revoked ## Managed SSO: full lifecycle provisioning With [Managed SSO](./enterprise-advanced-sso) (Enterprise Plus plan), SCIM manages the **entire user lifecycle** on Hugging Face. - SCIM **creates a new Hugging Face account** when a user is provisioned โ€” no pre-existing account is needed - The user is **immediately added** to the organization as a member, with no invitation step - SCIM **can update** user profile information (name, email, username) as changes occur in your IdP - When a user is deprovisioned in your IdP, their Hugging Face account is deactivated and their access is revoked ## How to enable SCIM To enable SCIM, go to your organization's settings, navigate to the **SSO** tab, and then select the **SCIM** sub-tab. You will find the **SCIM Tenant URL** and a button to generate a **SCIM token**. You will need both of these to configure your IdP. The SCIM token is a secret and should be stored securely in your IdP's configuration. Once SCIM is enabled in your IdP, provisioned users will appear in the **Users Management** tab and provisioned groups will appear in the **SCIM** tab in your organization's settings. ## Group provisioning In addition to user provisioning, SCIM supports **group provisioning**. Groups pushed from your IdP are stored as SCIM groups on Hugging Face and can be linked to [Resource Groups](./enterprise-resource-groups) from the **SCIM** tab in your organization's settings. ### Linking a SCIM group to a Resource Group To link a SCIM group, go to your organization's **SSO โ†’ SCIM** tab. Provisioned groups are listed in a table. In the **Resource Groups** column, each group shows either a **Link resource groups** button (if no links exist yet) or the number of currently linked resource groups (e.g. "2 resource groups"). Clicking either opens a modal where you can add one or more Resource Groups, each with its own role assignment. You can also change or remove existing links from the same modal. Before linking, make sure the following conditions are met: - The Resource Group must have **no existing members**. Linking to a non-empty Resource Group is not allowed. - The Resource Group must **not have auto-join enabled**. Auto-join (which automatically adds every new org member to the RG) is mutually exclusive with SCIM management. Disable auto-join on the RG before linking. A SCIM group can be linked to multiple Resource Groups, each with its own role. ### What happens after linking Once a SCIM group is linked to a Resource Group: - **Backfill**: Any members already in the SCIM group are immediately added to the Resource Group at the configured role. - **Ongoing sync**: Membership changes in your IdP are automatically reflected: - When a user is **added** to the group in your IdP, they are added to all linked Resource Groups. - When a user is **removed** from the group in your IdP, they are removed from all linked Resource Groups, except those the user is linked to through other SCIM groups. For those, the user's role will be updated to the โ€œhighestโ€ role granted by the other SCIM groups. - When a SCIM group is **deleted** in your IdP, all its members are removed from the linked Resource Groups, except for users who belong to those Resource Groups through other SCIM groups. For each of those Resource Groups, usersโ€™ roles are updated to the โ€œhighestโ€ role granted by the other SCIM groups. - **Role changes**: If you update the role on a link, all current group members' roles in that Resource Group are updated immediately. ### SCIM-managed Resource Groups A Resource Group linked to a SCIM group is considered **SCIM-managed**. The IdP is the sole source of truth for its membership. As a result: - Manual membership changes via the Hub UI or API are **blocked** โ€” any attempt to add, remove, or change a member's role on a SCIM-managed Resource Group will return a `403` error. - Auto-join **cannot be enabled** on a SCIM-managed Resource Group. To re-enable auto-join, first remove the SCIM link. Group provisioning works the same way for both Basic SSO and Managed SSO. ## Supported user attributes The Hugging Face SCIM endpoint supports the following user attributes: | Attribute | Description | Basic SSO | Managed SSO | | --- | --- | --- | --- | | `userName` | Hugging Face username | Read-only | Read/Write | | `name.givenName` | First name | Read-only | Read/Write | | `name.familyName` | Last name | Read-only | Read/Write | | `emails[type eq "work"].value` | Email address | Read-only | Read/Write | | `externalId` | IdP-assigned identifier | Read/Write | Read/Write | | `active` | Whether the user is an active member | Read/Write | Read/Write | With Basic SSO, only `active` and `externalId` can be modified via SCIM โ€” all other attributes are controlled by the user on their Hugging Face account. For group provisioning, the supported attributes are `displayName`, `members`, and `externalId`. ## Deprovisioning Deprovisioning behavior depends on how the user is removed and which SSO model you use. **Setting `active` to `false`** (soft deprovision): - The user loses access to the organization - With Basic SSO: the invitation is deactivated - With Managed SSO: the user is removed from the organization but their account and content are preserved โ€” this is **reversible** by setting `active` back to `true` **Deleting the user via SCIM** (hard deprovision): - With Basic SSO: the user is removed from the organization and all its resource groups. Their Hugging Face account and personal content are **not affected** โ€” they simply lose membership in your organization. - With Managed SSO: the user's Hugging Face account is **permanently deleted**, along with all content they created. This action is **irreversible**. ## Supported Identity Providers We support SCIM with any IdP that implements the SCIM 2.0 protocol. We have specific guides for some of the most popular providers: - [How to configure SCIM with Microsoft Entra ID](./security-sso-entra-id-scim) - [How to configure SCIM with Okta](./security-sso-okta-scim) ### Using PaddleNLP at Hugging Face https://huggingface.co/docs/hub/paddlenlp.md # Using PaddleNLP at Hugging Face Leveraging the [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) framework, [`PaddleNLP`](https://github.com/PaddlePaddle/PaddleNLP) is an easy-to-use and powerful NLP library with awesome pre-trained model zoo, supporting wide-range of NLP tasks from research to industrial applications. ## Exploring PaddleNLP in the Hub You can find `PaddleNLP` models by filtering at the left of the [models page](https://huggingface.co/models?library=paddlenlp&sort=downloads). All models on the Hub come up with the following features: 1. An automatically generated model card with a brief description and metadata tags that help for discoverability. 2. An interactive widget you can use to play out with the model directly in the browser. 3. An Inference Providers widget that allows to make inference requests. 4. Easily deploy your model as a Gradio app on Spaces. ## Installation To get started, you can follow [PaddlePaddle Quick Start](https://www.paddlepaddle.org.cn/en/install) to install the PaddlePaddle Framework with your favorite OS, Package Manager and Compute Platform. `paddlenlp` offers a quick one-line install through pip: ``` pip install -U paddlenlp ``` ## Using existing models Similar to `transformer` models, the `paddlenlp` library provides a simple one-liner to load models from the Hugging Face Hub by setting `from_hf_hub=True`! Depending on how you want to use them, you can use the high-level API using the `Taskflow` function or you can use `AutoModel` and `AutoTokenizer` for more control. ```py # Taskflow provides a simple end-to-end capability and a more optimized experience for inference from paddlenlp.transformers import Taskflow taskflow = Taskflow("fill-mask", task_path="PaddlePaddle/ernie-1.0-base-zh", from_hf_hub=True) # If you want more control, you will need to define the tokenizer and model. from paddlenlp.transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("PaddlePaddle/ernie-1.0-base-zh", from_hf_hub=True) model = AutoModelForMaskedLM.from_pretrained("PaddlePaddle/ernie-1.0-base-zh", from_hf_hub=True) ``` If you want to see how to load a specific model, you can click `Use in paddlenlp` and you will be given a working snippet that you can load it! ## Sharing your models You can share your `PaddleNLP` models by using the `save_to_hf_hub` method under all `Model` and `Tokenizer` classes. ```py from paddlenlp.transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("PaddlePaddle/ernie-1.0-base-zh", from_hf_hub=True) model = AutoModelForMaskedLM.from_pretrained("PaddlePaddle/ernie-1.0-base-zh", from_hf_hub=True) tokenizer.save_to_hf_hub(repo_id="/") model.save_to_hf_hub(repo_id="/") ``` ## Additional resources - PaddlePaddle Installation [guide](https://www.paddlepaddle.org.cn/en/install). - PaddleNLP [GitHub Repo](https://github.com/PaddlePaddle/PaddleNLP). - [PaddlePaddle on the Hugging Face Hub](https://huggingface.co/PaddlePaddle) ### How to configure SCIM with Okta https://huggingface.co/docs/hub/security-sso-okta-scim.md # How to configure SCIM with Okta This guide explains how to set up SCIM user and group provisioning between Okta and your Hugging Face organization using SCIM. > [!WARNING] > This feature is part of the Enterprise and Enterprise Plus plans. ## Step 1: Get SCIM configuration from Hugging Face 1. Navigate to your organization's settings page on Hugging Face. 2. Go to the **SSO** tab, then click on the **SCIM** sub-tab. 3. Copy the **SCIM Tenant URL**. You will need this for the Okta configuration. 4. Click **Generate an access token**. A new SCIM token will be generated. Copy this token immediately and store it securely, as you will not be able to see it again. ## Step 2: Enter Admin Credentials 1. In Okta, go to **Applications** and select your Hugging Face app. 2. Go to the **General** tab and click **Edit** on App Settings 3. For the Provisioning option select **SCIM**, click **Save** 4. Go to the **Provisioning** tab and click **Edit**. 5. Enter the **SCIM Tenant URL** as the SCIM connector base URL. 6. Enter **userName** for Unique identifier field for users. 7. Select all necessary actions for Supported provisioning actions. 8. Select **HTTP Header** for Authentication Mode. 9. Enter the **Access Token** you generated as the Authorization Bearer Token. 10. Click **Test Connector Configuration** to verify the connection. 11. Save your changes. ## Step 3: Configure Provisioning 1. In the **Provisioning** tab, click **To App** from the side nav. 2. Click **Edit** and check to Enable all the features you need, i.e. Create, Update, Delete Users. 3. Click **Save** at the bottom. ## Step 4: Configure Attribute Mappings 1. While still in the **Provisioning** tab scroll down to Attribute Mappings section 2. The default attribute mappings often require adjustments for robust provisioning. We recommend using the following configuration. You can delete attributes that are not here: ## Step 5: Assign Users or Groups 1. Visit the **Assignments** tab, click **Assign** 2. Click **Assign to People** or **Assign to Groups** 3. After finding the User or Group that needs to be assigned, click **Assign** next to their name 4. In the mapping modal the Username needs to be edited to comply with the following rules. > [!WARNING] > > Only regular characters and `-` are accepted in the Username. > `--` (double dash) is forbidden. > `-` cannot start or end the name. > Digit-only names are not accepted. > Minimum length is 2 and maximum length is 42. > Username has to be unique within your org. > 5. Scroll down and click **Save and Go Back** 6. Click **Done** 7. Confirm that users or groups are created, updated, or deactivated in your Hugging Face organization as expected. ## Step 6: Push Okta Groups to Hugging Face via SCIM Before you can link groups to Hugging Face Resource Groups, you need to push your Okta groups to Hugging Face using the **Push Groups** tab. This is separate from assigning users to the app in Step 5. > [!WARNING] > Okta does not support using the same group for app assignment (Step 5) and Group Push. Use a dedicated group for pushing โ€” keep your push groups separate from your assignment groups. 1. In the Okta Admin Console, go to **Applications** and select your Hugging Face app. 2. Click the **Push Groups** tab. 3. Click **+ Push Groups** and select **Find groups by name**. 4. Search for the Okta group you want to push and select it from the results. 5. Choose how to handle the group in Hugging Face: - **Create Group**: Creates a new SCIM group in your Hugging Face organization. - **Link Group**: Links to an existing group already in your Hugging Face organization. 6. Click **Save**. To push additional groups, click **Save & Add Another** and repeat. Once pushed, the group will appear under **SCIM Groups** in your Hugging Face organization settings (SSO โ†’ SCIM tab). Any membership changes you make to the group in Okta will automatically sync to Hugging Face. ## Step 7: Link SCIM Groups to Hugging Face Resource Groups Once your groups are provisioned from Okta, you can link them to Hugging Face Resource Groups to manage permissions at scale. This allows all members of a SCIM group to automatically receive specific roles (like read or write) for a collection of resources. > [!NOTE] > Before linking, make sure the Resource Group you want to link is **empty** (has no existing members) and does **not** have auto-join enabled. Both conditions are required โ€” linking will fail otherwise. 1. In your Hugging Face organization settings, navigate to the **SSO** -> **SCIM** tab. You will see a list of your provisioned groups under **SCIM Groups**. 2. Locate the group you wish to configure and click **Link resource groups** in its row. 3. A dialog will appear. Click **Link a Resource Group**. 4. From the dropdown menus, select the **Resource Group** you want to link and the **Role Assignment** you want to grant to the members of the SCIM group. 5. Click **Link to SCIM group** and save the mapping. Once linked, the Resource Group becomes **SCIM-managed**: any members already in the SCIM group are immediately added to the Resource Group (backfill), and all future membership changes in Okta are automatically reflected. Manual membership edits on the Resource Group via the Hub UI or API will be blocked. ### Distilabel https://huggingface.co/docs/hub/datasets-distilabel.md # Distilabel Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers. Distilabel can be used for generating synthetic data and AI feedback for a wide variety of projects including traditional predictive NLP (classification, extraction, etc.), or generative and large language model scenarios (instruction following, dialogue generation, judging etc.). Distilabel's programmatic approach allows you to build scalable pipelines data generation and AI feedback. The goal of distilabel is to accelerate your AI development by quickly generating high-quality, diverse datasets based on verified research methodologies for generating and judging with AI feedback. ## What do people build with distilabel? The Argilla community uses distilabel to create amazing [datasets](https://huggingface.co/datasets?other=distilabel) and [models](https://huggingface.co/models?other=distilabel). - The [1M OpenHermesPreference](https://huggingface.co/datasets/argilla/OpenHermesPreferences) is a dataset of ~1 million AI preferences that have been generated using the [teknium/OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) LLM. It is a great example on how you can use distilabel to scale and increase dataset development. - [distilabeled Intel Orca DPO dataset](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs) used to fine-tune the [improved OpenHermes model](https://huggingface.co/argilla/distilabeled-OpenHermes-2.5-Mistral-7B). This dataset was built by combining human curation in Argilla with AI feedback from distilabel, leading to an improved version of the Intel Orca dataset and outperforming models fine-tuned on the original dataset. - The [haiku DPO data](https://github.com/davanstrien/haiku-dpo) is an example how anyone can create a synthetic dataset for a specific task, which after curation and evaluation can be used for fine-tuning custom LLMs. ## Prerequisites First [login with your Hugging Face account](/docs/huggingface_hub/quick-start#login): ```bash hf auth login ``` Make sure you have `distilabel` installed: ```bash pip install -U distilabel[vllm] ``` ## Distilabel pipelines Distilabel pipelines can be built with any number of interconnected steps or tasks. The output of one step or task is fed as input to another. A series of steps can be chained together to build complex data processing and generation pipelines with LLMs. The input of each step is a batch of data, containing a list of dictionaries, where each dictionary represents a row of the dataset, and the keys are the column names. To feed data from and to the Hugging Face hub, we've defined a `Distiset` class as an abstraction of a `datasets.DatasetDict`. ## Distiset as dataset object A Pipeline in distilabel returns a special type of Hugging Face `datasets.DatasetDict` which is called `Distiset`. The Pipeline can output multiple subsets in the Distiset, which is a dictionary-like object with one entry per subset. A Distiset can then be pushed seamlessly to the Hugging face Hub, with all the subsets in the same repository. ## Load data from the Hub to a Distiset To showcase an example of loading data from the Hub, we will reproduce the [Prometheus 2 paper](https://arxiv.org/pdf/2405.01535) and use the PrometheusEval task implemented in distilabel. The Prometheus 2 and Prometheuseval task direct assessment and pairwise ranking tasks i.e. assessing the quality of a single isolated response for a given instruction with or without a reference answer, and assessing the quality of one response against another one for a given instruction with or without a reference answer, respectively. We will use these task on a dataset loaded from the Hub, which was created by the Hugging Face H4 team named [HuggingFaceH4/instruction-dataset](https://huggingface.co/datasets/HuggingFaceH4/instruction-dataset). ```python from distilabel.llms import vLLM from distilabel.pipeline import Pipeline from distilabel.steps import KeepColumns, LoadDataFromHub from distilabel.steps.tasks import PrometheusEval if __name__ == "__main__": with Pipeline(name="prometheus") as pipeline: load_dataset = LoadDataFromHub( name="load_dataset", repo_id="HuggingFaceH4/instruction-dataset", split="test", output_mappings={"prompt": "instruction", "completion": "generation"}, ) task = PrometheusEval( name="task", llm=vLLM( model="prometheus-eval/prometheus-7b-v2.0", chat_template="[INST] {{ messages[0]['content'] }}\n{{ messages[1]['content'] }}[/INST]", ), mode="absolute", rubric="factual-validity", reference=False, num_generations=1, group_generations=False, ) keep_columns = KeepColumns( name="keep_columns", columns=["instruction", "generation", "feedback", "result", "model_name"], ) load_dataset >> task >> keep_columns ``` Then we need to call `pipeline.run` with the runtime parameters so that the pipeline can be launched and data can be stores in the `Distiset` object. ```python distiset = pipeline.run( parameters={ task.name: { "llm": { "generation_kwargs": { "max_new_tokens": 1024, "temperature": 0.7, }, }, }, }, ) ``` ## Push a distilabel Distiset to the Hub Push the `Distiset` to a Hugging Face repository, where each one of the subsets will correspond to a different configuration: ```python distiset.push_to_hub( "my-org/my-dataset", commit_message="Initial commit", private=False, token=os.getenv("HF_TOKEN"), ) ``` ## ๐Ÿ“š Resources - [๐Ÿš€ Distilabel Docs](https://distilabel.argilla.io/latest/) - [๐Ÿš€ Distilabel Docs - distiset](https://distilabel.argilla.io/latest/sections/how_to_guides/advanced/distiset/) - [๐Ÿš€ Distilabel Docs - prometheus](https://distilabel.argilla.io/1.2.0/sections/pipeline_samples/papers/prometheus/) - [๐Ÿ†• Introducing distilabel](https://argilla.io/blog/introducing-distilabel-1/) ### Gating Group Collections https://huggingface.co/docs/hub/enterprise-gating-group-collections.md # Gating Group Collections > [!WARNING] > This feature is part of the Team & Enterprise plans. Gating Group Collections allow organizations to grant (or reject) access to all the models and datasets in a collection at once, rather than per repo. Users will only have to go through **a single access request**. To enable Gating Group in a collection: - the collection owner must be an organization - the organization must be subscribed to a Team or Enterprise plan - all models and datasets in the collection must be owned by the same organization as the collection - each model or dataset in the collection may only belong to one Gating Group Collection (but they can still be included in non-gating i.e. _regular_ collections). > [!TIP] > Gating only applies to models and datasets; any other resource part of the collection (such as a Space or a Paper) won't be affected. ## Manage gating group as an organization admin To enable access requests, go to the collection page and click on **Gating group** in the bottom-right corner. Hugging Face collection page with gating group collection feature disabled By default, gating group is disabled: click on **Configure Access Requests** to open the settings Hugging Face gating group collection settings with gating disabled By default, access to the repos in the collection is automatically granted to users when they request it. This is referred to as **automatic approval**. In this mode, any user can access your repos once theyโ€™ve agreed to share their contact information with you. Hugging Face gating group collection settings with automatic mode selected If you want to manually approve which users can access repos in your collection, you must set it to **Manual Review**. When this is the case, you will notice a new option: **Notifications frequency**, which lets you configure when to get notified about new users requesting access. It can be set to once a day or real-time. By default, emails are sent to the first 5 admins of the organization. You can also set a different email address in the **Notifications email** field. Hugging Face gating group collection settings with manual review mode selected ### Review access requests Once access requests are enabled, you have full control of who can access repos in your gating group collection, whether the approval mode is manual or automatic. You can review and manage requests either from the UI or via the API. **Approving a request for a repo in a gating group collection will automatically approve access to all repos (models and datasets) in that collection.** #### From the UI You can review who has access to all the repos in your Gating Group Collection from the settings page of any of the repos in the collection, by clicking on the **Review access requests** button: Hugging Face repo access settings when repo is in a gating group collection This will open a modal with 3 lists of users: - **pending**: the list of users waiting for approval to access your repository. This list is empty unless youโ€™ve selected **Manual Review**. You can either **Accept** or **Reject** each request. If the request is rejected, the user cannot access your repository and cannot request access again. - **accepted**: the complete list of users with access to your repository. You can choose to **Reject** access at any time for any user, whether the approval mode is manual or automatic. You can also **Cancel** the approval, which will move the user to the **pending** list. - **rejected**: the list of users youโ€™ve manually rejected. Those users cannot access your repositories. If they go to your repository, they will see a message _Your request to access this repo has been rejected by the repoโ€™s authors_. Manage access requests modal for a repo in a gating group collection #### Via the API You can programmatically manage access requests in a Gated Group Collection through the API of any of its models or datasets. Visit our [gated models](https://huggingface.co/docs/hub/models-gated#via-the-api) or [gated datasets](https://huggingface.co/docs/hub/datasets-gated#via-the-api) documentation to know more about it. #### Download access report You can download access reports for the Gated Group Collection through the settings page of any of its models or datasets. Visit our [gated models](https://huggingface.co/docs/hub/models-gated#download-access-report) or [gated datasets](https://huggingface.co/docs/hub/datasets-gated#download-access-report) documentation to know more about it. #### Customize requested information Organizations can customize the gating parameters as well as the user information that is collected per gated repo. Please, visit our [gated models](https://huggingface.co/docs/hub/models-gated#customize-requested-information) or [gated datasets](https://huggingface.co/docs/hub/datasets-gated#customize-requested-information) documentation for more details. > [!WARNING] > There is currently no way to customize the gate parameters and requested information in a centralized way. If you want to collect the same data no matter which collection's repository a user requests access throughout, you need to add the same gate parameters in the metadata of all the models and datasets of the collection, and keep it synced. ## Access gated repos in a Gating Group Collection as a user A Gated Group Collection shows a specific icon next to its name: Hugging Face collection page with gating group collection feature enabled To get access to the models and datasets in a Gated Group Collection, a single access request on the page of any of those repositories is needed. Once your request is approved, you will be able to access all the other repositories in the collection, including future ones. Visit our [gated models](https://huggingface.co/docs/hub/models-gated#access-gated-models-as-a-user) or [gated datasets](https://huggingface.co/docs/hub/datasets-gated#access-gated-datasets-as-a-user) documentation to learn more about requesting access to a repository. ### THE LANDSCAPE OF ML DOCUMENTATION TOOLS https://huggingface.co/docs/hub/model-card-landscape-analysis.md # THE LANDSCAPE OF ML DOCUMENTATION TOOLS The development of the model cards framework in 2018 was inspired by the major documentation framework efforts of Data Statements for Natural Language Processing ([Bender & Friedman, 2018](https://aclanthology.org/Q18-1041/)) and Datasheets for Datasets ([Gebru et al., 2018](https://www.fatml.org/media/documents/datasheets_for_datasets.pdf)). Since model cards were proposed, a number of other tools have been proposed for documenting and evaluating various aspects of the machine learning development cycle. These tools, including model cards and related documentation efforts proposed prior to model cards, can be contextualised with regard to their focus (e.g., on which part of the ML system lifecycle does the tool focus?) and their intended audiences (e.g., who is the tool designed for?). In Figures 1-2 below, we summarise several prominent documentation tools along these dimensions, provide contextual descriptions of each tool, and link to examples. We broadly classify the documentation tools as belong to the following groups: * **Data-focused**, including documentation tools focused on datasets used in the machine learning system lifecycle * **Models-and-methods-focused**, including documentation tools focused on machine learning models and methods; and * **Systems-focused**, including documentation tools focused on ML systems, including models, methods, datasets, APIs, and non AI/ML components that interact with each other as part of an ML system These groupings are not mutually exclusive; they do include overlapping aspects of the ML system lifecycle. For example, **system cards** focus on documenting ML systems that may include multiple models and datasets, and thus might include content that overlaps with data-focused or model-focused documentation tools. The tools described are a non-exhaustive list of documentation tools for the ML system lifecycle. In general, we included tools that were: * Focused on documentation of some (or multiple) aspects of the ML system lifecycle * Included the release of a template intended for repeated use, adoption, and adaption ## Summary of ML Documentation Tools ### Figure 1 | **Stage of ML System Lifecycle** | **Tool** | **Brief Description** | **Examples** | |:--------------------------------: |-------------------------------------------------------------------------------------------------------------------------------------------------------------------- |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | DATA | ***Datasheets*** [(Gebru et al., 2018)](https://www.fatml.org/media/documents/datasheets_for_datasets.pdf) | โ€œWe recommend that every dataset be accompanied with a datasheet documenting its motivation, creation, composition, intended uses, distribution, maintenance, and other information.โ€ | See, for example, [Ivy Leeโ€™s repo](https://github.com/ivylee/model-cards-and-datasheets) with examples | | DATA | ***Data Statements*** [(Bender & Friedman, 2018)(Bender et al., 2021)](https://techpolicylab.uw.edu/wp-content/uploads/2021/11/Data_Statements_Guide_V2.pdf) | โ€œA data statement is a characterization of a dataset that provides context to allow developers and users to better understand how experimental results might generalize, how software might be appropriately deployed, and what biases might be reflected in systems built on the software.โ€ | See [Data Statements for NLP Workshop](https://techpolicylab.uw.edu/events/event/data-statements-for-nlp/) | | DATA | ***Dataset Nutrition Labels*** [(Holland et al., 2018)](https://huggingface.co/papers/1805.03677) | โ€œThe Dataset Nutrition Labelโ€ฆis a diagnostic framework that lowers the barrier to standardized data analysis by providing a distilled yet comprehensive overview of dataset โ€œingredientsโ€ before AI model development.โ€ | See [The Data Nutrition Label](https://datanutrition.org/labels/) | | DATA | ***Data Cards for NLP*** [(McMillan-Major et al., 2021)](https://huggingface.co/papers/2108.07374) | โ€œWe present two case studies of creating documentation templates and guides in natural language processing (NLP): the Hugging Face (HF) dataset hub[^1] and the benchmark for Generation and its Evaluation and Metrics (GEM). We use the term data card to refer to documentation for datasets in both cases. | See [(McMillan-Major et al., 2021)](https://huggingface.co/papers/2108.07374) | | DATA | ***Dataset Development Lifecycle Documentation Framework*** [(Hutchinson et al., 2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445918) | โ€œWe introduce a rigorous framework for dataset development transparency that supports decision-making and accountability. The framework uses the cyclical, infrastructural and engineering nature of dataset development to draw on best practices from the software development lifecycle.โ€ | See [(Hutchinson et al., 2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445918), Appendix A for templates | | DATA | ***Data Cards*** [(Pushkarna et al., 2021)](https://huggingface.co/papers/2204.01075) | โ€œData Cards are structured summaries of essential facts about various aspects of ML datasets needed by stakeholders across a datasetโ€™s lifecycle for responsible AI development. These summaries provide explanations of processes and rationales that shape the data and consequently the models.โ€ | See the [Data Cards Playbook github](https://github.com/PAIR-code/datacardsplaybook/) | | DATA | ***CrowdWorkSheets*** [(Dรญaz et al., 2022)](https://huggingface.co/papers/2206.08931) | โ€œWe introduce a novel framework, CrowdWorkSheets, for dataset developers to facilitate transparent documentation of key decisions points at various stages of the data annotation pipeline: task formulation, selection of annotators, plat- form and infrastructure choices, dataset analysis and evaluation, and dataset release and maintenance.โ€ | See [(Dรญaz et al., 2022)](https://huggingface.co/papers/2206.08931) | | MODELS AND METHODS | ***Model Cards*** [Mitchell et al. (2018)](https://huggingface.co/papers/1810.03993) | โ€œModel cards are short documents accompanying trained machine learning models that provide benchmarked evaluation in a variety of conditionsโ€ฆthat are relevant to the intended application domains. Model cards also disclose the context in which models are intended to be used, details of the performance evaluation procedures, and other relevant information.โ€ | See https://huggingface.co/models, the [Model Card Guidebook](https://huggingface.co/docs/hub/model-card-guidebook), and [Model Card Examples](https://huggingface.co/docs/hub/model-card-appendix#model-card-examples) | | MODELS AND METHODS | ***Value Cards*** [Shen et al. (2021)](https://dl.acm.org/doi/abs/10.1145/3442188.3445971) | โ€œWe present Value Cards, a deliberation-driven toolkit for bringing computer science students and practitioners the awareness of the social impacts of machine learning-based decision making systemsโ€ฆ.Value Cards encourages the investigations and debates towards different ML performance metrics and their potential trade-offs.โ€ | See [Shen et al. (2021)](https://dl.acm.org/doi/abs/10.1145/3442188.3445971), Section 3.3 | | MODELS AND METHODS | ***Method Cards*** [Adkins et al. (2022)](https://dl.acm.org/doi/pdf/10.1145/3491101.3519724) | โ€œWe propose method cards to guide ML engineers through the process of model developmentโ€ฆThe information comprises both prescriptive and descriptive elements, putting the main focus on ensuring that ML engineers are able to use these methods properly.โ€ | See [Adkins et al. (2022)](https://dl.acm.org/doi/pdf/10.1145/3491101.3519724), Appendix A | | MODELS AND METHODS | ***Consumer Labels for ML Models*** [Seifert et al. (2019)](https://ris.utwente.nl/ws/portalfiles/portal/158031484/Seifert2019_cogmi_consumer_labels_preprint.pdf) | โ€œWe propose to issue consumer labels for trained and published ML models. These labels primarily target machine learning lay persons, such as the operators of an ML system, the executors of decisions, and the decision subjects themselvesโ€ | See [Seifert et al. (2019)](https://ris.utwente.nl/ws/portalfiles/portal/158031484/Seifert2019_cogmi_consumer_labels_preprint.pdf) | | SYSTEMS | ***Factsheets*** [Arnold et al. (2019)](https://huggingface.co/papers/1808.07261) | โ€œA FactSheet will contain sections on all relevant attributes of an AI service, such as intended use, performance, safety, and security. Performance will include appropriate accuracy or risk measures along with timing information.โ€ | See [IBMโ€™s AI Factsheets 360](https://aifs360.res.ibm.com) and [Hind et al., (2020)](https://dl.acm.org/doi/abs/10.1145/3334480.3383051) | | SYSTEMS | ***System Cards*** [Procope et al. (2022)](https://ai.facebook.com/research/publications/system-level-transparency-of-machine-learning) | โ€œSystem Cards aims to increase the transparency of ML systems by providing stakeholders with an overview of different components of an ML system, how these components interact, and how different pieces of data and protected information are used by the system.โ€ | See [Metaโ€™s Instagram Feed Ranking System Card](https://ai.facebook.com/tools/system-cards/instagram-feed-ranking/) | | SYSTEMS | ***Reward Reports for RL*** [Gilbert et al. (2022)](https://huggingface.co/papers/2204.10817) | โ€œWe sketch a framework for documenting deployed learning systems, which we call Reward Reportsโ€ฆWe outline Reward Reports as living documents that track updates to design choices and assumptions behind what a particular automated system is optimizing for. They are intended to track dynamic phenomena arising from system deployment, rather than merely static properties of models or data.โ€ | See https://rewardreports.github.io | | SYSTEMS | ***Robustness Gym*** [Goel et al. (2021)](https://huggingface.co/papers/2101.04840) | โ€œWe identify challenges with evaluating NLP systems and propose a solution in the form of Robustness Gym (RG), a simple and extensible evaluation toolkit that unifies 4 standard evaluation paradigms: subpopulations, transformations, evaluation sets, and adversarial attacks.โ€ | See https://github.com/robustness-gym/robustness-gym | | SYSTEMS | ***ABOUT ML*** [Raji and Yang, (2019)](https://huggingface.co/papers/1912.06166) | โ€œABOUT ML (Annotation and Benchmarking on Understanding and Transparency of Machine Learning Lifecycles) is a multi-year, multi-stakeholder initiative led by PAI. This initiative aims to bring together a diverse range of perspectives to develop, test, and implement machine learning system documentation practices at scale.โ€ | See [ABOUT MLโ€™s resources library](https://partnershiponai.org/about-ml-resources-library/) | ### DATA-FOCUSED DOCUMENTATION TOOLS Several proposed documentation tools focus on datasets used in the ML system lifecycle, including to train, develop, validate, finetune, and evaluate machine learning models as part of continuous cycles. These tools generally focus on the many aspects of the data lifecycle (perhaps for a particular dataset, group of datasets, or more broadly), including how the data was assembled, collected, annotated and how it should be used. * Extending the concept of datasheets in the electronics industry, [Gebru et al. (2018)](https://www.fatml.org/media/documents/datasheets_for_datasets.pdf) propose datasheets for datasets to document details related to a datasetโ€™s creation, potential uses, and associated concerns. * [Bender and Friedman (2018)](https://aclanthology.org/Q18-1041/) propose data statements for natural language processing. [Bender, Friedman and McMillan-Major (2021)](https://techpolicylab.uw.edu/wp-content/uploads/2021/11/Data_Statements_Guide_V2.pdf) update the original data statements framework and provide resources including a guide for writing data statements and translating between the first version of the schema and the newer version[^2]. * [Holland et al. (2018)](https://huggingface.co/papers/1805.03677) propose data nutrition labels, akin to nutrition facts for foodstuffs and nutrition labels for privacy disclosures, as a tool for analyzing and making decisions about datasets. The Data Nutrition Label team released an updated design of and interface for the label in 2020 ([Chmielinski et al., 2020)](https://huggingface.co/papers/2201.03954)). * [McMillan-Major et al. (2021)](https://huggingface.co/papers/2108.07374) describe the development process and resulting templates for **data cards for NLP** in the form of data cards on the Hugging Face Hub[^3] and data cards for datasets that are part of the NLP benchmark for Generation and its Evaluation Metrics (GEM) environment[^4]. * [Hutchinson et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445918) describe the need for comprehensive dataset documentation, and drawing on software development practices, provide templates for documenting several aspects of the dataset development lifecycle (for the purposes of Tables 1 and 2, we refer to their framework as the **Dataset Development Lifecycle Documentation Framework**). * [Pushkarna et al. (2021)](https://huggingface.co/papers/2204.01075) propose the data cards as part of the **data card playbook**, a human-centered documentation tool focused on datasets used in industry and research. ### MODEL-AND-METHOD-FOCUSED DOCUMENTATION TOOLS Another set of documentation tools can be thought of as focusing on machine learning models and machine learning methods. These include: * [Mitchell et al. (2018)](https://huggingface.co/papers/1810.03993) propose **model cards** for model reporting to accompany trained ML models and document issues related to evaluation, use, and other issues * [Shen et al. (2021)](https://dl.acm.org/doi/abs/10.1145/3442188.3445971) propose **value cards** for teaching students and practitioners about values related to ML models * [Seifert et al. (2019)](https://ris.utwente.nl/ws/portalfiles/portal/158031484/Seifert2019_cogmi_consumer_labels_preprint.pdf) propose **consumer labels for ML models** to help non-experts using or affected by the model understand key issues related to the model. * [Adkins et al. (2022)](https://dl.acm.org/doi/pdf/10.1145/3491101.3519724) analyse aspects of descriptive documentation tools โ€“ which they consider to include **model cards** and data sheets โ€“ and argue for increased prescriptive tools for ML engineers. They propose method cards, focused on ML methods, and design primarily with technical stakeholders like model developers and reviewers in mind. * They envision the relationship between model cards and method cards, in part, by stating: โ€œThe sections and prompts we proposeโ€ฆ[in the method card template] focus on ML methods that are sufficient to produce a proper ML model with defined input, output, and task. Examples for these are object detection methods such as Single-shot Detectors and language modelling methods such as Generative Pre-trained Transformers (GPT). *It is possible to create Model Cards for the models created using these methods*.โ€ * They also state โ€œWhile Model Cards and FactSheets put main focus on documenting existing models, Method Cards focus more on the underlying methodical and algorithmic choices that need to be considered when creating and training these models. *As a rough analogy, if Model Cards and FactSheets provide nutritional information about cooked meals, Method Cards provide the recipes*.โ€ ### SYSTEM-FOCUSED DOCUMENTATION TOOLS Rather than focusing on particular models, datasets, or methods, system-focused documentation tools look at how models interact with each other, with datasets, methods, and with other ML components to form ML systems. * [Procope et al. (2022)](https://ai.facebook.com/research/publications/system-level-transparency-of-machine-learning) propose system cards to document and explain AI systems โ€“ potentially including multiple ML models, AI tools, and non-AI technologies โ€“ that work together to accomplish tasks. * [Arnold et al. (2019)](https://huggingface.co/papers/1808.07261) extend the idea of declarations of conformity for consumer products to AI services, proposing FactSheets to document aspects of โ€œAI servicesโ€ which are typically accessed through APIs and may be composed of multiple different ML models. [Hind et al. (2020)](https://dl.acm.org/doi/abs/10.1145/3334480.3383051) share reflections on building factsheets. * [Gilbert et al. (2022)](https://huggingface.co/papers/2204.10817) propose **Reward Reports for Reinforcement Learning** systems, recognizing the dynamic nature of ML systems and the need for documentation efforts to incorporate considerations of post-deployment performance, especially for reinforcement learning systems. * [Goel et al. (2021)](https://huggingface.co/papers/2101.04840) develop **Robustness Gym**, an evaluation toolkit for testing several aspects of deep neural networks in real-world systems, allowing for comparison across evaluation paradigms. * Through the [ABOUT ML project](https://partnershiponai.org/workstream/about-ml/) ([Raji and Yang, 2019](https://huggingface.co/papers/1912.06166)), the Partnership on AI is coordinating efforts across groups of stakeholders in the machine learning community to develop comprehensive, scalable documentation tools for ML systems. ## THE EVOLUTION OF MODEL CARDS Since the proposal for model cards by Mitchell et al. in 2018, model cards have been adopted and adapted by various organisations, including by major technology companies and startups developing and hosting machine learning models[^5], researchers describing new techniques[^6], and government stakeholders evaluating models for various projects[^7]. Model cards also appear as part of AI Ethics educational toolkits, and numerous organisations and developers have created implementations for automating or semi-automating the creation of model cards. Appendix A provides a set of examples of model cards for various types of ML models created by different organisations (including model cards for large language models), model card generation tools, and model card educational tools. ### MODEL CARDS ON THE HUGGING FACE HUB Since 2018, new platforms and mediums for hosting and sharing model cards have also emerged. For example, particularly relevant to this project, Hugging Face hosts model cards on the Hugging Face Hub as README files in the repositories associated with ML models. As a result, model cards figure as a prominent form of documentation for users of models on the Hugging Face Hub. As part of our analysis of model cards, we developed and proposed model cards for several dozen ML models on the Hugging Face Hub, using the Hubโ€™s Pull Request (PR) and Discussion features to gather feedback on model cards, verify information included in model cards, and publish model cards for models on the Hugging Face Hub. At the time of writing of this guide book, all of Hugging Faceโ€™s models on the Hugging Face Hub have an associated model card on the Hub[^8]. The high number of models uploaded to the Hugging Face Hub (101,041 models at the point of writing), enabled us to explore the content within model cards on the hub: We began by analysing language model, model cards, in order to identify patterns (e.g repeated sections and subsections, with the aim of answering initial questions such as: 1) How many of these models have model cards? 2) What percent of downloads had an associated model card? From our analysis of all the models on the hub, we noticed that the most downloads come from top 200 models. With a continued focus on large language models, ordered by most downloaded and only models with model cards to begin with, we noted the most recurring sections within their respective model cards. While some headings within model cards may differ between models, we grouped components/the theme of each section within each model cards and then mapped them to section headings that were the most recurring (mostly found in the top 200 downloaded models and with the aid/guidance of the Bloom model card) > [!TIP] > [Checkout the User Studies](./model-cards-user-studies) > [!TIP] > [See Appendix](./model-card-appendix) [^1]: For each tool, descriptions are excerpted from the linked paper listed in the second column. [^2]: See https://techpolicylab.uw.edu/data-statements/ . [^3]: See https://techpolicylab.uw.edu/data-statements/ . [^4]: See https://techpolicylab.uw.edu/data-statements/ . [^5]: See, e.g., the Hugging Face Hub, Google Cloudโ€™s Model Cards https://modelcards.withgoogle.com/about . [^6]: See Appendix A. [^7]: See GSA / US Census Bureau Collaboration on Model Card Generator. [^8]: By โ€œHugging Face models,โ€ we mean models shared by Hugging Face, not another organisation, on the Hub. Formally, these are models without a โ€˜/โ€™ in their model ID. --- **Please cite as:** Ozoani, Ezi and Gerchick, Marissa and Mitchell, Margaret. Model Card Guidebook. Hugging Face, 2022. https://huggingface.co/docs/hub/en/model-card-guidebook ### Livebook on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-livebook.md # Livebook on Spaces **Livebook** is an open-source tool for writing interactive code notebooks in [Elixir](https://elixir-lang.org/). It's part of a growing collection of Elixir tools for [numerical computing](https://github.com/elixir-nx/nx), [data science](https://github.com/elixir-nx/explorer), and [Machine Learning](https://github.com/elixir-nx/bumblebee). Some of Livebook's most exciting features are: - **Reproducible workflows**: Livebook runs your code in a predictable order, all the way down to package management - **Smart cells**: perform complex tasks, such as data manipulation and running machine learning models, with a few clicks using Livebook's extensible notebook cells - **Elixir powered**: use the power of the Elixir programming language to write concurrent and distributed notebooks that scale beyond your machine To learn more about it, watch this [15-minute video](https://www.youtube.com/watch?v=EhSNXWkji6o). Or visit [Livebook's website](https://livebook.dev/). Or follow its [Twitter](https://twitter.com/livebookdev) and [blog](https://news.livebook.dev/) to keep up with new features and updates. ## Your first Livebook Space You can get Livebook up and running in a Space with just a few clicks. Click the button below to start creating a new Space using Livebook's Docker template: Then: 1. Give your Space a name 2. Set the password of your Livebook 3. Set its visibility to public 4. Create your Space ![Creating a Livebok Space ](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/spaces-livebook-new-space.png) This will start building your Space using Livebook's Docker image. The visibility of the Space must be set to public for the Smart cells feature in Livebook to function properly. However, your Livebook instance will be protected by Livebook authentication. > [!TIP] > Smart cell is a type of Livebook cell that provides a UI component for accomplishing a specific task. The code for the task is generated automatically based on the user's interactions with the UI, allowing for faster completion of high-level tasks without writing code from scratch. Once the app build is finished, go to the "App" tab in your Space and log in to your Livebook using the password you previously set: ![Livebook authentication](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/spaces-livebook-authentication.png) That's it! Now you can start using Livebook inside your Space. If this is your first time using Livebook, you can learn how to use it with its interactive notebooks within Livebook itself: ![Livebook's learn notebooks](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/spaces-livebook-learn-section.png) ## Livebook integration with Hugging Face Models Livebook has an [official integration with Hugging Face models](https://livebook.dev/integrations/hugging-face). With this feature, you can run various Machine Learning models within Livebook with just a few clicks. Here's a quick video showing how to do that: ## How to update Livebook's version To update Livebook to its latest version, go to the Settings page of your Space and click on "Factory reboot this Space": ![Factory reboot a Space](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/spaces-livebook-factory-reboot.png) ## Caveats The following caveats apply to running Livebook inside a Space: - The Space's visibility setting must be public. Otherwise, Smart cells won't work. That said, your Livebook instance will still be behind Livebook authentication since you've set the `LIVEBOOK_PASSWORD` secret. - Livebook global configurations will be lost once the Space restarts. Consider using the [desktop app](https://livebook.dev/#install) if you find yourself in need of persisting configuration across deployments. ## Feedback and support If you have improvement suggestions or need specific support, please join the [Livebook community on GitHub](https://github.com/livebook-dev/livebook/discussions). ### Spaces Custom Domain https://huggingface.co/docs/hub/spaces-custom-domain.md # Spaces Custom Domain > [!WARNING] > This feature is part of PRO or Team & Enterprise plans. ## Getting started with a Custom Domain Spaces Custom Domain allows you to host your Space on a custom domain of your choosing: `yourdomain.example.com`. The custom domain must be a valid DNS name. > [!NOTE] > Custom domains require your Space to have **public** or **protected** visibility. They are not supported on private Spaces. ### Setting up your domain You can submit a custom domain in the settings of your Space, under "Custom Domain". You'll need to add a CNAME record pointing your domain to `hf.space`: ### Verifying your domain After submission, the request will move to "pending" status: Once the DNS is properly configured, you'll see a "ready" status confirming the custom domain is active for your Space. If you've completed all the steps but aren't seeing a "ready" status, you can enter your domain [here](https://toolbox.googleapps.com/apps/dig/#CNAME/) to verify it points to `hf.space`. If it doesn't, please check your domain host to ensure the CNAME record was added correctly. ## Removing a Custom Domain Simply remove a custom domain by using the delete button to the right of "Custom Domain" in the settings of your Space. You can delete while the custom domain is pending or in ready state. ### Storage Regions on the Hub https://huggingface.co/docs/hub/storage-regions.md # Storage Regions on the Hub > [!WARNING] > This feature is part of the Team & Enterprise plans. Regions allow you to specify where your organization's models, datasets and Spaces are stored. For non-Team or Enterprise users, repositories are always stored in the US. This offers two key benefits: - Regulatory and legal compliance - Performance (faster download/upload speeds and lower latency) Currently available regions: - US ๐Ÿ‡บ๐Ÿ‡ธ - EU ๐Ÿ‡ช๐Ÿ‡บ - Coming soon: Asia-Pacific ๐ŸŒ ## Getting started with Storage Regions Organizations subscribed to a Team or Enterprise plan can access the Regions settings page to manage their repositories storage locations. screenshot of Hugging Face Storage Regions feature This page displays: - An audit of your organization's repository locations - Options to select where new repositories will be stored > [!TIP] > Some advanced compute options for Spaces, such as ZeroGPU, may not be available in all regions. ## Repository Tag Any repository (model or dataset) stored in a non-default location displays its Region as a tag, allowing organization members to quickly identify repository locations. screenshot of Hugging Face Storage Regions tag feature ## Regulatory and legal compliance Regulated industries often require data storage in specific regions. For EU companies, you can use the Hub for ML development in a GDPR-compliant manner, with datasets, models and inference endpoints stored in EU data centers. ## Performance Storing models and datasets closer to your team and infrastructure significantly improves performance for both uploads and downloads. This impact is substantial given the typically large size of model weights and dataset files. example of Hugging Face Storage Regions feature For example, European users storing repositories in the EU region can expect approximately 4-5x faster upload and download speeds compared to US storage. ## Spaces Both Spaces's storage and runtime use the chosen region. Available hardware configurations vary by region, and some features may not be available in all regions. Contact your HF account team for specific requests. ### Basic SSO https://huggingface.co/docs/hub/security-sso-basic.md # Basic SSO > [!WARNING] > This feature is part of the Team & Enterprise plans. Basic SSO adds an access-control layer on top of the standard Hugging Face login. It allows you to enforce authentication through your Identity Provider (IdP) when members access resources under your organization's namespace, such as private models, datasets, and Spaces. For a comparison with Managed SSO, see the [SSO overview](./enterprise-sso). ## How it works > [!NOTE] > **Basic SSO does not replace the Hugging Face login.** Your members will still need to sign in to Hugging Face with their own credentials (email/password, Google, or GitHub) before being prompted to complete SSO authentication to access your organization's resources. This is by design: Basic SSO secures access to your organization without taking over the user's Hugging Face identity. When Single Sign-On is enabled, organization members authenticate through your Identity Provider (IdP). You pick whether SSO is **enforced** or **optional**: - **Enforced** (default): Members have to complete SSO authentication before accessing anything under the organization's namespace. - **Optional**: Members get prompted via a banner at the top of the page to set up SSO, but can skip it and still access the organization. This is handy when you're migrating a lot of users and want to give them time to sort out their accounts before definitely enforcing SSO. Public content is still accessible to everyone, including non-members. **We use email addresses to identify SSO users. As a user, make sure that your organizational email address (e.g. your company email) has been added to [your user account](https://huggingface.co/settings/account).** When users log in, they will be prompted to complete the Single Sign-On authentication flow with a banner similar to the following: Single Sign-On only applies to your organization. Members may belong to other organizations on Hugging Face. ## Getting started Basic SSO can be configured directly from your organization's settings. Hugging Face Hub can work with any OIDC-compliant or SAML Identity Provider, including Okta, OneLogin, and Microsoft Entra ID (Azure AD). See our [Configuration Guides](./security-sso-configuration-guides) for step-by-step setup instructions. ## User provisioning Once SSO is enabled on your organization, a direct join link can be copied and shared with new members. This SSO join link is available in both the **SSO** and **Members** settings tabs. Since organizations with SSO enabled cannot use classic invite links, the SSO join link is the primary method for inviting teammates to your organization. Simply click the copy button to copy the link to your clipboard and share it with the members you want to invite. When recipients click the shared link, they will be able to authenticate via SSO and directly join your organization. Organizations on the Enterprise plan can also use [SCIM](./enterprise-scim) to automate invitation-based provisioning from your Identity Provider. See the [SCIM guide](./enterprise-scim) for more details. ## SSO features Basic SSO supports [role mapping, resource group mapping, session timeout, matching email domains, and external collaborators](./security-sso-user-management). These features are configurable from your organization's settings. ### SSO Configuration Guides https://huggingface.co/docs/hub/security-sso-configuration-guides.md # SSO Configuration Guides > [!WARNING] > This feature is part of the Team & Enterprise plans. These guides help you configure SAML 2.0 and OpenID Connect (OIDC) with your Identity Provider for [Basic SSO](./security-sso-basic). Hugging Face Hub can work with any SAML or OIDC-compliant Identity Provider. > [!NOTE] > If you are looking to set up [Managed SSO](./enterprise-advanced-sso), the configuration is done in collaboration with the Hugging Face team. Please contact us to get started. ## Okta - [How to configure OIDC with Okta](./security-sso-okta-oidc) - [How to configure SAML with Okta](./security-sso-okta-saml) - [How to configure SCIM with Okta](./security-sso-okta-scim) ## Microsoft Entra ID (Azure AD) - [How to configure SAML with Entra ID](./security-sso-azure-saml) - [How to configure OIDC with Entra ID](./security-sso-azure-oidc) - [How to configure SCIM with Entra ID](./security-sso-entra-id-scim) ## Google Workspace - [How to configure SAML with Google Workspace](./security-sso-google-saml) - [How to configure OIDC with Google Workspace](./security-sso-google-oidc) ### Storage limits https://huggingface.co/docs/hub/storage-limits.md # Storage limits At Hugging Face we aim to provide the AI community with significant volumes of **free storage space for public repositories**, with options to buy more storage if necessary. We also bill for storage space for **private repositories**, above a free tier (see table below). > [!TIP] > Storage limits and policies apply to all types of repositories (models, datasets, buckets, โ€ฆ) on the Hub. We [optimize our infrastructure](https://huggingface.co/blog/xethub-joins-hf) continuously to [scale our storage](https://x.com/julien_c/status/1821540661973160339) for the coming years of growth in AI and Machine learning. We do have mitigations in place to prevent abuse of free public storage, and in general we ask users and organizations to make sure any uploaded large model or dataset is **as useful to the community as possible** (as represented by numbers of likes or downloads, for instance). Upgrade to a paid Organization or User (PRO) account to unlock higher limits. ## Storage plans | Type of account | Public storage | Private storage | | ------------------------ | ------------------------------------------------------------------- | ---------------------------- | | Free user or org | Best-effort\* | 100GB | | PRO | Up to 10TB included\* + [add-on](#public-storage-add-on) โœ… grants available for impactful workโ€  | 1TB + pay-as-you-go | | Team Organizations | 12TB base + 1TB per seat + [add-on](#public-storage-add-on) โœ… | 1TB per seat + pay-as-you-go | | Enterprise Organizations | 200TB base + 1TB per seat + [add-on](#public-storage-add-on) ๐Ÿ† Up to 1,000TB for large contracts | 1TB per seat + pay-as-you-go | ๐Ÿ’ก [Team or Enterprise Organizations](https://huggingface.co/enterprise) include 1TB of private storage per seat in the subscription: for example, if your organization has 40 members, then you have 40TB of included private storage. \* We aim to continue providing the AI community with generous free storage space for public repositories. Beyond the first few gigabytes, please use this resource responsibly by uploading content that offers genuine value to other users. If you need substantial storage space, you will need to upgrade to [PRO, Team or Enterprise](https://huggingface.co/pricing). โ€  In some cases, additional storage grants are available for high-impact open-source work where a paid plan genuinely cannot cover the need. Contact us with evidence of community impact (likes, downloads, citations). ### Public Storage add-on Users on a paid plan (PRO, Team, or Enterprise) can subscribe to a **Public Storage add-on** for additional public storage on top of their plan's base limit. | Storage add-on | Price | Per TB | | -------------- | -------------- | ---------------- | | 1 TB | $12/month | $12/TB/month | | 5 TB | $60/month | $12/TB/month | | 10 TB | $120/month | $12/TB/month | | 20 TB | $240/month | $12/TB/month | | 50 TB | $500/month | $10/TB/month | You can subscribe or change your tier from the **Billing** settings page of your account or organization. Upgrades take effect immediately; downgrades are scheduled to take effect at the start of the next month. If you need more storage, you can [contact us](https://huggingface.co/contact/sales) to take advantage of [custom large-scale pricing](https://huggingface.co/pricing#storage). ### Private storage Pay-as-you-go Above the included 1TB (or 1TB per seat) of private storage in [PRO](https://huggingface.co/subscribe/pro) and [Team or Enterprise Organizations](https://huggingface.co/enterprise), additional private storage is charged to your payment method in Pay-as-you-go mode, at a base price of $18/TB/mo. Additional discounts are available for large-scale volumes through our account executives: | Volume | Price (private repos) | | ------ | --------------------- | | Base | $18/TB/mo | | 50TB+ | $16/TB/mo | | 200TB+ | $14/TB/mo | | 500TB+ | $12/TB/mo | See our [billing doc](./billing) for more details, or view the latest pricing at [huggingface.co/pricing](https://huggingface.co/pricing#storage). ## Repository limitations and recommendations > [!NOTE] > This section does not apply to [Storage Buckets](./storage-buckets) In addition to storage limits at the account (user or organization) level, there are some limitations to be aware of when dealing with large amounts of data in a specific Git-backed repository. Given the time it takes to stream the data, getting an upload/push to fail at the end of the process or encountering a degraded experience, be it on hf.co or when working locally, can be very annoying. In the following section, we describe our recommendations on how to best structure your large repos. ### Recommendations We gathered a list of tips and recommendations for structuring your repo. If you are looking for more practical tips, check out [this guide](https://huggingface.co/docs/huggingface_hub/main/en/guides/upload#tips-and-tricks-for-large-uploads) on how to upload large amount of data using the Python library. | Characteristic | Recommended | Tips | | ---------------- | ------------------ | ------------------------------------------------------ | | Repo size | - | upgrade your [storage plan](#storage-plans) or contact us for large repos (TBs of data) | | Files per repo | <100k | merge data into fewer files | | Entries per folder | <10k | use subdirectories in repo | | File size | <200GB | split data into chunked files | | Commit size | <100 files* | upload files in multiple commits | | Commits per repo | - | upload multiple files per commit and/or squash history | _\* Not relevant when using `git` CLI directly_ Please read the next section to better understand those limits and how to deal with them. ### Explanations What are we talking about when we say "large uploads", and what are their associated limitations? Large uploads can be very diverse, from repositories with a few huge files (e.g. model weights) to repositories with thousands of small files (e.g. an image dataset). Under the hood, the Hub uses Git to version the data, which has structural implications on what you can do in your repo. If your repo is crossing some of the numbers mentioned in the previous section, **we strongly encourage you to check out [`git-sizer`](https://github.com/github/git-sizer)**, which has very detailed documentation about the different factors that will impact your experience. Here is a TL;DR of factors to consider: - **Repository size**: The total size of the data you're planning to upload. There is no per-repo size limit for models and datasets, but uploads count against your account's total storage quota (see [Storage plans](#storage-plans) above). If you need more storage, [upgrade your plan](https://huggingface.co/pricing) or purchase a [storage add-on](#public-storage-add-on). - **Number of files**: - For optimal experience, we recommend keeping the total number of files under 100k, and ideally much less. Try merging the data into fewer files if you have more. For example, json files can be merged into a single jsonl file, or large datasets can be exported as Parquet files or in [WebDataset](https://github.com/webdataset/webdataset) format. - The maximum number of files per folder cannot exceed 10k files per folder. A simple solution is to create a repository structure that uses subdirectories. For example, a repo with 1k folders from `000/` to `999/`, each containing at most 1000 files, is already enough. - **File size**: In the case of uploading large files (e.g. model weights), we strongly recommend splitting them **into chunks <200GB each.**. There are a few reasons for this: - Uploading and downloading smaller files is much easier both for you and the other users. Connection issues can always happen when streaming data and smaller files avoid resuming from the beginning in case of errors. - Files are served to the users using CloudFront. From our experience, huge files are not cached by this service leading to a slower download speed. In all cases, no single file will exceed 500GB. I.e. 500GB is the hard limit for a single file size. - **Number of commits**: There is no hard limit for the total number of commits on your repo history. However, from our experience, the user experience on the Hub starts to degrade after a few thousand commits. We are constantly working to improve the service, but one must always remember that a git repository is not meant to work as a database with a lot of writes. If your repo's history gets very large, it is always possible to squash all the commits to get a fresh start using `huggingface_hub`'s [`super_squash_history`](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.super_squash_history). Be aware that this is a non-revertible operation. - **Number of operations per commit**: Once again, there is no hard limit here. When a commit is uploaded on the Hub, each git operation (addition or delete) is checked by the server. When a hundred Large Files are committed at once, each file is checked individually to ensure it's been correctly uploaded. When pushing data through HTTP, a timeout of 60s is set on the request, meaning that if the process takes more time, an error is raised. However, it can happen (in rare cases) that even if the timeout is raised client-side, the process is still completed server-side. This can be checked manually by browsing the repo on the Hub. To prevent this timeout, we recommend adding around 50-100 files per commit. ### Sharing large datasets on the Hub One key way Hugging Face supports the machine learning ecosystem is by hosting datasets on the Hub, including very large ones. Large datasets count against your account's total storage quota, so make sure your [storage plan](#storage-plans) has sufficient capacity before uploading. Additional public storage can be purchased as an [add-on](#public-storage-add-on). For hosting large datasets on the Hub, we require the following: - A dataset card: we want to ensure that your dataset can be used effectively by the community and one of the key ways of enabling this is via a dataset card. This [guidance](./datasets-cards) provides an overview of how to write a dataset card. - You are sharing the dataset to enable community reuse. If you plan to upload a dataset you anticipate won't have any further reuse, other platforms are likely more suitable. - You must follow the repository limitations outlined above. - Using file formats that are well integrated with the Hugging Face ecosystem. We have good support for [Parquet](https://huggingface.co/docs/datasets/main/en/loading#parquet) and [WebDataset](https://huggingface.co/docs/datasets/main/en/loading#webdataset) formats, which are often good options for sharing large datasets efficiently. This will also ensure the dataset viewer works for your dataset. - Avoid the use of custom loading scripts when using datasets. In our experience, datasets that require custom code to use often end up with limited reuse. Please get in touch with us if any of these requirements are difficult for you to meet because of the type of data or domain you are working in. ### Sharing large volumes of models on the Hub Similarly to datasets, large models or large volumes of models (for instance, hundreds of automated quants) count against your account's total storage quota. Make sure your [storage plan](#storage-plans) has sufficient capacity, or purchase a [storage add-on](#public-storage-add-on). ### Grants for research teams and non-profits We recommend that academic and research institutions upgrade to [Team, Enterprise, or Academia Hub](https://huggingface.co/pricing) for guaranteed storage limits. In some cases, storage grants may be available for high-impact open-source work where a paid plan genuinely cannot cover the need. These are evaluated on a case-by-case basis and require demonstrated community impact (downloads, citations, community adoption, etc.). Contact datasets@huggingface.co or models@huggingface.co with a detailed proposal. ## How can I free up storage space in my account/organization? There are several ways to manage and free some storage space in your account or organization. First, if you need more storage space, upgrade to a PRO, Team or Enterprise plan for increased storage limits. โš ๏ธ **Important**: Deleting Large Files is a destructive operation that cannot be undone. Make sure to backup your files before proceeding. Key points to remember: - Deleting only LFS pointers doesn't free up space - If you do not rewrite the Git history, future checkouts of branches/tags containing deleted LFS files with existing lfs pointers will fail (to avoid errors, add the following line to your `.gitconfig` file: `lfs.skipdownloaderrors=true`) ### Deleting individual LFS files 1. Navigate to your repository's Settings page 2. Click on "List LFS files" in the "Storage" section 3. Use the actions menu to delete specific files ### Deleting Pull request refs [Pull requests](./repositories-pull-requests-discussions) create git refs that store their commits. After closing or merging a PR, you can delete its ref to free up storage space. This is especially useful when: - PRs contain large files that were never merged - You've squashed the main branch and removed files later on โ€” those files remain in the PR branch history even if they weren't added by the PR itself To delete a PR ref, open the closed or merged PR and look for the storage notice at the bottom showing the estimated space that could be freed. Click "Delete ref" to permanently remove it. > [!NOTE] > Deleting a PR ref is irreversible and will prevent anyone from fetching or checking out those commits locally. ### Super-squash your repository using the API The super-squash operation compresses your entire Git history into a single commit. Consider using super-squash when you need to reclaim storage from old LFS versions you're not using. This operation is only available through the [Hub Python Library](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.super_squash_history) or the API. โš ๏ธ **Important**: This is a destructive operation that cannot be undone, commit history will be permanently lost and **LFS file history will be removed** The effects from the squash operation on your storage quota are not immediate and will be reflected on your quota within 36 hours. ### Advanced: Track LFS file references When you find an LFS file in your repository's "List LFS files" but don't know where it came from, you can trace its history using its SHA-256 OID by using the git log command: ```bash git log --all -p -S ``` For example: ```bash git log --all -p -S 68d45e234eb4a928074dfd868cead0219ab85354cc53d20e772753c6bb9169d3 commit 5af368743e3f1d81c2a846f7c8d4a028ad9fb021 Date: Sun Apr 28 02:01:18 2024 +0200 Update LayerNorm tensor names to weight and bias diff --git a/model.safetensors b/model.safetensors index a090ee7..e79c80e 100644 --- a/model.safetensors +++ b/model.safetensors @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:68d45e234eb4a928074dfd868cead0219ab85354cc53d20e772753c6bb9169d3 +oid sha256:0bb7a1683251b832d6f4644e523b325adcf485b7193379f5515e6083b5ed174b size 440449768 commit 0a6aa9128b6194f4f3c4db429b6cb4891cdb421b (origin/pr/28) Date: Wed Nov 16 15:15:39 2022 +0000 Adding `safetensors` variant of this model (#15) - Adding `safetensors` variant of this model (18c87780b5e54825a2454d5855a354ad46c5b87e) Co-authored-by: Nicolas Patry diff --git a/model.safetensors b/model.safetensors new file mode 100644 index 0000000..a090ee7 --- /dev/null +++ b/model.safetensors @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:68d45e234eb4a928074dfd868cead0219ab85354cc53d20e772753c6bb9169d3 +size 440449768 commit 18c87780b5e54825a2454d5855a354ad46c5b87e (origin/pr/15) Date: Thu Nov 10 09:35:55 2022 +0000 Adding `safetensors` variant of this model diff --git a/model.safetensors b/model.safetensors new file mode 100644 index 0000000..a090ee7 --- /dev/null +++ b/model.safetensors @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:68d45e234eb4a928074dfd868cead0219ab85354cc53d20e772753c6bb9169d3 +size 440449768 ``` ### Model Card Guidebook https://huggingface.co/docs/hub/model-card-guidebook.md # Model Card Guidebook Model cards are an important documentation and transparency framework for machine learning models. We believe that model cards have the potential to serve as *boundary objects*, a single artefact that is accessible to users who have different backgrounds and goals when interacting with model cards โ€“ including developers, students, policymakers, ethicists, those impacted by machine learning models, and other stakeholders. We recognize that developing a single artefact to serve such multifaceted purposes is difficult and requires careful consideration of potential users and use cases. Our goal as part of the Hugging Face science team over the last several months has been to help operationalize model cards towards that vision, taking into account these challenges, both at Hugging Face and in the broader ML community. To work towards that goal, it is important to recognize the thoughtful, dedicated efforts that have helped model cards grow into what they are today, from the adoption of model cards as a standard practice at many large organisations to the development of sophisticated tools for hosting and generating model cards. Since model cards were proposed by Mitchell et al. (2018), the landscape of machine learning documentation has expanded and evolved. A plethora of documentation tools and templates for data, models, and ML systems have been proposed and have developed โ€“ reflecting the incredible work of hundreds of researchers, impacted community members, advocates, and other stakeholders. Important discussions about the relationship between ML documentation and theories of change in responsible AI have created continued important discussions, and at times, divergence. We also recognize the challenges facing model cards, which in some ways mirror the challenges facing machine learning documentation and responsible AI efforts more generally, and we see opportunities ahead to help shape both model cards and the ecosystems in which they function positively in the months and years ahead. Our work presents a view of where we think model cards stand right now and where they could go in the future, at Hugging Face and beyond. This work is a โ€œsnapshotโ€ of the current state of model cards, informed by a landscape analysis of the many ways ML documentation artefacts have been instantiated. It represents one perspective amongst multiple about both the current state and more aspirational visions of model cards. In this blog post, we summarise our work, including a discussion of the broader, growing landscape of ML documentation tools, the diverse audiences for and opinions about model cards, and potential new templates for model card content. We also explore and develop model cards for machine learning models in the context of the Hugging Face Hub, using the Hubโ€™s features to collaboratively create, discuss, and disseminate model cards for ML models. With the launch of this Guidebook, we introduce several new resources and connect together previous work on Model Cards: 1) An updated Model Card template, released in the `huggingface_hub` library [modelcard_template.md file](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md), drawing together Model Card work in academia and throughout the industry. 2) An [Annotated Model Card Template](./model-card-annotated), which details how to fill the card out. 3) A [Model Card Creator Tool](https://huggingface.co/spaces/huggingface/Model_Cards_Writing_Tool), to ease card creation without needing to program, and to help teams share the work of different sections. 4) A [User Study](./model-cards-user-studies) on Model Card usage at Hugging Face 5) A [Landscape Analysis and Literature Review](./model-card-landscape-analysis) of the state of the art in model documentation. We also include an [Appendix](./model-card-appendix) with further details from this work. --- **Please cite as:** Ozoani, Ezi and Gerchick, Marissa and Mitchell, Margaret. Model Card Guidebook. Hugging Face, 2022. https://huggingface.co/docs/hub/en/model-card-guidebook ### Pandas https://huggingface.co/docs/hub/datasets-pandas.md # Pandas [Pandas](https://github.com/pandas-dev/pandas) is a widely used Python data analysis toolkit. Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths ([`hf://`](/docs/huggingface_hub/guides/hf_file_system#integrations)) to read and write data on the Hub. ## Load a DataFrame You can load data from local files or from remote storage like Hugging Face Datasets. Pandas supports many formats including CSV, JSON and Parquet: ```python >>> import pandas as pd >>> df = pd.read_csv("path/to/data.csv") ``` To load a file from Hugging Face, the path needs to start with `hf://`. For example, the path to the [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb) dataset repository is `hf://datasets/stanfordnlp/imdb`. The dataset on Hugging Face contains multiple Parquet files. The Parquet file format is designed to make reading and writing data frames efficient, and to make sharing data across data analysis languages easy. Here is how to load the file `plain_text/train-00000-of-00001.parquet`: ```python >>> import pandas as pd >>> df = pd.read_parquet("hf://datasets/stanfordnlp/imdb/plain_text/train-00000-of-00001.parquet") >>> df text label 0 I rented I AM CURIOUS-YELLOW from my video sto... 0 1 "I Am Curious: Yellow" is a risible and preten... 0 2 If only to avoid making this type of film in t... 0 3 This film was probably inspired by Godard's Ma... 0 4 Oh, brother...after hearing about this ridicul... 0 ... ... ... 24995 A hit at the time but now better categorised a... 1 24996 I love this movie like no other. Another time ... 1 24997 This film and it's sequel Barry Mckenzie holds... 1 24998 'The Adventures Of Barry McKenzie' started lif... 1 24999 The story centers around Barry McKenzie who mu... 1 ``` For more information on the Hugging Face paths and how they are implemented, please refer to the [the client library's documentation on the HfFileSystem](/docs/huggingface_hub/guides/hf_file_system). > [!TIP] > The same `hf://` paths also work with [Storage Buckets](./storage-buckets): > ```python > >>> df = pd.read_parquet("hf://buckets/username/my-bucket/data.parquet") > >>> df.to_parquet("hf://buckets/username/my-bucket/output.parquet") > ``` ## Save a DataFrame You can save a pandas DataFrame using `to_csv/to_json/to_parquet` to a local file or to Hugging Face directly. To save the DataFrame on Hugging Face, you first need to [Login with your Hugging Face account](/docs/huggingface_hub/quick-start#login), for example using: ``` hf auth login ``` Then you can [Create a dataset repository](/docs/huggingface_hub/quick-start#create-a-repository), for example using: ```python from huggingface_hub import HfApi HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset") ``` Finally, you can use [Hugging Face paths](/docs/huggingface_hub/guides/hf_file_system#integrations) in Pandas: ```python import pandas as pd df.to_parquet("hf://datasets/username/my_dataset/imdb.parquet") # or write in separate files if the dataset has train/validation/test splits df_train.to_parquet("hf://datasets/username/my_dataset/train.parquet") df_valid.to_parquet("hf://datasets/username/my_dataset/validation.parquet") df_test .to_parquet("hf://datasets/username/my_dataset/test.parquet") ``` Note that Parquet files on Hugging Face are optimized to improve storage efficiency, accelerate downloads and uploads, and enable efficient dataset streaming and editing: * [Parquet Content Defined Chunking](https://huggingface.co/blog/parquet-cdc) optimizes Parquet for [Xet](https://huggingface.co/docs/hub/en/xet/index), Hugging Face's storage backend. It accelerates uploads and downloads thanks to chunk-based deduplication and allows efficient file editing * Page index accelerates filters when streaming and enables efficient random access, e.g. in the [Dataset Viewer](https://huggingface.co/docs/dataset-viewer) Pandas require extra argument to write optimized Parquet files: ```python import pandas as pd df.to_parquet( "hf://datasets/username/my_dataset/imdb.parquet", # Optimize for Xet use_content_defined_chunking=True, write_page_index=True, ) ``` * `use_content_defined_chunking=True` to enable Parquet Content Defined Chunking, for [deduplication](https://huggingface.co/blog/parquet-cdc) and [editing](./datasets-editing) (it requires `pyarrow>=21.0`) * `write_page_index=True` to include a page index in the Parquet metadata, for [streaming and random access](./datasets-streaming) > [!TIP] > Content defined chunking (CDC) makes the Parquet writer chunk the data pages in a way that makes duplicate data chunked and compressed identically. > Without CDC, the pages are arbitrarily chunked and therefore duplicate data are impossible to detect because of compression. > Thanks to CDC, Parquet uploads and downloads from Hugging Face are faster, since duplicate data are uploaded or downloaded only once. Find more information about Xet [here](https://huggingface.co/join/xet). ## Leverage Xet deduplication for Parquet Optimized Parquet files are written with Content Defined Chunking, which enables deduplication. This accelerates uploads since chunks of data that already exist on Hugging Face don't need to be uploaded again, and this saves a lot of I/O. For example, this code uploads the content of `df` and then for `edited_df` the upload is faster since it only uploads the chunks that changed: ```python import pandas as pd df.to_parquet( "hf://datasets/username/my_dataset/imdb.parquet", # Optimize for Xet use_content_defined_chunking=True, write_page_index=True, ) edited_df = ... # e.g. with added/modified/removed rows or columns edited_df.to_parquet( "hf://datasets/username/my_dataset/imdb.parquet", # Optimize for Xet use_content_defined_chunking=True, write_page_index=True, ) ``` Chunks are ~64kB and Parquet saves data column per column, so in practice this is what happens when editing an Optimized Parquet file: * add a new column -> only the chunks of the new column are uploaded * add/edit/delete a row -> one chunk per column is uploaded And in addition to this, the chunks of the Parquet footer containing metadata are also uploaded. ## Use Images You can load a folder with a metadata file containing a field for the names or paths to the images, structured like this: ``` Example 1: Example 2: folder/ folder/ โ”œโ”€โ”€ metadata.csv โ”œโ”€โ”€ metadata.csv โ”œโ”€โ”€ img000.png โ””โ”€โ”€ images โ”œโ”€โ”€ img001.png โ”œโ”€โ”€ img000.png ... ... โ””โ”€โ”€ imgNNN.png โ””โ”€โ”€ imgNNN.png ``` You can iterate on the images paths like this: ```python import pandas as pd folder_path = "path/to/folder/" df = pd.read_csv(folder_path + "metadata.csv") for image_path in (folder_path + df["file_name"]): ... ``` Since the dataset is in a [supported structure](https://huggingface.co/docs/hub/en/datasets-image#additional-columns) (a `metadata.csv` or `.jsonl` file with a `file_name` field), you can save this dataset to Hugging Face and the Dataset Viewer shows both the metadata and images on Hugging Face. ```python from huggingface_hub import HfApi api = HfApi() api.upload_folder( folder_path=folder_path, repo_id="username/my_image_dataset", repo_type="dataset", ) ``` ### Image methods and Parquet Using [pandas-image-methods](https://github.com/lhoestq/pandas-image-methods) you enable `PIL.Image` methods on an image column. It also enables saving the dataset as one single Parquet file containing both the images and the metadata: ```python import pandas as pd from pandas_image_methods import PILMethods pd.api.extensions.register_series_accessor("pil")(PILMethods) df["image"] = (folder_path + df["file_name"]).pil.open() df.to_parquet("data.parquet") ``` All the `PIL.Image` methods are available, e.g. ```python df["image"] = df["image"].pil.rotate(90) ``` ## Use Audios You can load a folder with a metadata file containing a field for the names or paths to the audios, structured like this: ``` Example 1: Example 2: folder/ folder/ โ”œโ”€โ”€ metadata.csv โ”œโ”€โ”€ metadata.csv โ”œโ”€โ”€ rec000.wav โ””โ”€โ”€ audios โ”œโ”€โ”€ rec001.wav โ”œโ”€โ”€ rec000.wav ... ... โ””โ”€โ”€ recNNN.wav โ””โ”€โ”€ recNNN.wav ``` You can iterate on the audios paths like this: ```python import pandas as pd folder_path = "path/to/folder/" df = pd.read_csv(folder_path + "metadata.csv") for audio_path in (folder_path + df["file_name"]): ... ``` Since the dataset is in a [supported structure](https://huggingface.co/docs/hub/en/datasets-audio#additional-columns) (a `metadata.csv` or `.jsonl` file with a `file_name` field), you can save it to Hugging Face, and the Hub Dataset Viewer shows both the metadata and audio. ```python from huggingface_hub import HfApi api = HfApi() api.upload_folder( folder_path=folder_path, repo_id="username/my_audio_dataset", repo_type="dataset", ) ``` ### Audio methods and Parquet Using [pandas-audio-methods](https://github.com/lhoestq/pandas-audio-methods) you enable `soundfile` methods on an audio column. It also enables saving the dataset as one single Parquet file containing both the audios and the metadata: ```python import pandas as pd from pandas_audio_methods import SFMethods pd.api.extensions.register_series_accessor("sf")(SFMethods) df["audio"] = (folder_path + df["file_name"]).sf.open() df.to_parquet("data.parquet") ``` This makes it easy to use with `librosa` e.g. for resampling: ```python df["audio"] = [librosa.load(audio, sr=16_000) for audio in df["audio"]] df["audio"] = df["audio"].sf.write() ``` ## Use Transformers You can use `transformers` pipelines on pandas DataFrames to classify, generate text, images, etc. This section shows a few examples with `tqdm` for progress bars. > [!TIP] > Pipelines don't accept a `tqdm` object as input but you can use a python generator instead, in the form `x for x in tqdm(...)` ### Text Classification ```python from transformers import pipeline from tqdm import tqdm pipe = pipeline("text-classification", model="clapAI/modernBERT-base-multilingual-sentiment") # Compute labels df["label"] = [y["label"] for y in pipe(x for x in tqdm(df["text"]))] # Compute labels and scores df[["label", "score"]] = [(y["label"], y["score"]) for y in pipe(x for x in tqdm(df["text"]))] ``` ### Text Generation ```python from transformers import pipeline from tqdm import tqdm pipe = pipeline("text-generation", model="Qwen/Qwen2.5-1.5B-Instruct") # Generate chat response prompt = "What is the main topic of this sentence ? REPLY IN LESS THAN 3 WORDS. Sentence: '{}'" df["output"] = [y["generated_text"][1]["content"] for y in pipe([{"role": "user", "content": prompt.format(x)}] for x in tqdm(df["text"]))] ``` ### GGUF usage with LM Studio https://huggingface.co/docs/hub/lmstudio.md # GGUF usage with LM Studio ![cover](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/gguf-lmstudio-coverimage.png) [LM Studio](https://lmstudio.ai) is a desktop application for experimenting & developing with local AI models directly on your computer. LM Studio is built on llama.cpp and works on Mac (Apple Silicon), Windows, and Linux! ## Getting models from Hugging Face into LM Studio First, enable LM Studio under your [Local Apps Settings](https://huggingface.co/settings/local-apps) in Hugging Face. ### Option 1: Use the 'Use this model' button right from Hugging Face For any GGUF or MLX LLM, click the "Use this model" dropdown and select LM Studio. This will run the model directly in LM Studio if you already have it, or show you a download option if you don't. To try LM Studio with a trending model, find them here: [https://huggingface.co/models?library=gguf\&sort=trending](https://huggingface.co/models?library=gguf&sort=trending) ### Option 2: Use LM Studio's In-App Downloader Open the LM Studio app and search for any model by pressing โŒ˜ + Shift + M on Mac, or Ctrl + Shift + M on PC (M stands for Models). You can even paste entire Hugging Face URLs into the search bar! For each model, you can expand the dropdown to view multiple quantization options. LM Studio highlights the recommended choice for your hardware and indicates which options are supported. ### Option 3: Use lms, LM Studio's CLI: If you prefer a terminal based workflow, use lms, LM Studio's CLI. #### **Search for models from the terminal:** Search with keyword ```bash lms get qwen ``` Filter search by MLX or GGUF results ```bash lms get qwen \--mlx # or \--gguf ``` #### **Download any model from Hugging Face:** Use a full Hugging Face URL ```bash lms get https://huggingface.co/lmstudio-community/Ministral-3-8B-Reasoning-2512-GGUF ``` #### **Choose a model quantization** You can choose a model quantization level that balances performance, memory usage, and accuracy. This is done with the @ qualifier, for example: ```bash lms get https://huggingface.co/lmstudio-community/Ministral-3-8B-Reasoning-2512-GGUF@Q6\_K ``` ## You downloaded the model โ€“ Now what? You've downloaded a model following one of the above options, now let's get started in LM Studio! ### Getting started with the LM Studio Application In the LM Studio application, head to the model loader to view a list of downloaded models and select one to load. You may customize the model load parameters, though LM Studio will by default select the load parameters that optimizes model performance on your hardware. Once the model has completed loading (as indicated by the progress bar), you may start chatting away using our app's chat interface! ### Or, use LM Studio's CLI to interact with your models See a list of commands [here](https://lmstudio.ai/docs/cli). Note that you need to run LM Studio ***at least once*** before you can use lms ## **Keeping up with the latest models** Follow the [LM Studio Community](https://huggingface.co/lmstudio-community) page on Hugging Face to stay updated on the latest & greatest local LLMs as soon as they come out. ### Hugging Face Dataset Upload Decision Guide https://huggingface.co/docs/hub/datasets-upload-guide-llm.md # Hugging Face Dataset Upload Decision Guide > [!TIP] > This guide is primarily designed for LLMs to help users upload datasets to the Hugging Face Hub in the most compatible format. Users can also reference this guide to understand the upload process and best practices. > Decision guide for uploading datasets to Hugging Face Hub. Optimized for Dataset Viewer compatibility and integration with the Hugging Face ecosystem. ## Overview Your goal is to help a user upload a dataset to the Hugging Face Hub. Ideally, the dataset should be compatible with the Dataset Viewer (and thus the `load_dataset` function) to ensure easy access and usability. You should aim to meet the following criteria: | **Criteria** | Description | Priority | | ---------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------- | | **Respect repository limits** | Ensure the dataset adheres to Hugging Face's storage limits for file sizes, repository sizes, and file counts. See the Critical Constraints section below for specific limits. | Required | | **Use hub-compatible formats** | Use Parquet format when possible (best compression, rich typing, large dataset support). For smaller datasets (10k)** | Use upload_large_folder to avoid Git limitations | `api.upload_large_folder(folder_path="./data", repo_id="username/dataset", repo_type="dataset")` | | **Streaming large media** | WebDataset format for efficient streaming | Create .tar shards, then `upload_large_folder()` | | **Scientific data (HDF5, NetCDF)** | Convert to Parquet with Array features | See [Scientific Data](#scientific-data) section | | **Custom/proprietary formats** | Document thoroughly if conversion impossible | `upload_large_folder()` with comprehensive README | ## Upload Workflow 0. โœ“ **Gather dataset information** (if needed): - What type of data? (images, text, audio, CSV, etc.) - How is it organized? (folder structure, single file, multiple files) - What's the approximate size? - What format are the files in? - Any special requirements? (e.g., streaming, private access) - Check for existing README or documentation files that describe the dataset 1. โœ“ **Authenticate**: - CLI: `hf auth login` - Or use token: `HfApi(token="hf_...")` or set `HF_TOKEN` environment variable 2. โœ“ **Identify your data type**: Check the [Quick Reference](#quick-reference-by-data-type) table above 3. โœ“ **Choose upload method**: - **Small files (<1GB) with hub-compatible format**: Can use [Hub UI](https://huggingface.co/new-dataset) for quick uploads - **Built-in loader available**: Use the loader + `push_to_hub()` (see Quick Reference table) - **Large datasets or many files**: Use `upload_large_folder()` for files >100GB or >10k files - **Custom formats**: Convert to hub-compatible format if possible, otherwise document thoroughly 4. โœ“ **Test locally** (if using built-in loader): ```python # Validate your dataset loads correctly before uploading dataset = load_dataset("loader_name", data_dir="./your_data") print(dataset) ``` 5. โœ“ **Upload to Hub**: ```python # Basic upload dataset.push_to_hub("username/dataset-name") # With options for large datasets dataset.push_to_hub( "username/dataset-name", max_shard_size="5GB", # Control memory usage private=True # For private datasets ) ``` 6. โœ“ **Verify your upload**: - Check Dataset Viewer: `https://huggingface.co/datasets/username/dataset-name` - Test loading: `load_dataset("username/dataset-name")` - If viewer shows errors, check the [Troubleshooting](#common-issues--solutions) section ## Common Conversion Patterns When built-in loaders don't match your data structure, use the datasets library as a compatibility layer. Convert your data to a Dataset object, then use `push_to_hub()` for maximum flexibility and Dataset Viewer compatibility. ### From DataFrames If you already have your data working in pandas, polars, or other dataframe libraries, you can convert directly: ```python # From pandas DataFrame import pandas as pd from datasets import Dataset df = pd.read_csv("your_data.csv") dataset = Dataset.from_pandas(df) dataset.push_to_hub("username/dataset-name") # From polars DataFrame (direct method) import polars as pl from datasets import Dataset df = pl.read_csv("your_data.csv") dataset = Dataset.from_polars(df) # Direct conversion dataset.push_to_hub("username/dataset-name") # From PyArrow Table (useful for scientific data) import pyarrow as pa from datasets import Dataset # If you have a PyArrow table table = pa.table({'data': [1, 2, 3], 'labels': ['a', 'b', 'c']}) dataset = Dataset(table) dataset.push_to_hub("username/dataset-name") # For Spark/Dask dataframes, see https://huggingface.co/docs/hub/datasets-libraries ``` ## Custom Format Conversion When built-in loaders don't match your data format, convert to Dataset objects following these principles: ### Design Principles **1. Prefer wide/flat structures over joins** - Denormalize relational data into single rows for better usability - Include all relevant information in each example - Lean towards bigger but more usable data - Hugging Face's infrastructure uses advanced deduplication (XetHub) and Parquet optimizations to handle redundancy efficiently **2. Use configs for logical dataset variations** - Beyond train/test/val splits, use configs for different subsets or views of your data - Each config can have different features or data organization - Example: language-specific configs, task-specific views, or data modalities ### Conversion Methods **Small datasets (fits in memory) - use `Dataset.from_dict()`**: ```python # Parse your custom format into a dictionary data_dict = { "text": ["example1", "example2"], "label": ["positive", "negative"], "score": [0.9, 0.2] } # Create dataset with appropriate features from datasets import Dataset, Features, Value, ClassLabel features = Features({ 'text': Value('string'), 'label': ClassLabel(names=['negative', 'positive']), 'score': Value('float32') }) dataset = Dataset.from_dict(data_dict, features=features) dataset.push_to_hub("username/dataset") ``` **Large datasets (memory-efficient) - use `Dataset.from_generator()`**: ```python def data_generator(): # Parse your custom format progressively for item in parse_large_file("data.custom"): yield { "text": item["content"], "label": item["category"], "embedding": item["vector"] } # Specify features for Dataset Viewer compatibility from datasets import Features, Value, ClassLabel, List features = Features({ 'text': Value('string'), 'label': ClassLabel(names=['cat1', 'cat2', 'cat3']), 'embedding': List(feature=Value('float32'), length=768) }) dataset = Dataset.from_generator(data_generator, features=features) dataset.push_to_hub("username/dataset", max_shard_size="1GB") ``` **Tip**: For large datasets, test with a subset first by adding a limit to your generator or using `.select(range(100))` after creation. ### Using Configs for Dataset Variations ```python # Push different configurations of your dataset dataset_en = Dataset.from_dict(english_data, features=features) dataset_en.push_to_hub("username/multilingual-dataset", config_name="english") dataset_fr = Dataset.from_dict(french_data, features=features) dataset_fr.push_to_hub("username/multilingual-dataset", config_name="french") # Users can then load specific configs dataset = load_dataset("username/multilingual-dataset", "english") ``` ### Multi-modal Examples **Text + Audio (speech recognition)**: ```python def speech_generator(): for audio_file in Path("audio/").glob("*.wav"): transcript_file = audio_file.with_suffix(".txt") yield { "audio": str(audio_file), "text": transcript_file.read_text().strip(), "speaker_id": audio_file.stem.split("_")[0] } features = Features({ 'audio': Audio(sampling_rate=16000), 'text': Value('string'), 'speaker_id': Value('string') }) dataset = Dataset.from_generator(speech_generator, features=features) dataset.push_to_hub("username/speech-dataset") ``` **Multiple images per example**: ```python # Before/after images, medical imaging, etc. data = { "image_before": ["img1_before.jpg", "img2_before.jpg"], "image_after": ["img1_after.jpg", "img2_after.jpg"], "treatment": ["method_A", "method_B"] } features = Features({ 'image_before': Image(), 'image_after': Image(), 'treatment': ClassLabel(names=['method_A', 'method_B']) }) dataset = Dataset.from_dict(data, features=features) dataset.push_to_hub("username/before-after-images") ``` **Note**: For text + images, consider using ImageFolder with metadata.csv which handles this automatically. ## Essential Features Features define the schema and data types for your dataset columns. Specifying correct features ensures: - Proper data handling and type conversion - Dataset Viewer functionality (e.g., image/audio previews) - Efficient storage and loading - Clear documentation of your data structure For complete feature documentation, see: [Dataset Features](https://huggingface.co/docs/datasets/about_dataset_features) ### Feature Types Overview **Basic Types**: - `Value`: Scalar values - `string`, `int64`, `float32`, `bool`, `binary`, and other numeric types - `ClassLabel`: Categorical data with named classes - `Sequence`: Lists of any feature type - `LargeList`: For very large lists **Media Types** (enable Dataset Viewer previews): - `Image()`: Handles various image formats, returns PIL Image objects - `Audio(sampling_rate=16000)`: Audio with array data and optional sampling rate - `Video()`: Video files - `Pdf()`: PDF documents with text extraction **Array Types** (for tensors/scientific data): - `Array2D`, `Array3D`, `Array4D`, `Array5D`: Fixed or variable-length arrays - Example: `Array2D(shape=(224, 224), dtype='float32')` - First dimension can be `None` for variable length **Translation Types**: - `Translation`: For translation pairs with fixed languages - `TranslationVariableLanguages`: For translations with varying language pairs **Note**: New feature types are added regularly. Check the documentation for the latest additions. ## Upload Methods **Dataset objects (use push_to_hub)**: Use when you've loaded/converted data using the datasets library ```python dataset.push_to_hub("username/dataset", max_shard_size="5GB") ``` **Pre-existing files (use upload_large_folder)**: Use when you have hub-compatible files (e.g., Parquet files) already prepared and organized ```python from huggingface_hub import HfApi api = HfApi() api.upload_large_folder(folder_path="./data", repo_id="username/dataset", repo_type="dataset", num_workers=16) ``` **Important**: Before using `upload_large_folder`, verify the files meet repository limits: - Check folder structure if you have file access: ensure no folder contains >10k files - Ask the user to confirm: "Are your files in a hub-compatible format (Parquet/CSV/JSON) and organized appropriately?" - For non-standard formats, consider converting to Dataset objects first to ensure compatibility ## Validation **Consider small reformatting**: If data is close to a built-in loader format, suggest minor changes: - Rename columns (e.g., 'filename' โ†’ 'file_name' for ImageFolder) - Reorganize folders (e.g., move images into class subfolders) - Rename files to match expected patterns (e.g., 'data.csv' โ†’ 'train.csv') **Pre-upload**: - Test locally: `load_dataset("imagefolder", data_dir="./data")` - Verify features work correctly: ```python # Test first example print(dataset[0]) # For images: verify they load if 'image' in dataset.features: dataset[0]['image'] # Should return PIL Image # Check dataset size before upload print(f"Size: {len(dataset)} examples") ``` - Check metadata.csv has 'file_name' column - Verify relative paths, no leading slashes - Ensure no folder >10k files **Post-upload**: - Check viewer: `https://huggingface.co/datasets/username/dataset` - Test loading: `load_dataset("username/dataset")` - Verify features preserved: `print(dataset.features)` ## Common Issues โ†’ Solutions | Issue | Solution | | -------------------------- | ------------------------------------ | | "Repository not found" | Run `hf auth login` | | Memory errors | Use `max_shard_size="500MB"` | | Dataset viewer not working | Wait 5-10min, check README.md config | | Timeout errors | Use `multi_commits=True` | | Files >50GB | Split into smaller files | | "File not found" | Use relative paths in metadata | ## Dataset Viewer Configuration **Note**: This section is primarily for datasets uploaded directly to the Hub (via UI or `upload_large_folder`). Datasets uploaded with `push_to_hub()` typically configure the viewer automatically. ### When automatic detection works The Dataset Viewer automatically detects standard structures: - Files named: `train.csv`, `test.json`, `validation.parquet` - Directories named: `train/`, `test/`, `validation/` - Split names with delimiters: `test-data.csv` โœ“ (not `testdata.csv` โœ—) ### Manual configuration For custom structures, add YAML to your README.md: ```yaml --- configs: - config_name: default # Required even for single config! data_files: - split: train path: "data/train/*.parquet" - split: test path: "data/test/*.parquet" --- ``` Multiple configurations example: ```yaml --- configs: - config_name: english data_files: "en/*.parquet" - config_name: french data_files: "fr/*.parquet" --- ``` ### Common viewer issues - **No viewer after upload**: Wait 5-10 minutes for processing - **"Config names error"**: Add `config_name` field (required!) - **Files not detected**: Check naming patterns (needs delimiters) - **Viewer disabled**: Remove `viewer: false` from README YAML ## Quick Templates ```python # ImageFolder with metadata dataset = load_dataset("imagefolder", data_dir="./images") dataset.push_to_hub("username/dataset") # Memory-efficient upload dataset.push_to_hub("username/dataset", max_shard_size="500MB") # Multiple CSV files dataset = load_dataset('csv', data_files={'train': 'train.csv', 'test': 'test.csv'}) dataset.push_to_hub("username/dataset") ``` ## Documentation **Core docs**: [Adding datasets](https://huggingface.co/docs/hub/datasets-adding) | [Dataset viewer](https://huggingface.co/docs/hub/datasets-viewer) | [Storage limits](https://huggingface.co/docs/hub/storage-limits) | [Upload guide](https://huggingface.co/docs/datasets/upload_dataset) ## Dataset Cards Remind users to add a dataset card (README.md) with: - Dataset description and usage - License information - Citation details See [Dataset Cards guide](https://huggingface.co/docs/hub/datasets-cards) for details. --- ## Appendix: Special Cases ### WebDataset Structure For streaming large media datasets: - Create 1-5GB tar shards - Consistent internal structure - Upload with `upload_large_folder` ### Scientific Data - HDF5/NetCDF โ†’ Convert to Parquet with Array features - Time series โ†’ Array2D(shape=(None, n)) - Complex metadata โ†’ Store as JSON strings ### Community Resources For very specialized or bespoke formats: - Search the Hub for similar datasets: `https://huggingface.co/datasets` - Ask for advice on the [Hugging Face Forums](https://discuss.huggingface.co/c/datasets/10) - Join the [Hugging Face Discord](https://hf.co/join/discord) for real-time help - Many domain-specific formats already have examples on the Hub ### How to configure SAML SSO with Google Workspace https://huggingface.co/docs/hub/security-sso-google-saml.md # How to configure SAML SSO with Google Workspace In this guide, we will use Google Workspace as the SSO provider and with the Security Assertion Markup Language (SAML) protocol as our preferred identity protocol. We currently support SP-initiated and IdP-initiated authentication. For user provisioning, see [SCIM](./enterprise-scim). > [!WARNING] > This feature is part of the Team & Enterprise plans. ## Step 1: Create SAML App in Google Workspace - In your Google Workspace admin console, navigate to `Admin` > `Apps` > `Web and mobile apps`. - Click `Add app` and then `Add custom SAML app`. - You must provide a name for your application in the "App name" field. - Click `Continue`. ## Step 2: Configure Hugging Face with Google's IdP Details - The next screen in the Google setup contains the SSO information for your application. - In your Hugging Face organization settings, go to the `SSO` tab and select the `SAML` protocol. - Copy the **SSO URL** from Google into the **Sign-on URL** field on Hugging Face. - Copy the **Certificate** from Google into the corresponding field on Hugging Face. The public certificate must have the following format: ``` -----BEGIN CERTIFICATE----- {certificate} -----END CERTIFICATE----- ``` - In the Google Workspace setup, click `Continue`. ## Step 3: Configure Google with Hugging Face's SP Details - In the "Service provider details" screen, you'll need the `Assertion Consumer Service URL` and `SP Entity ID` from your Hugging Face SSO settings. Copy them into the corresponding `ACS URL` and `Entity ID` fields in Google. - Ensure the following are set: - Check the **Signed response** box. - Name ID format: `EMAIL` - Name ID: `Basic Information > Primary email` - Click `Continue`. ## Step 4: Attribute Mapping - On the "Attribute mapping" screen, click `Add mapping` and configure the attributes you want to send. This step is optional and depends on whether you want to use [Role Mapping](./security-sso-user-management#role-mapping) or [Resource Group Mapping](./security-sso-user-management#resource-group-mapping) on Hugging Face. - Click `Finish`. ## Step 5: Test and Enable SSO > [!WARNING] > Before testing, ensure you have granted access to the application for the appropriate users in the Google Workspace admin console under the app's "User access" settings. The admin performing the test must have access. It may take a few minutes for user access changes to apply on Google Workspace. - Now, in your Hugging Face SSO settings, click on **"Update and Test SAML configuration"**. - You should be redirected to your Google login prompt. Once logged in, you'll be redirected to your organization's settings page. - A green check mark near the SAML selector will confirm that the test was successful. - Once the test is successful, you can enable SSO for your organization by clicking the "Enable" button. - Once enabled, members of your organization must complete the SSO authentication flow described in the [How it works](./security-sso-basic#how-it-works) section. ### Third-party scanner: JFrog https://huggingface.co/docs/hub/security-jfrog.md # Third-party scanner: JFrog [JFrog](https://jfrog.com/)'s security scanner detects malicious behavior in machine learning models. ![JFrog report for the danger.dat file contained in mcpotato/42-eicar-street](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/jfrog-report.png) *Example of a report for [danger.dat](https://huggingface.co/mcpotato/42-eicar-street/blob/main/danger.dat)* We [partnered with JFrog](https://hf.co/blog/jfrog) to provide scanning in order to make the Hub safer. Model files are scanned by the JFrog scanner and we expose the scanning results on the Hub interface. JFrog's scanner is built with the goal to reduce false positives. Indeed, what we currently observe is that code contained within model weights is not always malicious. When code is detected in a file, JFrog's scanner will parse it and analyze to check for potential malicious usage. Here is an example repository you can check out to see the feature in action: [mcpotato/42-eicar-street](https://huggingface.co/mcpotato/42-eicar-street). ## Model security refresher To share models, we serialize the data structures we use to interact with the models, in order to facilitate storage and transport. Some serialization formats are vulnerable to nasty exploits, such as arbitrary code execution (looking at you pickle), making sharing models potentially dangerous. As Hugging Face has become a popular platform for model sharing, weโ€™d like to protect the community from this, hence why we have developed tools like [picklescan](https://github.com/mmaitre314/picklescan) and why we integrate third party scanners. Pickle is not the only exploitable format out there, [see for reference](https://github.com/Azure/counterfit/wiki/Abusing-ML-model-file-formats-to-create-malware-on-AI-systems:-A-proof-of-concept) how one can exploit Keras Lambda layers to achieve arbitrary code execution. ### Managing Spaces with CircleCI Workflows https://huggingface.co/docs/hub/spaces-circleci.md # Managing Spaces with CircleCI Workflows You can keep your app in sync with your GitHub repository with a **CircleCI workflow**. [CircleCI](https://circleci.com) is a continuous integration and continuous delivery (CI/CD) platform that helps automate the software development process. A [CircleCI workflow](https://circleci.com/docs/workflows/) is a set of automated tasks defined in a configuration file, orchestrated by CircleCI, to streamline the process of building, testing, and deploying software applications. *Note: For files larger than 10MB, Spaces requires Git-LFS. If you don't want to use Git-LFS, you may need to review your files and check your history. Use a tool like [BFG Repo-Cleaner](https://rtyley.github.io/bfg-repo-cleaner/) to remove any large files from your history. BFG Repo-Cleaner will keep a local copy of your repository as a backup.* First, set up your GitHub repository and Spaces app together. Add your Spaces app as an additional remote to your existing Git repository. ```bash git remote add space https://huggingface.co/spaces/HF_USERNAME/SPACE_NAME ``` Then force push to sync everything for the first time: ```bash git push --force space main ``` Next, set up a [CircleCI workflow](https://circleci.com/docs/workflows/) to push your `main` git branch to Spaces. In the example below: * Replace `HF_USERNAME` with your username and `SPACE_NAME` with your Space name. * [Create a context in CircleCI](https://circleci.com/docs/contexts/) and add an env variable into it called *HF_PERSONAL_TOKEN* (you can give it any name, use the key you create in place of HF_PERSONAL_TOKEN) and the value as your Hugging Face API token. You can find your Hugging Face API token under **API Tokens** on [your Hugging Face profile](https://huggingface.co/settings/tokens). ```yaml version: 2.1 workflows: main: jobs: - sync-to-huggingface: context: - HuggingFace filters: branches: only: - main jobs: sync-to-huggingface: docker: - image: alpine resource_class: small steps: - run: name: install git command: apk update && apk add openssh-client git - checkout - run: name: push to Huggingface hub command: | git config user.email "" git config user.name "" git push -f https://HF_USERNAME:${HF_PERSONAL_TOKEN}@huggingface.co/spaces/HF_USERNAME/SPACE_NAME main ``` ### Perform SQL operations https://huggingface.co/docs/hub/datasets-duckdb-sql.md # Perform SQL operations Performing SQL operations with DuckDB opens up a world of possibilities for querying datasets efficiently. Let's dive into some examples showcasing the power of DuckDB functions. For our demonstration, we'll explore a fascinating dataset. The [MMLU](https://huggingface.co/datasets/cais/mmlu) dataset is a multitask test containing multiple-choice questions spanning various knowledge domains. To preview the dataset, let's select a sample of 3 rows: ```bash FROM 'hf://datasets/cais/mmlu/all/test-*.parquet' USING SAMPLE 3; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ question โ”‚ subject โ”‚ choices โ”‚ answer โ”‚ โ”‚ varchar โ”‚ varchar โ”‚ varchar[] โ”‚ int64 โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ The model of lightโ€ฆ โ”‚ conceptual_physics โ”‚ [wave model, particle model, Both of these, Neither of these] โ”‚ 1 โ”‚ โ”‚ A person who is loโ€ฆ โ”‚ professional_psychโ€ฆ โ”‚ [his/her life scripts., his/her own feelings, attitudes, and beliefs., the emotional reactions and behaviors of the people he/she is interacting with.โ€ฆ โ”‚ 1 โ”‚ โ”‚ The thermic effectโ€ฆ โ”‚ nutrition โ”‚ [is substantially higher for carbohydrate than for protein, is accompanied by a slight decrease in body core temperature., is partly related to sympatโ€ฆ โ”‚ 2 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` This command retrieves a random sample of 3 rows from the dataset for us to examine. Let's start by examining the schema of our dataset. The following table outlines the structure of our dataset: ```bash DESCRIBE FROM 'hf://datasets/cais/mmlu/all/test-*.parquet' USING SAMPLE 3; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ column_name โ”‚ column_type โ”‚ null โ”‚ key โ”‚ default โ”‚ extra โ”‚ โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ question โ”‚ VARCHAR โ”‚ YES โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ subject โ”‚ VARCHAR โ”‚ YES โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ choices โ”‚ VARCHAR[] โ”‚ YES โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ answer โ”‚ BIGINT โ”‚ YES โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` Next, let's analyze if there are any duplicated records in our dataset: ```bash SELECT *, COUNT(*) AS counts FROM 'hf://datasets/cais/mmlu/all/test-*.parquet' GROUP BY ALL HAVING counts > 2; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ question โ”‚ subject โ”‚ choices โ”‚ answer โ”‚ counts โ”‚ โ”‚ varchar โ”‚ varchar โ”‚ varchar[] โ”‚ int64 โ”‚ int64 โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ 0 rows โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` Fortunately, our dataset doesn't contain any duplicate records. Let's see the proportion of questions based on the subject in a bar representation: ```bash SELECT subject, COUNT(*) AS counts, BAR(COUNT(*), 0, (SELECT COUNT(*) FROM 'hf://datasets/cais/mmlu/all/test-*.parquet')) AS percentage FROM 'hf://datasets/cais/mmlu/all/test-*.parquet' GROUP BY subject ORDER BY counts DESC; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ subject โ”‚ counts โ”‚ percentage โ”‚ โ”‚ varchar โ”‚ int64 โ”‚ varchar โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ professional_law โ”‚ 1534 โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ โ”‚ โ”‚ moral_scenarios โ”‚ 895 โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚ โ”‚ miscellaneous โ”‚ 783 โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ– โ”‚ โ”‚ professional_psychology โ”‚ 612 โ”‚ โ–ˆโ–ˆโ–ˆโ– โ”‚ โ”‚ high_school_psychology โ”‚ 545 โ”‚ โ–ˆโ–ˆโ–ˆ โ”‚ โ”‚ high_school_macroeconomics โ”‚ 390 โ”‚ โ–ˆโ–ˆโ– โ”‚ โ”‚ elementary_mathematics โ”‚ 378 โ”‚ โ–ˆโ–ˆโ– โ”‚ โ”‚ moral_disputes โ”‚ 346 โ”‚ โ–ˆโ–‰ โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ 57 rows (8 shown) 3 columns โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` Now, let's prepare a subset of the dataset containing questions related to **nutrition** and create a mapping of questions to correct answers. Notice that we have the column **choices** from which we can get the correct answer using the **answer** column as an index. ```bash SELECT * FROM 'hf://datasets/cais/mmlu/all/test-*.parquet' WHERE subject = 'nutrition' LIMIT 3; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ question โ”‚ subject โ”‚ choices โ”‚ answer โ”‚ โ”‚ varchar โ”‚ varchar โ”‚ varchar[] โ”‚ int64 โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ Which foods tend tโ€ฆ โ”‚ nutrition โ”‚ [Meat, Confectionary, Fruits and vegetables, Potatoes] โ”‚ 2 โ”‚ โ”‚ In which one of thโ€ฆ โ”‚ nutrition โ”‚ [If the incidence rate of the disease falls., If survival time with the disease increases., If recovery of the disease is faster., If the population in which theโ€ฆ โ”‚ 1 โ”‚ โ”‚ Which of the folloโ€ฆ โ”‚ nutrition โ”‚ [The flavonoid class comprises flavonoids and isoflavonoids., The digestibility and bioavailability of isoflavones in soya food products are not changed by proceโ€ฆ โ”‚ 0 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ```bash SELECT question, choices[answer] AS correct_answer FROM 'hf://datasets/cais/mmlu/all/test-*.parquet' WHERE subject = 'nutrition' LIMIT 3; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ question โ”‚ correct_answer โ”‚ โ”‚ varchar โ”‚ varchar โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ Which foods tend to be consumed in lower quantities in Wales and Scotland (as of 2020)?\n โ”‚ Confectionary โ”‚ โ”‚ In which one of the following circumstances will the prevalence of a disease in the population increase, all else being constant?\n โ”‚ If the incidence rate of the disease falls. โ”‚ โ”‚ Which of the following statements is correct?\n โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` To ensure data cleanliness, let's remove any newline characters at the end of the questions and filter out any empty answers: ```bash SELECT regexp_replace(question, '\n', '') AS question, choices[answer] AS correct_answer FROM 'hf://datasets/cais/mmlu/all/test-*.parquet' WHERE subject = 'nutrition' AND LENGTH(correct_answer) > 0 LIMIT 3; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ question โ”‚ correct_answer โ”‚ โ”‚ varchar โ”‚ varchar โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ Which foods tend to be consumed in lower quantities in Wales and Scotland (as of 2020)? โ”‚ Confectionary โ”‚ โ”‚ In which one of the following circumstances will the prevalence of a disease in the population increase, all else being constant? โ”‚ If the incidence rate of the disease falls. โ”‚ โ”‚ Which vitamin is a major lipid-soluble antioxidant in cell membranes? โ”‚ Vitamin D โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` Finally, lets highlight some of the DuckDB functions used in this section: - `DESCRIBE`, returns the table schema. - `USING SAMPLE`, samples are used to randomly select a subset of a dataset. - `BAR`, draws a band whose width is proportional to (x - min) and equal to width characters when x = max. Width defaults to 80. - `string[begin:end]`, extracts a string using slice conventions. Missing begin or end arguments are interpreted as the beginning or end of the list respectively. Negative values are accepted. - `regexp_replace`, if the string contains the regexp pattern, replaces the matching part with replacement. - `LENGTH`, gets the number of characters in the string. > [!TIP] > There are plenty of useful functions available in DuckDB's [SQL functions overview](https://duckdb.org/docs/sql/functions/overview). The best part is that you can use them directly on Hugging Face datasets. ### Static HTML Spaces https://huggingface.co/docs/hub/spaces-sdks-static.md # Static HTML Spaces Spaces also accommodate custom HTML for your app instead of using Streamlit or Gradio. Set `sdk: static` inside the `YAML` block at the top of your Spaces **README.md** file. Then you can place your HTML code within an **index.html** file. Here are some examples of Spaces using custom HTML: * [Smarter NPC](https://huggingface.co/spaces/mishig/smarter_npc): Display a PlayCanvas project with an iframe in Spaces. * [Huggingfab](https://huggingface.co/spaces/pierreant-p/huggingfab): Display a Sketchfab model in Spaces. * [Diffuse the rest](https://huggingface.co/spaces/huggingface-projects/diffuse-the-rest): Draw and diffuse the rest ## Adding a build step before serving Static Spaces support adding a custom build step before serving your static assets. This is useful for frontend frameworks like React, Svelte and Vue that require a build process before serving the application. The build command runs automatically when your Space is updated. Add `app_build_command` inside the `YAML` block at the top of your Spaces **README.md** file, and `app_file`. For example: - `app_build_command: npm run build` - `app_file: dist/index.html` Example spaces: - [Svelte App](https://huggingface.co/spaces/julien-c/vite-svelte) - [React App](https://huggingface.co/spaces/coyotte508/static-vite) Under the hood, it will [launch a build](https://huggingface.co/spaces/huggingface/space-build), storing the generated files in a special `refs/convert/build` ref. ## Space variables Custom [environment variables](./spaces-overview#managing-secrets) can be passed to your Space. OAuth information such as the client ID and scope are also available as environment variables, if you have [enabled OAuth](./spaces-oauth) for your Space. To use these variables in JavaScript, you can use the `window.huggingface.variables` object. For example, to access the `OAUTH_CLIENT_ID` variable, you can use `window.huggingface.variables.OAUTH_CLIENT_ID`. Here is an example of a Space using custom environment variables and oauth enabled and displaying the variables in the HTML: * [Static Variables](https://huggingface.co/spaces/huggingfacejs/static-variables) ### Using spaCy at Hugging Face https://huggingface.co/docs/hub/spacy.md # Using spaCy at Hugging Face `spaCy` is a popular library for advanced Natural Language Processing used widely across industry. `spaCy` makes it easy to use and train pipelines for tasks like named entity recognition, text classification, part of speech tagging and more, and lets you build powerful applications to process and analyze large volumes of text. ## Exploring spaCy models in the Hub The official models from `spaCy` 3.3 are in the `spaCy` [Organization Page](https://huggingface.co/spacy). Anyone in the community can also share their `spaCy` models, which you can find by filtering at the left of the [models page](https://huggingface.co/models?library=spacy). All models on the Hub come up with useful features 1. An automatically generated model card with label scheme, metrics, components, and more. 2. An evaluation sections at top right where you can look at the metrics. 3. Metadata tags that help for discoverability and contain information such as license and language. 4. An interactive widget you can use to play out with the model directly in the browser 5. An Inference Providers widget that allows to make inference requests. ## Using existing models All `spaCy` models from the Hub can be directly installed using pip install. ```bash pip install "en_core_web_sm @ https://huggingface.co/spacy/en_core_web_sm/resolve/main/en_core_web_sm-any-py3-none-any.whl" ``` To find the link of interest, you can go to a repository with a `spaCy` model. When you open the repository, you can click `Use in spaCy` and you will be given a working snippet that you can use to install and load the model! Once installed, you can load the model as any spaCy pipeline. ```python # Using spacy.load(). import spacy nlp = spacy.load("en_core_web_sm") # Importing as module. import en_core_web_sm nlp = en_core_web_sm.load() ``` ## Sharing your models ### Using the spaCy CLI (recommended) The `spacy-huggingface-hub` library extends `spaCy` native CLI so people can easily push their packaged models to the Hub. You can install spacy-huggingface-hub from pip: ```bash pip install spacy-huggingface-hub ``` You can then check if the command has been registered successfully ```bash python -m spacy huggingface-hub --help ``` To push with the CLI, you can use the `huggingface-hub push` command as seen below. ```bash python -m spacy huggingface-hub push [whl_path] [--org] [--msg] [--local-repo] [--verbose] ``` | Argument | Type | Description | | -------------------- | ------------ | ----------------------------------------------------------------------------------------------------------------------------- | | `whl_path` | str / `Path` | The path to the `.whl` file packaged with [`spacy package`](https://spacy.io/api/cli#package). | | `--org`, `-o` | str | Optional name of organization to which the pipeline should be uploaded. | | `--msg`, `-m` | str | Commit message to use for update. Defaults to `"Update spaCy pipeline"`. | | `--local-repo`, `-l` | str / `Path` | Local path to the model repository (will be created if it doesn't exist). Defaults to `hub` in the current working directory. | | `--verbose`, `-V` | bool | Output additional info for debugging, e.g. the full generated hub metadata. | You can then upload any pipeline packaged with [`spacy package`](https://spacy.io/api/cli#package). Make sure to set `--build wheel` to output a binary .whl file. The uploader will read all metadata from the pipeline package, including the auto-generated pretty `README.md` and the model details available in the `meta.json`. ```bash hf auth login python -m spacy package ./en_ner_fashion ./output --build wheel cd ./output/en_ner_fashion-0.0.0/dist python -m spacy huggingface-hub push en_ner_fashion-0.0.0-py3-none-any.whl ``` In just a minute, you can get your packaged model in the Hub, try it out directly in the browser, and share it with the rest of the community. All the required metadata will be uploaded for you and you even get a cool model card. The command will output two things: * Where to find your repo in the Hub! For example, https://huggingface.co/spacy/en_core_web_sm * And how to install the pipeline directly from the Hub! ### From a Python script You can use the `push` function from Python. It returns a dictionary containing the `"url"` and "`whl_url`" of the published model and the wheel file, which you can later install with `pip install`. ```py from spacy_huggingface_hub import push result = push("./en_ner_fashion-0.0.0-py3-none-any.whl") print(result["url"]) ``` ## Additional resources * spacy-huggingface-hub [library](https://github.com/explosion/spacy-huggingface-hub). * Launch [blog post](https://huggingface.co/blog/spacy) * spaCy v 3.1 [Announcement](https://explosion.ai/blog/spacy-v3-1#huggingface-hub) * spaCy [documentation](https://spacy.io/universe/project/spacy-huggingface-hub/) ### Webhook guide: build a Discussion bot based on BLOOM https://huggingface.co/docs/hub/webhooks-guide-discussion-bot.md # Webhook guide: build a Discussion bot based on BLOOM Here's a short guide on how to use Hugging Face Webhooks to build a bot that replies to Discussion comments on the Hub with a response generated by BLOOM, a multilingual language model, using Inference Providers. ## Create your Webhook in your user profile First, let's create a Webhook from your [settings]( https://huggingface.co/settings/webhooks). - Input a few target repositories that your Webhook will listen to. - You can put a dummy Webhook URL for now, but defining your webhook will let you look at the events that will be sent to it (and you can replay them, which will be useful for debugging). - Input a secret as it will be more secure. - Subscribe to Community (PR & discussions) events, as we are building a Discussion bot. Your Webhook will look like this: ![webhook-creation](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/001-discussion-bot/webhook-creation.png) ## Create a new `Bot` user profile In this guide, we create a separate user account to host a Space and to post comments: ![discussion-bot-profile](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/001-discussion-bot/discussion-bot-profile.png) > [!TIP] > When creating a bot that will interact with other users on the Hub, we ask that you clearly label the account as a "Bot" (see profile screenshot). ## Create a Space that will react to your Webhook The third step is actually to listen to the Webhook events. An easy way is to use a Space for this. We use the user account we created, but you could do it from your main user account if you wanted to. The Space's code is [here](https://huggingface.co/spaces/discussion-bot/webhook/tree/main). We used NodeJS and Typescript to implement it, but any language or framework would work equally well. Read more about Docker Spaces [here](https://huggingface.co/docs/hub/spaces-sdks-docker). **The main `server.ts` file is [here](https://huggingface.co/spaces/discussion-bot/webhook/blob/main/server.ts)** Let's walk through what happens in this file: ```ts app.post("/", async (req, res) => { if (req.header("X-Webhook-Secret") !== process.env.WEBHOOK_SECRET) { console.error("incorrect secret"); return res.status(400).json({ error: "incorrect secret" }); } ... ``` Here, we listen to POST requests made to `/`, and then we check that the `X-Webhook-Secret` header is equal to the secret we had previously defined (you need to also set the `WEBHOOK_SECRET` secret in your Space's settings to be able to verify it). ```ts const event = req.body.event; if ( event.action === "create" && event.scope === "discussion.comment" && req.body.comment.content.includes(BOT_USERNAME) ) { ... ``` The event's payload is encoded as JSON. Here, we specify that we will run our Webhook only when: - the event concerns a discussion comment - the event is a creation, i.e. a new comment has been posted - the comment's content contains `@discussion-bot`, i.e. our bot was just mentioned in a comment. In that case, we will continue to the next step: ```ts const INFERENCE_URL = "https://api-inference.huggingface.co/models/bigscience/bloom"; const PROMPT = `Pretend that you are a bot that replies to discussions about machine learning, and reply to the following comment:\n`; const response = await fetch(INFERENCE_URL, { method: "POST", body: JSON.stringify({ inputs: PROMPT + req.body.comment.content }), }); if (response.ok) { const output = await response.json(); const continuationText = output[0].generated_text.replace( PROMPT + req.body.comment.content, "" ); ... ``` This is the coolest part: we call Inference Providers for the BLOOM model, prompting it with `PROMPT`, and we get the continuation text, i.e., the part generated by the model. Finally, we will post it as a reply in the same discussion thread: ```ts const commentUrl = req.body.discussion.url.api + "/comment"; const commentApiResponse = await fetch(commentUrl, { method: "POST", headers: { Authorization: `Bearer ${process.env.HF_TOKEN}`, "Content-Type": "application/json", }, body: JSON.stringify({ comment: continuationText }), }); const apiOutput = await commentApiResponse.json(); ``` ## Configure your Webhook to send events to your Space Last but not least, you'll need to configure your Webhook to send POST requests to your Space. Let's first grab our Space's "direct URL" from the contextual menu. Click on "Embed this Space" and copy the "Direct URL". ![embed this Space](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/001-discussion-bot/embed-space.png) ![direct URL](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/001-discussion-bot/direct-url.png) Update your webhook to send requests to that URL: ![webhook settings](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/001-discussion-bot/webhook-creation.png) ## Result ![discussion-result](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/001-discussion-bot/discussion-result.png) ### Academia Hub https://huggingface.co/docs/hub/academia-hub.md # Academia Hub > [!TIP] > Ask your university's IT or Procurement Team to get in touch from a university-affiliated email address to initiate the subscription process. ## Accelerate your university's AI research, publication pipeline, and collaboration at scale The Hugging Face Hub is where leading researchers and developers across academia and industry collaborate on AI throughout the whole research lifecycle. Academia Hub brings that proven ecosystem to your university, giving your researchers everything they need to work securely, reproducibly, and at scale: compute, storage, collaboration, and governance, all managed through your institution. With Academia Hub, you get **university-level seat management and accounts for researchers, professors, and/or students.** ## Why Academia Hub: Built for the complete research lifecycle Academia Hub scales with your research from early prototypes to large-scale published models while ensuring security, reproducibility, and seamless collaboration across your entire institution. 1. **Store & version** datasets, models, and results with 1TB private storage per seat. 2. **Collaborate & review** with co-authors and lab members in shared workspaces. 3. **Prototype** ideas using interactive Spaces and hosted notebooks. 4. **Train & scale** with managed GPUs and tracked runs. 5. **Publish & share** using model cards, DOIs, and dataset releases. 6. **Preserve** your work for reproducibility and future research. All backed by enterprise-grade infrastructure, institutional governance, and storage that grows with your needs. ## Key features of Academia Hub ***For researchers and students*** - **Storage**: 1 TB private storage per seat (e.g., 400 seats = 400 TB) powered by Xet, purpose-built for versioning large AI models and datasets; expanded public storage; Dataset Viewer for private datasets. - **Hosting & demos**: Spaces Hosting for scalable AI demos and applications powered by ZeroGPU (5ร— priority quota); Dev Mode with SSH/VS Code access for development. - **Compute**: Priority GPU access (H100/H200) for training; managed runs with experiment tracking. - **Collaboration**: Team workspaces with version control, peer review, and shared governance. - **Publishing**: Share the research artifacts like models and datasets with the global AI community through citable releases with model cards, dataset cards, and DOIs. ***For administrators*** - **Pricing**: $10/seat/month (volume-based pricing); $2/seat/month compute credits included (top-ups available). - **Admin & security**: University-level seat management for researchers, professors, and students; centralized administration; SSO with university domain. ***For your research community*** - **Community & resources**: Connect with peers; curated models/datasets/projects for academia. ## How to get started Researchers and students: Contact us to express interest in Academia Hub and help us connect with your university's IT or Procurement Team. IT or Procurement staff: Get in touch directly to set up your institution's Academia Hub subscription or find out more about how your institution can benefit from Academia Hub. ### GGUF usage with llama.cpp https://huggingface.co/docs/hub/gguf-llamacpp.md # GGUF usage with llama.cpp > [!TIP] > You can now deploy any llama.cpp compatible GGUF on Hugging Face Endpoints, read more about it [here](https://huggingface.co/docs/inference-endpoints/en/others/llamacpp_container) Llama.cpp allows you to download and run inference on a GGUF simply by providing a path to the Hugging Face repo path and the file name. llama.cpp downloads the model checkpoint and automatically caches it. The location of the cache is defined by `LLAMA_CACHE` environment variable; read more about it [here](https://github.com/ggerganov/llama.cpp/pull/7826). You can install llama.cpp through brew (works on Mac and Linux), or you can build it from source. There are also pre-built binaries and Docker images that you can [check in the official documentation](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#usage). ### Option 1: Install with brew/ winget ```bash brew install llama.cpp ``` or, on windows via winget ```bash winget install llama.cpp ``` ### Option 2: build from source Step 1: Clone llama.cpp from GitHub. ``` git clone https://github.com/ggerganov/llama.cpp ``` Step 2: Move into the llama.cpp folder and build it. You can also add hardware-specific flags (for ex: `-DGGML_CUDA=1` for Nvidia GPUs). ``` cd llama.cpp cmake -B build # optionally, add -DGGML_CUDA=ON to activate CUDA cmake --build build --config Release ``` Note: for other hardware support (for ex: AMD ROCm, Intel SYCL), please refer to [llama.cpp's build guide](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md) Once installed, you can use the `llama-cli` or `llama-server` as follows: ```bash llama-cli -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0 ``` Note: You can explicitly add `-no-cnv` to run the CLI in raw completion mode (non-chat mode). Additionally, you can invoke an OpenAI spec chat completions endpoint directly using the llama.cpp server: ```bash llama-server -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0 ``` After running the server you can simply utilise the endpoint as below: ```bash curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer no-key" \ -d '{ "messages": [ { "role": "system", "content": "You are an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests." }, { "role": "user", "content": "Write a limerick about Python exceptions" } ] }' ``` Replace `-hf` with any valid Hugging Face hub repo name - off you go! ๐Ÿฆ™ ### Evaluation Results https://huggingface.co/docs/hub/eval-results.md # Evaluation Results > [!WARNING] > This is a work in progress feature. The Hub provides a decentralized system for tracking model evaluation results. Benchmark datasets host leaderboards, and model repos store evaluation scores that automatically appear on both the model page and the benchmark's leaderboard. ## Benchmark Datasets Dataset repos can be defined as **Benchmarks** (e.g., [MMLU-Pro](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro), [HLE](https://huggingface.co/datasets/cais/hle), [GPQA](https://huggingface.co/datasets/Idavidrein/gpqa)). These display a "Benchmark" tag and automatically aggregate evaluation results from model repos across the Hub and display a leaderboard of top models. ![Benchmark Dataset](https://huggingface.co/huggingface/documentation-images/resolve/main/evaluation-results/benchmark-preview.png) ## Model Evaluation Results Evaluation scores are stored in model repos as YAML files in the `.eval_results/` folder. These results: - Appear on the model page with links to the benchmark leaderboard - Are aggregated into the benchmark dataset's leaderboards - Can be submitted via PRs and marked as "community-provided" ![Model Evaluation Results](https://huggingface.co/huggingface/documentation-images/resolve/main/evaluation-results/eval-results-previw.png) ### Adding Evaluation Results To add evaluation results to a model, you can submit a PR to the model repo with a YAML file in the `.eval_results/` folder. Create a YAML file in `.eval_results/*.yaml` in your model repo: ```yaml - dataset: id: cais/hle # Required. Hub dataset ID (must be a Benchmark) task_id: default # Required. ID of the Task, as defined in the dataset's eval.yaml revision: # Optional. Dataset revision hash value: 20.90 # Required. Metric value verifyToken: # Optional. Cryptographic proof of auditable evaluation date: "2025-01-15" # Optional. ISO-8601 date or datetime of when the eval was run (defaults to git commit time) source: # Optional. Attribution for this result, for instance a repo containing output traces or a Paper url: https://huggingface.co/spaces/SaylorTwift/smollm3-mmlu-pro # Required if source provided name: Eval traces # Optional. Display name user: SaylorTwift # Optional. HF username/org notes: "no-tools" # Optional. Details about the evaluation setup (e.g., "tools", "no-tools", etc.) ``` Or, with only the required attributes: ```yaml - dataset: id: Idavidrein/gpqa task_id: gpqa_diamond value: 0.412 ``` Results display badges based on their metadata in the YAML file: | Badge | Condition | |-------|-----------| | verified | A `verifyToken` is valid (evaluation ran in HF Jobs with inspect-ai) | | community | Result submitted via open PR (not merged to main) | | leaderboard | Links to the benchmark dataset | | source | Links to evaluation logs or external source | For more details on how to format this data, check out the [Eval Results](https://github.com/huggingface/hub-docs/blob/main/eval_results.yaml) specifications. ### Community Contributions Anyone can submit evaluation results to any model via Pull Request: 1. Go to the model page and click on the "Community" tab and open a Pull Request. 3. Add a `.eval_results/*.yaml` file with your results. 4. The PR will show as "community-provided" on the model page while open. For help evaluating a model, see the [Evaluating models with Inspect](https://huggingface.co/docs/inference-providers/guides/evaluation-inspect-ai) guide. > [!TIP] > Community scores are visible while the PR is open. If a score is disputed, the model author can close the PR to remove it. The goal is to surface existing evaluation data transparently while building toward a fully reproducible standard via verified scores. ## Registering a Benchmark To register your dataset as a benchmark: 1. Create a dataset repo containing your evaluation data 2. Add an `eval.yaml` file to the repo root with your benchmark configuration, conform to the specification defined below. 3. The file is validated at push time 4. (**Beta**) Get in touch so we can add it to the allow-list. Examples can be found in these benchmarks: [GPQA](https://huggingface.co/datasets/Idavidrein/gpqa/blob/main/eval.yaml), [MMLU-Pro](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro/blob/main/eval.yaml), [HLE](https://huggingface.co/datasets/cais/hle/blob/main/eval.yaml), [GSM8K](https://huggingface.co/datasets/openai/gsm8k/blob/main/eval.yaml). ## Eval.yaml specification The `eval.yaml` should contain the following fields: - `name` โ€” Human-readable display name for the benchmark (e.g. `"Humanity's Last Exam"`). - `description` โ€” Short description of what the benchmark measures. - `evaluation_framework` โ€” Canonical evaluation framework identifier for this benchmark. This is an enumerable which the Hugging Face team maintains. Add your own to the list [here](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/eval.ts). Exactly one framework is supported per benchmark. - `tasks[]` โ€” List of tasks (sub-leaderboards) defined by this benchmark (see below). Required fields in each `tasks[]` item: - `id` โ€” Unique identifier for the task. (e.g. `"gpqa_diamond"`). A single benchmark can define several tasks, each producing its own leaderboard. Feel free to choose a leaderboard identifier for each task. Optional fields in each `tasks[]` item: - `config` โ€” Configuration of the Hugging Face dataset to evaluate (e.g. `"default"`). Defaults to the dataset's default config. - `split` โ€” Split of the Hugging Face dataset to evaluate (e.g. `"test"`). Defaults to `"test"`. When setting `evaluation_framework: inspect-ai`, one also requires to set the following fields: - `field_spec` โ€” Specification of the input and output fields. Consists of `input`, `target`, `choices` and optional `input_image` subfields. See the [docs](https://inspect.aisi.org.uk/tasks.html#hugging-facehttps://inspect.aisi.org.uk/tasks.html#hugging-face) for more details. - `solvers` โ€” Solvers used to go from input to output using the AI model. This can range from a simple system prompt to self-critique loops. See the [docs](https://inspect.aisi.org.uk/solvers.html) for more details. - `scores` โ€” Scorers used. Scorers determine whether solvers were successful in finding the right output for the target defined in the dataset, and in what measure. See the [docs](https://inspect.aisi.org.uk/scorers.html) for more details. Minimal example (required fields only): ```yaml name: MathArena AIME 2026 description: The American Invitational Mathematics Exam (AIME). evaluation_framework: math-arena tasks: - id: MathArena/aime_2026 ``` Extended example: ```yaml name: MathArena AIME 2026 description: The American Invitational Mathematics Exam (AIME). evaluation_framework: "math-arena" tasks: - id: MathArena/aime_2026 config: default split: test ``` Extended example (`"inspect-ai"`-specific): ```yaml name: Humanity's Last Exam description: > Humanity's Last Exam (HLE) is a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. Humanity's Last Exam consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. evaluation_framework: "inspect-ai" tasks: - id: hle config: default split: test field_spec: input: question input_image: image target: answer solvers: - name: system_message args: template: | Your response should be in the following format: Explanation: {your explanation for your answer choice} Answer: {your chosen answer} Confidence: {your confidence score between 0% and 100% for your answer} - name: generate scorers: - name: model_graded_fact args: model: openai/o3-mini ``` ### Network Security https://huggingface.co/docs/hub/enterprise-network-security.md # Network Security > [!WARNING] > This feature is part of the Enterprise Plus plan. ## Define your organization IP Ranges You can list the IP addresses of your organization's outbound traffic to apply for higher rate limits and/or to enforce authenticated access to Hugging Face from your corporate network. The outbound IP address ranges are defined in CIDR format. For example, `52.219.168.0/24` or `2600:1f69:7400::/40`. You can set multiple ranges, one per line. Once organization admins populate the โ€œOrganization IP Rangesโ€ in the Network Security settings, a manual verificationโ€”carried out jointly by Hugging Face Solution Engineers and the organizationโ€™s adminsโ€”is required for the "Require login for users in your IP ranges" setting to become available. After the โ€œOrganization IP Rangesโ€ have been manually verified, and the organization admins have enabled both โ€œRestrict organization access to your IP ranges onlyโ€ and โ€œRequire login for users in your IP rangesโ€, the following flow applies: - When a user arrives on the platform, their IP address is checked. - If the IP falls within the organizationโ€™s defined ranges, the user must authenticate (via the organizationโ€™s SSO if enabled). - Once authenticated, the Content Access Policy determines which resources the user can access. ## Higher Hub Rate Limits Most of the actions on the Hub have limits; for example, users are limited to creating a certain number of repositories per day. Enterprise Plus automatically gives your users the highest rate limits possible for every action. Additionally, once your IP ranges are set, enabling the "Higher Hub Rate Limits" option allows your organization to benefit from the highest HTTP rate limits on the Hub API, unlocking large volumes of model or dataset downloads. For more information about rate limits, see the [Hub Rate limits](./rate-limits) documentation. ## Restrict organization access to your IP ranges only This option restricts access to your organization's resources to only those coming from your defined IP ranges. No one can access your organization resources outside your IP ranges. The rules also apply to access tokens. When enabled, this option unlocks additional nested security settings below. ### Require login for users in your IP ranges When this option is enabled, anyone visiting Hugging Face from your corporate network must be logged in and belong to your organization (requires a manual verification when IP ranges have changed). If enabled, you can optionally define a content access policy. All public pages will show the following message if access is unauthenticated: ### Content Access Policy Define a fine-grained Content Access Policy by blocking specific content of the Hugging Face Hub. For example, you can block your organization's members from accessing Spaces. When users of your organization navigate to blocked content, they'll be presented the following page: To define Blocked content, add rules that target a repository type, an organization, a specific repository, or a combination such as a repository type within a given organization (e.g. all Spaces from a specific organization). The Always allowed field lets you define exceptions to the blocking rules. You can target content that should remain accessible even when a block rule would otherwise apply. ## Manage Network Security via API You can read and update your organization's network security settings programmatically via the Hub API. **OpenAPI reference:** - GET /api/organizations//settings/network-security - PATCH /api/organizations//settings/network-security ### Advanced Topics https://huggingface.co/docs/hub/spaces-advanced.md # Advanced Topics ## Contents - [Using OpenCV in Spaces](./spaces-using-opencv) - [More ways to create Spaces](./spaces-more-ways-to-create) - [Managing Spaces with Github Actions](./spaces-github-actions) - [Managing Spaces with CircleCI Workflows](./spaces-circleci) - [Custom Python Spaces](./spaces-sdks-python) - [How to Add a Space to ArXiv](./spaces-add-to-arxiv) - [Cookie limitations in Spaces](./spaces-cookie-limitations) - [How to handle URL parameters in Spaces](./spaces-handle-url-parameters) - [How to get user status and plan in Spaces](./spaces-get-user-plan) ### GitHub Actions https://huggingface.co/docs/hub/repositories-github-actions.md # GitHub Actions You can use [GitHub Actions](https://docs.github.com/en/actions) to automatically sync your GitHub repository to the Hugging Face Hub. The official [`huggingface/hub-sync`](https://github.com/marketplace/actions/sync-github-to-hugging-face-hub) action supports syncing **Models**, **Datasets**, and **Spaces**. ## Setup 1. Create a Hugging Face [access token](https://huggingface.co/settings/tokens) with **write** permission to the target repo. For better security, use a [fine-grained token](https://huggingface.co/settings/tokens) scoped to only the repository you're syncing to. 2. Add the token as a [GitHub secret](https://docs.github.com/en/actions/security-guides/encrypted-secrets#creating-encrypted-secrets-for-an-environment) called `HF_TOKEN` in your repository settings. 3. Add a workflow file (e.g. `.github/workflows/sync-to-hub.yml`) to your repository. ## Basic usage ```yaml name: Sync to Hugging Face Hub on: push: branches: [main] jobs: sync: runs-on: ubuntu-latest steps: - uses: actions/checkout@v6 - uses: huggingface/hub-sync@v0.1.0 with: github_repo_id: ${{ github.repository }} huggingface_repo_id: username/repo-name hf_token: ${{ secrets.HF_TOKEN }} ``` By default, this syncs to a **Space**. To sync a model or dataset, set the `repo_type` parameter: ```yaml - uses: huggingface/hub-sync@v0.1.0 with: github_repo_id: ${{ github.repository }} huggingface_repo_id: username/my-dataset hf_token: ${{ secrets.HF_TOKEN }} repo_type: dataset ``` ## Parameters | Parameter | Required | Default | Description | |---|---|---|---| | `github_repo_id` | Yes | โ€” | GitHub repository (use `${{ github.repository }}`) | | `huggingface_repo_id` | Yes | โ€” | Target repo on the Hub (`username/repo-name`) | | `hf_token` | Yes | โ€” | Hugging Face access token | | `repo_type` | No | `space` | `space`, `model`, or `dataset` | | `space_sdk` | No | `gradio` | `gradio`, `streamlit`, `docker`, or `static` | | `private` | No | `false` | Whether to create the repo as private | | `subdirectory` | No | `.` | Sync a specific subdirectory (useful for monorepos) | The action mirrors your files to the Hub using the `hf` CLI โ€” it is not a git-to-git sync. It automatically excludes `.github/` and `.git/` directories and mirrors deletions (files removed from GitHub will be removed from the Hub). For more complex workflows (e.g. build steps, custom upload logic), you can install and use the [`hf` CLI](https://huggingface.co/docs/huggingface_hub/en/guides/cli) directly in your workflow instead. For Spaces-specific guidance (file size limits, LFS handling), see [Managing Spaces with GitHub Actions](./spaces-github-actions). ### How to configure OIDC SSO with Okta https://huggingface.co/docs/hub/security-sso-okta-oidc.md # How to configure OIDC SSO with Okta In this guide, we will use Okta as the SSO provider and with the Open ID Connect (OIDC) protocol as our preferred identity protocol. > [!WARNING] > This feature is part of the Team & Enterprise plans. ## Step 1: Create a new application in your Identity Provider Open a new tab/window in your browser and sign in to your Okta account. Navigate to "Admin/Applications" and click the "Create App Integration" button. Then choose an โ€œOIDC - OpenID Connectโ€ application, select the application type "Web Application" and click "Create". ## Step 2: Configure your application in Okta Open a new tab/window in your browser and navigate to the SSO section of your organization's settings. Select the OIDC protocol. Copy the "Redirection URI" from the organization's settings on Hugging Face, and paste it in the "Sign-in redirect URI" field on Okta. The URL looks like this: `https://huggingface.co/organizations/[organizationIdentifier]/oidc/consume`. You can leave the optional Sign-out redirect URIs blank. Save your new application. ## Step 3: Finalize configuration on Hugging Face In your Okta application, under "General", find the following fields: - Client ID - Client secret - Issuer URL You will need these to finalize the SSO setup on Hugging Face. The Okta Issuer URL is generally a URL like `https://tenantId.okta.com`; you can refer to their [guide](https://support.okta.com/help/s/article/What-is-theIssuerlocated-under-the-OpenID-Connect-ID-Token-app-settings-used-for?language=en_US) for more details. In the SSO section of your organization's settings on Hugging Face, copy-paste these values from Okta: - Client ID - Client Secret You can now click on "Update and Test OIDC configuration" to save the settings. You should be redirected to your SSO provider (IdP) login prompt. Once logged in, you'll be redirected to your organization's settings page. A green check mark near the OIDC selector will attest that the test was successful. ## Step 4: Enable SSO in your organization Now that Single Sign-On is configured and tested, you can enable it for members of your organization by clicking on the "Enable" button. Once enabled, members of your organization must complete the SSO authentication flow described in the [How it works](./security-sso-basic#how-it-works) section. ### Digital Object Identifier (DOI) https://huggingface.co/docs/hub/doi.md # Digital Object Identifier (DOI) The Hugging Face Hub offers the possibility to generate DOI for your models or datasets. DOIs (Digital Object Identifiers) are strings uniquely identifying a digital object, anything from articles to figures, including datasets and models. DOIs are tied to object metadata, including the object's URL, version, creation date, description, etc. They are a commonly accepted reference to digital resources across research and academic communities; they are analogous to a book's ISBN. ## How to generate a DOI? To do this, you must go to the settings of your model or dataset. In the DOI section, a button called "Generate DOI" should appear: To generate the DOI for this model or dataset, you need to click on this button and acknowledge that some features on the hub will be restrained and some of your information (your full name) will be transferred to our partner DataCite. When generating a DOI, you can optionally personalize the author name list allowing you to credit all contributors to your model or dataset. After you agree to those terms, your model or dataset will get a DOI assigned, and a new tag should appear in your model or dataset header allowing you to cite it. ## Can I regenerate a new DOI if my model or dataset changes? If ever thereโ€™s a new version of a model or dataset, a new DOI can easily be assigned, and the previous version of the DOI gets outdated. This makes it easy to refer to a specific version of an object, even if it has changed. You just need to click on "Generate new DOI" and tadaam!๐ŸŽ‰ a new DOI is assigned for the current revision of your model or dataset. ## Why is there a 'locked by DOI' message on delete, rename and change visibility action on my model or dataset? DOIs make finding information about a model or dataset easier and sharing them with the world via a permanent link that will never expire or change. As such, datasets/models with DOIs are intended to persist perpetually and may only be deleted, renamed and changed their visibility upon filing a request with our support (website at huggingface.co) ## Further Reading - [Introducing DOI: the Digital Object Identifier to Datasets and Models](https://huggingface.co/blog/introducing-doi) ### How to get a user's plan and status in Spaces https://huggingface.co/docs/hub/spaces-get-user-plan.md # How to get a user's plan and status in Spaces From inside a Space's iframe, you can check if a user is logged in or not on the main site, and if they have a PRO subscription or if one of their orgs has a paid subscription. ```js window.addEventListener("message", (event) => { if (event.data.type === "USER_PLAN") { console.log("plan", event.data.plan); } }) window.parent.postMessage({ type: "USER_PLAN_REQUEST" }, "https://huggingface.co"); ``` `event.data.plan` will be of type: ```ts { user: "anonymous", org: undefined } | { user: "pro" | "free", org: undefined | "team" | "enterprise" | "plus" | "academia" } ``` You will get both the user's status (logged out = `"anonymous"`) and their plan. ## Examples - https://huggingface.co/spaces/huggingfacejs/plan ### Programmatic User Access Control Management https://huggingface.co/docs/hub/programmatic-user-access-control.md # Programmatic User Access Control Management This guide describes how to manage organization member roles and resource group membership via the Hub API: changing a member's organization role and resource group assignments, listing resource groups, adding users to groups, and batch workflows. **Table of contents:** - [Change member role via API](#change-member-role-via-api) โ€” Set a member's org role and resource group assignments (one member per request). - [Resource Groups API](#resource-groups-api) โ€” List resource groups and add users to them. - [Configure auto-join via API](#configure-auto-join-via-api) โ€” Enable or disable auto-join on a Resource Group. --- ## Change member role via API You can change a member's **organization role** (No Access / Read / Contributor / Write / Admin) and, optionally, their roles in **resource groups** using the Hub API. The API updates **one member per request**. To change roles for multiple members, call the API in a loop (examples below). **OpenAPI reference:** PUT /api/organizations/{name}/members/{username}/role ### Prerequisites - Your organization must have a **subscription plan** (e.g. Team or Enterprise). The endpoint returns 402 otherwise. - You must be authenticated as an organization member with **Write** (or Admin) permission on the organization. - The target user must already be a **member** of the organization. ### Base URL and authentication - **Base URL:** `https://huggingface.co` - **Authentication:** Send your token in the request header: ```http Authorization: Bearer ``` Create a fine-grained token with the "Write access to organizations settings / member management" permission scoped to your org at [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens). ### Change member role endpoint **Request** ```http PUT /api/organizations/{org_name}/members/{username}/role Authorization: Bearer Content-Type: application/json { "role": "read", "resourceGroups": [] } ``` - **Path parameters** - `org_name`: Organization slug (e.g. `my-org`). - `username`: Hugging Face **username** of the member whose role you are changing. - **Body** - `role` (required): The member's **organization-level** role. One of: `"no_access"`, `"read"`, `"contributor"`, `"write"`, or `"admin"`. - `resourceGroups` (optional): Array of resource group assignments for this user. Each item: - `id`: Resource group ID (24-character hex string; get IDs from the [resource groups list API](#list-resource-groups)). - `role`: Role in that resource group: `"read"`, `"contributor"`, `"write"`, or `"admin"`. - If you omit `resourceGroups` or pass `[]`, the user is removed from all resource groups. To only change org role and leave resource groups unchanged, pass their current resource group memberships (the body always sets both org role and resource group list). **Example (curl) โ€“ set org role to "read", no resource groups (removes any the user was previously in)** ```bash curl -s -X PUT \ -H "Authorization: Bearer $HF_TOKEN" \ -H "Content-Type: application/json" \ -d '{"role":"read","resourceGroups":[]}' \ "https://huggingface.co/api/organizations/my-org/members/member1/role" ``` **Example (curl) โ€“ set org role and resource group roles (overrides any current groups)** ```bash curl -s -X PUT \ -H "Authorization: Bearer $HF_TOKEN" \ -H "Content-Type: application/json" \ -d '{"role":"write","resourceGroups":[{"id":"507f1f77bcf86cd799439011","role":"read"}]}' \ "https://huggingface.co/api/organizations/my-org/members/member2/role" ``` **Success response:** Status `200 OK`; body: `{ "success": true }`. **Typical errors** - `400` โ€” Invalid body (e.g. invalid role or resource group `id`). - `402` โ€” Organization does not have a subscription plan. - `403` โ€” Not allowed (e.g. you lack Write on the org, or a resource group is not in the org). - `404` โ€” Organization or user not found. ### Updating multiple members The API changes **one member per request**. There is no bulk endpoint. To update many members, call the endpoint once per username (e.g. from a list or CSV). **Example: Bash โ€“ loop over usernames, same role for all** ```bash ORG_NAME="my-org" ROLE="read" for username in member1 member2 member3 member4; do echo "Setting $username to $ROLE ..." curl -s -w "\n%{http_code}" -X PUT \ -H "Authorization: Bearer $HF_TOKEN" \ -H "Content-Type: application/json" \ -d "{\"role\":\"$ROLE\",\"resourceGroups\":[]}" \ "https://huggingface.co/api/organizations/$ORG_NAME/members/$username/role" echo "" done ``` **Example: Python โ€“ loop over usernames** ```python import os import requests BASE_URL = "https://huggingface.co" HF_TOKEN = os.environ.get("HF_TOKEN", "") def change_member_role(org_name: str, username: str, role: str, resource_groups: list | None = None): payload = {"role": role, "resourceGroups": resource_groups or []} r = requests.put( f"{BASE_URL}/api/organizations/{org_name}/members/{username}/role", headers={"Authorization": f"Bearer {HF_TOKEN}", "Content-Type": "application/json"}, json=payload, ) if r.status_code != 200: raise RuntimeError(f"{r.status_code}: {r.text}") return r.json() org_name = "my-org" role = "read" for username in ["member1", "member2", "member3", "member4"]: print(f"Setting {username} to {role} ... ", end="") try: change_member_role(org_name, username, role) print("OK") except Exception as e: print(f"Failed: {e}") ``` For different roles per user, loop over `(username, role)` pairs (e.g. from a CSV) and call `change_member_role` for each. --- ## Resource Groups API The following endpoints let you **list** resource groups and **add** users to them. To **change** an existing member's organization-level role or their resource group assignments, see [Change member role via API](#change-member-role-via-api) above. **OpenAPI reference:** [Resource groups](https://huggingface.co/spaces/huggingface/openapi#tag/resource-groups) **Table of contents โ€” API approaches:** | Goal | Section | | -------------------------------------------------- | ----------------------------------------------------------------------- | | Add many users to **one** resource group | [Add users to a resource group](#add-users-to-a-resource-group) | | Add the **same** users to **many** resource groups | [Batch-add by looping over the API](#batch-add-by-looping-over-the-api) | | Add **different** users per group | [Batch-add by looping over the API](#batch-add-by-looping-over-the-api) | ### Base URL and authentication - **Base URL:** `https://huggingface.co` - **Authentication:** Use one of: - **Access token (recommended for scripts):** Create a fine-grained token with the "Write access to organizations settings / member management" permission scoped to your org at [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens). Send it in the request header: ```http Authorization: Bearer ``` - **Session cookie:** If calling from a browser or tool that shares the same session as the Hub UI, the cookie is sent automatically. ### List resource groups Get all resource groups you can manage for the organization. Use this to obtain each group's `id` for the add-users calls. **Request** ```http GET /api/organizations/{org_name}/resource-groups Authorization: Bearer ``` **Example (curl)** ```bash curl -s -H "Authorization: Bearer $HF_TOKEN" \ "https://huggingface.co/api/organizations/my-org/resource-groups" ``` **Example response (trimmed)** ```json [ { "id": "507f1f77bcf86cd799439011", "name": "Cohort 2024", "description": "Members in this group", "users": [...], "repos": [...] } ] ``` Use the `id` of each resource group when adding users. ### Add users to a resource group Add one or more users to a single resource group in one request. You can send multiple users in the same request. **Request** ```http POST /api/organizations/{org_name}/resource-groups/{resource_group_id}/users Authorization: Bearer Content-Type: application/json { "users": [ { "user": "member1", "role": "read" }, { "user": "member2", "role": "read" }, { "user": "member3", "role": "write" } ] } ``` - **Path parameters** - `org_name`: Organization slug (e.g. `my-org`). - `resource_group_id`: The resource group's `id` (24-character hex string from the list endpoint). - **Body** - `users`: Array of objects. Each object must have: - `user`: Hugging Face **username** (required). - `role`: One of `"read"`, `"contributor"`, `"write"`, `"admin"`. **Example (curl)** ```bash curl -s -X POST \ -H "Authorization: Bearer $HF_TOKEN" \ -H "Content-Type: application/json" \ -d '{"users":[{"user":"member1","role":"read"},{"user":"member2","role":"read"}]}' \ "https://huggingface.co/api/organizations/my-org/resource-groups/507f1f77bcf86cd799439011/users" ``` **Success:** Status `200 OK`; body is the updated resource group object (includes the new users in `users`). **Typical errors:** - `400` โ€” e.g. user not found, duplicate usernames, or invalid body. - `403` โ€” Not allowed (e.g. not in org, or already in the resource group). The message will indicate whether users are not in the organization or already in the group. ### Adding members via email (workaround) The add-users endpoint only accepts **Hugging Face usernames**, not emails. If you have a list of **emails** (e.g. member emails), you can resolve email โ†’ username first, then call the add-users API. Note that email filtering **only** works when the email's domain matches one of the organization's allowed domains: the **Organization email domain** (Settings โ†’ Account โ†’ Organization email domain) and/or the org's **SSO allowed domains** (if SSO is configured). **Step 1 โ€“ Resolve email to username** ```http GET /api/organizations/{org_name}/members?email={email}&limit=1 Authorization: Bearer ``` Response is an array of members; each member has `user` (username). Use `user` for the add-users call. **Step 2 โ€“ Add to resource group** Use the username from step 1 in a normal add-users request: ```http POST /api/organizations/{org_name}/resource-groups/{resource_group_id}/users Content-Type: application/json Body: { "users": [{ "user": "", "role": "read" }] } ``` **Example: one email (bash)** ```bash ORG_NAME="my-org" RG_ID="507f1f77bcf86cd799439011" EMAIL="member@org.com" # Step 1: look up member by email (domain must match org's Organization email domain or SSO allowed domains) MEMBERS=$(curl -s -H "Authorization: Bearer $HF_TOKEN" \ "https://huggingface.co/api/organizations/$ORG_NAME/members?email=$EMAIL&limit=1") USERNAME=$(echo "$MEMBERS" | jq -r '(.[0] // {} | .user // "")') if [ -z "$USERNAME" ]; then echo "No member found for $EMAIL" exit 1 fi # Step 2: add to resource group curl -s -X POST -H "Authorization: Bearer $HF_TOKEN" -H "Content-Type: application/json" \ -d "{\"users\":[{\"user\":\"$USERNAME\",\"role\":\"read\"}]}" \ "https://huggingface.co/api/organizations/$ORG_NAME/resource-groups/$RG_ID/users" ``` **Example: multiple emails in a loop (Python)** ```python import os import requests BASE = "https://huggingface.co" ORG = "my-org" RG_ID = "507f1f77bcf86cd799439011" ROLE = "read" headers = {"Authorization": f"Bearer {os.environ['HF_TOKEN']}", "Content-Type": "application/json"} emails = ["member1@org.com", "member2@org.com"] for email in emails: # Step 1: resolve email โ†’ username (email domain must match org's Organization email domain or SSO allowed domains) r = requests.get(f"{BASE}/api/organizations/{ORG}/members", params={"email": email, "limit": 1}, headers=headers) r.raise_for_status() members = r.json() if not members: print(f"No member found for {email}") continue username = members[0]["user"] # Step 2: add that user to the resource group add_r = requests.post( f"{BASE}/api/organizations/{ORG}/resource-groups/{RG_ID}/users", headers=headers, json={"users": [{"user": username, "role": ROLE}]}, ) if add_r.status_code == 200: print(f"Added {username} ({email})") else: print(f"Failed {email}: {add_r.status_code} {add_r.text}") ``` If a user is already in the resource group, the add call returns `403`; the script reports it as a failure and you can skip or ignore that case if you prefer. **Limitation:** The email filter only applies when the org has an **Organization email domain** and/or **SSO allowed domains** set, and the email's domain matches one of them. Otherwise you cannot look up by email via the members API; you'd need another source for email โ†’ username (e.g. your own directory). ### Batch-add by looping over the API You can add many users to **one** resource group in one or a few requests (e.g. chunk your list of usernames), or add users to **several** resource groups by looping over groups and calling the add-users endpoint for each. **Example: Bash โ€“ one group, multiple users in one request** ```bash #!/bin/bash # Add a list of users to a single resource group. # Usage: ./add-users-to-rg.sh ORG_NAME="${1:-my-org}" RG_ID="${2:-507f1f77bcf86cd799439011}" ROLE="${3:-read}" USERS="member1 member2 member3 member4" USERS_JSON=$(echo "$USERS" | tr ' ' '\n' | while read u; do [ -n "$u" ] && echo "{\"user\":\"$u\",\"role\":\"$ROLE\"}" done | paste -sd ',' -) curl -s -w "\nHTTP_STATUS:%{http_code}" -X POST \ -H "Authorization: Bearer $HF_TOKEN" \ -H "Content-Type: application/json" \ -d "{\"users\":[$USERS_JSON]}" \ "https://huggingface.co/api/organizations/$ORG_NAME/resource-groups/$RG_ID/users" ``` **Example: Bash โ€“ loop over multiple groups** ```bash # Get group IDs and add users to each curl -s -H "Authorization: Bearer $HF_TOKEN" \ "https://huggingface.co/api/organizations/my-org/resource-groups" \ | jq -r '.[].id' \ | while read -r RG_ID; do [ -z "$RG_ID" ] && continue echo "Adding users to resource group $RG_ID ..." curl -s -X POST -H "Authorization: Bearer $HF_TOKEN" -H "Content-Type: application/json" \ -d "{\"users\":[$USERS_JSON]}" \ "https://huggingface.co/api/organizations/my-org/resource-groups/$RG_ID/users" done ``` **Example: Python โ€“ batch-add to one or many groups** ```python import os import requests BASE_URL = "https://huggingface.co" HF_TOKEN = os.environ.get("HF_TOKEN", "") def list_resource_groups(org_name: str): r = requests.get( f"{BASE_URL}/api/organizations/{org_name}/resource-groups", headers={"Authorization": f"Bearer {HF_TOKEN}"}, ) r.raise_for_status() return r.json() def add_users_to_resource_group(org_name: str, resource_group_id: str, users_with_roles: list): """users_with_roles: list of {"user": "username", "role": "read"|"write"|"contributor"|"admin"}""" r = requests.post( f"{BASE_URL}/api/organizations/{org_name}/resource-groups/{resource_group_id}/users", headers={"Authorization": f"Bearer {HF_TOKEN}", "Content-Type": "application/json"}, json={"users": users_with_roles}, ) if r.status_code != 200: raise RuntimeError(f"Add users failed {r.status_code}: {r.text}") return r.json() # Example: same users added to every resource group org_name = "my-org" role = "read" usernames = ["member1", "member2", "member3"] users_with_roles = [{"user": u, "role": role} for u in usernames] for rg in list_resource_groups(org_name): add_users_to_resource_group(org_name, rg["id"], users_with_roles) ``` For a long list of usernames, chunk them (e.g. 50 per request) and call the API once per chunk to avoid large request bodies or timeouts. ### Important notes 1. **Usernames only** โ€” The API accepts Hugging Face **usernames**, not emails. You need a mapping from email โ†’ username (e.g. from your directory or the org members list) before calling the API. 2. **Users must be in the organization** โ€” Every user in the request must already be a member of the organization. Otherwise the request returns `403` with a message that some users are not in the org. 3. **Idempotency** โ€” If a user is already in the resource group, the backend may return `403` for that request. Your script can catch errors and continue, or skip users already in the group if you first fetch the group's `users` list. 4. **Rate limits** โ€” For large batches, consider adding a short delay between requests (e.g. 0.5โ€“1 second) to avoid hitting rate limits. 5. **Token scope** โ€” The access token must have sufficient permissions for the organization (typically at least "Write access to organizations settings / member management"). Create and store the token securely; do not commit it to version control. --- ## Configure auto-join via API [Auto-join](./security-resource-groups#auto-join) automatically adds every org member to a Resource Group at a specified role. You can enable or disable it via the API. **Enable auto-join** ```http POST /api/organizations/{org_name}/resource-groups/{resource_group_id}/settings Authorization: Bearer Content-Type: application/json { "autoJoin": { "enabled": true, "role": "read" } } ``` - **Path parameters** - `org_name`: Organization slug (e.g. `my-org`). - `resource_group_id`: The Resource Group's ID (24-character hex string; get IDs from the [list resource groups endpoint](#list-resource-groups)). - **Body** - `role`: The role to assign to all org members. One of `"read"`, `"contributor"`, `"write"`, or `"admin"`. Enabling auto-join on an existing Resource Group immediately adds all current org members (backfill). **Disable auto-join** Send the same request with `"enabled": false`. The `role` field is not required when disabling: ```http POST /api/organizations/{org_name}/resource-groups/{resource_group_id}/settings Authorization: Bearer Content-Type: application/json { "autoJoin": { "enabled": false } } ``` > [!NOTE] > Disabling auto-join does **not** remove members who were previously auto-joined. It only stops future org members from being added automatically. Existing members remain in the Resource Group. ### Datasets Overview https://huggingface.co/docs/hub/datasets-overview.md # Datasets Overview ## Datasets on the Hub The Hugging Face Hub hosts a [large number of community-curated datasets](https://huggingface.co/datasets) for a diverse range of tasks such as translation, automatic speech recognition, and image classification. Alongside the information contained in the [dataset card](./datasets-cards), many datasets, such as [GLUE](https://huggingface.co/datasets/nyu-mll/glue), include a [Dataset Viewer](./data-studio) to showcase the data. Each dataset is a [Git repository](./repositories) that contains the data required to generate splits for training, evaluation, and testing. For information on how a dataset repository is structured, refer to the [Data files Configuration page](./datasets-data-files-configuration). Following the supported repo structure will ensure that the dataset page on the Hub will have a Viewer. ## Search for datasets Like models and spaces, you can search the Hub for datasets using the search bar in the top navigation or on the [main datasets page](https://huggingface.co/datasets). There's a large number of languages, tasks, and licenses that you can use to filter your results to find a dataset that's right for you. ## Privacy Since datasets are repositories, you can [toggle their visibility between private and public](./repositories-settings#private-repositories) through the Settings tab. If a dataset is owned by an [organization](./organizations), the privacy settings apply to all the members of the organization. ### Widgets https://huggingface.co/docs/hub/models-widgets.md # Widgets ## What's a widget? Many model repos have a widget that allows anyone to run inferences directly in the browser. These widgets are powered by [Inference Providers](https://huggingface.co/docs/inference-providers), which provide developers streamlined, unified access to hundreds of machine learning models, backed by our serverless inference partners. Here are some examples of current popular models: - [DeepSeek V3](https://huggingface.co/deepseek-ai/DeepSeek-V3-0324) - State-of-the-art open-weights conversational model - [Flux Kontext](https://huggingface.co/black-forest-labs/FLUX.1-Kontext-dev) - Open-weights transformer model for image editing - [Falconsai's NSFW Detection](https://huggingface.co/Falconsai/nsfw_image_detection) - Image content moderation - [ResembleAI's Chatterbox](https://huggingface.co/ResembleAI/chatterbox) - Production-grade open source text-to-speech model. You can explore more models and their widgets on the [models page](https://huggingface.co/models?inference_provider=all&sort=trending) or try them interactively in the [Inference Playground](https://huggingface.co/playground). ## Enabling a widget Widgets are displayed when the model is hosted by at least one Inference Provider, ensuring optimal performance and reliability for the model's inference. Providers autonomously choose and control what models they deploy. The type of widget displayed (text-generation, text to image, etc) is inferred from the model's `pipeline_tag`, a special tag that the Hub tries to compute automatically for all models. The only exception is for the `conversational` widget which is shown on models with a `pipeline_tag` of either `text-generation` or `image-text-to-text`, as long as theyโ€™re also tagged as `conversational`. We choose to expose **only one** widget per model for simplicity. For some libraries, such as `transformers`, the model type can be inferred automatically based from configuration files (`config.json`). The architecture can determine the type: for example, `AutoModelForTokenClassification` corresponds to `token-classification`. If you're interested in this, you can see pseudo-code in [this gist](https://gist.github.com/julien-c/857ba86a6c6a895ecd90e7f7cab48046). For most other use cases, we use the model tags to determine the model task type. For example, if there is `tag: text-classification` in the [model card metadata](./model-cards), the inferred `pipeline_tag` will be `text-classification`. **You can always manually override your pipeline type with `pipeline_tag: xxx` in your [model card metadata](./model-cards#model-card-metadata).** (You can also use the metadata GUI editor to do this). ### How can I control my model's widget example input? You can specify the widget input in the model card metadata section: ```yaml widget: - text: "This new restaurant has amazing food and great service!" example_title: "Positive Review" - text: "I'm really disappointed with this product. Poor quality and overpriced." example_title: "Negative Review" - text: "The weather is nice today." example_title: "Neutral Statement" ``` You can provide more than one example input. In the examples dropdown menu of the widget, they will appear as `Example 1`, `Example 2`, etc. Optionally, you can supply `example_title` as well. ```yaml widget: - text: "Is this review positive or negative? Review: Best cast iron skillet you will ever buy." example_title: "Sentiment analysis" - text: "Barack Obama nominated Hilary Clinton as his secretary of state on Monday. He chose her because she had ..." example_title: "Coreference resolution" - text: "On a shelf, there are five books: a gray book, a red book, a purple book, a blue book, and a black book ..." example_title: "Logic puzzles" - text: "The two men running to become New York City's next mayor will face off in their first debate Wednesday night ..." example_title: "Reading comprehension" ``` Moreover, you can specify non-text example inputs in the model card metadata. Refer [here](./models-widgets-examples) for a complete list of sample input formats for all widget types. For vision & audio widget types, provide example inputs with `src` rather than `text`. For example, allow users to choose from two sample audio files for automatic speech recognition tasks by: ```yaml widget: - src: https://example.org/somewhere/speech_samples/sample1.flac example_title: Speech sample 1 - src: https://example.org/somewhere/speech_samples/sample2.flac example_title: Speech sample 2 ``` Note that you can also include example files in your model repository and use them as: ```yaml widget: - src: https://huggingface.co/username/model_repo/resolve/main/sample1.flac example_title: Custom Speech Sample 1 ``` But even more convenient, if the file lives in the corresponding model repo, you can just use the filename or file path inside the repo: ```yaml widget: - src: sample1.flac example_title: Custom Speech Sample 1 ``` or if it was nested inside the repo: ```yaml widget: - src: nested/directory/sample1.flac ``` We provide example inputs for some languages and most widget types in [default-widget-inputs.ts file](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/default-widget-inputs.ts). If some examples are missing, we welcome PRs from the community to add them! ## Example outputs As an extension to example inputs, for each widget example, you can also optionally describe the corresponding model output, directly in the `output` property. This is useful when the model is not yet supported by Inference Providers, so that the model page can still showcase how the model works and what results it gives. For instance, for an [automatic-speech-recognition](./models-widgets-examples#automatic-speech-recognition) model: ```yaml widget: - src: sample1.flac output: text: "Hello my name is Julien" ``` The `output` property should be a YAML dictionary that represents the output format from Inference Providers. For a model that outputs text, see the example above. For a model that outputs labels (like a [text-classification](./models-widgets-examples#text-classification) model for instance), output should look like this: ```yaml widget: - text: "I liked this movie" output: - label: POSITIVE score: 0.8 - label: NEGATIVE score: 0.2 ``` Finally, for a model that outputs an image, audio, or any other kind of asset, the output should include a `url` property linking to either a file name or path inside the repo or a remote URL. For example, for a text-to-image model: ```yaml widget: - text: "picture of a futuristic tiger, artstation" output: url: images/tiger.jpg ``` We can also surface the example outputs in the Hugging Face UI, for instance, for a text-to-image model to display a gallery of cool image generations. ## Widget Availability and Provider Support Not all models have widgets available. Widget availability depends on: 1. **Task Support**: The model's task must be supported by at least one provider in the Inference Providers network 2. **Provider Availability**: At least one provider must be serving the specific model 3. **Model Configuration**: The model must have proper metadata and configuration files To view the full list of supported tasks, check out [our dedicated documentation page](https://huggingface.co/docs/inference-providers/tasks/index). The list of all providers and the tasks they support is available in [this documentation page](https://huggingface.co/docs/inference-providers/index#partners). For models without provider support, you can still showcase functionality using [example outputs](#example-outputs) in your model card. You can also click _Ask for provider support_ directly on the model page to encourage providers to serve the model, given there is enough community interest. ## Exploring Models with the Inference Playground Before integrating models into your applications, you can test them interactively with the [Inference Playground](https://huggingface.co/playground). The playground allows you to: - Test different [chat completion models](https://huggingface.co/models?inference_provider=all&sort=trending&other=conversational) with custom prompts - Compare responses across different models - Experiment with inference parameters like temperature, max tokens, and more - Find the perfect model for your specific use case The playground uses the same Inference Providers infrastructure that powers the widgets, so you can expect similar performance and capabilities when you integrate the models into your own applications. ### Search https://huggingface.co/docs/hub/search.md # Search You can easily search anything on the Hub with **Full-text search**. We index model cards, dataset cards, and Spaces app.py files. Go directly to https://huggingface.co/search or, using the search bar at the top of https://huggingface.co, you can select "Try Full-text search" to help find what you seek on the Hub across models, datasets, and Spaces: ## Filter with ease By default, models, datasets, & spaces are being searched when a user enters a query. If one prefers, one can filter to search only models, datasets, or spaces. Moreover, one can copy & share the URL from one's browser's address bar, which should contain the filter information as URL query. For example, when one searches for a query `llama` with a filter to show `Spaces` only, one gets URL https://huggingface.co/search/full-text?q=llama&type=space ### Data Studio https://huggingface.co/docs/hub/data-studio.md # Data Studio Each dataset page includes a table with the contents of the dataset, arranged by pages of 100 rows. You can navigate between pages using the buttons at the bottom of the table. ## Inspect data distributions At the top of the columns you can see the graphs representing the distribution of their data. This gives you a quick insight on how balanced your classes are, what are the range and distribution of numerical data and lengths of texts, and what portion of the column data is missing. ## Filter by value If you click on a bar of a histogram from a numerical column, the dataset viewer will filter the data and show only the rows with values that fall in the selected range. Similarly, if you select one class from a categorical column, it will show only the rows from the selected category. ## Search a word in the dataset You can search for a word in the dataset by typing it in the search bar at the top of the table. The search is case-insensitive and will match any row containing the word. The text is searched in the columns of `string`, even if the values are nested in a dictionary or a list. ## Run SQL queries on the dataset You can run SQL queries on the dataset in the browser using the SQL Console. This feature also leverages our [auto-conversion to Parquet](data-studio#access-the-parquet-files). For more information see our guide on [SQL Console](./datasets-viewer-sql-console). ## Share a specific row You can share a specific row by clicking on it, and then copying the URL in the address bar of your browser. For example https://huggingface.co/datasets/nyu-mll/glue/viewer/mrpc/test?p=2&row=241 will open the dataset studio on the MRPC dataset, on the test split, and on the 241st row. ## Large scale datasets The Dataset Viewer supports large scale datasets, but depending on the data format it may only show the first 5GB of the dataset: - For Parquet datasets: the Dataset Viewer shows the full dataset, but sorting, filtering and search are only enabled on the first 5GB. - For datasets >5GB in other formats (e.g. [WebDataset](https://github.com/webdataset/webdataset) or JSON Lines): the Dataset Viewer only shows the first 5GB, and sorting, filtering and search are enabled on these first 5GB. In this case, an informational message lets you know that the Viewer is partial. This should be a large enough sample to represent the full dataset accurately, let us know if you need a bigger sample. ## Access the parquet files To power the dataset viewer, the first 5GB of every dataset are auto-converted to the Parquet format (unless it was already a Parquet dataset). In the dataset viewer (for example, see [GLUE](https://huggingface.co/datasets/nyu-mll/glue)), you can click on [_"Auto-converted to Parquet"_](https://huggingface.co/datasets/nyu-mll/glue/tree/refs%2Fconvert%2Fparquet/cola) to access the Parquet files. Please, refer to the [dataset viewer docs](/docs/datasets-server/parquet_process) to learn how to query the dataset parquet files with libraries such as Polars, Pandas or DuckDB. > [!TIP] > Parquet is a columnar storage format optimized for querying and processing large datasets. Parquet is a popular choice for big data processing and analytics and is widely used for data processing and machine learning. You can learn more about the advantages associated with this format in the documentation. ### Conversion bot When you create a new dataset, the [`parquet-converter` bot](https://huggingface.co/parquet-converter) notifies you once it converts the dataset to Parquet. The [discussion](./repositories-pull-requests-discussions) it opens in the repository provides details about the Parquet format and links to the Parquet files. ### Programmatic access You can also access the list of Parquet files programmatically using the [Hub API](./api#get-apidatasetsrepoidparquet); for example, endpoint [`https://huggingface.co/api/datasets/nyu-mll/glue/parquet`](https://huggingface.co/api/datasets/nyu-mll/glue/parquet) lists the parquet files of the `nyu-mll/glue` dataset. We also have a specific documentation about the [Dataset Viewer API](https://huggingface.co/docs/dataset-viewer), which you can call directly. That API lets you access the contents, metadata and basic statistics of all Hugging Face Hub datasets, and powers the Dataset viewer frontend. ## Dataset preview For the biggest datasets, the page shows a preview of the first 100 rows instead of a full-featured viewer. This restriction only applies for datasets over 5GB that are not natively in Parquet format or that have not been auto-converted to Parquet. ## Embed the Dataset Viewer in a webpage You can embed the Dataset Viewer in your own webpage using an iframe. The URL to use is `https://huggingface.co/datasets///embed/viewer`, where `` is the owner of the dataset and `` is the name of the dataset. You can also pass other parameters like the subset, split, filter, search or selected row. For more information see our guide on [How to embed the Dataset Viewer in a webpage](./datasets-viewer-embed). ## Configure the Dataset Viewer To have a properly working Dataset Viewer for your dataset, make sure your dataset is in a supported format and structure. There is also an option to configure your dataset using YAML. You can specify which files to display in the Dataset Viewer by adding a YAML configuration block at the top of your dataset's `README.md` file. For example, to choose which file goes into which split: ```yaml --- configs: - config_name: default data_files: - split: train path: "data.csv" - split: test path: "holdout.csv" --- ``` You can also select multiple files per split or use glob patterns: ```yaml --- configs: - config_name: default data_files: - split: train path: - "data/train_part1.csv" - "data/train_part2.csv" - split: test path: "data/*.csv" --- ``` For **private** datasets, the Dataset Viewer is enabled for [PRO users](https://huggingface.co/pricing) and [Team or Enterprise organizations](https://huggingface.co/enterprise). For more information see our guide on [How to configure the Dataset Viewer](./datasets-viewer-configure). ### Data files Configuration https://huggingface.co/docs/hub/datasets-data-files-configuration.md # Data files Configuration There are no constraints on how to structure dataset repositories. However, if you want the Dataset Viewer to show certain data files, or to separate your dataset in train/validation/test splits, you need to structure your dataset accordingly. Often it is as simple as naming your data files according to their split names, e.g. `train.csv` and `test.csv`. ## What are splits and subsets? Machine learning datasets typically have splits and may also have subsets. A dataset is generally made of _splits_ (e.g. `train` and `test`) that are used during different stages of training and evaluating a model. A _subset_ (also called _configuration_) is a sub-dataset contained within a larger dataset. Subsets are especially common in multilingual speech datasets where there may be a different subset for each language. If you're interested in learning more about splits and subsets, check out the [Splits and subsets](/docs/datasets-server/configs_and_splits) guide! ![split-configs-server](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/split-configs-server.gif) ## Automatic splits detection Splits are automatically detected based on file and directory names. For example, this is a dataset with `train`, `test`, and `validation` splits: ``` my_dataset_repository/ โ”œโ”€โ”€ README.md โ”œโ”€โ”€ train.csv โ”œโ”€โ”€ test.csv โ””โ”€โ”€ validation.csv ``` To structure your dataset by naming your data files or directories according to their split names, see the [File names and splits](./datasets-file-names-and-splits) documentation and the [companion collection of example datasets](https://huggingface.co/collections/datasets-examples/file-names-and-splits-655e28af4471bd95709eb135). ## Manual splits and subsets configuration You can choose the data files to show in the Dataset Viewer for your dataset using YAML. It is useful if you want to specify which file goes into which split manually. You can also define multiple subsets for your dataset, and pass dataset building parameters (e.g. the separator to use for CSV files). Here is an example of a configuration defining a subset called "benchmark" with a `test` split. ```yaml configs: - config_name: benchmark data_files: - split: test path: benchmark.csv ``` See the documentation on [Manual configuration](./datasets-manual-configuration) for more information. Look also to the [example datasets](https://huggingface.co/collections/datasets-examples/manual-configuration-655e293cea26da0acab95b87). ## Supported file formats See the [File formats](./datasets-adding#file-formats) doc page to find the list of supported formats and recommendations for your dataset. If your dataset uses CSV or TSV files, you can find more information in the [example datasets](https://huggingface.co/collections/datasets-examples/format-csv-and-tsv-655f681cb9673a4249cccb3d). ### Dataset Viewer size-limit errors (`TooBigContentError`) If you see `Error code: TooBigContentError`, then the dataset viewer could not read a preview within its limits. Common messages include `Parquet error: Scan size limit exceeded` and `The size of the content of the first rows exceeds the maximum supported size`. What you can do: - For Parquet files, use smaller row groups and include a page index (`write_page_index=True`) so the Viewer can read only what it needs. - Avoid very large values in the first rows (very long strings, large JSON blobs, base64 payloads). Move large payloads to separate files when possible. - Split very large files into smaller shards or splits, then re-upload. - If the issue remains, review [Configure the Dataset Viewer](./datasets-viewer-configure) and open a discussion on your dataset page with the full error text. ## Image, Audio and Video datasets For image/audio/video classification datasets, you can also use directories to name the image/audio/video classes. And if your images/audio/video files have metadata (e.g. captions, bounding boxes, transcriptions, etc.), you can have metadata files next to them. We provide two guides that you can check out: - [How to create an image dataset](./datasets-image) ([example datasets](https://huggingface.co/collections/datasets-examples/image-dataset-6568e7cf28639db76eb92d65)) - [How to create an audio dataset](./datasets-audio) ([example datasets](https://huggingface.co/collections/datasets-examples/audio-dataset-66aca0b73e8f69e3d069e607)) - [How to create a video dataset](./datasets-video) ### Transforming your dataset https://huggingface.co/docs/hub/datasets-polars-operations.md # Transforming your dataset On this page we'll guide you through some of the most common operations used when doing data analysis. This is only a small subset of what's possible in Polars. For more information, please visit the [Documentation](https://docs.pola.rs/). For the example we will use the [Common Crawl statistics](https://huggingface.co/datasets/commoncrawl/statistics) dataset. These statistics include: number of pages, distribution of top-level domains, crawl overlaps, etc. For more detailed information and graphs please visit their [official statistics page](https://commoncrawl.github.io/cc-crawl-statistics/plots/tlds). ## Reading ```python import polars as pl df = pl.read_csv( "hf://datasets/commoncrawl/statistics/tlds.csv", try_parse_dates=True, ) df.head(3) ``` ```bash โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”† suffix โ”† crawl โ”† date โ”† โ€ฆ โ”† pages โ”† urls โ”† hosts โ”† domains โ”‚ โ”‚ --- โ”† --- โ”† --- โ”† --- โ”† โ”† --- โ”† --- โ”† --- โ”† --- โ”‚ โ”‚ i64 โ”† str โ”† str โ”† date โ”† โ”† i64 โ”† i64 โ”† f64 โ”† f64 โ”‚ โ•žโ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก โ”‚ 0 โ”† a.se โ”† CC-MAIN-2008-2009 โ”† 2009-01-12 โ”† โ€ฆ โ”† 18 โ”† 18 โ”† 1.0 โ”† 1.0 โ”‚ โ”‚ 1 โ”† a.se โ”† CC-MAIN-2009-2010 โ”† 2010-09-25 โ”† โ€ฆ โ”† 3462 โ”† 3259 โ”† 166.0 โ”† 151.0 โ”‚ โ”‚ 2 โ”† a.se โ”† CC-MAIN-2012 โ”† 2012-11-02 โ”† โ€ฆ โ”† 6957 โ”† 6794 โ”† 172.0 โ”† 150.0 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ## Selecting columns The dataset contains some columns we don't need. To remove them, we will use the `select` method: ```python df = df.select("suffix", "date", "tld", "pages", "domains") df.head(3) ``` ```bash โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ suffix โ”† crawl โ”† date โ”† tld โ”† pages โ”† domains โ”‚ โ”‚ --- โ”† --- โ”† --- โ”† --- โ”† --- โ”† --- โ”‚ โ”‚ str โ”† str โ”† date โ”† str โ”† i64 โ”† f64 โ”‚ โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก โ”‚ a.se โ”† CC-MAIN-2008-2009 โ”† 2009-01-12 โ”† se โ”† 18 โ”† 1.0 โ”‚ โ”‚ a.se โ”† CC-MAIN-2009-2010 โ”† 2010-09-25 โ”† se โ”† 3462 โ”† 151.0 โ”‚ โ”‚ a.se โ”† CC-MAIN-2012 โ”† 2012-11-02 โ”† se โ”† 6957 โ”† 150.0 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ## Filtering We can filter the dataset using the `filter` method. This method accepts complex expressions, but let's start simple by filtering based on the crawl date: ```python import datetime df = df.filter(pl.col("date") >= datetime.date(2020, 1, 1)) ``` You can combine multiple predicates with `&` or `|` operators: ```python df = df.filter( (pl.col("date") >= datetime.date(2020, 1, 1)) | pl.col("crawl").str.contains("CC") ) ``` ## Transforming In order to add new columns to the dataset, use `with_columns`. In the example below we calculate the total number of pages per domain and add a new column `pages_per_domain` using the `alias` method. The entire statement within `with_columns` is called an expression. Read more about expressions and how to use them in the [Polars user guide](https://docs.pola.rs/user-guide/expressions/) ```python df = df.with_columns( (pl.col("pages") / pl.col("domains")).alias("pages_per_domain") ) df.sample(3) ``` ```bash โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ suffix โ”† crawl โ”† date โ”† tld โ”† pages โ”† domains โ”† pages_per_domain โ”‚ โ”‚ --- โ”† --- โ”† --- โ”† --- โ”† --- โ”† --- โ”† --- โ”‚ โ”‚ str โ”† str โ”† date โ”† str โ”† i64 โ”† f64 โ”† f64 โ”‚ โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก โ”‚ net.bt โ”† CC-MAIN-2014-41 โ”† 2014-10-06 โ”† bt โ”† 4 โ”† 1.0 โ”† 4.0 โ”‚ โ”‚ org.mk โ”† CC-MAIN-2016-44 โ”† 2016-10-31 โ”† mk โ”† 1445 โ”† 430.0 โ”† 3.360465 โ”‚ โ”‚ com.lc โ”† CC-MAIN-2016-44 โ”† 2016-10-31 โ”† lc โ”† 1 โ”† 1.0 โ”† 1.0 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ## Aggregation & Sorting In order to aggregate data together you can use the `group_by`, `agg` and `sort` methods. Within the aggregation context you can combine expressions to create powerful statements which are still easy to read. First, we aggregate all the data to the top-level domain `tld` per scraped date: ```python df = df.group_by("tld", "date").agg( pl.col("pages").sum(), pl.col("domains").sum(), ) ``` Now we can calculate several statistics per top level domain: - Number of unique scrape dates - Average number of domains in the scraped period - Average growth rate in terms of number of pages ```python df = df.group_by("tld").agg( pl.col("date").unique().count().alias("number_of_scrapes"), pl.col("domains").mean().alias("avg_number_of_domains"), pl.col("pages").sort_by("date").pct_change().mean().alias("avg_page_growth_rate"), ) df = df.sort("avg_number_of_domains", descending=True) df.head(10) ``` ```bash โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ tld โ”† number_of_scrapes โ”† avg_number_of_domains โ”† avg_percent_change_in_number_oโ€ฆ โ”‚ โ”‚ --- โ”† --- โ”† --- โ”† --- โ”‚ โ”‚ str โ”† u32 โ”† f64 โ”† f64 โ”‚ โ•žโ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก โ”‚ com โ”† 101 โ”† 1.9571e7 โ”† 0.022182 โ”‚ โ”‚ de โ”† 101 โ”† 1.8633e6 โ”† 0.5232 โ”‚ โ”‚ org โ”† 101 โ”† 1.5049e6 โ”† 0.019604 โ”‚ โ”‚ net โ”† 101 โ”† 1.5020e6 โ”† 0.021002 โ”‚ โ”‚ cn โ”† 101 โ”† 1.1101e6 โ”† 0.281726 โ”‚ โ”‚ ru โ”† 101 โ”† 1.0561e6 โ”† 0.416303 โ”‚ โ”‚ uk โ”† 101 โ”† 827453.732673 โ”† 0.065299 โ”‚ โ”‚ nl โ”† 101 โ”† 710492.623762 โ”† 1.040096 โ”‚ โ”‚ fr โ”† 101 โ”† 615471.594059 โ”† 0.419181 โ”‚ โ”‚ jp โ”† 101 โ”† 615391.455446 โ”† 0.246162 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ### Shiny on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-shiny.md # Shiny on Spaces [Shiny](https://shiny.posit.co/) is an open-source framework for building simple, beautiful, and performant data applications. The goal when developing Shiny was to build something simple enough to teach someone in an afternoon but extensible enough to power large, mission-critical applications. You can create a useful Shiny app in a few minutes, but if the scope of your project grows, you can be sure that Shiny can accommodate that application. The main feature that differentiates Shiny from other frameworks is its reactive execution model. When you write a Shiny app, the framework infers the relationship between inputs, outputs, and intermediary calculations and uses those relationships to render only the things that need to change as a result of a user's action. The result is that users can easily develop efficient, extensible applications without explicitly caching data or writing callback functions. ## Shiny for Python [Shiny for Python](https://shiny.rstudio.com/py/) is a pure Python implementation of Shiny. This gives you access to all of the great features of Shiny like reactivity, complex layouts, and modules without needing to use R. Shiny for Python is ideal for Hugging Face applications because it integrates smoothly with other Hugging Face tools. To get started deploying a Space, click this button to select your hardware and specify if you want a public or private Space. The Space template will populate a few files to get your app started. _app.py_ This file defines your app's logic. To learn more about how to modify this file, see [the Shiny for Python documentation](https://shiny.rstudio.com/py/docs/overview.html). As your app gets more complex, it's a good idea to break your application logic up into [modules](https://shiny.rstudio.com/py/docs/workflow-modules.html). _Dockerfile_ The Dockerfile for a Shiny for Python app is very minimal because the library doesn't have many system dependencies, but you may need to modify this file if your application has additional system dependencies. The one essential feature of this file is that it exposes and runs the app on the port specified in the space README file (which is 7860 by default). __requirements.txt__ The Space will automatically install dependencies listed in the requirements.txt file. Note that you must include shiny in this file. ## Shiny for R [Shiny for R](https://shiny.rstudio.com/) is a popular and well-established application framework in the R community and is a great choice if you want to host an R app on Hugging Face infrastructure or make use of some of the great [Shiny R extensions](https://github.com/nanxstats/awesome-shiny-extensions). To integrate Hugging Face tools into an R app, you can either use [httr2](https://httr2.r-lib.org/) to call Hugging Face APIs, or [reticulate](https://rstudio.github.io/reticulate/) to call one of the Hugging Face Python SDKs. To deploy an R Shiny Space, click this button and fill out the space metadata. This will populate the Space with all the files you need to get started. _app.R_ This file contains all of your application logic. If you prefer, you can break this file up into `ui.R` and `server.R`. _Dockerfile_ The Dockerfile builds off of the [rocker shiny](https://hub.docker.com/r/rocker/shiny) image. You'll need to modify this file to use additional packages. If you are using a lot of tidyverse packages we recommend switching the base image to [rocker/shinyverse](https://hub.docker.com/r/rocker/shiny-verse). You can install additional R packages by adding them under the `RUN install2.r` section of the dockerfile, and github packages can be installed by adding the repository under `RUN installGithub.r`. There are two main requirements for this Dockerfile: - First, the file must expose the port that you have listed in the README. The default is 7860 and we recommend not changing this port unless you have a reason to. - Second, for the moment you must use the development version of [httpuv](https://github.com/rstudio/httpuv) which resolves an issue with app timeouts on Hugging Face. ### Using SetFit with Hugging Face https://huggingface.co/docs/hub/setfit.md # Using SetFit with Hugging Face SetFit is an efficient and prompt-free framework for few-shot fine-tuning of [Sentence Transformers](https://sbert.net/). It achieves high accuracy with little labeled data - for instance, with only 8 labeled examples per class on the Customer Reviews sentiment dataset, SetFit is competitive with fine-tuning RoBERTa Large on the full training set of 3k examples ๐Ÿคฏ! Compared to other few-shot learning methods, SetFit has several unique features: * ๐Ÿ—ฃ **No prompts or verbalizers:** Current techniques for few-shot fine-tuning require handcrafted prompts or verbalizers to convert examples into a format suitable for the underlying language model. SetFit dispenses with prompts altogether by generating rich embeddings directly from text examples. * ๐ŸŽ **Fast to train:** SetFit doesn't require large-scale models like [T0](https://huggingface.co/bigscience/T0) or GPT-3 to achieve high accuracy. As a result, it is typically an order of magnitude (or more) faster to train and run inference with. * ๐ŸŒŽ **Multilingual support**: SetFit can be used with any [Sentence Transformer](https://huggingface.co/models?library=sentence-transformers&sort=downloads) on the Hub, which means you can classify text in multiple languages by simply fine-tuning a multilingual checkpoint. ## Exploring SetFit on the Hub You can find SetFit models by filtering at the left of the [models page](https://huggingface.co/models?library=setfit). All models on the Hub come with these useful features: 1. An automatically generated model card with a brief description. 2. An interactive widget you can use to play with the model directly in the browser. 3. An Inference Providers widget that allows you to make inference requests. ## Installation To get started, you can follow the [SetFit installation guide](https://huggingface.co/docs/setfit/installation). You can also use the following one-line install through pip: ``` pip install -U setfit ``` ## Using existing models All `setfit` models can easily be loaded from the Hub. ```py from setfit import SetFitModel model = SetFitModel.from_pretrained("tomaarsen/setfit-paraphrase-mpnet-base-v2-sst2-8-shot") ``` Once loaded, you can use [`SetFitModel.predict`](https://huggingface.co/docs/setfit/reference/main#setfit.SetFitModel.predict) to perform inference. ```py model.predict("Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris.") ``` ```bash ['positive', 'negative'] ``` If you want to load a specific SetFit model, you can click `Use in SetFit` and you will be given a working snippet! ## Additional resources * [All SetFit models available on the Hub](https://huggingface.co/models?library=setfit) * SetFit [repository](https://github.com/huggingface/setfit) * SetFit [docs](https://huggingface.co/docs/setfit) * SetFit [paper](https://arxiv.org/abs/2209.11055) ### Using ๐Ÿค— `transformers` at Hugging Face https://huggingface.co/docs/hub/transformers.md # Using ๐Ÿค— `transformers` at Hugging Face ๐Ÿค— `transformers` is a library maintained by Hugging Face and the community, for state-of-the-art Machine Learning for Pytorch, TensorFlow and JAX. It provides thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio. We are a bit biased, but we really like ๐Ÿค— `transformers`! ## Exploring ๐Ÿค— transformers in the Hub There are over 630,000 `transformers` models in the Hub which you can find by filtering at the left of [the models page](https://huggingface.co/models?library=transformers&sort=downloads). You can find models for many different tasks: * Extracting the answer from a context ([question-answering](https://huggingface.co/models?library=transformers&pipeline_tag=question-answering&sort=downloads)). * Creating summaries from a large text ([summarization](https://huggingface.co/models?library=transformers&pipeline_tag=summarization&sort=downloads)). * Classify text (e.g. as spam or not spam, [text-classification](https://huggingface.co/models?library=transformers&pipeline_tag=text-classification&sort=downloads)). * Generate a new text with models such as GPT ([text-generation](https://huggingface.co/models?library=transformers&pipeline_tag=text-generation&sort=downloads)). * Identify parts of speech (verb, subject, etc.) or entities (country, organization, etc.) in a sentence ([token-classification](https://huggingface.co/models?library=transformers&pipeline_tag=token-classification&sort=downloads)). * Transcribe audio files to text ([automatic-speech-recognition](https://huggingface.co/models?library=transformers&pipeline_tag=automatic-speech-recognition&sort=downloads)). * Classify the speaker or language in an audio file ([audio-classification](https://huggingface.co/models?library=transformers&pipeline_tag=audio-classification&sort=downloads)). * Detect objects in an image ([object-detection](https://huggingface.co/models?library=transformers&pipeline_tag=object-detection&sort=downloads)). * Segment an image ([image-segmentation](https://huggingface.co/models?library=transformers&pipeline_tag=image-segmentation&sort=downloads)). * Do Reinforcement Learning ([reinforcement-learning](https://huggingface.co/models?library=transformers&pipeline_tag=reinforcement-learning&sort=downloads))! You can try out the models directly in the browser if you want to test them out without downloading them thanks to the in-browser widgets! ## Transformers repository files A [Transformers](https://hf.co/docs/transformers/index) model repository generally contains model files and preprocessor files. ### Model - The **`config.json`** file stores details about the model architecture such as the number of hidden layers, vocabulary size, number of attention heads, the dimensions of each head, and more. This metadata is the model blueprint. - The **`model.safetensors`** file stores the models pretrained layers and weights. For large models, the safetensors file is sharded to limit the amount of memory required to load it. Browse the **`model.safetensors.index.json`** file to see which safetensors file the model weights are being loaded from. ```json { "metadata": { "total_size": 16060522496 }, "weight_map": { "lm_head.weight": "model-00004-of-00004.safetensors", "model.embed_tokens.weight": "model-00001-of-00004.safetensors", ... } } ``` You can also visualize this mapping by clicking on the โ†— button on the model card. [Safetensors](https://hf.co/docs/safetensors/index) is a safer and faster serialization format - compared to [pickle](./security-pickle#use-your-own-serialization-format) - for storing model weights. You may encounter weights pickled in formats such as **`bin`**, **`pth`**, or **`ckpt`**, but **`safetensors`** is increasingly adopted in the model ecosystem as a better alternative. - A model may also have a **`generation_config.json`** file which stores details about how to generate text, such as whether to sample, the top tokens to sample from, the temperature, and the special tokens for starting and stopping generation. ### Preprocessor - The **`tokenizer_config.json`** file stores the special tokens added by a model. These special tokens signal many things to a model such as the beginning of a sentence, specific formatting for chat templates, or indicating an image. This file also shows the maximum input sequence length the model can accept, the preprocessor class, and the outputs it returns. - The **`tokenizer.json`** file stores the model's learned vocabulary. - The **`special_tokens_map.json`** is a mapping of the special tokens. For example, in [Llama 3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct/blob/main/special_tokens_map.json), the beginning of string token is `"<|begin_of_text|>"`. > [!TIP] > For other modalities, the `tokenizer_config.json` file is replaced by `preprocessor_config.json`. ## Using existing models All `transformer` models are a line away from being used! Depending on how you want to use them, you can use the high-level API using the `pipeline` function or you can use `AutoModel` for more control. ```py # With pipeline, just specify the task and the model id from the Hub. from transformers import pipeline pipe = pipeline("text-generation", model="distilbert/distilgpt2") # If you want more control, you will need to define the tokenizer and model. from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2") model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2") ``` You can also load a model from a specific version (based on commit hash, tag name, or branch) as follows: ```py model = AutoModel.from_pretrained( "julien-c/EsperBERTo-small", revision="v2.0.1" # tag name, or branch name, or commit hash ) ``` If you want to see how to load a specific model, you can click `Use in Transformers` and you will be given a working snippet that you can load it! If you need further information about the model architecture, you can also click the "Read model documentation" at the bottom of the snippet. ## Sharing your models To read all about sharing models with `transformers`, please head out to the [Share a model](https://huggingface.co/docs/transformers/model_sharing) guide in the official documentation. Many classes in `transformers`, such as the models and tokenizers, have a `push_to_hub` method that allows to easily upload the files to a repository. ```py # Pushing model to your own account model.push_to_hub("my-awesome-model") # Pushing your tokenizer tokenizer.push_to_hub("my-awesome-model") # Pushing all things after training trainer.push_to_hub() ``` There is much more you can do, so we suggest to review the [Share a model](https://huggingface.co/docs/transformers/model_sharing) guide. ## Additional resources * Transformers [library](https://github.com/huggingface/transformers). * Transformers [docs](https://huggingface.co/docs/transformers/index). * Share a model [guide](https://huggingface.co/docs/transformers/model_sharing). ### Team & Enterprise plans https://huggingface.co/docs/hub/enterprise.md # Team & Enterprise plans > [!TIP] > > Subscribe to a Team or Enterprise plan to get access to advanced features for your organization. Team & Enterprise organization plans add advanced capabilities to organizations, enabling safe, compliant and managed collaboration for companies and teams on Hugging Face. ## Compare our plans at a quick glance ### Core usage, storage, rate limits | Feature | Free | Team | Enterprise | Enterprise Plus | | ----------------------------------------------------- | ----------- | -------------------- | --------------------- | ---------------------- | | Storage โ€“ Public repos | Best effort | 12TB base + 1TB/seat | 200TB base + 1TB/seat | 500TB base + 1TB/seat | | Storage โ€“ Private repos | 100GB | 1TB/seat + PAYG | 1TB/seat + PAYG | 1TB/seat + PAYG | | [Extra storage](./storage-limits#pay-as-you-go-price) | โŒ | โœ… PAYG | โœ… PAYG | โœ… PAYG | | API requests / period\* | 1,000 | 3,000 | 6,000 | 10,000 up to 100,000โ€  | | Resolver requests / period\* | 5,000 | 20,000 | 50,000 | 100,000 up to 500,000โ€  | | Pages requests / period\* | 200 | 400 | 600 | 1,000 up to 10,000โ€  | \* All quotas are calculated over 5-minute fixed windows โ€  When Organization IP Ranges are defined ### Inference & Hub credits | Feature | Free | Team | Enterprise | Enterprise Plus | | ---------------------------------------------------------------------------------------------------------------------- | ------- | --------------------------------- | --------------------------------- | --------------------------------- | | Serve models with Inference Providers | โœ… PAYG | โœ… $2/seat/mo included + PAYG | โœ… $2/seat/mo included + PAYG | โœ… $2/seat/mo included + PAYG | | [Usage & billing control](https://huggingface.co/docs/inference-providers/pricing#inference-providers-usage-breakdown) | โŒ | โœ… | โœ… | โœ… | | Scale deployment with Inference Endpoints | โœ… PAYG | โœ… PAYG | โœ… PAYG | โœ… PAYG | | Hub credits\* included in plan | โŒ | โŒ (bulk purchase available) | $2k included | 5% of ACV included | \* Hub credits can be utilized for inference providers, inference endpoints, Jobs, Space upgrade, ZeroGPU quota extension ### Spaces & Jobs | Feature | Free | Team | Enterprise | Enterprise Plus | | -------------------------------------- | --------- | ----------- | ----------- | --------------- | | Spaces โ€“ CPU-based runtime | 8 units\* | โœ… No limit | โœ… No limit | โœ… No limit | | Spaces โ€“ ZeroGPU usage tiers | 3.5 minโ€  | 25 minโ€  | 45 minโ€  | 45 minโ€  | | Spaces โ€“ Upgraded hardware | PAYG | PAYG | PAYG | PAYG | | Dev Mode / Custom domain for Spaces | โŒ | โœ… | โœ… | โœ… | | Jobs & Scripts (train/fine-tune, eval) | PAYG | PAYG | PAYG | PAYG | \* running at the same time โ€  included daily quota; paid plans can extend beyond quota using credits at $1 per 10 min of GPU time ### Repo rules, access control, visibility | Feature | Free | Team | Enterprise | Enterprise Plus | | -------------------------------------- | :----------------------------------: | :-------------: | :-------------: | :------------------------: | | Access control granularity | [Standard](./organizations-security) | โœ… Fine-grained | โœ… Fine-grained | โœ… Fine-grained + policies | | Org controls | โŒ | โœ… | โœ… | โœ… | | Hub controls | โŒ | โŒ | โŒ | โœ… | | Default private repos | โŒ | โœ… | โœ… | โœ… | | Disable public repositories (org-wide) | โŒ | โœ… | โœ… | โœ… | | [Data residency](./storage-regions) | โŒ | โœ… | โœ… | โœ… | | Data Studio (private datasets) | โŒ | โœ… | โœ… | โœ… | | Gating Group Collections | โŒ | โœ… | โœ… | โœ… | ### Identity, authentication, org security | Feature | Free | Team | Enterprise | Enterprise Plus | | -------------------------------------------------- | :--: | :----------: | :----------: | :-------------: | | [SSO to private org](./enterprise-sso) | โŒ | โœ… Basic SSO | โœ… Basic SSO | โœ… Managed SSO | | [SSO to public Hub](./enterprise-advanced-sso) | โŒ | โŒ | โŒ | โœ… | | [Enforce 2FA](./enterprise-advanced-security) | โŒ | โœ… | โœ… | โœ… | | [OAuth Token Exchange](./oauth#token-exchange-for-organizations-rfc-8693) | โŒ | โŒ | โœ… | โœ… | | Disable personal public repos for users | โŒ | โŒ | โŒ | โœ… | | Disable joining other orgs for users | โŒ | โŒ | โŒ | โœ… | | Disable PRO subscription | โŒ | โŒ | โŒ | โœ… | | Hide members list | โŒ | โœ… | โœ… | โœ… | ### Governance, auditing, compliance | Feature | Free | Team | Enterprise | Enterprise Plus | | ----------------------------------------------------------------------- | :--: | :---------: | :---------: | :-------------: | | RBAC | โœ… | โœ… Advanced | โœ… Advanced | โœ… Advanced | | [Audit logs](./audit-logs) | โŒ | โœ… | โœ… | โœ… | | [Resource groups](./enterprise-advanced-security) | โŒ | โœ… | โœ… | โœ… | | [Tokens admin / management](./enterprise-tokens-management) | โŒ | โœ… | โœ… | โœ… | | [Token revocation](./enterprise-tokens-management#revoking-via-api) | โŒ | โŒ | โœ… | โœ… | | [Users Download analytics](./enterprise-network-security) | โŒ | โŒ | โŒ | โœ… | | [Content access / policy controls](./enterprise-network-security) | โŒ | โŒ | โŒ | โœ… | | [Network access controls](./enterprise-network-security) | โŒ | โŒ | โŒ | โœ… | | [Enforced authentication (advanced)](./enterprise-network-security) | โŒ | โŒ | โŒ | โœ… | ### User provisioning & admin | Feature | Free | Team | Enterprise | Enterprise Plus | | ---------------------- | :-------: | :-----------: | :-----------: | :-------------: | | Onboarding/Offboarding | โœ… manual | โœ… controlled | โœ… controlled | โœ… automated | | SCIM provisioning | โŒ | โŒ | โœ… Invitation-based | โœ… Full lifecycle | | Managed users | โŒ | โŒ | โŒ | โœ… | ### Support, billing, procurement | Feature | Free | Team | Enterprise | Enterprise Plus | | ------------------------------------------- | :----------: | :--------------------: | :--------------------: | :--------------------: | | Support | Forum access | Best effort | Email support with SLA | Advanced Slack support | | Billing | | Credit card self-serve | Pay with Invoice | Pay with Invoice | | Contract (including Purchase Order) | โŒ | โŒ | โœ… HF template | โœ… customer paper | | Legal Review | โŒ | โŒ | โŒ | โœ… | | Vendor onboarding & Security questionnaires | โŒ | โŒ | โŒ | โœ… | ### Community | Feature | Free | Team | Enterprise | Enterprise Plus | | --------------------------------------------------------------------------------------------------------- | :--: | :--: | :--------: | :-------------: | | Org Article | โŒ | โœ… | โœ… | โœ… | | [Publisher Analytics Dashboard](./publisher-analytics) | โŒ | โœ… | โœ… | โœ… | | [Set your primary org on your profile](https://huggingface.co/changelog/primary-organization-on-profiles) | โŒ | โœ… | โœ… | โœ… | ### Pricing | Feature | Free | Team | Enterprise | Enterprise Plus | | ------------------ | :--: | :---------: | :--------------: | :-------------: | | Pricing | - | 20$/user/Mo | from 50$/user/Mo | custom | | Pilot availability | โŒ | โŒ | โŒ | โœ… | ## Dive more In the following sections we will document the following Team & Enterprise features: - [Single Sign-On (SSO)](./enterprise-sso) - [Audit Logs](./audit-logs) - [Storage Regions](./storage-regions) - [Data Studio for Private datasets](./enterprise-datasets) - [Resource Groups](./security-resource-groups) - [Advanced Compute Options](./advanced-compute-options) - [Advanced Security](./enterprise-advanced-security) - [Tokens Management](./enterprise-tokens-management) - [OAuth Token Exchange](./oauth#token-exchange-for-organizations-rfc-8693) - [Publisher Analytics](./publisher-analytics) - [Gating Group Collections](./enterprise-gating-group-collections) - [Network Security](./enterprise-network-security) - [Higher Rate limits](./rate-limits) - [Blog Articles](./enterprise-blog-articles) Finally, Team & Enterprise plans include vastly more [included public storage](./storage-limits), as well as 1TB of [private storage](./storage-limits) per seat in the subscription, i.e. if your organization has 40 members, then you have 40TB included storage for your private models and datasets. ### Langfuse on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-langfuse.md # Langfuse on Spaces This guide shows you how to deploy Langfuse on Hugging Face Spaces and start instrumenting your LLM application for observability. This integration helps you to experiment with LLM APIs on the Hugging Face Hub, manage your prompts in one place, and evaluate model outputs. ## What is Langfuse? [Langfuse](https://langfuse.com) is an open-source LLM engineering platform that helps teams collaboratively debug, evaluate, and iterate on their LLM applications. Key features of Langfuse include LLM tracing to capture the full context of your application's execution flow, prompt management for centralized and collaborative prompt iteration, evaluation metrics to assess output quality, dataset creation for testing and benchmarking, and a playground to experiment with prompts and model configurations. _This video is a 10 min walkthrough of the Langfuse features:_ ## Why LLM Observability? - As language models become more prevalent, understanding their behavior and performance is important. - **LLM observability** involves monitoring and understanding the internal states of an LLM application through its outputs. - It is essential for addressing challenges such as: - **Complex control flows** with repeated or chained calls, making debugging challenging. - **Non-deterministic outputs**, adding complexity to consistent quality assessment. - **Varied user intents**, requiring deep understanding to improve user experience. - Building LLM applications involves intricate workflows, and observability helps in managing these complexities. ## Step 1: Set up Langfuse on Spaces The Langfuse Hugging Face Space allows you to get up and running with a deployed version of Langfuse with just a few clicks. To get started, click the button above or follow these steps: 1. Create a [**new Hugging Face Space**](https://huggingface.co/new-space) 2. Select **Docker** as the Space SDK 3. Select **Langfuse** as the Space template 4. Attach a **[Storage Bucket](https://huggingface.co/docs/hub/storage-buckets)** to ensure your Langfuse data is persisted across restarts 5. Ensure the space is set to **public** visibility so Langfuse API/SDK's can access the app (see note below for more details) 6. [Optional but recommended] For a secure deployment, replace the default values of the **environment variables**: - `NEXTAUTH_SECRET`: Used to validate login session cookies, generate secret with at least 256 entropy using `openssl rand -base64 32`. - `SALT`: Used to salt hashed API keys, generate secret with at least 256 entropy using `openssl rand -base64 32`. - `ENCRYPTION_KEY`: Used to encrypt sensitive data. Must be 256 bits, 64 string characters in hex format, generate via: `openssl rand -hex 32`. 7. Click **Create Space**! ![Clone the Langfuse Space](https://langfuse.com/images/cookbook/huggingface/huggingface-space-setup.png) ### User Access Your Langfuse Space is pre-configured with Hugging Face OAuth for secure authentication, so you'll need to authorize `read` access to your Hugging Face account upon first login by following the instructions in the pop-up. Once inside the app, you can use [the native Langfuse features](https://langfuse.com/docs/rbac) to manage Organizations, Projects, and Users. The Langfuse space _must_ be set to **public** visibility so that Langfuse API/SDK's can reach the app. This means that by default, _any_ logged-in Hugging Face user will be able to access the Langfuse space. You can prevent new users from signing up and accessing the space via two different methods: #### 1. (Recommended) Hugging Face native org-level OAuth restrictions If you want to restrict access to only members of a specified organization(s), you can simply set the `hf_oauth_authorized_org` metadata field in the space's `README.md` file, as shown [here](https://huggingface.co/docs/hub/spaces-oauth#create-an-oauth-app). Once configured, only users who are members of the specified organization(s) will be able to access the space. #### 2. Manual access control You can also restrict access on a per-user basis by setting the `AUTH_DISABLE_SIGNUP` environment variable to `true`. Be sure that you've first signed in & authenticated to the space before setting this variable else your own user profile won't be able to authenticate. > [!TIP] > **Note:** If you've set the `AUTH_DISABLE_SIGNUP` environment variable to `true` to restrict access, and want to grant a new user access to the space, you'll need to first set it back to `false` (wait for rebuild to complete), add the user and have them authenticate with OAuth, and then set it back to `true`. ## Step 2: Use Langfuse Now that you have Langfuse running, you can start instrumenting your LLM application to capture traces and manage your prompts. Let's see how! ### Monitor Any Application Langfuse is model agnostic and can be used to trace any application. Follow the [get-started guide](https://langfuse.com/docs) in Langfuse documentation to see how you can instrument your code. Langfuse maintains native integrations with many popular LLM frameworks, including [Langchain](https://langfuse.com/docs/integrations/langchain/tracing), [LlamaIndex](https://langfuse.com/docs/integrations/llama-index/get-started) and [OpenAI](https://langfuse.com/docs/integrations/openai/python/get-started) and offers Python and JS/TS SDKs to instrument your code. Langfuse also offers various API endpoints to ingest data and has been integrated by other open source projects such as [Langflow](https://langfuse.com/docs/integrations/langflow), [Dify](https://langfuse.com/docs/integrations/dify) and [Haystack](https://langfuse.com/docs/integrations/haystack/get-started). ### Example 1: Trace Calls to Inference Providers As a simple example, here's how to trace LLM calls to [Inference Providers](https://huggingface.co/docs/inference-providers/en/index) using the Langfuse Python SDK. Be sure to first configure your `LANGFUSE_HOST`, `LANGFUSE_PUBLIC_KEY` and `LANGFUSE_SECRET_KEY` environment variables, and make sure you've [authenticated with your Hugging Face account](https://huggingface.co/docs/huggingface_hub/en/quick-start#authentication). ```python from langfuse.openai import openai from huggingface_hub import get_token client = openai.OpenAI( base_url="https://router.huggingface.co/hf-inference/models/meta-llama/Llama-3.3-70B-Instruct/v1", api_key=get_token(), ) messages = [{"role": "user", "content": "What is observability for LLMs?"}] response = client.chat.completions.create( model="meta-llama/Llama-3.3-70B-Instruct", messages=messages, max_tokens=100, ) ``` ### Example 2: Monitor a Gradio Application We created a Gradio template space that shows how to create a simple chat application using a Hugging Face model and trace model calls and user feedback in Langfuse - without leaving Hugging Face. To get started, [duplicate this Gradio template space](https://huggingface.co/spaces/langfuse/langfuse-gradio-example-template?duplicate=true) and follow the instructions in the [README](https://huggingface.co/spaces/langfuse/langfuse-gradio-example-template/blob/main/README.md). ## Step 3: View Traces in Langfuse Once you have instrumented your application, and ingested traces or user feedback into Langfuse, you can view your traces in Langfuse. ![Example trace with Gradio](https://langfuse.com/images/cookbook/huggingface/huggingface-gradio-example-trace.png) _[Example trace in the Langfuse UI](https://langfuse-langfuse-template-space.hf.space/project/cm4r1ajtn000a4co550swodxv/traces/9cdc12fb-71bf-4074-ab0b-0b8d212d839f?timestamp=2024-12-20T12%3A12%3A50.089Z&view=preview)_ ## Additional Resources and Support - [Langfuse documentation](https://langfuse.com/docs) - [Langfuse GitHub repository](https://github.com/langfuse/langfuse) - [Langfuse Discord](https://langfuse.com/discord) - [Langfuse template Space](https://huggingface.co/spaces/langfuse/langfuse-template-space) For more help, open a support thread on [GitHub discussions](https://langfuse.com/discussions) or [open an issue](https://github.com/langfuse/langfuse/issues). ### Using SpeechBrain at Hugging Face https://huggingface.co/docs/hub/speechbrain.md # Using SpeechBrain at Hugging Face `speechbrain` is an open-source and all-in-one conversational toolkit for audio/speech. The goal is to create a single, flexible, and user-friendly toolkit that can be used to easily develop state-of-the-art speech technologies, including systems for speech recognition, speaker recognition, speech enhancement, speech separation, language identification, multi-microphone signal processing, and many others. ## Exploring SpeechBrain in the Hub You can find `speechbrain` models by filtering at the left of the [models page](https://huggingface.co/models?library=speechbrain). All models on the Hub come up with the following features: 1. An automatically generated model card with a brief description. 2. Metadata tags that help for discoverability with information such as the language, license, paper, and more. 3. An interactive widget you can use to play out with the model directly in the browser. 4. An Inference Providers widget that allows to make inference requests. ## Using existing models `speechbrain` offers different interfaces to manage pretrained models for different tasks, such as `EncoderClassifier`, `EncoderClassifier`, `SepformerSeparation`, and `SpectralMaskEnhancement`. These classes have a `from_hparams` method you can use to load a model from the Hub Here is an example to run inference for sound recognition in urban sounds. ```py import torchaudio from speechbrain.pretrained import EncoderClassifier classifier = EncoderClassifier.from_hparams( source="speechbrain/urbansound8k_ecapa" ) out_prob, score, index, text_lab = classifier.classify_file('speechbrain/urbansound8k_ecapa/dog_bark.wav') ``` If you want to see how to load a specific model, you can click `Use in speechbrain` and you will be given a working snippet that you can load it! ## Additional resources * SpeechBrain [website](https://speechbrain.github.io/). * SpeechBrain [docs](https://speechbrain.readthedocs.io/en/latest/index.html). ### ZenML on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-zenml.md # ZenML on Spaces [ZenML](https://github.com/zenml-io/zenml) is an extensible, open-source MLOps framework for creating portable, production-ready MLOps pipelines. It's built for Data Scientists, ML Engineers, and MLOps Developers to collaborate as they develop to production. ZenML offers a simple and flexible syntax, is cloud- and tool-agnostic, and has interfaces/abstractions catered toward ML workflows. With ZenML you'll have all your favorite tools in one place, so you can tailor a workflow that caters to your specific needs. The ZenML Huggingface Space allows you to get up and running with a deployed version of ZenML with just a few clicks. Within a few minutes, you'll have this default ZenML dashboard deployed and ready for you to connect to from your local machine. In the sections that follow, you'll learn to deploy your own instance of ZenML and use it to view and manage your machine learning pipelines right from the Hub. ZenML on Huggingface Spaces is a **self-contained application completely hosted on the Hub using Docker**. The diagram below illustrates the complete process. ![ZenML on HuggingFace Spaces -- default deployment](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/zenml/hf_spaces_chart.png) Visit [the ZenML documentation](https://docs.zenml.io/) to learn more about its features and how to get started with running your machine learning pipelines through your Huggingface Spaces deployment. You can check out [some small sample examples](https://github.com/zenml-io/zenml/tree/main/examples) of ZenML pipelines to get started or take your pick of some more complex production-grade projects at [the ZenML Projects repository](https://github.com/zenml-io/zenml-projects). ZenML integrates with many of your favorite tools out of the box, [including Huggingface](https://zenml.io/integrations/huggingface) of course! If there's something else you want to use, we're built to be extensible and you can easily make it work with whatever your custom tool or workflow is. ## โšก๏ธ Deploy ZenML on Spaces You can deploy ZenML on Spaces with just a few clicks: To set up your ZenML app, you need to specify three main components: the Owner (either your personal account or an organization), a Space name, and the Visibility (a bit lower down the page). Note that the space visibility needs to be set to 'Public' if you wish to connect to the ZenML server from your local machine. ![Choose the ZenML Docker template](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/zenml/choose_space.png) You have the option here to select a higher tier machine to use for your server. The advantage of selecting a paid CPU instance is that it is not subject to auto-shutdown policies and thus will stay up as long as you leave it up. In order to make use of a persistent CPU, you'll likely want to create and set up a MySQL database to connect to (see below). To personalize your Space's appearance, such as the title, emojis, and colors, navigate to "Files and Versions" and modify the metadata in your README.md file. Full information on Spaces configuration parameters can be found on the HuggingFace [documentation reference guide](https://huggingface.co/docs/hub/spaces-config-reference). After creating your Space, you'll notice a 'Building' status along with logs displayed on the screen. When this switches to 'Running', your Space is ready for use. If the ZenML login UI isn't visible, try refreshing the page. In the upper-right hand corner of your space you'll see a button with three dots which, when you click on it, will offer you a menu option to "Embed this Space". (See [the HuggingFace documentation](https://huggingface.co/docs/hub/spaces-embed) for more details on this feature.) Copy the "Direct URL" shown in the box that you can now see on the screen. This should look something like this: `https://-.hf.space`. Open that URL and use our default login to access the dashboard (username: 'default', password: (leave it empty)). ## Connecting to your ZenML Server from your Local Machine Once you have your ZenML server up and running, you can connect to it from your local machine. To do this, you'll need to get your Space's 'Direct URL' (see above). > [!WARNING] > Your Space's URL will only be available and usable for connecting from your > local machine if the visibility of the space is set to 'Public'. You can use the 'Direct URL' to connect to your ZenML server from your local machine with the following CLI command (after installing ZenML, and using your custom URL instead of the placeholder): ```shell zenml connect --url '' --username='default' --password='' ``` You can also use the Direct URL in your browser to use the ZenML dashboard as a fullscreen application (i.e. without the HuggingFace Spaces wrapper around it). > [!WARNING] > The ZenML dashboard will currently not work when viewed from within the Huggingface > webpage (i.e. wrapped in the main `https://huggingface.co/...` website). This is on > account of a limitation in how cookies are handled between ZenML and Huggingface. > You **must** view the dashboard from the 'Direct URL' (see above). ## Extra Configuration Options By default the ZenML application will be configured to use a SQLite non-persistent database. If you want to use a persistent database, you can configure this by amending the `Dockerfile` in your Space's root directory. For full details on the various parameters you can change, see [our reference documentation](https://docs.zenml.io/getting-started/deploying-zenml/docker#zenml-server-configuration-options) on configuring ZenML when deployed with Docker. > [!TIP] > If you are using the space just for testing and experimentation, you don't need > to make any changes to the configuration. Everything will work out of the box. You can also use an external secrets backend together with your HuggingFace Spaces as described in [our documentation](https://docs.zenml.io/getting-started/deploying-zenml/docker#zenml-server-configuration-options). You should be sure to use HuggingFace's inbuilt 'Repository secrets' functionality to configure any secrets you need to use in your`Dockerfile` configuration. [See the documentation](https://huggingface.co/docs/hub/spaces-sdks-docker#secret-management) for more details how to set this up. > [!WARNING] > If you wish to use a cloud secrets backend together with ZenML for secrets > management, **you must take the following minimal security precautions** on your ZenML Server on the > Dashboard: > > - change your password on the `default` account that you get when you start. You > can do this from the Dashboard or via the CLI. > - create a new user account with a password and assign it the `admin` role. This > can also be done from the Dashboard (by 'inviting' a new user) or via the CLI. > - reconnect to the server using the new user account and password as described > above, and use this new user account as your working account. > > This is because the default user created by the > HuggingFace Spaces deployment process has no password assigned to it and as the > Space is publicly accessible (since the Space is public) *potentially anyone > could access your secrets without this extra step*. To change your password > navigate to the Settings page by clicking the button in the upper right hand > corner of the Dashboard and then click 'Update Password'. ## Upgrading your ZenML Server on HF Spaces The default space will use the latest version of ZenML automatically. If you want to update your version, you can simply select the 'Factory reboot' option within the 'Settings' tab of the space. Note that this will wipe any data contained within the space and so if you are not using a MySQL persistent database (as described above) you will lose any data contained within your ZenML deployment on the space. You can also configure the space to use an earlier version by updating the `Dockerfile`'s `FROM` import statement at the very top. ## Next Steps As a next step, check out our [Starter Guide to MLOps with ZenML](https://docs.zenml.io/starter-guide/pipelines) which is a series of short practical pages on how to get going quickly. Alternatively, check out [our `quickstart` example](https://github.com/zenml-io/zenml/tree/main/examples/quickstart) which is a full end-to-end example of many of the features of ZenML. ## ๐Ÿค— Feedback and support If you are having trouble with your ZenML server on HuggingFace Spaces, you can view the logs by clicking on the "Open Logs" button at the top of the space. This will give you more context of what's happening with your server. If you have suggestions or need specific support for anything else which isn't working, please [join the ZenML Slack community](https://zenml.io/slack-invite/) and we'll be happy to help you out! ### Using SpanMarker at Hugging Face https://huggingface.co/docs/hub/span_marker.md # Using SpanMarker at Hugging Face [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) is a framework for training powerful Named Entity Recognition models using familiar encoders such as BERT, RoBERTa and DeBERTa. Tightly implemented on top of the ๐Ÿค— Transformers library, SpanMarker can take good advantage of it. As a result, SpanMarker will be intuitive to use for anyone familiar with Transformers. ## Exploring SpanMarker in the Hub You can find `span_marker` models by filtering at the left of the [models page](https://huggingface.co/models?library=span-marker). All models on the Hub come with these useful features: 1. An automatically generated model card with a brief description. 2. An interactive widget you can use to play with the model directly in the browser. 3. An Inference Providers widget that allows you to make inference requests. ## Installation To get started, you can follow the [SpanMarker installation guide](https://tomaarsen.github.io/SpanMarkerNER/install.html). You can also use the following one-line install through pip: ``` pip install -U span_marker ``` ## Using existing models All `span_marker` models can easily be loaded from the Hub. ```py from span_marker import SpanMarkerModel model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-fewnerd-fine-super") ``` Once loaded, you can use [`SpanMarkerModel.predict`](https://tomaarsen.github.io/SpanMarkerNER/api/span_marker.modeling.html#span_marker.modeling.SpanMarkerModel.predict) to perform inference. ```py model.predict("Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris.") ``` ```json [ {"span": "Amelia Earhart", "label": "person-other", "score": 0.7629689574241638, "char_start_index": 0, "char_end_index": 14}, {"span": "Lockheed Vega 5B", "label": "product-airplane", "score": 0.9833564758300781, "char_start_index": 38, "char_end_index": 54}, {"span": "Atlantic", "label": "location-bodiesofwater", "score": 0.7621214389801025, "char_start_index": 66, "char_end_index": 74}, {"span": "Paris", "label": "location-GPE", "score": 0.9807717204093933, "char_start_index": 78, "char_end_index": 83} ] ``` If you want to load a specific SpanMarker model, you can click `Use in SpanMarker` and you will be given a working snippet! ## Additional resources * SpanMarker [repository](https://github.com/tomaarsen/SpanMarkerNER) * SpanMarker [docs](https://tomaarsen.github.io/SpanMarkerNER) ### Downloading datasets https://huggingface.co/docs/hub/datasets-downloading.md # Downloading datasets ## Integrated libraries If a dataset on the Hub is tied to a [supported library](./datasets-libraries), loading the dataset can be done in just a few lines. For information on accessing the dataset, you can click on the "Use this dataset" button on the dataset page to see how to do so. For example, [`samsum`](https://huggingface.co/datasets/knkarthick/samsum?library=datasets) shows how to do so with `datasets` below. ## Using the Hugging Face Client Library You can use the [`huggingface_hub`](/docs/huggingface_hub) library to create, delete, update and retrieve information from repos. For example, to download the `HuggingFaceH4/ultrachat_200k` dataset from the command line, run ```bash hf download HuggingFaceH4/ultrachat_200k --repo-type dataset ``` See the [HF CLI download documentation](https://huggingface.co/docs/huggingface_hub/en/guides/cli#download-a-dataset-or-a-space) for more information. You can also integrate this into your own library! For example, you can quickly load a CSV dataset with a few lines using Pandas. ```py from huggingface_hub import hf_hub_download import pandas as pd REPO_ID = "YOUR_REPO_ID" FILENAME = "data.csv" dataset = pd.read_csv( hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset") ) ``` ## Using Git Since all datasets on the Hub are Xet-backed Git repositories, you can clone the datasets locally by [installing git-xet](./xet/using-xet-storage#git-xet) and running: ```bash git xet install git lfs install git clone git@hf.co:datasets/ # example: git clone git@hf.co:datasets/allenai/c4 ``` If you have write-access to the particular dataset repo, you'll also have the ability to commit and push revisions to the dataset. Add your SSH public key to [your user settings](https://huggingface.co/settings/keys) to push changes and/or access private repos. ## Using hf-mount For large datasets, you can mount a repo as a local filesystem with [hf-mount](https://github.com/huggingface/hf-mount) instead of downloading the full repo. Files are fetched lazily โ€” only the bytes your code reads hit the network. Useful when your workflow expects local file paths (e.g. `tarfile`, `zipfile`, `imagefolder`) rather than Python iterators. ```bash brew install hf-mount hf-mount start repo datasets/stanfordnlp/imdb /tmp/imdb ``` Repos are mounted read-only. See [Mount as a Local Filesystem](./storage-buckets-access#mount-as-a-local-filesystem) for full setup details, backend options, and caching. ### Using MLX at Hugging Face https://huggingface.co/docs/hub/mlx.md # Using MLX at Hugging Face [MLX](https://github.com/ml-explore/mlx) is a model training and serving framework for Apple silicon made by Apple Machine Learning Research. It comes with a variety of examples: - [Generate text with MLX-LM](https://github.com/ml-explore/mlx-lm/tree/main) and [generating text with MLX-LM for models in GGUF format](https://github.com/ml-explore/mlx-examples/tree/main/llms/gguf_llm). - Large-scale text generation with [LLaMA](https://github.com/ml-explore/mlx-examples/tree/main/llms/llama). - Fine-tuning with [LoRA](https://github.com/ml-explore/mlx-examples/tree/main/lora). - Generating images with [Stable Diffusion](https://github.com/ml-explore/mlx-examples/tree/main/stable_diffusion). - Speech recognition with [OpenAI's Whisper](https://github.com/ml-explore/mlx-examples/tree/main/whisper). ## Exploring MLX on the Hub You can find MLX models by filtering at the left of the [models page](https://huggingface.co/models?library=mlx&sort=trending). There's also an open [MLX community](https://huggingface.co/mlx-community) of contributors converting and publishing weights for MLX format. Thanks to MLX Hugging Face Hub integration, you can load MLX models with a few lines of code. ## Installation MLX comes as a standalone package, and there's a subpackage called MLX-LM with Hugging Face integration for Large Language Models. To install MLX-LM, you can use the following one-line install through `pip`: ```bash pip install mlx-lm ``` You can get more information about it [here](https://github.com/ml-explore/mlx-lm/tree/main). If you install `mlx-lm`, you don't need to install `mlx`. If you don't want to use `mlx-lm` but only MLX, you can install MLX itself as follows. With `pip`: ```bash pip install mlx ``` With `conda`: ```bash conda install -c conda-forge mlx ``` ## Using Existing Models MLX-LM has useful utilities to generate text. The following line directly downloads and loads the model and starts generating text. ```bash python -m mlx_lm.generate --model mistralai/Mistral-7B-Instruct-v0.2 --prompt "hello" ``` For a full list of generation options, run ```bash python -m mlx_lm.generate --help ``` You can also load a model and start generating text through Python like below: ```python from mlx_lm import load, generate model, tokenizer = load("mistralai/Mistral-7B-Instruct-v0.2") response = generate(model, tokenizer, prompt="hello", verbose=True) ``` MLX-LM supports popular LLM architectures including LLaMA, Phi-2, Mistral, and Qwen. Models other than supported ones can easily be downloaded as follows: Setting `HF_XET_HIGH_PERFORMANCE=1` raises concurrency bounds and buffer sizes for machines with high bandwidth and at least 64 GB of RAM: ```bash pip install -U huggingface_hub export HF_XET_HIGH_PERFORMANCE=1 hf download --local-dir / ``` ## Converting and Sharing Models You can convert, and optionally quantize, LLMs from the Hugging Face Hub as follows: ```bash python -m mlx_lm.convert --hf-path mistralai/Mistral-7B-v0.1 -q ``` If you want to directly push the model after the conversion, you can do it like below. ```bash python -m mlx_lm.convert \ --hf-path mistralai/Mistral-7B-v0.1 \ -q \ --upload-repo / ``` ## Additional Resources * [MLX Repository](https://github.com/ml-explore/mlx) * [MLX Docs](https://ml-explore.github.io/mlx/) * [MLX-LM](https://github.com/ml-explore/mlx-lm/tree/main) * [MLX Examples](https://github.com/ml-explore/mlx-examples/tree/main) * [All MLX models on the Hub](https://huggingface.co/models?library=mlx&sort=trending) ### Third-party scanner: Protect AI https://huggingface.co/docs/hub/security-protectai.md # Third-party scanner: Protect AI > [!TIP] > Interested in joining our security partnership / providing scanning information on the Hub? Please get in touch with us over at security@huggingface.co.* [Protect AI](https://protectai.com/)'s [Guardian](https://protectai.com/guardian) catches pickle, Keras, and other exploits as detailed on their [Knowledge Base page](https://protectai.com/insights/knowledge-base/). Guardian also benefits from reports sent in by their community of bounty [Huntr](https://huntr.com/)s. ![Protect AI report for the danger.dat file contained in mcpotato/42-eicar-street](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/protect-ai-report.png) *Example of a report for [danger.dat](https://huggingface.co/mcpotato/42-eicar-street/blob/main/danger.dat)* We partnered with Protect AI to provide scanning in order to make the Hub safer. The same way files are scanned by our internal scanning system, public repositories' files are scanned by Guardian. Our frontend has been redesigned specifically for this purpose, in order to accommodate for new scanners: Here is an example repository you can check out to see the feature in action: [mcpotato/42-eicar-street](https://huggingface.co/mcpotato/42-eicar-street). ## Model security refresher To share models, we serialize the data structures we use to interact with the models, in order to facilitate storage and transport. Some serialization formats are vulnerable to nasty exploits, such as arbitrary code execution (looking at you pickle), making sharing models potentially dangerous. As Hugging Face has become a popular platform for model sharing, weโ€™d like to protect the community from this, hence why we have developed tools like [picklescan](https://github.com/mmaitre314/picklescan) and why we integrate third party scanners. Pickle is not the only exploitable format out there, [see for reference](https://github.com/Azure/counterfit/wiki/Abusing-ML-model-file-formats-to-create-malware-on-AI-systems:-A-proof-of-concept) how one can exploit Keras Lambda layers to achieve arbitrary code execution. ### Getting Started with Repositories https://huggingface.co/docs/hub/repositories-getting-started.md # Getting Started with Repositories This beginner-friendly guide will help you get the basic skills you need to create and manage your repository on the Hub. Each section builds on the previous one, so feel free to choose where to start! ## Requirements This document shows how to handle repositories through the web interface as well as through the terminal. There are no requirements if working with the UI. If you want to work with the terminal, please follow these installation instructions. If you do not have `git` available as a CLI command yet, you will need to [install Git](https://git-scm.com/downloads) for your platform. You will also need to [install Git-Xet](./xet/using-xet-storage#git-xet), which will be used to handle large files such as images and model weights. > [!TIP] > To be able to download and upload large files from Git, you need to install the [Git Xet](./xet/using-xet-storage#git) extension. To be able to push your code to the Hub, you'll need to authenticate somehow. The easiest way to do this is by installing the [`hf` CLI](https://huggingface.co/docs/huggingface_hub/guides/cli) and running the login command: ```bash # Install hf: # brew install hf # or # pip install hf hf auth login ``` **The content in the Getting Started section of this document is also available as a video!** ## Creating a repository Using the Hub's web interface you can easily create repositories, add files (even large ones!), explore models, visualize diffs, and much more. There are three kinds of repositories on the Hub, and in this guide you'll be creating a **model repository** for demonstration purposes. For information on creating and managing models, datasets, and Spaces, refer to their respective documentation. 1. To create a new repository, visit [huggingface.co/new](http://huggingface.co/new): 2. Specify the owner of the repository: this can be either you or any of the organizations youโ€™re affiliated with. 3. Enter your modelโ€™s name. This will also be the name of the repository. 4. Specify whether you want your model to be public or private. 5. Specify the license. You can leave the *License* field blank for now. To learn about licenses, visit the [**Licenses**](repositories-licenses) documentation. After creating your model repository, you should see a page like this: Note that the Hub prompts you to create a *Model Card*, which you can learn about in the [**Model Cards documentation**](./model-cards). Including a Model Card in your model repo is best practice, but since we're only making a test repo at the moment we can skip this. ## Adding files to a repository (Web UI) To add files to your repository via the web UI, start by selecting the **Files** tab, navigating to the desired directory, and then clicking **Add file**. You'll be given the option to create a new file or upload a file directly from your computer. ### Creating a new file Choosing to create a new file will take you to the following editor screen, where you can choose a name for your file, add content, and save your file with a message that summarizes your changes. Instead of directly committing the new file to your repo's `main` branch, you can select `Open as a pull request` to create a [Pull Request](./repositories-pull-requests-discussions). ### Uploading a file If you choose _Upload file_ you'll be able to choose a local file to upload, along with a message summarizing your changes to the repo. As with creating new files, you can select `Open as a pull request` to create a [Pull Request](./repositories-pull-requests-discussions) instead of adding your changes directly to the `main` branch of your repo. ## Adding files to a repository (CLI)[[cli]] You can upload files to your repository directly from the terminal using the [`hf` CLI](https://huggingface.co/docs/huggingface_hub/guides/cli). Use the `hf upload` command to push local files or entire folders: ```bash # Upload a single file to your model repo hf upload your-username/your-model-name model.safetensors # Upload an entire directory hf upload your-username/your-model-name ./my-model-directory # Upload to a dataset repo hf upload your-username/your-dataset-name ./data --repo-type dataset ``` The `hf` CLI handles large files automatically โ€” no extra setup is required. ## Adding files to a repository (git)[[terminal]] ### Cloning repositories Downloading repositories to your local machine is called *cloning*. You can use the following commands to load your repo and navigate to it: ```bash git clone https://huggingface.co// cd ``` Or for a dataset repo: ```bash git clone https://huggingface.co/datasets// cd ``` You can clone over SSH with the following command: ```bash git clone git@hf.co:/ cd ``` You'll need to add your SSH public key to [your user settings](https://huggingface.co/settings/keys) to push changes or access private repositories. ### Set up Now's the time, you can add any files you want to the repository! ๐Ÿ”ฅ Do you have files larger than 10MB? Those files should be tracked with [`git-xet`](./xet/using-xet-storage#git-xet), which you can initialize with: ```bash git xet install ``` When you use Hugging Face to create a repository, Hugging Face automatically provides a list of common file extensions for common Machine Learning large files in the `.gitattributes` file, which `git-xet` uses to efficiently track changes to your large files. However, you might need to add new extensions if your file types are not already handled. You can do so with `git xet track "*.your_extension"`. ### Pushing files You can use Git to save new files and any changes to already existing files as a bundle of changes called a *commit*, which can be thought of as a "revision" to your project. To create a commit, you have to `add` the files to let Git know that we're planning on saving the changes and then `commit` those changes. In order to sync the new commit with the Hugging Face Hub, you then `push` the commit to the Hub. ```bash # Create any files you like! Then... git add . git commit -m "First model version" # You can choose any descriptive message git push ``` And you're done! You can check your repository on Hugging Face with all the recently added files. For example, in the screenshot below the user added a number of files. Note that some files in this example have a size of `1.04 GB`, so the repo uses Xet to track it. > [!TIP] > If you cloned the repository with HTTP, you might be asked to fill your username and password on every push operation. The simplest way to avoid repetition is to [switch to SSH](#cloning-repositories), instead of HTTP. Alternatively, if you have to use HTTP, you might find it helpful to setup a [git credential helper](https://git-scm.com/docs/gitcredentials#_avoiding_repetition) to autofill your username and password. ## Viewing a repo's history Every time you go through the `add`-`commit`-`push` cycle, the repo will keep track of every change you've made to your files. The UI allows you to explore the model files and commits and to see the difference (also known as *diff*) introduced by each commit. To see the history, you can click on the **History: X commits** link. You can click on an individual commit to see what changes that commit introduced: ### Embedding Atlas https://huggingface.co/docs/hub/datasets-embedding-atlas.md # Embedding Atlas [Embedding Atlas](https://apple.github.io/embedding-atlas/) is an interactive visualization tool for exploring large embedding spaces. It enables you to visualize, cross-filter, and search embeddings alongside associated metadata, helping you understand patterns and relationships in high-dimensional data. All computation happens in your computer, ensuring your data remains private and secure. Here is an [example atlas](https://huggingface.co/spaces/davanstrien/megascience) for the [MegaScience](https://huggingface.co/datasets/MegaScience/MegaScience) dataset hosted as a Static Space: ## Key Features - **Interactive exploration**: Navigate through millions of embeddings with smooth, responsive visualization - **Browser-based computation**: Compute embeddings and projections locally without sending data to external servers - **Cross-filtering**: Link and filter data across multiple metadata columns - **Search capabilities**: Find similar data points to a given query or existing item - **Multiple integration options**: Use via command line, Jupyter widgets, or web interface ## Prerequisites First, install Embedding Atlas: ```bash pip install embedding-atlas ``` If you plan to load private datasets from the Hugging Face Hub, you'll also need to [login with your Hugging Face account](/docs/huggingface_hub/quick-start#login): ```bash hf auth login ``` ## Loading Datasets from the Hub Embedding Atlas provides seamless integration with the Hugging Face Hub, allowing you to visualize embeddings from any dataset directly. ### Using the Command Line The simplest way to visualize a Hugging Face dataset is through the command line interface. Try it with the IMDB dataset: ```bash # Load the IMDB dataset from the Hub embedding-atlas stanfordnlp/imdb # Specify the text column for embedding computation embedding-atlas stanfordnlp/imdb --text "text" # Load only a sample for faster exploration embedding-atlas stanfordnlp/imdb --text "text" --sample 5000 ``` For your own datasets, use the same pattern: ```bash # Load your dataset from the Hub embedding-atlas username/dataset-name # Load multiple splits embedding-atlas username/dataset-name --split train --split test # Specify custom text column embedding-atlas username/dataset-name --text "content" ``` ### Using Python and Jupyter You can also use Embedding Atlas in Jupyter notebooks for interactive exploration: ```python from embedding_atlas.widget import EmbeddingAtlasWidget from datasets import load_dataset import pandas as pd # Load the IMDB dataset from Hugging Face Hub dataset = load_dataset("stanfordnlp/imdb", split="train[:5000]") # Convert to pandas DataFrame df = dataset.to_pandas() # Create interactive widget widget = EmbeddingAtlasWidget(df) widget ``` For your own datasets: ```python from embedding_atlas.widget import EmbeddingAtlasWidget from datasets import load_dataset import pandas as pd # Load your dataset from the Hub dataset = load_dataset("username/dataset-name", split="train") df = dataset.to_pandas() # Create interactive widget widget = EmbeddingAtlasWidget(df) widget ``` ### Working with Pre-computed Embeddings If you have datasets with pre-computed embeddings, you can load them directly: ```bash # Load dataset with pre-computed coordinates embedding-atlas username/dataset-name \ --x "embedding_x" \ --y "embedding_y" # Load with pre-computed nearest neighbors embedding-atlas username/dataset-name \ --neighbors "neighbors_column" ``` ## Customizing Embeddings Embedding Atlas uses [SentenceTransformers](https://huggingface.co/sentence-transformers) by default but supports custom embedding models: ```bash # Use a specific embedding model embedding-atlas stanfordnlp/imdb \ --text "text" \ --model "sentence-transformers/all-MiniLM-L6-v2" # For models requiring remote code execution embedding-atlas username/dataset-name \ --model "custom/model" \ --trust-remote-code ``` ### UMAP Projection Parameters Fine-tune the dimensionality reduction for your specific use case: ```bash embedding-atlas stanfordnlp/imdb \ --text "text" \ --umap-n-neighbors 30 \ --umap-min-dist 0.1 \ --umap-metric "cosine" ``` ## Use Cases ### Exploring Text Datasets Visualize and explore text corpora to identify clusters, outliers, and patterns: ```python from embedding_atlas.widget import EmbeddingAtlasWidget from datasets import load_dataset import pandas as pd # Load a text classification dataset dataset = load_dataset("stanfordnlp/imdb", split="train[:5000]") df = dataset.to_pandas() # Visualize with metadata widget = EmbeddingAtlasWidget(df) widget ``` ## Additional Resources - [Embedding Atlas GitHub Repository](https://github.com/apple/embedding-atlas) - [Official Documentation](https://apple.github.io/embedding-atlas/) - [Interactive Demo](https://apple.github.io/embedding-atlas/upload/) - [Command Line Reference](https://apple.github.io/embedding-atlas/tool.html) ### Using PEFT at Hugging Face https://huggingface.co/docs/hub/peft.md # Using PEFT at Hugging Face ๐Ÿค— [Parameter-Efficient Fine-Tuning (PEFT)](https://huggingface.co/docs/peft/index) is a library for efficiently adapting pre-trained language models to various downstream applications without fine-tuning all the modelโ€™s parameters. ## Exploring PEFT on the Hub You can find PEFT models by filtering at the left of the [models page](https://huggingface.co/models?library=peft&sort=trending). ## Installation To get started, you can check out the [Quick Tour in the PEFT docs](https://huggingface.co/docs/peft/quicktour). To install, follow the [PEFT installation guide](https://huggingface.co/docs/peft/install). You can also use the following one-line install through pip: ``` $ pip install peft ``` ## Using existing models All PEFT models can be loaded from the Hub. To use a PEFT model you also need to load the base model that was fine-tuned, as shown below. Every fine-tuned model has the base model in its model card. ```py from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel, PeftConfig base_model = "mistralai/Mistral-7B-v0.1" adapter_model = "dfurman/Mistral-7B-Instruct-v0.2" model = AutoModelForCausalLM.from_pretrained(base_model) model = PeftModel.from_pretrained(model, adapter_model) tokenizer = AutoTokenizer.from_pretrained(base_model) model = model.to("cuda") model.eval() ``` Once loaded, you can pass your inputs to the tokenizer to prepare them, and call `model.generate()` in regular `transformers` fashion. ```py inputs = tokenizer("Tell me the recipe for chocolate chip cookie", return_tensors="pt") with torch.no_grad(): outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=10) print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0]) ``` It outputs the following: ```text Tell me the recipe for chocolate chip cookie dough. 1. Preheat oven to 375 degrees F (190 degrees C). 2. In a large bowl, cream together 1/2 cup (1 stick) of butter or margarine, 1/2 cup granulated sugar, and 1/2 cup packed brown sugar. 3. Beat in 1 egg and 1 teaspoon vanilla extract. 4. Mix in 1 1/4 cups all-purpose flour. 5. Stir in 1/2 teaspoon baking soda and 1/2 teaspoon salt. 6. Fold in 3/4 cup semisweet chocolate chips. 7. Drop by ``` If you want to load a specific PEFT model, you can click `Use in PEFT` in the model card and you will be given a working snippet! ## Additional resources * PEFT [repository](https://github.com/huggingface/peft) * PEFT [docs](https://huggingface.co/docs/peft/index) * PEFT [models](https://huggingface.co/models?library=peft&sort=trending) ### Using Keras at Hugging Face https://huggingface.co/docs/hub/keras.md # Using Keras at Hugging Face Keras is an open-source multi-backend deep learning framework, with support for JAX, TensorFlow, and PyTorch. You can find more details about it on [keras.io](https://keras.io/). ## Exploring Keras in the Hub You can list `keras` models on the Hub by filtering by library name on the [models page](https://huggingface.co/models?library=keras&sort=downloads). Keras models on the Hub come up with useful features when uploaded directly from the Keras library: 1. A generated model card with a description, a plot of the model, and more. 2. A download count to monitor the popularity of a model. 3. A code snippet to quickly get started with the model. ## Using existing models Keras is deeply integrated with the Hugging Face Hub. This means you can load and save models on the Hub directly from the library. To do that, you need to install a recent version of Keras and `huggingface_hub`. The `huggingface_hub` library is a lightweight Python client used by Keras to interact with the Hub. ``` pip install -U keras huggingface_hub ``` Once you have the library installed, you just need to use the regular `keras.saving.load_model` method by passing as argument a Hugging Face path. An HF path is a `repo_id` prefixed by `hf://` e.g. `"hf://keras-io/weather-prediction"`. Read more about `load_model` in [Keras documentation](https://keras.io/api/models/model_saving_apis/model_saving_and_loading/#load_model-function). ```py import keras model = keras.saving.load_model("hf://Wauplin/mnist_example") ``` If you want to see how to load a specific model, you can click **Use this model** on the model page to get a working code snippet! ## Sharing your models Similarly to `load_model`, you can save and share a `keras` model on the Hub using `model.save()` with an HF path: ```py model = ... model.save("hf://your-username/your-model-name") ``` If the repository does not exist on the Hub, it will be created for you. The uploaded model contains a model card, a plot of the model, the `metadata.json` and `config.json` files, and a `model.weights.h5` file containing the model weights. By default, the repository will contain a minimal model card. Check out the [Model Card guide](https://huggingface.co/docs/hub/model-cards) to learn more about model cards and how to complete them. You can also programmatically update model cards using `huggingface_hub.ModelCard` (see [guide](https://huggingface.co/docs/huggingface_hub/guides/model-cards)). > [!TIP] > You might be already familiar with `.keras` files. In fact, a `.keras` file is simply a zip file containing the `.json` and `model.weights.h5` files. When pushed to the Hub, the model is saved as an unzipped folder in order to let you navigate through the files. Note that if you manually upload a `.keras` file to a model repository on the Hub, the repository will automatically be tagged as `keras` but you won't be able to load it using `keras.saving.load_model`. ## Additional resources * Keras Developer [Guides](https://keras.io/guides/). * Keras [examples](https://keras.io/examples/). ### Models https://huggingface.co/docs/hub/models.md # Models The Hugging Face Hub hosts many models for a [variety of machine learning tasks](https://huggingface.co/tasks). Models are stored in repositories, so they benefit from [all the features](./repositories) possessed by every repo on the Hugging Face Hub. Additionally, model repos have attributes that make exploring and using models as easy as possible. These docs will take you through everything you'll need to know to find models on the Hub, upload your models, and make the most of everything the Model Hub offers! ## Contents - [The Model Hub](./models-the-hub) - [Model Cards](./model-cards) - [CO2 emissions](./model-cards-co2) - [Eval Results](./eval-results) - [Gated models](./models-gated) - [Uploading Models](./models-uploading) - [Downloading Models](./models-downloading) - [Libraries](./models-libraries) - [Widgets](./models-widgets) - [Widget Examples](./models-widgets-examples) - [Model Inference](./models-inference) - [Local Apps](./local-apps) - [Frequently Asked Questions](./models-faq) - [Advanced Topics](./models-advanced) - [Integrating libraries with the Hub](./models-adding-libraries) - [Tasks](./models-tasks) ### Gated models https://huggingface.co/docs/hub/models-gated.md # Gated models To give more control over how models are used, the Hub allows model authors to enable **access requests** for their models. Users must agree to share their contact information (username and email address) with the model authors to access the model files when enabled. Model authors can configure this request with additional fields. A model with access requests enabled is called a **gated model**. Access requests are always granted to individual users rather than to entire organizations. A common use case of gated models is to provide access to early research models before the wider release. ## Manage gated models as a model author To enable access requests, go to the model settings page. By default, the model is not gated. Click on **Enable Access request** in the top-right corner. By default, access to the model is automatically granted to the user when requesting it. This is referred to as **automatic approval**. In this mode, any user can access your model once they've shared their personal information with you. If you want to manually approve which users can access your model, you must set it to **manual approval**. When this is the case, you will notice more options: - **Add access** allows you to search for a user and grant them access even if they did not request it. - **Notification frequency** lets you configure when to get notified if new users request access. It can be set to once a day or real-time. By default, an email is sent to your primary email address. For models hosted under an organization, emails are by default sent to the first 5 admins of the organization. In both cases (user or organization) you can set a different email address in the **Notifications email** field. ### Review access requests Once access requests are enabled, you have full control of who can access your model or not, whether the approval mode is manual or automatic. You can review and manage requests either from the UI or via the API. #### From the UI You can review who has access to your gated model from its settings page by clicking on the **Review access requests** button. This will open a modal with 3 lists of users: - **pending**: the list of users waiting for approval to access your model. This list is empty unless you've selected **manual approval**. You can either **Accept** or **Reject** the demand. If the demand is rejected, the user cannot access your model and cannot request access again. - **accepted**: the complete list of users with access to your model. You can choose to **Reject** access at any time for any user, whether the approval mode is manual or automatic. You can also **Cancel** the approval, which will move the user to the *pending* list. - **rejected**: the list of users you've manually rejected. Those users cannot access your models. If they go to your model repository, they will see a message *Your request to access this repo has been rejected by the repo's authors*. #### Via the API You can automate the approval of access requests by using the API. You must pass a `token` with `write` access to the gated repository. To generate a token, go to [your user settings](https://huggingface.co/settings/tokens). | Method | URI | Description | Headers | Payload | ------ | --- | ----------- | ------- | ------- | | `GET` | `/api/models/{repo_id}/user-access-request/pending` | Retrieve the list of pending requests. | `{"authorization": "Bearer $token"}` | | | `GET` | `/api/models/{repo_id}/user-access-request/accepted` | Retrieve the list of accepted requests. | `{"authorization": "Bearer $token"}` | | | `GET` | `/api/models/{repo_id}/user-access-request/rejected` | Retrieve the list of rejected requests. | `{"authorization": "Bearer $token"}` | | | `POST` | `/api/models/{repo_id}/user-access-request/handle` | Change the status of a given access request to `status`. | `{"authorization": "Bearer $token"}` | `{"status": "accepted"/"rejected"/"pending", "user": "username", "rejectionReason": "Optional rejection reason that will be visible to the user (max 200 characters)."}` | | `POST` | `/api/models/{repo_id}/user-access-request/grant` | Allow a specific user to access your repo. | `{"authorization": "Bearer $token"}` | `{"user": "username"} ` | The base URL for the HTTP endpoints above is `https://huggingface.co`. **NEW!** Those endpoints are now officially supported in our Python client `huggingface_hub`. List the access requests to your model with [`list_pending_access_requests`](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.list_pending_access_requests), [`list_accepted_access_requests`](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.list_accepted_access_requests) and [`list_rejected_access_requests`](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.list_rejected_access_requests). You can also accept, cancel and reject access requests with [`accept_access_request`](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.accept_access_request), [`cancel_access_request`](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.cancel_access_request), [`reject_access_request`](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.reject_access_request). Finally, you can grant access to a user with [`grant_access`](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.grant_access). ### Download access report You can download a report of all access requests for a gated model with the **download user access report** button. Click on it to download a json file with a list of users. For each entry, you have: - **user**: the user id. Example: *julien-c*. - **fullname**: name of the user on the Hub. Example: *Julien Chaumond*. - **status**: status of the request. Either `"pending"`, `"accepted"` or `"rejected"`. - **email**: email of the user. - **time**: datetime when the user initially made the request. ### Customize requested information By default, users landing on your gated model will be asked to share their contact information (email and username) by clicking the **Agree and send request to access repo** button. If you want to collect more user information, you can configure additional fields. This information will be accessible from the **Settings** tab. To do so, add an `extra_gated_fields` property to your [model card metadata](./model-cards#model-card-metadata) containing a list of key/value pairs. The *key* is the name of the field and *value* its type or an object with a `type` field. The list of field types is: - `text`: a single-line text field. - `checkbox`: a checkbox field. - `date_picker`: a date picker field. - `country`: a country dropdown. The list of countries is based on the [ISO 3166-1 alpha-2](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2) standard. - `select`: a dropdown with a list of options. The list of options is defined in the `options` field. Example: `options: ["option 1", "option 2", {label: "option3", value: "opt3"}]`. Finally, you can also personalize the message displayed to the user with the `extra_gated_prompt` extra field. Here is an example of customized request form where the user is asked to provide their company name and country and acknowledge that the model is for non-commercial use only. ```yaml --- extra_gated_prompt: "You agree to not use the model to conduct experiments that cause harm to human subjects." extra_gated_fields: Company: text Country: country Specific date: date_picker I want to use this model for: type: select options: - Research - Education - label: Other value: other I agree to use this model for non-commercial use ONLY: checkbox --- ``` In some cases, you might also want to modify the default text in the gate heading, description, and button. For those use cases, you can modify `extra_gated_heading`, `extra_gated_description` and `extra_gated_button_content` like this: ```yaml --- extra_gated_heading: "Acknowledge license to accept the repository" extra_gated_description: "Our team may take 2-3 days to process your request" extra_gated_button_content: "Acknowledge license" --- ``` ### Example use cases of programmatically managing access requests Here are a few interesting use cases of programmatically managing access requests for gated repos we've seen organically emerge in the community. As a reminder, the model repo needs to be set to manual approval, otherwise users get access to it automatically. Possible use cases of programmatic management include: - If you have advanced user request screening requirements (for advanced compliance requirements, etc) or you wish to handle the user requests outside the Hub. - An example for this was Meta's [Llama 2](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) initial release where users had to request access on a Meta website. - You can ask users for their HF username in your access flow, and then use a script to programmatically accept user requests on the Hub based on your set of conditions. - If you want to condition access to a model based on completing a payment flow (note that the actual payment flow happens outside of the Hub). - Here's an [example repo](https://huggingface.co/Trelis/openchat_3.5-function-calling-v3) from TrelisResearch that uses this use case. - [@RonanMcGovern](https://huggingface.co/RonanMcGovern) has posted a [video about the flow](https://www.youtube.com/watch?v=2OT2SI5auQU) and tips on how to implement it. ## Manage gated models as an organization (Team & Enterprise) [Team & Enterprise](https://huggingface.co/docs/hub/en/enterprise) subscribers can create a Gating Group Collection to grant (or reject) access to all the models and datasets in a collection at once. More information about Gating Group Collections can be found in [our dedicated doc](https://huggingface.co/docs/hub/en/enterprise-gating-group-collections). ## Access gated models as a user As a user, if you want to use a gated model, you will need to request access to it. This means that you must be logged in to a Hugging Face user account. Requesting access can only be done from your browser. Go to the model on the Hub and you will be prompted to share your information: By clicking on **Agree**, you agree to share your username and email address with the model authors. In some cases, additional fields might be requested. To help the model authors decide whether to grant you access, try to fill out the form as completely as possible. Once the access request is sent, there are two possibilities. If the approval mechanism is automatic, you immediately get access to the model files. Otherwise, the requests have to be approved manually by the authors, which can take more time. > [!WARNING] > The model authors have complete control over model access. In particular, they can decide at any time to block your access to the model without prior notice, regardless of approval mechanism or if your request has already been approved. ### Download files To download files from a gated model you'll need to be authenticated. In the browser, this is automatic as long as you are logged in with your account. If you are using a script, you will need to provide a [user token](./security-tokens). In the Hugging Face Python ecosystem (`transformers`, `diffusers`, `datasets`, etc.), you can login your machine using the [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/index) library and running in your terminal: ```bash hf auth login ``` Alternatively, you can programmatically login using `login()` in a notebook or a script: ```python >>> from huggingface_hub import login >>> login() ``` You can also provide the `token` parameter to most loading methods in the libraries (`from_pretrained`, `hf_hub_download`, `load_dataset`, etc.), directly from your scripts. For more details about how to login, check out the [login guide](https://huggingface.co/docs/huggingface_hub/quick-start#login). ### Restricting Access for EU Users For gated models, you can add an additional layer of access control to specifically restrict users from European Union countries. This is useful if your model's license or terms of use prohibit its distribution in the EU. To enable this, add the `extra_gated_eu_disallowed: true` property to your model card's metadata. **Important:** This feature will only activate if your model is already gated. If `gated: false` or the property is not set, this restriction will not apply. ```yaml --- license: mit gated: true extra_gated_eu_disallowed: true --- ``` The system identifies a user's location based on their IP address. ### Run with Docker https://huggingface.co/docs/hub/spaces-run-with-docker.md # Run with Docker You can use Docker to run most Spaces locally. To view instructions to download and run Spaces' Docker images, click on the "Run with Docker" button on the top-right corner of your Space page: ## Login to the Docker registry Some Spaces will require you to login to Hugging Face's Docker registry. To do so, you'll need to provide: - Your Hugging Face username as `username` - A User Access Token as `password`. Generate one [here](https://huggingface.co/settings/tokens). ### Integrate your library with the Hub https://huggingface.co/docs/hub/models-adding-libraries.md # Integrate your library with the Hub The Hugging Face Hub aims to facilitate sharing machine learning models, checkpoints, and artifacts. This endeavor includes integrating the Hub into many of the amazing third-party libraries in the community. Some of the ones already integrated include [spaCy](https://spacy.io/usage/projects#huggingface_hub), [Sentence Transformers](https://sbert.net/), [OpenCLIP](https://github.com/mlfoundations/open_clip), and [timm](https://huggingface.co/docs/timm/index), among many others. Integration means users can download and upload files to the Hub directly from your library. We hope you will integrate your library and join us in democratizing artificial intelligence for everyone. Integrating the Hub with your library provides many benefits, including: - Free model hosting for you and your users. - Built-in file versioning - even for huge files - made possible by [Git-Xet](./xet/using-xet-storage#git-xet). - Community features (discussions, pull requests, likes). - Usage metrics for all models ran with your library. This tutorial will help you integrate the Hub into your library so your users can benefit from all the features offered by the Hub. Before you begin, we recommend you create a [Hugging Face account](https://huggingface.co/join) from which you can manage your repositories and files. If you need help with the integration, feel free to open an [issue](https://github.com/huggingface/huggingface_hub/issues/new/choose), and we would be more than happy to help you. ## Implementation Implementing an integration of a library with the Hub often means providing built-in methods to load models from the Hub and allow users to push new models to the Hub. This section will cover the basics of how to do that using the `huggingface_hub` library. For more in-depth guidance, check out [this guide](https://huggingface.co/docs/huggingface_hub/guides/integrations). ### Installation To integrate your library with the Hub, you will need to add `huggingface_hub` library as a dependency: ```bash pip install huggingface_hub ``` For more details about `huggingface_hub` installation, check out [this guide](https://huggingface.co/docs/huggingface_hub/installation). > [!TIP] > In this guide, we will focus on Python libraries. If you've implemented your library in JavaScript, you can use [`@huggingface/hub`](https://www.npmjs.com/package/@huggingface/hub) instead. The rest of the logic (i.e. hosting files, code samples, etc.) does not depend on the code language. > > ``` > npm add @huggingface/hub > ``` Users will need to authenticate once they have successfully installed the `huggingface_hub` library. The easiest way to authenticate is to save the token on the machine. Users can do that from the terminal using the `login()` command: ``` hf auth login ``` The command tells them if they are already logged in and prompts them for their token. The token is then validated and saved in their `HF_HOME` directory (defaults to `~/.cache/huggingface/token`). Any script or library interacting with the Hub will use this token when sending requests. Alternatively, users can programmatically login using `login()` in a notebook or a script: ```py from huggingface_hub import login login() ``` Authentication is optional when downloading files from public repos on the Hub. ### Download files from the Hub Integrations allow users to download a model from the Hub and instantiate it directly from your library. This is often made possible by providing a method (usually called `from_pretrained` or `load_from_hf`) that has to be specific to your library. To instantiate a model from the Hub, your library has to: - download files from the Hub. This is what we will discuss now. - instantiate the Python model from these files. Use the [`hf_hub_download`](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/file_download#huggingface_hub.hf_hub_download) method to download files from a repository on the Hub. Downloaded files are stored in the cache: `~/.cache/huggingface/hub`. Users won't have to re-download the file the next time they use it, which saves a lot of time for large files. Furthermore, if the repository is updated with a new version of the file, `huggingface_hub` will automatically download the latest version and store it in the cache. Users don't have to worry about updating their files manually. For example, download the `config.json` file from the [lysandre/arxiv-nlp](https://huggingface.co/lysandre/arxiv-nlp) repository: ```python >>> from huggingface_hub import hf_hub_download >>> config_path = hf_hub_download(repo_id="lysandre/arxiv-nlp", filename="config.json") >>> config_path '/home/lysandre/.cache/huggingface/hub/models--lysandre--arxiv-nlp/snapshots/894a9adde21d9a3e3843e6d5aeaaf01875c7fade/config.json' ``` `config_path` now contains a path to the downloaded file. You are guaranteed that the file exists and is up-to-date. If your library needs to download an entire repository, use [`snapshot_download`](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/file_download#huggingface_hub.snapshot_download). It will take care of downloading all the files in parallel. The return value is a path to the directory containing the downloaded files. ```py >>> from huggingface_hub import snapshot_download >>> snapshot_download(repo_id="lysandre/arxiv-nlp") '/home/lysandre/.cache/huggingface/hub/models--lysandre--arxiv-nlp/snapshots/894a9adde21d9a3e3843e6d5aeaaf01875c7fade' ``` Many options exists to download files from a specific revision, to filter which files to download, to provide a custom cache directory, to download to a local directory, etc. Check out the [download guide](https://huggingface.co/docs/huggingface_hub/en/guides/download) for more details. ### Upload files to the Hub You might also want to provide a method so that users can push their own models to the Hub. This allows the community to build an ecosystem of models compatible with your library. The `huggingface_hub` library offers methods to create repositories and upload files: - `create_repo` creates a repository on the Hub. - `upload_file` and `upload_folder` upload files to a repository on the Hub. The `create_repo` method creates a repository on the Hub. Use the `repo_id` parameter to provide a name for your repository: ```python >>> from huggingface_hub import create_repo >>> create_repo(repo_id="test-model") 'https://huggingface.co/lysandre/test-model' ``` When you check your Hugging Face account, you should now see a `test-model` repository under your namespace. The [`upload_file`](https://huggingface.co/docs/huggingface_hub/en/package_reference/hf_api#huggingface_hub.HfApi.upload_file) method uploads a file to the Hub. This method requires the following: - A path to the file to upload. - The final path in the repository. - The repository you wish to push the files to. For example: ```python >>> from huggingface_hub import upload_file >>> upload_file( ... path_or_fileobj="/home/lysandre/dummy-test/README.md", ... path_in_repo="README.md", ... repo_id="lysandre/test-model" ... ) 'https://huggingface.co/lysandre/test-model/blob/main/README.md' ``` If you check your Hugging Face account, you should see the file inside your repository. Usually, a library will serialize the model to a local directory and then upload to the Hub the entire folder at once. This can be done using [`upload_folder`](https://huggingface.co/docs/huggingface_hub/en/package_reference/hf_api#huggingface_hub.HfApi.upload_folder): ```py >>> from huggingface_hub import upload_folder >>> upload_folder( ... folder_path="/home/lysandre/dummy-test", ... repo_id="lysandre/test-model", ... ) ``` For more details about how to upload files, check out the [upload guide](https://huggingface.co/docs/huggingface_hub/en/guides/upload). ## Model cards Model cards are files that accompany the models and provide handy information. Under the hood, model cards are simple Markdown files with additional metadata. Model cards are essential for discoverability, reproducibility, and sharing! You can find a model card as the README.md file in any model repo. See the [model cards guide](./model-cards) for more details about how to create a good model card. If your library allows pushing a model to the Hub, it is recommended to generate a minimal model card with prefilled metadata (typically `library_name`, `pipeline_tag` or `tags`) and information on how the model has been trained. This will help having a standardized description for all models built with your library. ## Register your library Well done! You should now have a library able to load a model from the Hub and eventually push new models. The next step is to make sure that your models on the Hub are well-documented and integrated with the platform. To do so, libraries can be registered on the Hub, which comes with a few benefits for the users: - a pretty label can be shown on the model page (e.g. `KerasNLP` instead of `keras-nlp`) - a link to your library repository and documentation is added to each model page - a custom download count rule can be defined - code snippets can be generated to show how to load the model using your library To register a new library, please open a Pull Request [here](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/model-libraries.ts) following the instructions below: - The library id should be lowercased and hyphen-separated (example: `"adapter-transformers"`). Make sure to preserve alphabetical order when opening the PR. - set `repoName` and `prettyLabel` with user-friendly casing (example: `DeepForest`). - set `repoUrl` with a link to the library source code (usually a GitHub repository). - (optional) set `docsUrl` with a link to the docs of the library. If the documentation is in the GitHub repo referenced above, no need to set it twice. - set `filter` to `false`. - (optional) define how downloads must be counted by setting `countDownload`. Downloads can be tracked by file extensions or filenames. Make sure to not duplicate the counting. For instance, if loading a model requires 3 files, the download count rule must count downloads only on 1 of the 3 files. Otherwise, the download count will be overestimated. **Note:** if the library uses one of the default config files (`config.json`, `config.yaml`, `hyperparams.yaml`, `params.json`, and `meta.yaml`, see [here](https://huggingface.co/docs/hub/models-download-stats#which-are-the-query-files-for-different-libraries)), there is no need to manually define a download count rule. - (optional) define `snippets` to let the user know how they can quickly instantiate a model. More details below. Before opening the PR, make sure that at least one model is referenced on https://huggingface.co/models?other=my-library-name. If not, the model card metadata of the relevant models must be updated with `library_name: my-library-name` (see [example](https://huggingface.co/google/gemma-scope/blob/main/README.md?code=true#L3)). If you are not the owner of the models on the Hub, please open PRs (see [example](https://huggingface.co/MCG-NJU/VFIMamba/discussions/1)). Here is a minimal [example](https://github.com/huggingface/huggingface.js/pull/885/files) adding integration for VFIMamba. ### Code snippets We recommend adding a code snippet to explain how to use a model in your downstream library. To add a code snippet, you should update the [model-libraries-snippets.ts](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/model-libraries-snippets.ts) file with instructions for your model. For example, the [Asteroid](https://huggingface.co/asteroid-team) integration includes a brief code snippet for how to load and use an Asteroid model: ```typescript const asteroid = (model: ModelData) => `from asteroid.models import BaseModel model = BaseModel.from_pretrained("${model.id}")`; ``` Doing so will also add a tag to your model so users can quickly identify models from your library. Once your snippet has been added to [model-libraries-snippets.ts](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/model-libraries-snippets.ts), you can reference it in [model-libraries.ts](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/model-libraries.ts) as described above. ## Document your library Finally, you can add your library to the Hub's documentation. Check for example the [Setfit PR](https://github.com/huggingface/hub-docs/pull/1150) that added [SetFit](./setfit) to the documentation. ### The HF PRO subscription ๐Ÿ”ฅ https://huggingface.co/docs/hub/pro.md # The HF PRO subscription ๐Ÿ”ฅ The PRO subscription unlocks essential features for serious users, including: - Higher [storage capacity](./storage-limits) for public and private repositories - Higher bandwidth and API [rate limits](./rate-limits) - Included credits for [Inference Providers](/docs/inference-providers/) - Higher tier for [ZeroGPU Spaces](./spaces-zerogpu) usage, and pay-as-you-go quota extension - Ability to create ZeroGPU Spaces and use [Dev Mode](./spaces-dev-mode) - Ability to publish Social Posts and Community Blogs - Leverage the [Data Studio](./data-studio) on private datasets - Run and schedule serverless [CPU/GPU Jobs](./jobs) View the full list of benefits at **https://huggingface.co/pro** then subscribe over at https://huggingface.co/subscribe/pro ### Using BERTopic at Hugging Face https://huggingface.co/docs/hub/bertopic.md # Using BERTopic at Hugging Face [BERTopic](https://github.com/MaartenGr/BERTopic) is a topic modeling framework that leverages ๐Ÿค— transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions. BERTopic supports all kinds of topic modeling techniques: Guided Supervised Semi-supervised Manual Multi-topic distributions Hierarchical Class-based Dynamic Online/Incremental Multimodal Multi-aspect Text Generation/LLM Zero-shot (new!) Merge Models (new!) Seed Words (new!) ## Exploring BERTopic on the Hub You can find BERTopic models by filtering at the left of the [models page](https://huggingface.co/models?library=bertopic&sort=trending). BERTopic models hosted on the Hub have a model card with useful information about the models. Thanks to BERTopic Hugging Face Hub integration, you can load BERTopic models with a few lines of code. You can also deploy these models using [Inference Endpoints](https://huggingface.co/inference-endpoints). ## Installation To get started, you can follow the [BERTopic installation guide](https://github.com/MaartenGr/BERTopic#installation). You can also use the following one-line install through pip: ```bash pip install bertopic ``` ## Using Existing Models All BERTopic models can easily be loaded from the Hub: ```py from bertopic import BERTopic topic_model = BERTopic.load("MaartenGr/BERTopic_Wikipedia") ``` Once loaded, you can use BERTopic's features to predict the topics for new instances: ```py topic, prob = topic_model.transform("This is an incredible movie!") topic_model.topic_labels_[topic] ``` Which gives us the following topic: ```text 64_rating_rated_cinematography_film ``` ## Sharing Models When you have created a BERTopic model, you can easily share it with others through the Hugging Face Hub. To do so, we can make use of the `push_to_hf_hub` function that allows us to directly push the model to the Hugging Face Hub: ```python from bertopic import BERTopic # Train model topic_model = BERTopic().fit(my_docs) # Push to HuggingFace Hub topic_model.push_to_hf_hub( repo_id="MaartenGr/BERTopic_ArXiv", save_ctfidf=True ) ``` Note that the saved model does not include the dimensionality reduction and clustering algorithms. Those are removed since they are only necessary to train the model and find relevant topics. Inference is done through a straightforward cosine similarity between the topic and document embeddings. This not only speeds up the model but allows us to have a tiny BERTopic model that we can work with. ## Additional Resources * [BERTopic repository](https://github.com/MaartenGr/BERTopic) * [BERTopic docs](https://maartengr.github.io/BERTopic/) * [BERTopic models in the Hub](https://huggingface.co/models?library=bertopic&sort=trending) ### Models Download Stats https://huggingface.co/docs/hub/models-download-stats.md # Models Download Stats ## How are downloads counted for models? Counting the number of downloads for models is not a trivial task, as a single model repository might contain multiple files, including multiple model weight files (e.g., with sharded models) and different formats depending on the library (GGUF, PyTorch, TensorFlow, etc.). To avoid double counting downloads (e.g., counting a single download of a model as multiple downloads), the Hub uses a set of query files that are employed for download counting. No information is sent from the user, and no additional calls are made for this. The count is done server-side as the Hub serves files for downloads. Every HTTP request to these files, including `GET` and `HEAD`, will be counted as a download. By default, when no library is specified, the Hub uses `config.json` as the default query file. Otherwise, the query file depends on each library, and the Hub might examine files such as `pytorch_model.bin` or `adapter_config.json`. ## Which are the query files for different libraries? By default, the Hub looks at `config.json`, `config.yaml`, `hyperparams.yaml`, `params.json`, and `meta.yaml`. Some libraries override these defaults by specifying their own filter (specifying `countDownloads`). The code that defines these overrides is [open-source](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/model-libraries.ts). For example, for the `nemo` library, all files with `.nemo` extension are used to count downloads. ## Can I add my query files for my library? Yes, you can open a Pull Request [here](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/model-libraries.ts). Here is a minimal [example](https://github.com/huggingface/huggingface.js/pull/885/files) adding download metrics for VFIMamba. Check out the [integration guide](./models-adding-libraries#register-your-library) for more details. ## How are `GGUF` files handled? GGUF files are self-contained and are not tied to a single library, so all of them are counted for downloads. This will double count downloads in the case a user performs cloning of a whole repository, but most users and interfaces download a single GGUF file for a given repo. ## How is `diffusers` handled? The `diffusers` library is an edge case and has its filter configured in the internal codebase. The filter ensures repos tagged as `diffusers` count both files loaded via the library as well as through UIs that require users to manually download the top-level safetensors. ``` filter: [ { bool: { /// Include documents that match at least one of the following rules should: [ /// Downloaded from diffusers lib { term: { path: "model_index.json" }, }, /// Direct downloads (LoRa, Auto1111 and others) /// Filter out nested safetensors and pickle weights to avoid double counting downloads from the diffusers lib { regexp: { path: "[^/]*\\.safetensors" }, }, { regexp: { path: "[^/]*\\.ckpt" }, }, { regexp: { path: "[^/]*\\.bin" }, }, ], minimum_should_match: 1, }, }, ] } ``` ### Docker Spaces Examples https://huggingface.co/docs/hub/spaces-sdks-docker-examples.md # Docker Spaces Examples We gathered some example demos in the [Spaces Examples](https://huggingface.co/SpacesExamples) organization. Please check them out! * Dummy FastAPI app: https://huggingface.co/spaces/DockerTemplates/fastapi_dummy * FastAPI app serving a static site and using `transformers`: https://huggingface.co/spaces/DockerTemplates/fastapi_t5 * Phoenix app for https://huggingface.co/spaces/DockerTemplates/single_file_phx_bumblebee_ml * HTTP endpoint in Go with query parameters https://huggingface.co/spaces/XciD/test-docker-go?q=Adrien * Shiny app written in Python https://huggingface.co/spaces/elonmuskceo/shiny-orbit-simulation * Genie.jl app in Julia https://huggingface.co/spaces/nooji/GenieOnHuggingFaceSpaces * Argilla app for data labelling and curation: https://huggingface.co/spaces/argilla/live-demo and [write-up about hosting Argilla on Spaces](./spaces-sdks-docker-argilla) by [@dvilasuero](https://huggingface.co/dvilasuero) ๐ŸŽ‰ * JupyterLab and VSCode: https://huggingface.co/spaces/DockerTemplates/docker-examples by [@camenduru](https://twitter.com/camenduru) and [@nateraw](https://hf.co/nateraw). * Zeno app for interactive model evaluation: https://huggingface.co/spaces/zeno-ml/diffusiondb and [instructions for setup](https://zenoml.com/docs/deployment#hugging-face-spaces) * Gradio App: https://huggingface.co/spaces/sayakpaul/demo-docker-gradio ### Advanced Compute Options https://huggingface.co/docs/hub/advanced-compute-options.md # Advanced Compute Options > [!WARNING] > This feature is part of the Team & Enterprise plans. Team & Enterprise organizations gain access to advanced compute options to accelerate their machine learning journey. ## Host ZeroGPU Spaces in your organization ZeroGPU is a dynamic GPU allocation system that optimizes AI deployment on Hugging Face Spaces. By automatically allocating and releasing NVIDIA H200 GPU slices (70GB VRAM) as needed, organizations can efficiently serve their AI applications without dedicated GPU instances. screenshot of Hugging Face Advanced Compute Options (ZeroGPU) **Key benefits for organizations** - **Free GPU Access**: Access powerful NVIDIA H200 GPUs at no additional cost through dynamic allocation - **Enhanced Resource Management**: Host up to 50 ZeroGPU Spaces for efficient team-wide AI deployment - **Simplified Deployment**: Easy integration with PyTorch-based models, Gradio apps, and other Hugging Face libraries - **Enterprise-Grade Infrastructure**: Access to high-performance NVIDIA H200 GPUs with 70GB VRAM per workload [Learn more about ZeroGPU โ†’](https://huggingface.co/docs/hub/spaces-zerogpu) ### Managed SSO https://huggingface.co/docs/hub/enterprise-advanced-sso.md # Managed SSO > [!WARNING] > This feature is part of the Enterprise Plus plan. Managed SSO **replaces the Hugging Face login entirely**. Your Identity Provider becomes the sole authentication method for your organization's members across the entire Hugging Face platform. The organization controls the full user lifecycle, from account creation to deactivation. For a comparison with Basic SSO, see the [SSO overview](./enterprise-sso). ## How it works > [!NOTE] > **Managed SSO replaces the Hugging Face login.** Your IdP is the only way for managed users to authenticate on Hugging Face โ€” there is no separate Hugging Face login. Unlike Basic SSO, members do not need a pre-existing Hugging Face account. When a user authenticates through your IdP for the first time, an account is automatically created for them. Your IdP is the mandatory authentication route for all your organization's members interacting with any part of the Hugging Face platform. Members are required to authenticate via your IdP for all Hugging Face services, not just when accessing private or organizational repositories. When a user is deactivated in your IdP, their Hugging Face account is deactivated as well. This gives your organization complete control over identity, access, and data governance. ## Getting started Managed SSO cannot be self-configured. To enable Managed SSO for your organization, please contact the Hugging Face team. The setup is done in collaboration with our technical team to ensure a smooth transition for your organization. Both SAML 2.0 and OIDC protocols are supported and can be integrated with popular identity providers such as Okta, Microsoft Entra ID (Azure AD), and Google Workspace. ## User provisioning Managed SSO introduces automated user provisioning through [SCIM](./enterprise-scim), which manages the entire user lifecycle on Hugging Face. SCIM allows your IdP to communicate user identity information to Hugging Face, enabling automatic creation, updates (e.g., name changes, role changes), and deactivation of user accounts as changes occur in your IdP. Learn more about how to set up and manage SCIM in our [dedicated guide](./enterprise-scim). ## SSO features Managed SSO supports [role mapping, resource group mapping, session timeout, and external collaborators](./security-sso-user-management). These features are configurable from your organization's settings. ## Restrictions on managed accounts > [!WARNING] > Important considerations for managed accounts. To ensure organizational control and data governance, managed user accounts have specific restrictions: * **No personal content creation**: Managed users cannot create any content (models, datasets, or Spaces) in their personal user namespace. All content must be created within the organization. * **Organization-bound collaboration**: Managed users are restricted to collaborating solely within their managing organization. They cannot join other organizations or contribute to repositories outside of their managing organization. * **Content visibility**: Content created by managed users resides within the organization. While the managed users cannot create public content in their personal profile, they can **create public content within the organization** if the organization's settings permit it. These restrictions maintain your enterprise's security boundaries. For personal projects or broader collaboration outside your organization, members should use a separate, unmanaged Hugging Face account. ### Webhooks https://huggingface.co/docs/hub/webhooks.md # Webhooks Webhooks are a foundation for MLOps-related features. They allow you to listen for new changes on specific repos or to all repos belonging to particular set of users/organizations (not just your repos, but any repo). You can use them to auto-convert models, build community bots, or build CI/CD for your models, datasets, and Spaces (and much more!). Webhooks can also [trigger Jobs](./jobs-webhooks) to automate compute tasks in response to repo events. The documentation for Webhooks is below โ€“ or you can also browse our **guides** showcasing a few possible use cases of Webhooks: - [Fine-tune a new model whenever a dataset gets updated (Python)](./webhooks-guide-auto-retrain) - [Create a discussion bot on the Hub, using a LLM API (NodeJS)](./webhooks-guide-discussion-bot) - [Create metadata quality reports (Python)](./webhooks-guide-metadata-review) - and more to comeโ€ฆ ## Create your Webhook You can create new Webhooks and edit existing ones in your Webhooks [settings](https://huggingface.co/settings/webhooks): ![Settings of an individual webhook](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhook-settings.png) Webhooks can watch for repos updates, Pull Requests, discussions, and new comments. It's even possible to create a Space to react to your Webhooks! ## Webhook Payloads After registering a Webhook, you will be notified of new events via an `HTTP POST` call on the specified target URL. The payload is encoded in JSON. You can view the history of payloads sent in the activity tab of the webhook settings page, it's also possible to replay past webhooks for easier debugging: ![image.png](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhook-activity.png) As an example, here is the full payload when a Pull Request is opened: ```json { "event": { "action": "create", "scope": "discussion" }, "repo": { "type": "model", "name": "openai-community/gpt2", "id": "621ffdc036468d709f17434d", "private": false, "url": { "web": "https://huggingface.co/openai-community/gpt2", "api": "https://huggingface.co/api/models/openai-community/gpt2" }, "owner": { "id": "628b753283ef59b5be89e937" } }, "discussion": { "id": "6399f58518721fdd27fc9ca9", "title": "Update co2 emissions", "url": { "web": "https://huggingface.co/openai-community/gpt2/discussions/19", "api": "https://huggingface.co/api/models/openai-community/gpt2/discussions/19" }, "status": "open", "author": { "id": "61d2f90c3c2083e1c08af22d" }, "num": 19, "isPullRequest": true, "changes": { "base": "refs/heads/main" } }, "comment": { "id": "6399f58518721fdd27fc9caa", "author": { "id": "61d2f90c3c2083e1c08af22d" }, "content": "Add co2 emissions information to the model card", "hidden": false, // Note: when `hidden` is `true`, `content` will be undefined "url": { "web": "https://huggingface.co/openai-community/gpt2/discussions/19#6399f58518721fdd27fc9caa" } }, "webhook": { "id": "6390e855e30d9209411de93b", "version": 3 } } ``` ### Event The top-level properties `event` is always specified and used to determine the nature of the event. It has two sub-properties: `event.action` and `event.scope`. `event.scope` will be one of the following values: - `"repo"` - Global events on repos. Possible values for the associated `action`: `"create"`, `"delete"`, `"update"`, `"move"`. - `"repo.content"` - Events on the repo's content, such as new commits or tags. It triggers on new Pull Requests as well due to the newly created reference/commit. The associated `action` is always `"update"`. - `"repo.config"` - Events on the config: update Space secrets, update settings, update DOIs, disabled or not, etc. The associated `action` is always `"update"`. - `"discussion"` - Creating a discussion or Pull Request, updating the title or status, and merging. Possible values for the associated `action`: `"create"`, `"delete"`, `"update"`. - `"discussion.comment"` - Creating, updating, and hiding a comment. Possible values for the associated `action`: `"create"`, `"update"`. More scopes can be added in the future. To handle unknown events, your webhook handler can consider any action on a narrowed scope to be an `"update"` action on the broader scope. For example, if the `"repo.config.dois"` scope is added in the future, any event with that scope can be considered by your webhook handler as an `"update"` action on the `"repo.config"` scope. ### Repo In the current version of webhooks, the top-level property `repo` is always specified, as events can always be associated with a repo. For example, consider the following value: ```json "repo": { "type": "model", "name": "some-user/some-repo", "id": "6366c000a2abcdf2fd69a080", "private": false, "url": { "web": "https://huggingface.co/some-user/some-repo", "api": "https://huggingface.co/api/models/some-user/some-repo" }, "headSha": "c379e821c9c95d613899e8c4343e4bfee2b0c600", "owner": { "id": "61d2000c3c2083e1c08af22d" } } ``` `repo.headSha` is the sha of the latest commit on the repo's `main` branch. It is only sent when `event.scope` starts with `"repo"`, not on community events like discussions and comments. ### Code changes On code changes, the top-level property `updatedRefs` is specified on repo events. It is an array of references that have been updated. Here is an example value: ```json "updatedRefs": [ { "ref": "refs/heads/main", "oldSha": "ce9a4674fa833a68d5a73ec355f0ea95eedd60b7", "newSha": "575db8b7a51b6f85eb06eee540738584589f131c" }, { "ref": "refs/tags/test", "oldSha": null, "newSha": "575db8b7a51b6f85eb06eee540738584589f131c" } ] ``` Newly created references will have `oldSha` set to `null`. Deleted references will have `newSha` set to `null`. You can react to new commits on specific pull requests, new tags, or new branches. ### Config changes When the top-level property `event.scope` is `"repo.config"`, the `updatedConfig` property is specified. It is an object containing the updated config. Here is an example value: ```json "updatedConfig": { "private": false } ``` When the updated config key is not supported by the webhook, the object will be empty: ```json "updatedConfig": {} ``` For now only `private` is supported. If you would benefit from more config keys being present here, please let us know at website@huggingface.co. ### Discussions and Pull Requests The top-level property `discussion` is specified on community events (discussions and Pull Requests). The `discussion.isPullRequest` property is a boolean indicating if the discussion is also a Pull Request (on the Hub, a PR is a special type of discussion). Here is an example value: ```json "discussion": { "id": "639885d811ae2bad2b7ba461", "title": "Hello!", "url": { "web": "https://huggingface.co/some-user/some-repo/discussions/3", "api": "https://huggingface.co/api/models/some-user/some-repo/discussions/3" }, "status": "open", "author": { "id": "61d2000c3c2083e1c08af22d" }, "isPullRequest": true, "changes": { "base": "refs/heads/main" } "num": 3 } ``` ### Comment The top level property `comment` is specified when a comment is created (including on discussion creation) or updated. Here is an example value: ```json "comment": { "id": "6398872887bfcfb93a306f18", "author": { "id": "61d2000c3c2083e1c08af22d" }, "content": "This adds an env key", "hidden": false, "url": { "web": "https://huggingface.co/some-user/some-repo/discussions/4#6398872887bfcfb93a306f18" } } ``` ## Webhook secret Setting a Webhook secret is useful to make sure payloads sent to your Webhook handler URL are actually from Hugging Face. If you set a secret for your Webhook, it will be sent along as an `X-Webhook-Secret` HTTP header on every request. Only ASCII characters are supported. > [!TIP] > It's also possible to add the secret directly in the handler URL. For example, setting it as a query parameter: https://example.com/webhook?secret=XXX. > > This can be helpful if accessing the HTTP headers of the request is complicated for your Webhook handler. ## Rate limiting Each Webhook is limited to 1,000 triggers per 24 hours. You can view your usage in the Webhook settings page in the "Activity" tab. If you need to increase the number of triggers for your Webhook, upgrade to PRO, Team or Enterprise and contact us at website@huggingface.co. ## Developing your Webhooks If you do not have an HTTPS endpoint/URL, you can try out public tools for webhook testing. These tools act as catch-all (capture all requests) sent to them and give 200 OK status code. [Beeceptor](https://beeceptor.com/) is one tool you can use to create a temporary HTTP endpoint and review the incoming payload. Another such tool is [Webhook.site](https://webhook.site/). Additionally, you can route a real Webhook payload to the code running locally on your machine during development. This is a great way to test and debug for faster integrations. You can do this by exposing your localhost port to the Internet. To be able to go this path, you can use [ngrok](https://ngrok.com/) or [localtunnel](https://theboroer.github.io/localtunnel-www/). ## Debugging Webhooks You can easily find recently generated events for your webhooks. Open the activity tab for your webhook. There you will see the list of recent events. ![image.png](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhook-payload.png) Here you can review the HTTP status code and the payload of the generated events. Additionally, you can replay these events by clicking on the `Replay` button! Note: When changing the target URL or secret of a Webhook, replaying an event will send the payload to the updated URL. ## FAQ ##### Can I define webhooks on my organization vs my user account? No, this is not currently supported. ##### How can I subscribe to all events on HF (or across a whole repo type, like on all models)? This is not currently exposed to end users but we can toggle this for you if you send an email to website@huggingface.co. ### marimo on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-marimo.md # marimo on Spaces [marimo](https://github.com/marimo-team/marimo) is a reactive notebook for Python that models notebooks as dataflow graphs. When you run a cell or interact with a UI element, marimo automatically runs affected cells (or marks them as stale), keeping code and outputs consistent and preventing bugs before they happen. Every marimo notebook is stored as pure Python, executable as a script, and deployable as an app. Key features: - โšก๏ธ **reactive:** run a cell, and marimo reactively runs all dependent cells or marks them as stale - ๐Ÿ–๏ธ **interactive:** bind sliders, tables, plots, and more to Python โ€” no callbacks required - ๐Ÿ”ฌ **reproducible:** no hidden state, deterministic execution, built-in package management - ๐Ÿƒ **executable:** execute as a Python script, parametrized by CLI args - ๐Ÿ›œ **shareable:** deploy as an interactive web app or slides, run in the browser via WASM - ๐Ÿ›ข๏ธ **designed for data:** query dataframes and databases with SQL, filter and search dataframes ## Deploying marimo apps on Spaces To get started with marimo on Spaces, click the button below: This will start building your Space using marimo's Docker template. If successful, you should see a similar application to the [marimo introduction notebook](https://huggingface.co/spaces/marimo-team/marimo-app-template). ## Customizing your marimo app When you create a marimo Space, you'll get a few key files to help you get started: ### 1. app.py This is your main marimo notebook file that defines your app's logic. marimo notebooks are pure Python files that use the `@app.cell` decorator to define cells. To learn more about building notebooks and apps, see [the marimo documentation](https://docs.marimo.io). As your app grows, you can organize your code into modules and import them into your main notebook. ### 2. Dockerfile The Dockerfile for a marimo app is minimal since marimo has few system dependencies. The key requirements are: - It installs the dependencies listed in `requirements.txt` (using `uv`) - It creates a non-root user for security - It runs the app using `marimo run app.py` You may need to modify this file if your application requires additional system dependencies, permissions, or other CLI flags. ### 3. requirements.txt The Space will automatically install dependencies listed in the `requirements.txt` file. At minimum, you must include `marimo` in this file. You will want to add any other required packages your app needs. The marimo Space template provides a basic setup that you can extend based on your needs. When deployed, your notebook will run in "app mode" which hides the code cells and only shows the interactive outputs - perfect for sharing with end users. You can opt to include the code cells in your app by setting adding `--include-code` to the `marimo run` command in the Dockerfile. ## Additional Resources and Support - [marimo documentation](https://docs.marimo.io) - [marimo GitHub repository](https://github.com/marimo-team/marimo) - [marimo Discord](https://marimo.io/discord) - [marimo template Space](https://huggingface.co/spaces/marimo-team/marimo-app-template) ## Troubleshooting If you encounter issues: 1. Make sure your notebook runs locally in app mode using `marimo run app.py` 2. Check that all required packages are listed in `requirements.txt` 3. Verify the port configuration matches (7860 is the default for Spaces) 4. Check Space logs for any Python errors For more help, visit the [marimo Discord](https://marimo.io/discord) or [open an issue](https://github.com/marimo-team/marimo/issues). ### How to configure SAML SSO with Okta https://huggingface.co/docs/hub/security-sso-okta-saml.md # How to configure SAML SSO with Okta In this guide, we will use Okta as the SSO provider and with the Security Assertion Markup Language (SAML) protocol as our preferred identity protocol. We currently support SP-initiated and IdP-initiated authentication. For user provisioning, see [SCIM](./enterprise-scim). > [!WARNING] > This feature is part of the Team & Enterprise plans. ## Step 1: Create a new application in your Identity Provider Open a new tab/window in your browser and sign in to your Okta account. Navigate to "Admin/Applications" and click the "Create App Integration" button. Then choose an "SAML 2.0" application and click "Create". ## Step 2: Configure your application on Okta Open a new tab/window in your browser and navigate to the SSO section of your organization's settings. Select the SAML protocol. Copy the "Assertion Consumer Service URL" from the organization's settings on Hugging Face, and paste it in the "Single sign-on URL" field on Okta. The URL looks like this: `https://huggingface.co/organizations/[organizationIdentifier]/saml/consume`. On Okta, set the following settings: - Set Audience URI (SP Entity Id) to match the "SP Entity ID" value on Hugging Face. - Set Name ID format to EmailAddress. - Under "Show Advanced Settings", verify that Response and Assertion Signature are set to: Signed. Save your new application. ## Step 3: Finalize configuration on Hugging Face In your Okta application, under "Sign On/Settings/More details", find the following fields: - Sign-on URL - Public certificate - SP Entity ID You will need them to finalize the SSO setup on Hugging Face. In the SSO section of your organization's settings, copy-paste these values from Okta: - Sign-on URL - SP Entity ID - Public certificate The public certificate must have the following format: ``` -----BEGIN CERTIFICATE----- {certificate} -----END CERTIFICATE----- ``` You can now click on "Update and Test SAML configuration" to save the settings. You should be redirected to your SSO provider (IdP) login prompt. Once logged in, you'll be redirected to your organization's settings page. A green check mark near the SAML selector will attest that the test was successful. ## Step 4: Enable SSO in your organization Now that Single Sign-On is configured and tested, you can enable it for members of your organization by clicking on the "Enable" button. Once enabled, members of your organization must complete the SSO authentication flow described in the [How it works](./security-sso-basic#how-it-works) section. ### DuckDB https://huggingface.co/docs/hub/datasets-duckdb.md # DuckDB [DuckDB](https://github.com/duckdb/duckdb) is an in-process SQL [OLAP](https://en.wikipedia.org/wiki/Online_analytical_processing) database management system. You can use the Hugging Face paths (`hf://`) to access data on the Hub: The [DuckDB CLI](https://duckdb.org/docs/api/cli/overview.html) (Command Line Interface) is a single, dependency-free executable. There are also other APIs available for running DuckDB, including Python, C++, Go, Java, Rust, and more. For additional details, visit their [clients](https://duckdb.org/docs/api/overview.html) page. > [!TIP] > For installation details, visit the [installation page](https://duckdb.org/docs/installation). Starting from version `v0.10.3`, the DuckDB CLI includes native support for accessing datasets on the Hugging Face Hub via URLs with the `hf://` scheme. Here are some features you can leverage with this powerful tool: - Query public datasets and your own gated and private datasets - Analyze datasets and perform SQL operations - Combine datasets and export it to different formats - Conduct vector similarity search on embedding datasets - Implement full-text search on datasets For a complete list of DuckDB features, visit the DuckDB [documentation](https://duckdb.org/docs/). To start the CLI, execute the following command in the installation folder: ```bash ./duckdb ``` ## Forging the Hugging Face URL To access Hugging Face datasets, use the following URL format: ```plaintext hf://datasets/{my-username}/{my-dataset}/{path_to_file} ``` - **my-username**, the user or organization of the dataset, e.g. `ibm` - **my-dataset**, the dataset name, e.g: `duorc` - **path_to_parquet_file**, the parquet file path which supports glob patterns, e.g `**/*.parquet`, to query all parquet files > [!TIP] > You can query auto-converted Parquet files using the @~parquet branch, which corresponds to the `refs/convert/parquet` revision. For more details, refer to the documentation at https://huggingface.co/docs/datasets-server/en/parquet#conversion-to-parquet. > > To reference the `refs/convert/parquet` revision of a dataset, use the following syntax: > > ```plaintext > hf://datasets/{my-username}/{my-dataset}@~parquet/{path_to_file} > ``` > > Here is a sample URL following the above syntax: > > ```plaintext > hf://datasets/ibm/duorc@~parquet/ParaphraseRC/test/0000.parquet > ``` Let's start with a quick demo to query all the rows of a dataset: ```sql FROM 'hf://datasets/ibm/duorc/ParaphraseRC/*.parquet' LIMIT 3; ``` Or using traditional SQL syntax: ```sql SELECT * FROM 'hf://datasets/ibm/duorc/ParaphraseRC/*.parquet' LIMIT 3; ``` In the following sections, we will cover more complex operations you can perform with DuckDB on Hugging Face datasets. > [!TIP] > **Querying Storage Buckets**: When using the DuckDB Python client, you can query data stored in [Storage Buckets](./storage-buckets) by registering the Hugging Face filesystem: > ```python > import duckdb > from huggingface_hub import HfFileSystem > duckdb.register_filesystem(HfFileSystem()) > duckdb.sql("SELECT * FROM 'hf://buckets/username/my-bucket/data.parquet' LIMIT 10") > ``` Native `hf://buckets/` support in DuckDB is expected in a future release. ### Using Stanza at Hugging Face https://huggingface.co/docs/hub/stanza.md # Using Stanza at Hugging Face `stanza` is a collection of accurate and efficient tools for the linguistic analysis of many human languages. Starting from raw text to syntactic analysis and entity recognition, Stanza brings state-of-the-art NLP models to languages of your choosing. ## Exploring Stanza in the Hub You can find `stanza` models by filtering at the left of the [models page](https://huggingface.co/models?library=stanza&sort=downloads). You can find over 70 models for different languages! All models on the Hub come up with the following features: 1. An automatically generated model card with a brief description and metadata tags that help for discoverability. 2. An interactive widget you can use to play out with the model directly in the browser (for named entity recognition and part of speech). 3. An Inference Providers widget that allows to make inference requests (for named entity recognition and part of speech). ## Using existing models The `stanza` library automatically downloads models from the Hub. You can use `stanza.Pipeline` to download the model from the Hub and do inference. ```python import stanza nlp = stanza.Pipeline('en') # download th English model and initialize an English neural pipeline doc = nlp("Barack Obama was born in Hawaii.") # run annotation over a sentence ``` ## Sharing your models To add new official Stanza models, you can follow the process to [add a new language](https://stanfordnlp.github.io/stanza/new_language.html) and then [share your models with the Stanza team](https://stanfordnlp.github.io/stanza/new_language.html#contributing-back-to-stanza). You can also find the official script to upload models to the Hub [here](https://github.com/stanfordnlp/huggingface-models/blob/main/hugging_stanza.py). ## Additional resources * `stanza` [docs](https://stanfordnlp.github.io/stanza/). ### Notifications https://huggingface.co/docs/hub/notifications.md # Notifications Notifications allow you to know when new activities (**Pull Requests or discussions**) happen on models, datasets, and Spaces belonging to users or organizations you are watching. By default, you'll receive a notification if: - Someone mentions you in a discussion/PR. - A new comment is posted in a discussion/PR you participated in. - A new discussion/PR or comment is posted in one of the repositories of an organization or user you are watching. - Someone replies to one of your posts, blog articles, or paper pages. ![Notifications page](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/notifications-page.png) You'll get new notifications by email and [directly on the website](https://huggingface.co/notifications), you can change this in your [notifications settings](#notifications-settings). ## Filtering and managing notifications On the [notifications page](https://huggingface.co/notifications), you have several options for filtering and managing your notifications more effectively: - Filter by Repository: Choose to display notifications from a specific repository only. - Filter by Read Status: Display only unread notifications or all notifications. - Filter by Participation: Show notifications you have participated in or those which you have been directly mentioned. Additionally, you can take the following actions to manage your notifications: - Mark as Read/Unread: Change the status of notifications to mark them as read or unread. - Mark as Done: Once marked as done, notifications will no longer appear in the notification center (they are deleted). By default, changes made to notifications will only apply to the selected notifications on the screen. However, you can also apply changes to all matching notifications (like in Gmail for instance) for greater convenience. ## Watching users and organizations By default, you'll be watching all the organizations you are a member of and will be notified of any new activity on those. You can also choose to get notified on arbitrary users or organizations. To do so, use the "Watch repos" button on their HF profiles. Note that you can also quickly watch/unwatch users and organizations directly from your [notifications settings](#notifications-settings). Finally, you can choose to watch a specific repository and get notified about any new activity without having to watch the whole organization or user account. ## Notifications settings In your [notifications settings](https://huggingface.co/settings/notifications) page, you can choose specific channels to get notified on depending on the type of activity, for example, receiving an email for direct mentions but only a web notification for new activity on watched users and organizations. By default, you'll get an email and a web notification for any new activity but feel free to adjust your settings depending on your needs. _Note that clicking the unsubscribe link in an email will unsubscribe you for the type of activity, eg direct mentions._ ![Notifications settings page](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/notifications-settings.png) You can quickly add any user/organization to your watch list by searching them by name using the dedicated search bar. Unsubscribe from a specific user/organization simply by unticking the corresponding checkbox. ## Mute notifications for a specific repository It's possible to mute notifications for a particular repository by using the "Mute notifications" action in the repository's contextual menu. This will prevent you from receiving any new notifications for that particular repository. You can unmute the repository at any time by clicking the "Unmute notifications" action in the same repository menu. ![mute notification menu](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/notifications-mute-menu.png) _Note, if a repository is muted, you won't receive any new notification unless you're directly mentioned or participating to a discussion._ The list of muted repositories is available from the notifications settings page: ![Notifications settings page muted repositories](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/notifications-settings-muted.png) ## Mute notifications for a specific discussion or PR You can also mute notifications for individual discussions or pull requests by clicking the mute icon in the header. Doing this prevents you from receiving any further notifications from that specific discussion or PR, including direct mentions. You can unmute at any time by clicking the same icon again. ![Notifications mute discussions](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/notifications-mute-discussion.png) ### Gradio Spaces https://huggingface.co/docs/hub/spaces-sdks-gradio.md # Gradio Spaces **Gradio** provides an easy and intuitive interface for running a model from a list of inputs and displaying the outputs in formats such as images, audio, 3D objects, and more. Gradio now even has a [Plot output component](https://gradio.app/docs/#o_plot) for creating data visualizations with Matplotlib, Bokeh, and Plotly! For more details, take a look at the [Getting started](https://gradio.app/getting_started/) guide from the Gradio team. Selecting **Gradio** as the SDK when [creating a new Space](https://huggingface.co/new-space) will initialize your Space with the latest version of Gradio by setting the `sdk` property to `gradio` in your `README.md` file's YAML block. If you'd like to change the Gradio version, you can edit the `sdk_version` property. Visit the [Gradio documentation](https://gradio.app/docs/) to learn all about its features and check out the [Gradio Guides](https://gradio.app/guides/) for some handy tutorials to help you get started! ## Your First Gradio Space: Hot Dog Classifier In the following sections, you'll learn the basics of creating a Space, configuring it, and deploying your code to it. We'll create a **Hot Dog Classifier** Space with Gradio that'll be used to demo the [julien-c/hotdog-not-hotdog](https://huggingface.co/julien-c/hotdog-not-hotdog) model, which can detect whether a given picture contains a hot dog ๐ŸŒญ You can find a completed version of this hosted at [NimaBoscarino/hotdog-gradio](https://huggingface.co/spaces/NimaBoscarino/hotdog-gradio). ## Create a new Gradio Space We'll start by [creating a brand new Space](https://huggingface.co/new-space) and choosing **Gradio** as our SDK. Hugging Face Spaces are Git repositories, meaning that you can work on your Space incrementally (and collaboratively) by pushing commits. Take a look at the [Getting Started with Repositories](./repositories-getting-started) guide to learn about how you can create and edit files before continuing. ## Add the dependencies For the **Hot Dog Classifier** we'll be using a [๐Ÿค— Transformers pipeline](https://huggingface.co/docs/transformers/pipeline_tutorial) to use the model, so we need to start by installing a few dependencies. This can be done by creating a **requirements.txt** file in our repository, and adding the following dependencies to it: ``` transformers torch ``` The Spaces runtime will handle installing the dependencies! ## Create the Gradio interface To create the Gradio app, make a new file in the repository called **app.py**, and add the following code: ```python import gradio as gr from transformers import pipeline pipeline = pipeline(task="image-classification", model="julien-c/hotdog-not-hotdog") def predict(input_img): predictions = pipeline(input_img) return input_img, {p["label"]: p["score"] for p in predictions} gradio_app = gr.Interface( predict, inputs=gr.Image(label="Select hot dog candidate", sources=['upload', 'webcam'], type="pil"), outputs=[gr.Image(label="Processed Image"), gr.Label(label="Result", num_top_classes=2)], title="Hot Dog? Or Not?", ) if __name__ == "__main__": gradio_app.launch() ``` This Python script uses a [๐Ÿค— Transformers pipeline](https://huggingface.co/docs/transformers/pipeline_tutorial) to load the [julien-c/hotdog-not-hotdog](https://huggingface.co/julien-c/hotdog-not-hotdog) model, which is used by the Gradio interface. The Gradio app will expect you to upload an image, which it'll then classify as *hot dog* or *not hot dog*. Once you've saved the code to the **app.py** file, visit the **App** tab to see your app in action! ## Embed Gradio Spaces on other webpages You can embed a Gradio Space on other webpages by using either Web Components or the HTML ` ``` For instance using the [NimaBoscarino/hotdog-gradio](https://huggingface.co/spaces/NimaBoscarino/hotdog-gradio) Space: ## Embedding with WebComponents If the Space you wish to embed is Gradio-based, you can use Web Components to embed your Space. WebComponents are faster than IFrames and automatically adjust to your web page so that you do not need to configure `width` or `height` for your element. First, you need to import the Gradio JS library that corresponds to the Gradio version in the Space by adding the following script to your HTML. Then, add a `gradio-app` element where you want to embed your Space. ```html ``` Check out the [Gradio documentation](https://www.gradio.app/guides/sharing-your-app#embedding-hosted-spaces) for more details. ### Spark https://huggingface.co/docs/hub/datasets-spark.md # Spark Spark enables real-time, large-scale data processing in a distributed environment. You can use `pyspark_huggingface` to access Hugging Face datasets repositories in PySpark via the "huggingface" Data Source. Try out [Spark Notebooks](https://huggingface.co/spaces/Dataset-Tools/Spark-Notebooks) on Hugging Face Spaces to get Notebooks with PySpark and `pyspark_huggingface` pre-installed. ## Set up ### Installation To be able to read and write to Hugging Face Datasets, you need to install the `pyspark_huggingface` library: ``` pip install pyspark_huggingface ``` This will also install required dependencies like `huggingface_hub` for authentication, and `pyarrow` for reading and writing datasets. ### Authentication You need to authenticate to Hugging Face to read private/gated dataset repositories or to write to your dataset repositories. You can use the CLI for example: ``` hf auth login ``` It's also possible to provide your Hugging Face token with the `HF_TOKEN` environment variable or passing the `token` option to the reader. For more details about authentication, check out [this guide](https://huggingface.co/docs/huggingface_hub/quick-start#authentication). ### Enable the "huggingface" Data Source PySpark 4 came with a new Data Source API which allows to use datasets from custom sources. If `pyspark_huggingface` is installed, PySpark auto-imports it and enables the "huggingface" Data Source. The library also backports the Data Source API for the "huggingface" Data Source for PySpark 3.5, 3.4 and 3.3. However in this case `pyspark_huggingface` should be imported explicitly to activate the backport and enable the "huggingface" Data Dource: ```python >>> import pyspark_huggingface huggingface datasource enabled for pyspark 3.x.x (backport from pyspark 4) ``` ## Read The "huggingface" Data Source allows to read datasets from Hugging Face, using `pyarrow` under the hood to stream Arrow data. This is compatible with all the dataset in [supported format](https://huggingface.co/docs/hub/datasets-adding#file-formats) on Hugging Face, like Parquet datasets. For example here is how to load the [stanfordnlp/imdb](https://huggingface.co/stanfordnlp/imdb) dataset: ```python >>> import pyspark_huggingface >>> from pyspark.sql import SparkSession >>> spark = SparkSession.builder.appName("demo").getOrCreate() >>> df = spark.read.format("huggingface").load("stanfordnlp/imdb") ``` Here is another example with the [BAAI/Infinity-Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct) dataset. It is a gated repository, users have to accept the terms of use before accessing it. It also has multiple subsets, namely, "3M" and "7M". So we need to specify which one to load. We use the `.format()` function to use the "huggingface" Data Source, and `.load()` to load the dataset (more precisely the config or subset named "7M" containing 7M samples). Then we compute the number of dialogue per language and filter the dataset. After logging-in to access the gated repository, we can run: ```python >>> import pyspark_huggingface >>> from pyspark.sql import SparkSession >>> spark = SparkSession.builder.appName("demo").getOrCreate() >>> df = spark.read.format("huggingface").option("config", "7M").load("BAAI/Infinity-Instruct") >>> df.show() +---+----------------------------+-----+----------+--------------------+ | id| conversations|label|langdetect| source| +---+----------------------------+-----+----------+--------------------+ | 0| [{human, def exti...| | en| code_exercises| | 1| [{human, See the ...| | en| flan| | 2| [{human, This is ...| | en| flan| | 3| [{human, If you d...| | en| flan| | 4| [{human, In a Uni...| | en| flan| | 5| [{human, Read the...| | en| flan| | 6| [{human, You are ...| | en| code_bagel| | 7| [{human, I want y...| | en| Subjective| | 8| [{human, Given th...| | en| flan| | 9|[{human, ๅ› ๆžœ่”็ณปๅŽŸๅˆ™ๆ˜ฏๆณ•...| | zh-cn| Subjective| | 10| [{human, Provide ...| | en|self-oss-instruct...| | 11| [{human, The univ...| | en| flan| | 12| [{human, Q: I am ...| | en| flan| | 13| [{human, What is ...| | en| OpenHermes-2.5| | 14| [{human, In react...| | en| flan| | 15| [{human, Write Py...| | en| code_exercises| | 16| [{human, Find the...| | en| MetaMath| | 17| [{human, Three of...| | en| MetaMath| | 18| [{human, Chandra ...| | en| MetaMath| | 19|[{human, ็”จ็ปๆตŽๅญฆ็Ÿฅ่ฏ†ๅˆ†ๆž...| | zh-cn| Subjective| +---+----------------------------+-----+----------+--------------------+ ``` This loads the dataset in a streaming fashion, and the output DataFrame has one partition per data file in the dataset to enable efficient distributed processing. To compute the number of dialogues per language we run this code that uses the `columns` option and a `groupBy()` operation. The `columns` option is useful to only load the data we need, since PySpark doesn't enable predicate push-down with the Data Source API. There is also a `filters` option to only load data with values within a certain range. ```python >>> df_langdetect_only = ( ... spark.read.format("huggingface") ... .option("config", "7M") ... .option("columns", '["langdetect"]') ... .load("BAAI/Infinity-Instruct") ... ) >>> df_langdetect_only.groupBy("langdetect").count().show() +----------+-------+ |langdetect| count| +----------+-------+ | en|6697793| | zh-cn| 751313| +----------+-------+ ``` To filter the dataset and only keep dialogues in Chinese: ```python >>> df_chinese_only = ( ... spark.read.format("huggingface") ... .option("config", "7M") ... .option("filters", '[("langdetect", "=", "zh-cn")]') ... .load("BAAI/Infinity-Instruct") ... ) >>> df_chinese_only.show() +---+----------------------------+-----+----------+----------+ | id| conversations|label|langdetect| source| +---+----------------------------+-----+----------+----------+ | 9|[{human, ๅ› ๆžœ่”็ณปๅŽŸๅˆ™ๆ˜ฏๆณ•...| | zh-cn|Subjective| | 19|[{human, ็”จ็ปๆตŽๅญฆ็Ÿฅ่ฏ†ๅˆ†ๆž...| | zh-cn|Subjective| | 38| [{human, ๆŸไธช่€ƒ่ฏ•ๅ…ฑๆœ‰Aใ€...| | zh-cn|Subjective| | 39|[{human, ๆ’ฐๅ†™ไธ€็ฏ‡ๅ…ณไบŽๆ–ๆณข...| | zh-cn|Subjective| | 57|[{human, ๆ€ป็ป“ไธ–็•ŒๅކๅฒไธŠ็š„...| | zh-cn|Subjective| | 61|[{human, ็”Ÿๆˆไธ€ๅˆ™ๅนฟๅ‘Š่ฏใ€‚...| | zh-cn|Subjective| | 66|[{human, ๆ่ฟฐไธ€ไธชๆœ‰ๆ•ˆ็š„ๅ›ข...| | zh-cn|Subjective| | 94|[{human, ๅฆ‚ๆžœๆฏ”ๅˆฉๅ’Œ่’‚่Š™ๅฐผ...| | zh-cn|Subjective| |102|[{human, ็”Ÿๆˆไธ€ๅฅ่‹ฑๆ–‡ๅ่จ€...| | zh-cn|Subjective| |106|[{human, ๅ†™ไธ€ๅฐๆ„Ÿ่ฐขไฟก๏ผŒๆ„Ÿ...| | zh-cn|Subjective| |118| [{human, ็”Ÿๆˆไธ€ไธชๆ•…ไบ‹ใ€‚}...| | zh-cn|Subjective| |174|[{human, ้ซ˜่ƒ†ๅ›บ้†‡ๆฐดๅนณ็š„ๅŽ...| | zh-cn|Subjective| |180|[{human, ๅŸบไบŽไปฅไธ‹่ง’่‰ฒไฟกๆฏ...| | zh-cn|Subjective| |192|[{human, ่ฏทๅ†™ไธ€็ฏ‡ๆ–‡็ซ ๏ผŒๆฆ‚...| | zh-cn|Subjective| |221|[{human, ไปฅ่ฏ—ๆญŒๅฝขๅผ่กจ่พพๅฏน...| | zh-cn|Subjective| |228|[{human, ๆ นๆฎ็ป™ๅฎš็š„ๆŒ‡ไปค๏ผŒ...| | zh-cn|Subjective| |236|[{human, ๆ‰“ๅผ€ไธ€ไธชๆ–ฐ็š„็”Ÿๆˆ...| | zh-cn|Subjective| |260|[{human, ็”Ÿๆˆไธ€ไธชๆœ‰ๅ…ณๆœชๆฅ...| | zh-cn|Subjective| |268|[{human, ๅฆ‚ๆžœๆœ‰ไธ€ๅฎšๆ•ฐ้‡็š„...| | zh-cn|Subjective| |273| [{human, ้ข˜็›ฎ๏ผšๅฐๆ˜Žๆœ‰5ไธช...| | zh-cn|Subjective| +---+----------------------------+-----+----------+----------+ ``` It is also possible to apply filters or remove columns on the loaded DataFrame, but it is more efficient to do it while loading, especially on Parquet datasets. Indeed, Parquet contains metadata at the file and row group level, which allows to skip entire parts of the dataset that don't contain samples that satisfy the criteria. Columns in Parquet can also be loaded independently, which allows to skip the excluded columns and avoid loading unnecessary data. ### Options Here is the list of available options you can pass to `read..option()`: * `config` (string): select a dataset subset/config * `split` (string): select a dataset split (default is "train") * `token` (string): your Hugging Face token Instead of specifying a config or split, you can select which files to load manually: * `data_dir` (string): select a directory * `data_files` (string): select one or many files, e.g. `"data/*.parquet"` or `'["part1.parquet", "par2.parquet"]'` For Parquet datasets: * `columns` (string): select a subset of columns to load, e.g. `'["id"]'` * `filters` (string): to skip files and row groups that don't match a criteria, e.g. `'[("source", "=", "code_exercises")]'`. Filters are passed to [pyarrow.parquet.ParquetDataset](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html). Any other option is passed as an argument to [datasets.load_dataset] (https://huggingface.co/docs/datasets/en/package_reference/loading_methods#datasets.load_dataset) ### Run SQL queries Once you have your PySpark Dataframe ready, you can run SQL queries using `spark.sql`: ```python >>> import pyspark_huggingface >>> from pyspark.sql import SparkSession >>> spark = SparkSession.builder.appName("demo").getOrCreate() >>> df = ( ... spark.read.format("huggingface") ... .option("config", "7M") ... .option("columns", '["source"]') ... .load("BAAI/Infinity-Instruct") ... ) >>> spark.sql("SELECT source, count(*) AS total FROM {df} GROUP BY source ORDER BY total DESC", df=df).show() +--------------------+-------+ | source| total| +--------------------+-------+ | flan|2435840| | Subjective|1342427| | OpenHermes-2.5| 855478| | MetaMath| 690138| | code_exercises| 590958| |Orca-math-word-pr...| 398168| | code_bagel| 386649| | MathInstruct| 329254| |python-code-datas...| 88632| |instructional_cod...| 82920| | CodeFeedback| 79513| |self-oss-instruct...| 50467| |Evol-Instruct-Cod...| 43354| |CodeExercise-Pyth...| 27159| |code_instructions...| 23130| | Code-Instruct-700k| 10860| |Glaive-code-assis...| 9281| |python_code_instr...| 2581| |Python-Code-23k-S...| 2297| +--------------------+-------+ ``` Again, specifying the `columns` option is not necessary, but is useful to avoid loading unnecessary data and make the query faster. ## Write You can write a PySpark Dataframe to Hugging Face with the "huggingface" Data Source. It uploads Parquet files in parallel in a distributed manner, and only commits the files once they're all uploaded. It works like this: ```python >>> import pyspark_huggingface >>> df.write.format("huggingface").save("username/dataset_name") ``` Here is how we can use this function to write the filtered version of the [BAAI/Infinity-Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct) dataset back to Hugging Face. First you need to [create a dataset repository](https://huggingface.co/new-dataset), e.g. `username/Infinity-Instruct-Chinese-Only` (you can set it to private if you want). Then, make sure you are authenticated and you can use the "huggingface" Data Source, set the `mode` to "overwrite" (or "append" if you want to extend an existing dataset), and push to Hugging Face with `.save()`: ```python >>> df_chinese_only.write.format("huggingface").mode("overwrite").save("username/Infinity-Instruct-Chinese-Only") ``` ### Mode Two modes are available when pushing a dataset to Hugging Face: * "overwrite": overwrite the dataset if it already exists * "append": append the dataset to an existing dataset ### Options Here is the list of available options you can pass to `write.option()`: * `token` (string): your Hugging Face token Contributions are welcome to add more options here, in particular `subset` and `split`. ## Storage Buckets It is common to process raw data in [Storage Buckets](/docs/hub/storage-buckets) and experiment there before publishing AI-ready data in Dataset repositories. Access Storage Buckets the same way as Dataset repositories but with the `buckets/` prefix and with the `data_dir` or `data_files` options: ```python >>> df = spark.read.format("huggingface").option("data_dir", "data").load("buckets/username/my-bucket") >>> # OR with a glob pattern >>> # df = spark.read.format("huggingface").option("data_files", "data/*.parquet").load("buckets/username/my-bucket") >>> df.write.format("huggingface").option("data_dir", "new-data").save("buckets/username/my-bucket") ``` ### Using ESPnet at Hugging Face https://huggingface.co/docs/hub/espnet.md # Using ESPnet at Hugging Face `espnet` is an end-to-end toolkit for speech processing, including automatic speech recognition, text to speech, speech enhancement, dirarization and other tasks. ## Exploring ESPnet in the Hub You can find hundreds of `espnet` models by filtering at the left of the [models page](https://huggingface.co/models?library=espnet&sort=downloads). All models on the Hub come up with useful features: 1. An automatically generated model card with a description, a training configuration, licenses and more. 2. Metadata tags that help for discoverability and contain information such as license, language and datasets. 3. An interactive widget you can use to play out with the model directly in the browser. 4. An Inference Providers widget that allows to make inference requests. ## Using existing models For a full guide on loading pre-trained models, we recommend checking out the [official guide](https://github.com/espnet/espnet_model_zoo)). If you're interested in doing inference, different classes for different tasks have a `from_pretrained` method that allows loading models from the Hub. For example: * `Speech2Text` for Automatic Speech Recognition. * `Text2Speech` for Text to Speech. * `SeparateSpeech` for Audio Source Separation. Here is an inference example: ```py import soundfile from espnet2.bin.tts_inference import Text2Speech text2speech = Text2Speech.from_pretrained("model_name") speech = text2speech("foobar")["wav"] soundfile.write("out.wav", speech.numpy(), text2speech.fs, "PCM_16") ``` If you want to see how to load a specific model, you can click `Use in ESPnet` and you will be given a working snippet that you can load it! ## Sharing your models `ESPnet` outputs a `zip` file that can be uploaded to Hugging Face easily. For a full guide on sharing models, we recommend checking out the [official guide](https://github.com/espnet/espnet_model_zoo#register-your-model)). The `run.sh` script allows to upload a given model to a Hugging Face repository. ```bash ./run.sh --stage 15 --skip_upload_hf false --hf_repo username/model_repo ``` ## Additional resources * ESPnet [docs](https://espnet.github.io/espnet/index.html). * ESPnet model zoo [repository](https://github.com/espnet/espnet_model_zoo). * Integration [docs](https://github.com/asteroid-team/asteroid/blob/master/docs/source/readmes/pretrained_models.md). ### Daft https://huggingface.co/docs/hub/datasets-daft.md # Daft [Daft](https://daft.ai/) is a high-performance data engine providing simple and reliable data processing for any modality and scale. Daft has native support for reading from and writing to Hugging Face datasets. ## Getting Started To get started, pip install `daft` with the `huggingface` feature: ```bash pip install 'daft[huggingface]' ``` ## Read Daft is able to read datasets directly from the Hugging Face Hub using the [`daft.read_huggingface()`](https://docs.daft.ai/en/stable/api/io/#daft.read_huggingface) function or via the `hf://datasets/` protocol. ### Reading an Entire Dataset Using [`daft.read_huggingface()`](https://docs.daft.ai/en/stable/api/io/#daft.read_huggingface), you can easily load a dataset. ```python import daft df = daft.read_huggingface("username/dataset_name") ``` This will read the entire dataset into a DataFrame. ### Reading Specific Files Not only can you read entire datasets, but you can also read individual files from a dataset repository. Using a read function that takes in a path (such as [`daft.read_parquet()`](https://docs.daft.ai/en/stable/api/io/#daft.read_parquet), [`daft.read_csv()`](https://docs.daft.ai/en/stable/api/io/#daft.read_csv), or [`daft.read_json()`](https://docs.daft.ai/en/stable/api/io/#daft.read_json)), specify a Hugging Face dataset path via the `hf://datasets/` prefix: ```python import daft # read a specific Parquet file df = daft.read_parquet("hf://datasets/username/dataset_name/file_name.parquet") # or a csv file df = daft.read_csv("hf://datasets/username/dataset_name/file_name.csv") # or a set of Parquet files using a glob pattern df = daft.read_parquet("hf://datasets/username/dataset_name/**/*.parquet") ``` ## Write Daft is able to write Parquet files to a Hugging Face dataset repository using [`daft.DataFrame.write_huggingface`](https://docs.daft.ai/en/stable/api/dataframe/#daft.DataFrame.write_deltalake). Daft supports [Content-Defined Chunking](https://huggingface.co/blog/parquet-cdc) and [Xet](https://huggingface.co/blog/xet-on-the-hub) for faster, deduplicated writes. Basic usage: ```python import daft df: daft.DataFrame = ... df.write_huggingface("username/dataset_name") ``` See the [`DataFrame.write_huggingface`](https://docs.daft.ai/en/stable/api/dataframe/#daft.DataFrame.write_huggingface) API page for more info. ## Authentication The `token` parameter in [`daft.io.HuggingFaceConfig`](https://docs.daft.ai/en/stable/api/config/#daft.io.HuggingFaceConfig) can be used to specify a Hugging Face access token for requests that require authentication (e.g. reading private dataset repositories or writing to a dataset repository). Example of loading a dataset with a specified token: ```python from daft.io import IOConfig, HuggingFaceConfig io_config = IOConfig(hf=HuggingFaceConfig(token="your_token")) df = daft.read_parquet("hf://datasets/username/dataset_name", io_config=io_config) ``` ### Dataset Cards https://huggingface.co/docs/hub/datasets-cards.md # Dataset Cards ## What are Dataset Cards? Each dataset may be documented by the `README.md` file in the repository. This file is called a **dataset card**, and the Hugging Face Hub will render its contents on the dataset's main page. To inform users about how to responsibly use the data, it's a good idea to include information about any potential biases within the dataset. Generally, dataset cards help users understand the contents of the dataset and give context for how the dataset should be used. You can also add dataset metadata to your card. The metadata describes important information about a dataset such as its license, language, and size. It also contains tags to help users discover a dataset on the Hub, and [data files configuration](./datasets-manual-configuration) options. Tags are defined in a YAML metadata section at the top of the `README.md` file. ## Dataset card metadata A dataset repo will render its README.md as a dataset card. To control how the Hub displays the card, you should create a YAML section in the README file to define some metadata. Start by adding three --- at the top, then include all of the relevant metadata, and close the section with another group of --- like the example below: ```yaml language: - "List of ISO 639-1 code for your language" - lang1 - lang2 pretty_name: "Pretty Name of the Dataset" tags: - tag1 - tag2 license: "any valid license identifier" task_categories: - task1 - task2 ``` The metadata that you add to the dataset card enables certain interactions on the Hub. For example: * Allow users to filter and discover datasets at https://huggingface.co/datasets. * If you choose a license using the keywords listed in the right column of [this table](./repositories-licenses), the license will be displayed on the dataset page. When creating a README.md file in a dataset repository on the Hub, use Metadata UI to fill the main metadata: To see metadata fields, see the detailed [Dataset Card specifications](https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1). ### Dataset card creation guide For a step-by-step guide on creating a dataset card, check out the [Create a dataset card](https://huggingface.co/docs/datasets/dataset_card) guide. Reading through existing dataset cards, such as the [ELI5 dataset card](https://huggingface.co/datasets/eli5/blob/main/README.md), is a great way to familiarize yourself with the common conventions. ### Linking a Paper If the dataset card includes a link to a Paper page (either on HF or an Arxiv abstract/PDF), the Hub will extract the arXiv ID and include it in the dataset tags with the format `arxiv:`. Clicking on the tag will let you: * Visit the Paper page * Filter for other models on the Hub that cite the same paper. Read more about paper pages [here](./paper-pages). ### Force set a dataset modality The Hub will automatically detect the modality of a dataset based on the files it contains (audio, video, geospatial, etc.). If you want to force a specific modality, you can add a tag to the dataset card metadata: `3d`, `audio`, `geospatial`, `image`, `tabular`, `text`, `timeseries`, `video`. For example, to force the modality to `audio`, add the following to the dataset card metadata: ```yaml tags: - audio ``` ### Associate a library to the dataset The dataset page automatically shows libraries and tools that are able to natively load the dataset, but if you want to show another specific library, you can add a tag to the dataset card metadata: `argilla`, `dask`, `datasets`, `distilabel`, `fiftyone`, `mlcroissant`, `pandas`, `webdataset`. See the [list of supported libraries](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/dataset-libraries.ts) for more information, or to propose to add a new library. For example, to associate the `argilla` library to the dataset card, add the following to the dataset card metadata: ```yaml tags: - argilla ``` ### Docker Spaces https://huggingface.co/docs/hub/spaces-sdks-docker.md # Docker Spaces Spaces accommodate custom [Docker containers](https://docs.docker.com/get-started/) for apps outside the scope of Streamlit and Gradio. Docker Spaces allow users to go beyond the limits of what was previously possible with the standard SDKs. From FastAPI and Go endpoints to Phoenix apps and ML Ops tools, Docker Spaces can help in many different setups. ## Setting up Docker Spaces Selecting **Docker** as the SDK when [creating a new Space](https://huggingface.co/new-space) will initialize your Space by setting the `sdk` property to `docker` in your `README.md` file's YAML block. Alternatively, given an existing Space repository, set `sdk: docker` inside the `YAML` block at the top of your Spaces **README.md** file. You can also change the default exposed port `7860` by setting `app_port: 7860`. Afterwards, you can create a usual `Dockerfile`. ```Yaml --- title: Basic Docker SDK Space emoji: ๐Ÿณ colorFrom: purple colorTo: gray sdk: docker app_port: 7860 --- ``` Internally you could have as many open ports as you want. For instance, you can install Elasticsearch inside your Space and call it internally on its default port 9200. If you want to expose apps served on multiple ports to the outside world, a workaround is to use a reverse proxy like Nginx to dispatch requests from the broader internet (on a single port) to different internal ports. ## Secrets and Variables Management You can manage a Space's environment variables in the Space Settings. Read more [here](./spaces-overview#managing-secrets). ### Variables #### Buildtime Variables are passed as `build-arg`s when building your Docker Space. Read [Docker's dedicated documentation](https://docs.docker.com/engine/reference/builder/#arg) for a complete guide on how to use this in the Dockerfile. ```Dockerfile # Declare your environment variables with the ARG directive ARG MODEL_REPO_NAME FROM python:latest # [...] # You can use them like environment variables RUN predict.py $MODEL_REPO_NAME ``` #### Runtime Variables are injected in the container's environment at runtime. ### Secrets #### Buildtime In Docker Spaces, the secrets management is different for security reasons. Once you create a secret in the [Settings tab](./spaces-overview#managing-secrets), you can expose the secret by adding the following line in your Dockerfile: For example, if `SECRET_EXAMPLE` is the name of the secret you created in the Settings tab, you can read it at build time by mounting it to a file, then reading it with `$(cat /run/secrets/SECRET_EXAMPLE)`. See an example below: ```Dockerfile # Expose the secret SECRET_EXAMPLE at buildtime and use its value as git remote URL RUN --mount=type=secret,id=SECRET_EXAMPLE,mode=0444,required=true \ git init && \ git remote add origin $(cat /run/secrets/SECRET_EXAMPLE) ``` ```Dockerfile # Expose the secret SECRET_EXAMPLE at buildtime and use its value as a Bearer token for a curl request RUN --mount=type=secret,id=SECRET_EXAMPLE,mode=0444,required=true \ curl test -H 'Authorization: Bearer $(cat /run/secrets/SECRET_EXAMPLE)' ``` #### Runtime Same as for public Variables, at runtime, you can access the secrets as environment variables. For example, in Python you would use `os.environ.get("SECRET_EXAMPLE")`. Check out this [example](https://huggingface.co/spaces/DockerTemplates/secret-example) of a Docker Space that uses secrets. ## Permissions The container runs with user ID 1000. To avoid permission issues you should create a user and set its `WORKDIR` before any `COPY` or download. ```Dockerfile # Set up a new user named "user" with user ID 1000 RUN useradd -m -u 1000 user # Switch to the "user" user USER user # Set home to the user's home directory ENV HOME=/home/user \ PATH=/home/user/.local/bin:$PATH # Set the working directory to the user's home directory WORKDIR $HOME/app # Try and run pip command after setting the user with `USER user` to avoid permission issues with Python RUN pip install --no-cache-dir --upgrade pip # Copy the current directory contents into the container at $HOME/app setting the owner to the user COPY --chown=user . $HOME/app # Download a checkpoint RUN mkdir content ADD --chown=user https:// content/ ``` Always specify the `--chown=user` with `ADD` and `COPY` to ensure the new files are owned by your user. If you still face permission issues, you might need to use `chmod` or `chown` in your `Dockerfile` to grant the right permissions. For example, if you want to use the directory `/data`, you can do: ```Dockerfile RUN mkdir -p /data RUN chmod 777 /data ``` You should always avoid superfluous chowns. > [!WARNING] > Updating metadata for a file creates a new copy stored in the new layer. Therefore, a recursive chown can result in a very large image due to the duplication of all affected files. Rather than fixing permission by running `chown`: ``` COPY checkpoint . RUN chown -R user checkpoint ``` you should always do: ``` COPY --chown=user checkpoint . ``` (same goes for `ADD` command) ## Data Persistence The data written on disk is lost whenever your Docker Space restarts. To persist data across restarts, you can attach a [Storage Bucket](./storage-buckets) to your Space. At the moment, `/data` volume is only available at runtime, i.e. you cannot use `/data` during the build step of your Dockerfile. You can also use our Datasets Hub for specific cases, where you can store state and data in a git LFS repository. You can find an example of persistence [here](https://huggingface.co/spaces/Wauplin/space_to_dataset_saver), which uses the [`huggingface_hub` library](https://huggingface.co/docs/huggingface_hub/index) for programmatically uploading files to a dataset repository. This Space example along with [this guide](https://huggingface.co/docs/huggingface_hub/main/en/guides/upload#scheduled-uploads) will help you define which solution fits best your data type. Finally, in some cases, you might want to use an external storage solution from your Space's code like an external hosted DB, S3, etc. ### Docker container with GPU You can run Docker containers with GPU support by using one of our GPU-flavored [Spaces Hardware](./spaces-gpus). We recommend using the [`nvidia/cuda`](https://hub.docker.com/r/nvidia/cuda) from Docker Hub as a base image, which comes with CUDA and cuDNN pre-installed. During Docker buildtime, you don't have access to a GPU hardware. Therefore, you should not try to run any GPU-related command during the build step of your Dockerfile. For example, you can't run `nvidia-smi` or `torch.cuda.is_available()` building an image. Read more [here](https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker#description). ## Read More - [Full Docker demo example](spaces-sdks-docker-first-demo) - [List of Docker Spaces examples](spaces-sdks-docker-examples) - [Spaces Examples](https://huggingface.co/SpacesExamples) ### Streamlit Spaces https://huggingface.co/docs/hub/spaces-sdks-streamlit.md # Streamlit Spaces **Streamlit** gives users freedom to build a full-featured web app with Python in a *reactive* way. Your code is rerun each time the state of the app changes. Streamlit is also great for data visualization and supports several charting libraries such as Bokeh, Plotly, and Altair. Read this [blog post](https://huggingface.co/blog/streamlit-spaces) about building and hosting Streamlit apps in Spaces. Selecting **Streamlit** as the SDK when [creating a new Space](https://huggingface.co/new-space) will initialize your Space with the latest version of Streamlit by setting the `sdk` property to `streamlit` in your `README.md` file's YAML block. If you'd like to change the Streamlit version, you can edit the `sdk_version` property. To use Streamlit in a Space, select **Streamlit** as the SDK when you create a Space through the [**New Space** form](https://huggingface.co/new-space). This will create a repository with a `README.md` that contains the following properties in the YAML configuration block: ```yaml sdk: streamlit sdk_version: 1.25.0 # The latest supported version ``` You can edit the `sdk_version`, but note that issues may occur when you use an unsupported Streamlit version. Not all Streamlit versions are supported, so please refer to the [reference section](./spaces-config-reference) to see which versions are available. For in-depth information about Streamlit, refer to the [Streamlit documentation](https://docs.streamlit.io/). > [!WARNING] > Only port 8501 is allowed for Streamlit Spaces (default port). As a result if you provide a `config.toml` file for your Space make sure the default port is not overridden. ## Your First Streamlit Space: Hot Dog Classifier In the following sections, you'll learn the basics of creating a Space, configuring it, and deploying your code to it. We'll create a **Hot Dog Classifier** Space with Streamlit that'll be used to demo the [julien-c/hotdog-not-hotdog](https://huggingface.co/julien-c/hotdog-not-hotdog) model, which can detect whether a given picture contains a hot dog ๐ŸŒญ You can find a completed version of this hosted at [NimaBoscarino/hotdog-streamlit](https://huggingface.co/spaces/NimaBoscarino/hotdog-streamlit). ## Create a new Streamlit Space We'll start by [creating a brand new Space](https://huggingface.co/new-space) and choosing **Streamlit** as our SDK. Hugging Face Spaces are Git repositories, meaning that you can work on your Space incrementally (and collaboratively) by pushing commits. Take a look at the [Getting Started with Repositories](./repositories-getting-started) guide to learn about how you can create and edit files before continuing. ## Add the dependencies For the **Hot Dog Classifier** we'll be using a [๐Ÿค— Transformers pipeline](https://huggingface.co/docs/transformers/pipeline_tutorial) to use the model, so we need to start by installing a few dependencies. This can be done by creating a **requirements.txt** file in our repository, and adding the following dependencies to it: ``` transformers torch ``` The Spaces runtime will handle installing the dependencies! ## Create the Streamlit app To create the Streamlit app, make a new file in the repository called **app.py**, and add the following code: ```python import streamlit as st from transformers import pipeline from PIL import Image pipeline = pipeline(task="image-classification", model="julien-c/hotdog-not-hotdog") st.title("Hot Dog? Or Not?") file_name = st.file_uploader("Upload a hot dog candidate image") if file_name is not None: col1, col2 = st.columns(2) image = Image.open(file_name) col1.image(image, use_column_width=True) predictions = pipeline(image) col2.header("Probabilities") for p in predictions: col2.subheader(f"{ p['label'] }: { round(p['score'] * 100, 1)}%") ``` This Python script uses a [๐Ÿค— Transformers pipeline](https://huggingface.co/docs/transformers/pipeline_tutorial) to load the [julien-c/hotdog-not-hotdog](https://huggingface.co/julien-c/hotdog-not-hotdog) model, which is used by the Streamlit interface. The Streamlit app will expect you to upload an image, which it'll then classify as *hot dog* or *not hot dog*. Once you've saved the code to the **app.py** file, visit the **App** tab to see your app in action! ## Embed Streamlit Spaces on other webpages You can use the HTML ` ``` ``` Additionally, you can checkout [our documentation](./spaces-embed). ### Annotated Model Card Template https://huggingface.co/docs/hub/model-card-annotated.md # Annotated Model Card Template ## Template [modelcard_template.md file](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md) ## Directions Fully filling out a model card requires input from a few different roles. (One person may have more than one role.) Weโ€™ll refer to these roles as the **developer**, who writes the code and runs training; the **sociotechnic**, who is skilled at analyzing the interaction of technology and society long-term (this includes lawyers, ethicists, sociologists, or rights advocates); and the **project organizer**, who understands the overall scope and reach of the model, can roughly fill out each part of the card, and who serves as a contact person for model card updates. * The **developer** is necessary for filling out [Training Procedure](#training-procedure-optional) and [Technical Specifications](#technical-specifications-optional). They are also particularly useful for the โ€œLimitationsโ€ section of [Bias, Risks, and Limitations](#bias-risks-and-limitations). They are responsible for providing [Results](#results) for the Evaluation, and ideally work with the other roles to define the rest of the Evaluation: [Testing Data, Factors & Metrics](#testing-data-factors--metrics). * The **sociotechnic** is necessary for filling out โ€œBiasโ€ and โ€œRisksโ€ within [Bias, Risks, and Limitations](#bias-risks-and-limitations), and particularly useful for โ€œOut of Scope Useโ€ within [Uses](#uses). * The **project organizer** is necessary for filling out [Model Details](#model-details) and [Uses](#uses). They might also fill out [Training Data](#training-data). Project organizers could also be in charge of [Citation](#citation-optional), [Glossary](#glossary-optional), [Model Card Contact](#model-card-contact), [Model Card Authors](#model-card-authors-optional), and [More Information](#more-information-optional). _Instructions are provided below, in italics._ Template variable names appear in `monospace`. --- # Model Name **Section Overview:** Provide the model name and a 1-2 sentence summary of what the model is. `model_id` `model_summary` # Table of Contents **Section Overview:** Provide this with links to each section, to enable people to easily jump around/use the file in other locations with the preserved TOC/print out the content/etc. # Model Details **Section Overview:** This section provides basic information about what the model is, its current status, and where it came from. It should be useful for anyone who wants to reference the model. ## Model Description `model_description` _Provide basic details about the model. This includes the architecture, version, if it was introduced in a paper, if an original implementation is available, and the creators. Any copyright should be attributed here. General information about training procedures, parameters, and important disclaimers can also be mentioned in this section._ * **Developed by:** `developers` _List (and ideally link to) the people who built the model._ * **Funded by:** `funded_by` _List (and ideally link to) the funding sources that financially, computationally, or otherwise supported or enabled this model._ * **Shared by [optional]:** `shared_by` _List (and ideally link to) the people/organization making the model available online._ * **Model type:** `model_type` _You can name the โ€œtypeโ€ as:_ _1. Supervision/Learning Method_ _2. Machine Learning Type_ _3. Modality_ * **Language(s)** [NLP]: `language` _Use this field when the system uses or processes natural (human) language._ * **License:** `license` _Name and link to the license being used._ * **Finetuned From Model [optional]:** `base_model` _If this model has another model as its base, link to that model here._ ## Model Sources [optional] * **Repository:** `repo` * **Paper [optional]:** `paper` * **Demo [optional]:** `demo` _Provide sources for the user to directly see the model and its details. Additional kinds of resources โ€“ training logs, lessons learned, etc. โ€“ belong in the [More Information](#more-information-optional) section. If you include one thing for this section, link to the repository._ # Uses **Section Overview:** This section addresses questions around how the model is intended to be used in different applied contexts, discusses the foreseeable users of the model (including those affected by the model), and describes uses that are considered out of scope or misuse of the model. Note this section is not intended to include the license usage details. For that, link directly to the license. ## Direct Use `direct_use` _Explain how the model can be used without fine-tuning, post-processing, or plugging into a pipeline. An example code snippet is recommended._ ## Downstream Use [optional] `downstream_use` _Explain how this model can be used when fine-tuned for a task or when plugged into a larger ecosystem or app. An example code snippet is recommended._ ## Out-of-Scope Use `out_of_scope_use` _List how the model may foreseeably be misused (used in a way it will not work for) and address what users ought not do with the model._ # Bias, Risks, and Limitations **Section Overview:** This section identifies foreseeable harms, misunderstandings, and technical and sociotechnical limitations. It also provides information on warnings and potential mitigations. Bias, risks, and limitations can sometimes be inseparable/refer to the same issues. Generally, bias and risks are sociotechnical, while limitations are technical: - A **bias** is a stereotype or disproportionate performance (skew) for some subpopulations. - A **risk** is a socially-relevant issue that the model might cause. - A **limitation** is a likely failure mode that can be addressed following the listed Recommendations. `bias_risks_limitations` _What are the known or foreseeable issues stemming from this model?_ ## Recommendations `bias_recommendations` _What are recommendations with respect to the foreseeable issues? This can include everything from โ€œdownsample your imageโ€ to filtering explicit content._ # Training Details **Section Overview:** This section provides information to describe and replicate training, including the training data, the speed and size of training elements, and the environmental impact of training. This relates heavily to the [Technical Specifications](#technical-specifications-optional) as well, and content here should link to that section when it is relevant to the training procedure. It is useful for people who want to learn more about the model inputs and training footprint. It is relevant for anyone who wants to know the basics of what the model is learning. ## Training Data `training_data` _Write 1-2 sentences on what the training data is. Ideally this links to a Dataset Card for further information. Links to documentation related to data pre-processing or additional filtering may go here as well as in [More Information](#more-information-optional)._ ## Training Procedure [optional] ### Preprocessing `preprocessing` _Detail tokenization, resizing/rewriting (depending on the modality), etc._ ### Speeds, Sizes, Times `speeds_sizes_times` _Detail throughput, start/end time, checkpoint sizes, etc._ # Evaluation **Section Overview:** This section describes the evaluation protocols, what is being measured in the evaluation, and provides the results. Evaluation ideally has at least two parts, with one part looking at quantitative measurement of general performance ([Testing Data, Factors & Metrics](#testing-data-factors--metrics)), such as may be done with benchmarking; and another looking at performance with respect to specific social safety issues ([Societal Impact Assessment](#societal-impact-assessment-optional)), such as may be done with red-teaming. You can also specify your model's evaluation results in a structured way in the model card metadata. Results are parsed by the Hub and displayed in a widget on the model page. See https://huggingface.co/docs/hub/model-cards#evaluation-results. ## Testing Data, Factors & Metrics _Evaluation is ideally **disaggregated** with respect to different factors, such as task, domain and population subgroup; and calculated with metrics that are most meaningful for foreseeable contexts of use. Equal evaluation performance across different subgroups is said to be "fair" across those subgroups; target fairness metrics should be decided based on which errors are more likely to be problematic in light of the model use. However, this section is most commonly used to report aggregate evaluation performance on different task benchmarks._ ### Testing Data `testing_data` _Describe testing data or link to its Dataset Card._ ### Factors `testing_factors` _What are the foreseeable characteristics that will influence how the model behaves? Evaluation should ideally be disaggregated across these factors in order to uncover disparities in performance._ ### Metrics `testing_metrics` _What metrics will be used for evaluation?_ ## Results `results` _Results should be based on the Factors and Metrics defined above._ ### Summary `results_summary` _What do the results say? This can function as a kind of tl;dr for general audiences._ ## Societal Impact Assessment [optional] _Use this free text section to explain how this model has been evaluated for risk of societal harm, such as for child safety, NCII, privacy, and violence. This might take the form of answers to the following questions:_ - _Is this model safe for kids to use? Why or why not?_ - _Has this model been tested to evaluate risks pertaining to non-consensual intimate imagery (including CSEM)?_ - _Has this model been tested to evaluate risks pertaining to violent activities, or depictions of violence? What were the results?_ _Quantitative numbers on each issue may also be provided._ # Model Examination [optional] **Section Overview:** This is an experimental section some developers are beginning to add, where work on explainability/interpretability may go. `model_examination` # Environmental Impact **Section Overview:** Summarizes the information necessary to calculate environmental impacts such as electricity usage and carbon emissions. * **Hardware Type:** `hardware_type` * **Hours used:** `hours_used` * **Cloud Provider:** `cloud_provider` * **Compute Region:** `cloud_region` * **Carbon Emitted:** `co2_emitted` _Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700)._ # Technical Specifications [optional] **Section Overview:** This section includes details about the model objective and architecture, and the compute infrastructure. It is useful for people interested in model development. Writing this section usually requires the model developer to be directly involved. ## Model Architecture and Objective `model_specs` ## Compute Infrastructure `compute_infrastructure` ### Hardware `hardware_requirements` _What are the minimum hardware requirements, e.g. processing, storage, and memory requirements?_ ### Software `software` # Citation [optional] **Section Overview:** The developersโ€™ preferred citation for this model. This is often a paper. ### BibTeX `citation_bibtex` ### APA `citation_apa` # Glossary [optional] **Section Overview:** This section defines common terms and how metrics are calculated. `glossary` _Clearly define terms in order to be accessible across audiences._ # More Information [optional] **Section Overview:** This section provides links to writing on dataset creation, technical specifications, lessons learned, and initial results. `more_information` # Model Card Authors [optional] **Section Overview:** This section lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction. `model_card_authors` # Model Card Contact **Section Overview:** Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors `model_card_contact` # How to Get Started with the Model **Section Overview:** Provides a code snippet to show how to use the model. `get_started_code` --- **Please cite as:** Ozoani, Ezi and Gerchick, Marissa and Mitchell, Margaret. Model Card Guidebook. Hugging Face, 2022. https://huggingface.co/docs/hub/en/model-card-guidebook ### How to handle URL parameters in Spaces https://huggingface.co/docs/hub/spaces-handle-url-parameters.md # How to handle URL parameters in Spaces You can use URL query parameters as a data sharing mechanism, for instance to be able to deep-link into an app with a specific state. On a Space page (`https://huggingface.co/spaces//`), the actual application page (`https://*.hf.space/`) is embedded in an iframe. The query string and the hash attached to the parent page URL are propagated to the embedded app on initial load, so the embedded app can read these values without special consideration. In contrast, updating the query string and the hash of the parent page URL from the embedded app is slightly more complex. If you want to do this in a Docker or static Space, you need to add the following JS code that sends a message to the parent page that has a `queryString` and/or `hash` key. ```js const queryString = "..."; const hash = "..."; window.parent.postMessage({ queryString, hash, }, "https://huggingface.co"); ``` **This is only for Docker or static Spaces.** For Streamlit apps, Spaces automatically syncs the URL parameters. Gradio apps can read the query parameters from the Spaces page, but do not sync updated URL parameters with the parent page. Note that the URL parameters of the parent page are propagated to the embedded app *only* on the initial load. So `location.hash` in the embedded app will not change even if the parent URL hash is updated using this method. An example of this method can be found in this static Space, [`whitphx/static-url-param-sync-example`](https://huggingface.co/spaces/whitphx/static-url-param-sync-example). ### Signing commits with GPG https://huggingface.co/docs/hub/security-gpg.md # Signing commits with GPG `git` has an authentication layer to control who can push commits to a repo, but it does not authenticate the actual commit authors. In other words, you can commit changes as `Elon Musk `, push them to your preferred `git` host (for instance github.com), and your commit will link to Elon's GitHub profile. (Try it! But don't blame us if Elon gets mad at you for impersonating him.) The reasons we implemented GPG signing were: - To provide finer-grained security, especially as more and more Enterprise users rely on the Hub. - To provide ML benchmarks backed by a cryptographically-secure source. See Ale Segala's [How (and why) to sign `git` commits](https://withblue.ink/2020/05/17/how-and-why-to-sign-git-commits.html) for more context. You can prove a commit was authored by you with GNU Privacy Guard (GPG) and a key server. GPG is a cryptographic tool used to verify the authenticity of a message's origin. We'll explain how to set this up on Hugging Face below. The Pro Git book is, as usual, a good resource about commit signing: [Pro Git: Signing your work](https://git-scm.com/book/en/v2/Git-Tools-Signing-Your-Work). ## Setting up signed commits verification You will need to install [GPG](https://gnupg.org/) on your system in order to execute the following commands. > It's included by default in most Linux distributions. > On Windows, it is included in Git Bash (which comes with `git` for Windows). You can sign your commits locally using [GPG](https://gnupg.org/). Then configure your profile to mark these commits as **verified** on the Hub, so other people can be confident that they come from a trusted source. For a more in-depth explanation of how git and GPG interact, please visit the [git documentation on the subject](https://git-scm.com/book/en/v2/Git-Tools-Signing-Your-Work) Commits can have the following signing statuses: | Status | Explanation | | ----------------- | ------------------------------------------------------------ | | Verified | The commit is signed and the signature is verified | | Unverified | The commit is signed but the signature could not be verified | | No signing status | The commit is not signed | For a commit to be marked as **verified**, you need to upload the public key used to sign it on your Hugging Face account. Use the `gpg --list-secret-keys` command to list the GPG keys for which you have both a public and private key. A private key is required for signing commits or tags. If you don't have a GPG key pair or you don't want to use the existing keys to sign your commits, go to **Generating a new GPG key**. Otherwise, go straight to [Adding a GPG key to your account](#adding-a-gpg-key-to-your-account). ## Generating a new GPG key To generate a GPG key, run the following: ```bash gpg --gen-key ``` GPG will then guide you through the process of creating a GPG key pair. Make sure you specify an email address for this key, and that the email address matches the one you specified in your Hugging Face [account](https://huggingface.co/settings/account). ## Adding a GPG key to your account 1. First, select or generate a GPG key on your computer. Make sure the email address of the key matches the one in your Hugging Face [account](https://huggingface.co/settings/account) and that the email of your account is verified. 2. Export the public part of the selected key: ```bash gpg --armor --export ``` 3. Then visit your profile [settings page](https://huggingface.co/settings/keys) and click on **Add GPG Key**. Copy & paste the output of the `gpg --export` command in the text area and click on **Add Key**. 4. Congratulations! ๐ŸŽ‰ You've just added a GPG key to your account! ## Configure git to sign your commits with GPG The last step is to configure git to sign your commits: ```bash git config user.signingkey git config user.email ``` Then add the `-S` flag to your `git commit` commands to sign your commits! ```bash git commit -S -m "My first signed commit" ``` Once pushed on the Hub, you should see the commit with a "Verified" badge. > [!TIP] > To sign all commits by default in any local repository on your computer, you can run git config --global commit.gpgsign true. ### User Studies https://huggingface.co/docs/hub/model-cards-user-studies.md # User Studies ## Model Card Audiences and Use Cases During our investigation into the landscape of model documentation tools (data cards etc), we noted how different stakeholders make use of existing infrastructure to create a kind of model card with information focused on their needed domain. One such example are โ€˜business analystsโ€™ or those whose focus is on B2B as well as an internal only audience.The static and more manual approach for this audience is using Confluence pages. (*if PMs write the page, we are detaching the model creators from its theoretical consumption; if ML engineers write the page, they may tend to stress only a certain type of information.* [^1]) or a proposed combination of HTML (Jinja) templates, Metaflow classes and external APi keys, in order to create model cards that include a perspective of the model information that is needed for their domain/use case. We conducted a user study, with the aim of validating a literature informed model card structure and to understand sections/ areas of ranked importance for the different stakeholders perspectives. The study aimed to validate the following components: * **Model Card Layout** During our examination of the state of the art of model cards, which noted recurring sections from the top ~100 downloaded models on the hub that had model cards. From this analysis we catalogued the top recurring model card sections and recurring information, this coupled with the structure of the Bloom model card, lead us to the initial version of a standard model card structure. As we began to structure our user studies, two variations of model cards - that made use of the [initial model card structure](./model-card-annotated) - were used as interactive demonstrations. The aim of these demoโ€™s was to understand not only the different user perspectives on the visual elements of the model cardโ€™s but also the content presented to users. The {desired} outcome would enable us to further understand what makes a model card both easier to read, still providing some level of interactivity within the model cards, all while presenting the information in an easily understandable [approachable] manner. * **Stakeholder Perspectives** As different people, of varying technical backgrounds, could be collaborating on a model and subsequently the model card, we sought to validate the need for different stakeholders perspectives. Based on the ease of use of writing the different model card sections and the sections that one would read first Participants ranked the different sections of model cards in the perspective of one reading a model card and then as an author of a model card. An ordering scheme - 1 being the highest weight and 10 being the lowest - was applied to the different sections that the user would usually read first in a model card and the sections of a model card that a model card author would find easiest to write. ## Summary of Responses to the User Studies Survey Our user studies provided further clarity on the sections that different user profiles/stakeholders would find more challenging or easier to write. The results illustrated below show that while the Bias, Risks and Limitations section ranks second for both model card writers and model card readers for *In what order do you write the model card and What section do you look at first*, respectively, it is also noted as the most challenging/longest section to write. This favoured/endorsed the need to further evaluate the Bias, Risks and Limitations sections in order to assist with writing this decisive/imperative section. These templates were then used to generate model cards for the top 200 most downloaded Hugging Face (HF) models. * We first began by pulling all Hugging Face model's on the hub and, in particular, subsections on Limitations and Bias ("Risks" subsections were largely not present). * Based on inputs that were the most continuously used with a higher number of model downloads, grouped by model typed, the tool provides prompted text within the Bias, Risks and Limitations sections. We also prompt a default text if the model type is not specified. Using this information, we returned back to our analysis of all model cards on the hub, coupled with suggestions from other researchers and peers at HF and additional research on the type of prompted information we could provide to users while they are creating model cards. These defaulted prompted text allowed us to satisfy the aims: 1) For those who have not created model cards before or who do not usually make a model card or any other type of model documentation for their modelโ€™s, the prompted text enables these users to easily create a model card. This in turn increased the number of model cards created. 2) Users who already write model cards, the prompted text invites them to add more to their model card, further developing the content/standard of model cards. ## User Study Details We selected people from a variety of different backgrounds relevant to machine learning and model documentation. Below, we detail their demographics, the questions they were asked, and the corresponding insights from their responses. Full details on responses are available in [Appendix A](./model-card-appendix#appendix-a-user-study). ### Respondent Demographics * Tech & Regulatory Affairs Counsel * ML Engineer (x2) * Developer Advocate * Executive Assistant * Monetization Lead * Policy Manager/AI Researcher * Research Intern **What are the key pieces of information you want or need to know about a model when interacting with a machine learning model?** **Insight:** * Respondents prioritised information about the model task/domain (x3), training data/training procedure (x2), how to use the model (with code) (x2), bias and limitations, and the model licence ### Feedback on Specific Model Card Formats #### Format 1: **Current [distilbert/distilgpt2 model card](https://huggingface.co/distilbert/distilgpt2) on the Hub** **Insights:** * Respondents found this model card format to be concise, complete, and readable. * There was no consensus about the collapsible sections (some liked them and wanted more, some disliked them). * Some respondents said โ€œRisks and Limitationsโ€ should go with โ€œOut of Scope Usesโ€ #### Format 2: **Nazneen Rajani's [Interactive Model Card space](https://huggingface.co/spaces/nazneen/interactive-model-cards)** **Insights:** * While a few respondents really liked this format, most found it overwhelming or as an overload of information. Several suggested this could be a nice tool to layer onto a base model card for more advanced audiences. #### Format 3: **Ezi Ozoani's [Semi-Interactive Model Card Space](https://huggingface.co/spaces/Ezi/ModelCardsAnalysis)** **Insights:** * Several respondents found this format overwhelming, but they generally found it less overwhelming than format 2. * Several respondents disagreed with the current layout and gave specific feedback about which sections should be prioritised within each column. ### Section Rankings *Ordered based on average ranking. Arrows are shown relative to the order of the associated section in the question on the survey.* **Insights:** * When writing model cards, respondents generally said they would write a model card in the same order in which the sections were listed in the survey question. * When ranking the sections of the model card by ease/quickness of writing, consensus was that the sections on uses and limitations and risks were the most difficult. * When reading model cards, respondents said they looked at the cardsโ€™ sections in an order that was close to โ€“ but not perfectly aligned with โ€“ the order in which the sections were listed in the survey question. ![user studies results 1](https://huggingface.co/datasets/huggingface/documentation-images/blob/main/hub/usaer-studes-responses(1).png) ![user studies results 2](https://huggingface.co/datasets/huggingface/documentation-images/blob/main/hub/user-studies-responses(2).png) > [!TIP] > [Checkout the Appendix](./model-card-appendix) Acknowledgements ================ We want to acknowledge and thank [Bibi Ofuya](https://www.figma.com/proto/qrPCjWfFz5HEpWqQ0PJSWW/Bibi's-Portfolio?page-id=0%3A1&node-id=1%3A28&viewport=243%2C48%2C0.2&scaling=min-zoom&starting-point-node-id=1%3A28) for her question creation and her guidance on user-focused ordering and presentation during the user studies. [^1]: See https://towardsdatascience.com/dag-card-is-the-new-model-card-70754847a111 --- **Please cite as:** Ozoani, Ezi and Gerchick, Marissa and Mitchell, Margaret. Model Card Guidebook. Hugging Face, 2022. https://huggingface.co/docs/hub/en/model-card-guidebook ### Skills https://huggingface.co/docs/hub/agents-skills.md # Skills > [!TIP] > Looking for the `hf` CLI Skill? It's the quickest way to connect your agent to the Hugging Face ecosystem. See the [Hugging Face CLI for AI Agents](./agents-cli) guide. Hugging Face provides a curated set of Skills built for AI builders. Train models, create datasets, run evaluations, track experiments. Each Skill is a self-contained `SKILL.md` that your agent follows while working on the task. Skills work with all major coding agents: Claude Code, OpenAI Codex, Google Gemini CLI, and Cursor. Learn more about the format at [agentskills.io](https://agentskills.io). ## Installation ```bash # register the skills marketplace /plugin marketplace add huggingface/skills # install a specific Skill /plugin install @huggingface/skills ``` Copy or symlink skills from the [repository](https://github.com/huggingface/skills) into one of Codex's standard `.agents/skills` locations (e.g. `$REPO_ROOT/.agents/skills` or `$HOME/.agents/skills`). Codex discovers them automatically via the Agent Skills standard. Alternatively, use the bundled [`agents/AGENTS.md`](https://github.com/huggingface/skills/blob/main/agents/AGENTS.md) as a fallback. ```bash gemini extensions install https://github.com/huggingface/skills.git --consent ``` Install via the Cursor plugin flow using the [repository URL](https://github.com/huggingface/skills). The repo includes `.cursor-plugin/plugin.json` and `.mcp.json` manifests. ## Available Skills | Skill | What it does | | ----- | ------------ | | [`hf-cli`](https://github.com/huggingface/skills/tree/main/skills/hf-cli) | Hub operations via the `hf` CLI: download, upload, manage repos, run jobs | | [`huggingface-datasets`](https://github.com/huggingface/skills/tree/main/skills/huggingface-datasets) | Explore datasets, paginate rows, search text, apply filters | | [`huggingface-llm-trainer`](https://github.com/huggingface/skills/tree/main/skills/huggingface-llm-trainer) | Train or fine-tune LLMs with TRL (SFT, DPO, GRPO) on HF Jobs | | [`huggingface-vision-trainer`](https://github.com/huggingface/skills/tree/main/skills/huggingface-vision-trainer) | Train object detection and image classification models | | [`huggingface-community-evals`](https://github.com/huggingface/skills/tree/main/skills/huggingface-community-evals) | Run evaluations against models on the Hugging Face Hub on local hardware | | [`huggingface-trackio`](https://github.com/huggingface/skills/tree/main/skills/huggingface-trackio) | Track and visualize ML training experiments with Trackio | | [`huggingface-papers`](https://github.com/huggingface/skills/tree/main/skills/huggingface-papers) | Look up and read Hugging Face paper pages in markdown | | [`huggingface-paper-publisher`](https://github.com/huggingface/skills/tree/main/skills/huggingface-paper-publisher) | Publish and manage research papers on the Hub | | [`huggingface-tool-builder`](https://github.com/huggingface/skills/tree/main/skills/huggingface-tool-builder) | Build reusable scripts for HF API operations | | [`gradio`](https://github.com/huggingface/skills/tree/main/skills/huggingface-gradio) | Build Gradio web UIs and demos | | [`transformers-js`](https://github.com/huggingface/skills/tree/main/skills/transformers-js) | Run ML models in JavaScript/TypeScript with WebGPU/WASM | ## Using Skills Once installed, mention the Skill directly in your prompt: - "Use the HF model trainer Skill to fine-tune Qwen3-0.6B with SFT on the Capybara dataset" - "Use the HF evaluation Skill to add benchmark results to my model card" - "Use the HF datasets Skill to create a new dataset from these examples" Your agent loads the corresponding `SKILL.md` instructions and helper scripts automatically. ## Resources - [Skills Repository](https://github.com/huggingface/skills) - Browse and contribute - [Agent Skills format](https://agentskills.io/home) - Specification and docs - [CLI Guide](./agents-cli) - Hugging Face CLI for AI Agents - [MCP Guide](./agents-mcp) - Use alongside Skills ### GGUF usage with GPT4All https://huggingface.co/docs/hub/gguf-gpt4all.md # GGUF usage with GPT4All [GPT4All](https://gpt4all.io/) is an open-source LLM application developed by [Nomic](https://nomic.ai/). Version 2.7.2 introduces a brand new, experimental feature called `Model Discovery`. `Model Discovery` provides a built-in way to search for and download GGUF models from the Hub. To get started, open GPT4All and click `Download Models`. From here, you can use the search bar to find a model. After you have selected and downloaded a model, you can go to `Settings` and provide an appropriate prompt template in the GPT4All format (`%1` and `%2` placeholders). Then from the main page, you can select the model from the list of installed models and start a conversation. ### GGUF https://huggingface.co/docs/hub/gguf.md # GGUF Hugging Face Hub supports all file formats, but has built-in features for [GGUF format](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md), a binary format that is optimized for quick loading and saving of models, making it highly efficient for inference purposes. GGUF is designed for use with GGML and other executors. GGUF was developed by [@ggerganov](https://huggingface.co/ggerganov) who is also the developer of [llama.cpp](https://github.com/ggerganov/llama.cpp), a popular C/C++ LLM inference framework. Models initially developed in frameworks like PyTorch can be converted to GGUF format for use with those engines. As we can see in this graph, unlike tensor-only file formats like [safetensors](https://huggingface.co/docs/safetensors) โ€“ย which is also a recommended model format for the Hub โ€“ GGUF encodes both the tensors and a standardized set of metadata. ## Finding GGUF files You can browse all models with GGUF files filtering by the GGUF tag: [hf.co/models?library=gguf](https://huggingface.co/models?library=gguf). Moreover, you can use [ggml-org/gguf-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) tool to convert/quantize your model weights into GGUF weights. For example, you can check out [TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF](https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF) for seeing GGUF files in action. ## Viewer for metadata & tensors info The Hub has a viewer for GGUF files that lets a user check out metadata & tensors info (name, shape, precision). The viewer is available on model page ([example](https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF?show_tensors=mixtral-8x7b-instruct-v0.1.Q4_0.gguf)) & files page ([example](https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/tree/main?show_tensors=mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf)). ## Usage with open-source tools * [llama.cpp](./gguf-llamacpp) * [LM Studio](./lmstudio) * [GPT4All](./gguf-gpt4all) * [Ollama](./ollama) ## Parsing the metadata with @huggingface/gguf We've also created a javascript GGUF parser that works on remotely hosted files (e.g. Hugging Face Hub). ```bash npm install @huggingface/gguf ``` ```ts import { gguf } from "@huggingface/gguf"; // remote GGUF file from https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF const URL_LLAMA = "https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/191239b/llama-2-7b-chat.Q2_K.gguf"; const { metadata, tensorInfos } = await gguf(URL_LLAMA); ``` Find more information [here](https://github.com/huggingface/huggingface.js/tree/main/packages/gguf). ## Quantization Types | type | source | description | |---------------------------|--------|-------------| | F64 | [Wikipedia](https://en.wikipedia.org/wiki/Double-precision_floating-point_format) | 64-bit standard IEEE 754 double-precision floating-point number. | | I64 | [GH](https://github.com/ggerganov/llama.cpp/pull/6062) | 64-bit fixed-width integer number. | | F32 | [Wikipedia](https://en.wikipedia.org/wiki/Single-precision_floating-point_format) | 32-bit standard IEEE 754 single-precision floating-point number. | | I32 | [GH](https://github.com/ggerganov/llama.cpp/pull/6045) | 32-bit fixed-width integer number. | | F16 | [Wikipedia](https://en.wikipedia.org/wiki/Half-precision_floating-point_format) | 16-bit standard IEEE 754 half-precision floating-point number. | | BF16 | [Wikipedia](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) | 16-bit shortened version of the 32-bit IEEE 754 single-precision floating-point number. | | I16 | [GH](https://github.com/ggerganov/llama.cpp/pull/6045) | 16-bit fixed-width integer number. | | Q8_K | [GH](https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305) | 8-bit quantization (`q`). Each block has 256 weights. Only used for quantizing intermediate results. All 2-6 bit dot products are implemented for this quantization type. Weight formula: `w = q * block_scale`. | | I8 | [GH](https://github.com/ggerganov/llama.cpp/pull/6045) | 8-bit fixed-width integer number. | | Q6_K | [GH](https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305) | 6-bit quantization (`q`). Super-blocks with 16 blocks, each block has 16 weights. Weight formula: `w = q * block_scale(8-bit)`, resulting in 6.5625 bits-per-weight. | | Q5_K | [GH](https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305) | 5-bit quantization (`q`). Super-blocks with 8 blocks, each block has 32 weights. Weight formula: `w = q * block_scale(6-bit) + block_min(6-bit)`, resulting in 5.5 bits-per-weight. | | Q4_K | [GH](https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305) | 4-bit quantization (`q`). Super-blocks with 8 blocks, each block has 32 weights. Weight formula: `w = q * block_scale(6-bit) + block_min(6-bit)`, resulting in 4.5 bits-per-weight. | | Q3_K | [GH](https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305) | 3-bit quantization (`q`). Super-blocks with 16 blocks, each block has 16 weights. Weight formula: `w = q * block_scale(6-bit)`, resulting in 3.4375 bits-per-weight. | | Q2_K | [GH](https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305) | 2-bit quantization (`q`). Super-blocks with 16 blocks, each block has 16 weights. Weight formula: `w = q * block_scale(4-bit) + block_min(4-bit)`, resulting in 2.625 bits-per-weight. | | IQ4_NL | [GH](https://github.com/ggerganov/llama.cpp/pull/5590) | 4-bit quantization (`q`). Super-blocks with 256 weights. Weight `w` is obtained using `super_block_scale` & `importance matrix`. | | IQ4_XS | [HF](https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70) | 4-bit quantization (`q`). Super-blocks with 256 weights. Weight `w` is obtained using `super_block_scale` & `importance matrix`, resulting in 4.25 bits-per-weight. | | IQ3_S | [HF](https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70) | 3-bit quantization (`q`). Super-blocks with 256 weights. Weight `w` is obtained using `super_block_scale` & `importance matrix`, resulting in 3.44 bits-per-weight. | | IQ3_XXS | [HF](https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70) | 3-bit quantization (`q`). Super-blocks with 256 weights. Weight `w` is obtained using `super_block_scale` & `importance matrix`, resulting in 3.06 bits-per-weight. | | IQ2_XXS | [HF](https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70) | 2-bit quantization (`q`). Super-blocks with 256 weights. Weight `w` is obtained using `super_block_scale` & `importance matrix`, resulting in 2.06 bits-per-weight. | | IQ2_S | [HF](https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70) | 2-bit quantization (`q`). Super-blocks with 256 weights. Weight `w` is obtained using `super_block_scale` & `importance matrix`, resulting in 2.5 bits-per-weight. | | IQ2_XS | [HF](https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70) | 2-bit quantization (`q`). Super-blocks with 256 weights. Weight `w` is obtained using `super_block_scale` & `importance matrix`, resulting in 2.31 bits-per-weight. | | IQ1_S | [HF](https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70) | 1-bit quantization (`q`). Super-blocks with 256 weights. Weight `w` is obtained using `super_block_scale` & `importance matrix`, resulting in 1.56 bits-per-weight. | | IQ1_M | [GH](https://github.com/ggerganov/llama.cpp/pull/6302) | 1-bit quantization (`q`). Super-blocks with 256 weights. Weight `w` is obtained using `super_block_scale` & `importance matrix`, resulting in 1.75 bits-per-weight. | | TQ1_0 | [GH](https://github.com/ggml-org/llama.cpp/pull/8151) | Ternary quantization. | | TQ2_0 | [GH](https://github.com/ggml-org/llama.cpp/pull/8151) | Ternary quantization. | | MXFP4 | [GH](https://github.com/ggml-org/llama.cpp/pull/15091) | 4-bit Microscaling Block Floating Point. | | **Legacy types** | | | | Q8_0 | [GH](https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557654249) | 8-bit round-to-nearest quantization (`q`). Each block has 32 weights. Weight formula: `w = q * block_scale`. Legacy quantization method (not used widely as of today). | | Q8_1 | [GH](https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557682290) | 8-bit round-to-nearest quantization (`q`). Each block has 32 weights. Weight formula: `w = q * block_scale + block_minimum`. Legacy quantization method (not used widely as of today). | | Q5_0 | [GH](https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557654249) | 5-bit round-to-nearest quantization (`q`). Each block has 32 weights. Weight formula: `w = q * block_scale`. Legacy quantization method (not used widely as of today). | | Q5_1 | [GH](https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557682290) | 5-bit round-to-nearest quantization (`q`). Each block has 32 weights. Weight formula: `w = q * block_scale + block_minimum`. Legacy quantization method (not used widely as of today). | | Q4_0 | [GH](https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557654249) | 4-bit round-to-nearest quantization (`q`). Each block has 32 weights. Weight formula: `w = q * block_scale`. Legacy quantization method (not used widely as of today). | | Q4_1 | [GH](https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557682290) | 4-bit round-to-nearest quantization (`q`). Each block has 32 weights. Weight formula: `w = q * block_scale + block_minimum`. Legacy quantization method (not used widely as of today). | *if there's any inaccuracy on the table above, please open a PR on [this file](https://github.com/huggingface/huggingface.js/blob/main/packages/gguf/src/quant-descriptions.ts).* ### Jupyter Notebooks on the Hugging Face Hub https://huggingface.co/docs/hub/notebooks.md # Jupyter Notebooks on the Hugging Face Hub [Jupyter notebooks](https://jupyter.org/) are a very popular format for sharing code and data analysis for machine learning and data science. They are interactive documents that can contain code, visualizations, and text. ## Open models in Google Colab and Kaggle When you visit a model page on the Hugging Face Hub, youโ€™ll see a new โ€œGoogle Colabโ€/ "Kaggle" button in the โ€œUse this modelโ€ drop down. Clicking this will generate a ready-to-run notebook with basic code to load and test the model. This is perfect for quick prototyping, inference testing, or fine-tuning experiments โ€” all without leaving your browser. ![Google Colab and Kaggle option for models on the Hub](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/hf-google-colab/gemma3-4b-it-dark.png) Users can also access a ready-to-run notebook by appending /colab to the model cardโ€™s URL. As an example, for the latest Gemma 3 4B IT model, the corresponding Colab notebook can be reached by taking the model card URL: https://huggingface.co/google/gemma-3-4b-it And then appending `/colab` to it: https://huggingface.co/google/gemma-3-4b-it/colab and similarly for kaggle: https://huggingface.co/google/gemma-3-4b-it/kaggle If a model repository includes a file called `notebook.ipynb`, we will use it for Colab and Kaggle instead of the auto-generated notebook content. Model authors can provide tailored examples, detailed walkthroughs, or advanced use cases while still benefiting from one-click Colab integration. [NousResearch/Genstruct-7B](https://huggingface.co/NousResearch/Genstruct-7B) is one such example. ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/hf-google-colab/genstruct-notebook-dark.png) ## Rendering .ipynb Jupyter notebooks on the Hub Under the hood, Jupyter Notebook files (usually shared with a `.ipynb` extension) are JSON files. While viewing these files directly is possible, it's not a format intended to be read by humans. The Hub has rendering support for notebooks hosted on the Hub. This means that notebooks are displayed in a human-readable format. ![Before and after notebook rendering](https://huggingface.co/blog/assets/135_notebooks-hub/before_after_notebook_rendering.png) Notebooks will be rendered when included in any type of repository on the Hub. This includes models, datasets, and Spaces. ### Launch in Google Colab [Google Colab](https://colab.google/) is a free Jupyter Notebook environment that requires no setup and runs entirely in the cloud. It's a great way to run Jupyter Notebooks without having to install anything on your local machine. All .ipynb files hosted on the Hub are automatically given a "Open in Colab" button. This allows you to open the notebook in Colab with a single click. ### Pull requests and Discussions https://huggingface.co/docs/hub/repositories-pull-requests-discussions.md # Pull requests and Discussions Hub Pull requests and Discussions allow users to do community contributions to repositories. Pull requests and discussions work the same for all the repo types. At a high level, the aim is to build a simpler version of other git hosts' (like GitHub's) PRs and Issues: - no forks are involved: contributors push to a special `ref` branch directly on the source repo. - there's no hard distinction between discussions and PRs: they are essentially the same so they are displayed in the same lists. - they are streamlined for ML (i.e. models/datasets/spaces repos), not arbitrary repos. _Note, Pull Requests and discussions can be enabled or disabled from the [repository settings](./repositories-settings#disabling-discussions--pull-requests)_ ## List By going to the community tab in any repository, you can see all Discussions and Pull requests. You can also filter to only see the ones that are open. ## View The Discussion page allows you to see the comments from different users. If it's a Pull Request, you can see all the changes by going to the Files changed tab. ## Editing a Discussion / Pull request title If you opened a PR or discussion, are the author of the repository, or have write access to it, you can edit the discussion title by clicking on the pencil button. ## Pin a Discussion / Pull Request If you have write access to a repository, you can pin discussions and Pull Requests. Pinned discussions appear at the top of all the discussions. ## Lock a Discussion / Pull Request If you have write access to a repository, you can lock discussions or Pull Requests. Once a discussion is locked, previous comments are still visible and users won't be able to add new comments. ## Comment edition and moderation If you wrote a comment or have write access to the repository, you can edit the content of the comment from the contextual menu in the top-right corner of the comment box. Once the comment has been edited, a new link will appear above the comment. This link shows the edit history. You can also hide a comment. Hiding a comment is irreversible, and nobody will be able to see its content nor edit it anymore. Read also [moderation](./moderation) to see how to report an abusive comment. ## Can I use Markdown and LaTeX in my comments and discussions? Yes! You can use Markdown to add formatting to your comments. Additionally, you can utilize LaTeX for mathematical typesetting, your formulas will be rendered with [KaTeX](https://katex.org/) before being parsed in Markdown. For LaTeX equations, you have to use the following delimiters: - `$$ ... $$` for display mode - `\\(...\\)` for inline mode (no space between the slashes and the parenthesis). ## How do I manage Pull requests locally? Let's assume your PR number is 42. ```bash git fetch origin refs/pr/42:pr/42 git checkout pr/42 # Do your changes git add . git commit -m "Add your change" git push origin pr/42:refs/pr/42 ``` ### Draft mode Draft mode is the default status when opening a new Pull request from scratch in "Advanced mode". With this status, other contributors know that your Pull request is under work and it cannot be merged. When your branch is ready, just hit the "Publish" button to change the status of the Pull request to "Open". Note that once published you cannot go back to draft mode. ## Deleting a Pull request ref When a Pull request is closed or merged, you can delete its associated git ref (the branch storing the PR's commits) to free up storage space. After closing or merging a PR, you'll see a notice at the bottom of the discussion showing the estimated storage that could be freed by deleting the ref. Click the "Delete ref" button to permanently remove the PR's git ref and reclaim the storage. > [!TIP] > This is especially useful when the main branch has been squashed and files removed later on. Those files remain in the PR branch history even if they weren't added by the PR itself, taking up storage that could be freed. > [!WARNING] > Deleting a PR ref is irreversible. Once deleted, you won't be able to fetch or checkout the PR's commits locally anymore. ## Pull requests advanced usage ### Where in the git repo are changes stored? Our Pull requests do not use forks and branches, but instead custom "branches" called `refs` that are stored directly on the source repo. [Git References](https://git-scm.com/book/en/v2/Git-Internals-Git-References) are the internal machinery of git which already stores tags and branches. The advantage of using custom refs (like `refs/pr/42` for instance) instead of branches is that they're not fetched (by default) by people (including the repo "owner") cloning the repo, but they can still be fetched on demand. ### Fetching all Pull requests: for git magicians ๐Ÿง™โ€โ™€๏ธ You can tweak your local **refspec** to fetch all Pull requests: 1. Fetch ```bash git fetch origin refs/pr/*:refs/remotes/origin/pr/* ``` 2. create a local branch tracking the ref ```bash git checkout pr/{PR_NUMBER} # for example: git checkout pr/42 ``` 3. IF you make local changes, to push to the PR ref: ```bash git push origin pr/{PR_NUMBER}:refs/pr/{PR_NUMBER} # for example: git push origin pr/42:refs/pr/42 ``` ### JupyterLab on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-jupyter.md # JupyterLab on Spaces [JupyterLab](https://jupyter.org/) is a web-based interactive development environment for Jupyter notebooks, code, and data. It is a great tool for data science and machine learning, and it is widely used by the community. With Hugging Face Spaces, you can deploy your own JupyterLab instance and use it for development directly from the Hugging Face website. ## โšก๏ธ Deploy a JupyterLab instance on Spaces You can deploy JupyterLab on Spaces with just a few clicks. First, go to [this link](https://huggingface.co/new-space?template=SpacesExamples/jupyterlab) or click the button below: Spaces requires you to define: * An **Owner**: either your personal account or an organization you're a part of. * A **Space name**: the name of the Space within the account you're creating the Space. * The **Visibility**: _private_ if you want the Space to be visible only to you or your organization, or _public_ if you want it to be visible to other users. * The **Hardware**: the hardware you want to use for your JupyterLab instance. This goes from CPUs to H100s. * You can optionally configure a `JUPYTER_TOKEN` password to protect your JupyterLab workspace. When unspecified, defaults to `huggingface`. We strongly recommend setting this up if your Space is public or if the Space is in an organization. Storage in Hugging Face Spaces is ephemeral, and the data you store in the default configuration can be lost in a reboot or reset of the Space. We recommend saving your work to a remote location or attaching a [Storage Bucket](https://huggingface.co/docs/hub/storage-buckets) to your Space for persistent data. ## Read more - [HF Docker Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker) If you have any feedback or change requests, please don't hesitate to reach out to the owners on the [Feedback Discussion](https://huggingface.co/spaces/SpacesExamples/jupyterlab/discussions/3). ## Acknowledgments This template was created by [camenduru](https://twitter.com/camenduru) and [nateraw](https://huggingface.co/nateraw), with contributions from [osanseviero](https://huggingface.co/osanseviero) and [azzr](https://huggingface.co/azzr). ### Perform vector similarity search https://huggingface.co/docs/hub/datasets-duckdb-vector-similarity-search.md # Perform vector similarity search The Fixed-Length Arrays feature was added in DuckDB version 0.10.0. This lets you use vector embeddings in DuckDB tables, making your data analysis even more powerful. Additionally, the array_cosine_similarity function was introduced. This function measures the cosine of the angle between two vectors, indicating their similarity. A value of 1 means theyโ€™re perfectly aligned, 0 means theyโ€™re perpendicular, and -1 means theyโ€™re completely opposite. Let's explore how to use this function for similarity searches. In this section, weโ€™ll show you how to perform similarity searches using DuckDB. We will use the [asoria/awesome-chatgpt-prompts-embeddings](https://huggingface.co/datasets/asoria/awesome-chatgpt-prompts-embeddings) dataset. First, let's preview a few records from the dataset: ```bash FROM 'hf://datasets/asoria/awesome-chatgpt-prompts-embeddings/data/*.parquet' SELECT act, prompt, len(embedding) as embed_len LIMIT 3; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ act โ”‚ prompt โ”‚ embed_len โ”‚ โ”‚ varchar โ”‚ varchar โ”‚ int64 โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ Linux Terminal โ”‚ I want you to act as a linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output insidโ€ฆ โ”‚ 384 โ”‚ โ”‚ English Translatorโ€ฆ โ”‚ I want you to act as an English translator, spelling corrector and improver. I will speak to you in any language and you will detect the language, translate it and answerโ€ฆ โ”‚ 384 โ”‚ โ”‚ `position` Interviโ€ฆ โ”‚ I want you to act as an interviewer. I will be the candidate and you will ask me the interview questions for the `position` position. I want you to only reply as the inteโ€ฆ โ”‚ 384 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` Next, let's choose an embedding to use for the similarity search: ```bash FROM 'hf://datasets/asoria/awesome-chatgpt-prompts-embeddings/data/*.parquet' SELECT embedding WHERE act = 'Linux Terminal'; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ embedding โ”‚ โ”‚ float[] โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ [-0.020781303, -0.029143505, -0.0660217, -0.00932716, -0.02601602, -0.011426172, 0.06627567, 0.11941507, 0.0013917526, 0.012889079, 0.053234346, -0.07380514, 0.04871567, -0.043601237, -0.0025319182, 0.0448โ€ฆ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` Now, let's use the selected embedding to find similar records: ```bash SELECT act, prompt, array_cosine_similarity(embedding::float[384], (SELECT embedding FROM 'hf://datasets/asoria/awesome-chatgpt-prompts-embeddings/data/*.parquet' WHERE act = 'Linux Terminal')::float[384]) AS similarity FROM 'hf://datasets/asoria/awesome-chatgpt-prompts-embeddings/data/*.parquet' ORDER BY similarity DESC LIMIT 3; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ act โ”‚ prompt โ”‚ similarity โ”‚ โ”‚ varchar โ”‚ varchar โ”‚ float โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ Linux Terminal โ”‚ I want you to act as a linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output insiโ€ฆ โ”‚ 1.0 โ”‚ โ”‚ JavaScript Console โ”‚ I want you to act as a javascript console. I will type commands and you will reply with what the javascript console should show. I want you to only reply with the terminโ€ฆ โ”‚ 0.7599728 โ”‚ โ”‚ R programming Inteโ€ฆ โ”‚ I want you to act as a R interpreter. I'll type commands and you'll reply with what the terminal should show. I want you to only reply with the terminal output inside onโ€ฆ โ”‚ 0.7303775 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` That's it! You have successfully performed a vector similarity search using DuckDB. ### Blog Articles for Organizations https://huggingface.co/docs/hub/enterprise-blog-articles.md # Blog Articles for Organizations > [!WARNING] > This feature is part of the Team & Enterprise plans. Blog Articles allow Team and Enterprise organizations to publish long-form content directly under your organization profile, enabling you to share model releases, research updates, and announcements with the broader community. ## Publishing as an Organization When creating a new article at [huggingface.co/new-blog](https://huggingface.co/new-blog), select your organization from the dropdown to publish as the organization rather than as an individual. Once published, the article will appear on your organization's profile page. ## Permissions To publish blog articles under an organization namespace, members need `write` or `admin` role at the organization level. See [Access Control in Organizations](./organizations-security) for more details on roles. > [!NOTE] > Blog article permissions are currently tied to organization-level roles and cannot be scoped using [Resource Groups](./security-resource-groups). Resource Groups only control access to repositories (models, datasets, and Spaces), not blog articles. ### Uploading models https://huggingface.co/docs/hub/models-uploading.md # Uploading models To upload models to the Hub, you'll need to create an account at [Hugging Face](https://huggingface.co/join). Models on the Hub are [Git-based repositories](./repositories), which give you versioning, branches, discoverability and sharing features, integration with dozens of libraries, and more! You have control over what you want to upload to your repository, which could include checkpoints, configs, and any other files. You can link repositories with an individual user, such as [osanseviero/fashion_brands_patterns](https://huggingface.co/osanseviero/fashion_brands_patterns), or with an organization, such as [facebook/bart-large-xsum](https://huggingface.co/facebook/bart-large-xsum). Organizations can collect models related to a company, community, or library! If you choose an organization, the model will be featured on the organizationโ€™s page, and every member of the organization will have the ability to contribute to the repository. You can create a new organization [here](https://huggingface.co/organizations/new). > **_NOTE:_** Models do NOT need to be compatible with the Transformers/Diffusers libraries to get download metrics. Any custom model is supported. Read more below! There are several ways to upload models for them to be nicely integrated into the Hub and get [download metrics](models-download-stats), described below. - In case your model is designed for a library that has [built-in support](#upload-from-a-library-with-built-in-support), you can use the methods provided by the library. Custom models that use `trust_remote_code=True` can also leverage these methods. - In case your model is a custom PyTorch model, one can leverage the [`PyTorchModelHubMixin` class](#upload-a-pytorch-model-using-huggingfacehub) as it allows to add `from_pretrained`, `push_to_hub` to any `nn.Module` class, just like models in the Transformers, Diffusers and Timm libraries. - In addition to programmatic uploads, you can always use the [web interface](#using-the-web-interface) or [the git command line](#using-git). Once your model is uploaded, we suggest adding a [Model Card](./model-cards) to your repo to document your model and make it more discoverable. drawing Example [repository](https://huggingface.co/LiheYoung/depth_anything_vitl14) that leverages [PyTorchModelHubMixin](#upload-a-pytorch-model-using-huggingfacehub). Downloads are shown on the right. ## Using the web interface To create a brand new model repository, visit [huggingface.co/new](http://huggingface.co/new). Then follow these steps: 1. In the "Files and versions" tab, select "Add File" and specify "Upload File": 2. From there, select a file from your computer to upload and leave a helpful commit message to know what you are uploading: 3. Afterwards, click **Commit changes** to upload your model to the Hub! 4. Inspect files and history You can check your repository with all the recently added files! The UI allows you to explore the model files and commits and to see the diff introduced by each commit: 5. Add metadata You can add metadata to your model card. You can specify: * the type of task this model is for, enabling widgets and Inference Providers. * the used library (`transformers`, `spaCy`, etc.) * the language * the dataset * metrics * license * a lot more! Read more about model tags [here](./model-cards#model-card-metadata). 6. Add TensorBoard traces Any repository that contains TensorBoard traces (filenames that contain `tfevents`) is categorized with the [`TensorBoard` tag](https://huggingface.co/models?filter=tensorboard). As a convention, we suggest that you save traces under the `runs/` subfolder. The "Training metrics" tab then makes it easy to review charts of the logged variables, like the loss or the accuracy. Models trained with ๐Ÿค— Transformers will generate [TensorBoard traces](https://huggingface.co/docs/transformers/main_classes/callback#transformers.integrations.TensorBoardCallback) by default if [`tensorboard`](https://pypi.org/project/tensorboard/) is installed. ## Upload from a library with built-in support First check if your model is from a library that has built-in support to push to/load from the Hub, like Transformers, Diffusers, Timm, Asteroid, etc.: https://huggingface.co/docs/hub/models-libraries. Below we'll show how easy this is for a library like Transformers: ```python from transformers import BertConfig, BertModel config = BertConfig() model = BertModel(config) model.push_to_hub("nielsr/my-awesome-bert-model") # reload model = BertModel.from_pretrained("nielsr/my-awesome-bert-model") ``` Some libraries, like Transformers, support loading [code from the Hub](https://huggingface.co/docs/transformers/custom_models). This is a way to make your model work with Transformers using the `trust_remote_code=True` flag. You may want to consider this option instead of a full-fledged library integration. ## Upload a PyTorch model using huggingface_hub In case your model is a (custom) PyTorch model, you can leverage the `PyTorchModelHubMixin` [class](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) available in the [huggingface_hub](https://github.com/huggingface/huggingface_hub) Python library. It is a minimal class which adds `from_pretrained` and `push_to_hub` capabilities to any `nn.Module`, along with download metrics. Here is how to use it (assuming you have run `pip install huggingface_hub`): ```python import torch import torch.nn as nn from huggingface_hub import PyTorchModelHubMixin class MyModel( nn.Module, PyTorchModelHubMixin, # optionally, you can add metadata which gets pushed to the model card repo_url="your-repo-url", pipeline_tag="text-to-image", license="mit", ): def __init__(self, num_channels: int, hidden_size: int, num_classes: int): super().__init__() self.param = nn.Parameter(torch.rand(num_channels, hidden_size)) self.linear = nn.Linear(hidden_size, num_classes) def forward(self, x): return self.linear(x + self.param) # create model config = {"num_channels": 3, "hidden_size": 32, "num_classes": 10} model = MyModel(**config) # save locally model.save_pretrained("my-awesome-model") # push to the hub model.push_to_hub("your-hf-username/my-awesome-model") # reload model = MyModel.from_pretrained("your-hf-username/my-awesome-model") ``` As you can see, the only requirement is that your model inherits from `PyTorchModelHubMixin`. All instance attributes will be automatically serialized to a `config.json` file. Note that the `init` method can only take arguments which are JSON serializable. Python dataclasses are supported. This comes with automated download metrics, meaning that you'll be able to see how many times the model is downloaded, the same way they are available for models integrated natively in the Transformers, Diffusers or Timm libraries. With this mixin class, each separate checkpoint is stored on the Hub in a single repository consisting of 2 files: - a `pytorch_model.bin` or `model.safetensors` file containing the weights - a `config.json` file which is a serialized version of the model configuration. This class is used for counting download metrics: everytime a user calls `from_pretrained` to load a `config.json`, the count goes up by one. See [this guide](https://huggingface.co/docs/hub/models-download-stats) regarding automated download metrics. It's recommended to add a model card to each checkpoint so that people can read what the model is about, have a link to the paper, etc. Visit [the huggingface_hub's documentation](https://huggingface.co/docs/huggingface_hub/guides/integrations) to learn more. Alternatively, one can also simply programmatically upload files or folders to the hub: https://huggingface.co/docs/huggingface_hub/guides/upload. ## Using Git Finally, since model repos are just Git repositories, you can also use Git to push your model files to the Hub. Follow the guide on [Getting Started with Repositories](repositories-getting-started#terminal) to learn about using the `git` CLI to commit and push your models. ### Appendix https://huggingface.co/docs/hub/model-card-appendix.md # Appendix ## Appendix A: User Study _Full text responses to key questions_ ### How would you define model cards? ***Insight: Respondents had generally similar views of what model cards are: documentation focused on issues like training, use cases, and bias/limitations*** * Model cards are model descriptions, both of how they were trained, their use cases, and potential biases and limitations * Documents describing the essential features of a model in order for the reader/user to understand the artefact he/she has in front, the background/training, how it can be used, and its technical/ethical limitations. * They serve as a living artefact of models to document them. Model cards contain information that go from a high level description of what the specific model can be used to, to limitations, biases, metrics, and much more. They are used primarily to understand what the model does. * Model cards are to models what GitHub READMEs are to GitHub projects. It tells people all the information they need to know about the model. If you don't write one, nobody will use your model. * From what I understand, a model card uses certain benchmarks (geography, culture, sex, etc) to define both a model's usability and limitations. It's essentially a model's 'nutrition facts label' that can show how a model was created and educates others on its reusability. * Model cards are the metadata and documentation about the model, everything I need to know to use the model properly: info about the model, what paper introduced it, what dataset was it trained on or fine-tuned on, whom does it belong to, are there known risks and limitations with this model, any useful technical info. * IMO model cards are a brief presentation of a model which includes: * short summary of the architectural particularities of the model * describing the data it was trained on * what is the performance on reference datasets (accuracy and speed metrics if possible) * limitations * how to use it in the context of the Transformers library * source (original article, Github repo,...) * Easily accessible documentation that any background can read and learn about critical model components and social impact ### What do you like about model cards? * They are interesting to teach people about new models * As a non-technical guy, the possibility of getting to know the model, to understand the basics of it, it's an opportunity for the author to disclose its innovation in a transparent & explainable (i.e. trustworthy) way. * I like interactive model cards with visuals and widgets that allow me to try the model without running any code. * What I like about good model cards is that you can find all the information you need about that particular model. * Model cards are revolutionary to the world of AI ethics. It's one of the first tangible steps in mitigating/educating on biases in machine learning. They foster greater awareness and accountability! * Structured, exhaustive, the more info the better. * It helps to get an understanding of what the model is good (or bad) at. * Conciseness and accessibility ### What do you dislike about model cards? * Might get to technical and/or dense * They contain lots of information for different audiences (researchers, engineers, non engineers), so it's difficult to explore model cards with an intended use cases. * [NOTE: this comment could be addressed with toggle views for different audiences] * Good ones are time consuming to create. They are hard to test to make sure the information is up to date. Often times, model cards are formatted completely differently - so you have to sort of figure out how that certain individual has structured theirs. * [NOTE: this comment helps demonstrate the value of a standardized format and automation tools to make it easier to create model cards] * Without the help of the community to pitch in supplemental evals, model cards might be subject to inherent biases that the developer might not be aware of. It's early days for them, but without more thorough evaluations, a model card's information might be too limited. * Empty model cards. No license information - customers need that info and generally don't have it. * They are usually either too concise or too verbose. * writing them lol bless you ### Other key new insights * Model cards are best filled out when done by people with different roles: Technical specifications can generally only be filled out by the developers; ethical considerations throughout are generally best informed by people who tend to work on ethical issues. * Model users care a lot about licences -- specifically, whether a model can legally be used for a specific task. ## Appendix B: Landscape Analysis _Overview of the state of model documentation in Machine Learning_ ### MODEL CARD EXAMPLES Examples of model cards and closely-related variants include: * Google Cloud: [Face Detection](https://modelcards.withgoogle.com/face-detection), [Object Detection](https://modelcards.withgoogle.com/object-detection) * Google Research: [ML Kit Vision Models](https://developers.google.com/s/results/ml-kit?q=%22Model%20Card%22), [Face Detection](https://sites.google.com/view/perception-cv4arvr/blazeface), [Conversation AI](https://github.com/conversationai/perspectiveapi/tree/main/model-cards) * OpenAI: [GPT-3](https://github.com/openai/gpt-3/blob/master/model-card.md), [GPT-2](https://github.com/openai/gpt-2/blob/master/model_card.md), [DALL-E dVAE](https://github.com/openai/DALL-E/blob/master/model_card.md), [CLIP](https://github.com/openai/CLIP-featurevis/blob/master/model-card.md) * [NVIDIA Model Cards](https://catalog.ngc.nvidia.com/models?filters=&orderBy=weightPopularASC&query=) * [Salesforce Model Cards](https://blog.salesforceairesearch.com/model-cards-for-ai-model-transparency/) * [Allen AI Model Cards](https://github.com/allenai/allennlp-models/tree/main/allennlp_models/modelcards) * [Co:here AI Model Cards](https://docs.cohere.ai/responsible-use/) * [Duke PULSE Model Card](https://arxiv.org/pdf/2003.03808.pdf) * [Stanford Dynasent](https://github.com/cgpotts/dynasent/blob/main/dynasent_modelcard.md) * [GEM Model Cards](https://gem-benchmark.com/model_cards) * Parl.AI: [Parl.AI sample model cards](https://github.com/facebookresearch/ParlAI/tree/main/docs/sample_model_cards), [BlenderBot 2.0 2.7B](https://github.com/facebookresearch/ParlAI/blob/main/parlai/zoo/blenderbot2/model_card.md) * [Perspective API Model Cards](https://github.com/conversationai/perspectiveapi/tree/main/model-cards) * See https://github.com/ivylee/model-cards-and-datasheets for more examples! ### MODEL CARDS FOR LARGE LANGUAGE MODELS Large language models are often released with associated documentation. Large language models that have an associated model card (or related documentation tool) include: * [Big Science BLOOM model card](https://huggingface.co/bigscience/bloom) * [GPT-2 Model Card](https://github.com/openai/gpt-2/blob/master/model_card.md) * [GPT-3 Model Card](https://github.com/openai/gpt-3/blob/master/model-card.md) * [DALL-E 2 Preview System Card](https://github.com/openai/dalle-2-preview/blob/main/system-card.md) * [OPT-175B model card](https://arxiv.org/pdf/2205.01068.pdf) ### MODEL CARD GENERATION TOOLS Tools for programmatically or interactively generating model cards include: * [Salesforce Model Card Creation](https://help.salesforce.com/s/articleView?id=release-notes.rn_bi_edd_model_card.htm&type=5&release=232) * [TensorFlow Model Card Toolkit](https://ai.googleblog.com/2020/07/introducing-model-card-toolkit-for.html) * [Python library](https://pypi.org/project/model-card-toolkit/) * [GSA / US Census Bureau Collaboration on Model Card Generator](https://bias.xd.gov/resources/model-card-generator/) * [Parl.AI Auto Generation Tool](https://parl.ai/docs/tutorial_model_cards.html) * [VerifyML Model Card Generation Web Tool](https://www.verifyml.com) * [RMarkdown Template for Model Card as part of vetiver package](https://cran.r-project.org/web/packages/vetiver/vignettes/model-card.html) * [Databaseline ML Cards toolkit](https://databaseline.tech/ml-cards/) ### MODEL CARD EDUCATIONAL TOOLS Tools for understanding model cards and understanding how to create model cards include: * [Hugging Face Hub docs](https://huggingface.co/course/chapter4/4?fw=pt) * [Perspective API](https://developers.perspectiveapi.com/s/about-the-api-model-cards) * [Kaggle](https://www.kaggle.com/code/var0101/model-cards/tutorial) * [Code.org](https://studio.code.org/s/aiml-2021/lessons/8) * [UNICEF](https://unicef.github.io/inventory/data/model-card/) --- **Please cite as:** Ozoani, Ezi and Gerchick, Marissa and Mitchell, Margaret. Model Card Guidebook. Hugging Face, 2022. https://huggingface.co/docs/hub/en/model-card-guidebook ### Quickstart https://huggingface.co/docs/hub/jobs-quickstart.md # Quickstart In this guide you will run a Job to fine-tune an open source model on Hugging Face infrastructure in only a few minutes. Make sure you are logged in to Hugging Face and have access to your [Jobs page](https://huggingface.co/settings/jobs). Jobs are available to any user or organization with [pre-paid credits](https://huggingface.co/pricing). ## Getting started First install the Hugging Face CLI: ### 1. Install the CLI Recommended approach: ```bash >>> curl -LsSf https://hf.co/cli/install.sh | bash ``` Or using Homebrew: ```bash >>> brew install hf ``` Or using uv: ```bash >>> uv tool install hf ``` ### 2. Login to your Hugging Face account Login ```bash >>> hf auth login ``` ### 3. Create your first jobs using the `hf jobs` command Run a UV command or script ```bash >>> hf jobs uv run python -c 'print("Hello from the cloud!")' Job started with ID: 693aef401a39f67af5a41c0e View at: https://huggingface.co/jobs/lhoestq/693aef401a39f67af5a41c0e Hello from the cloud! ``` ```bash >>> echo "print('Hello from uv script!')" > script.py >>> hf jobs uv run script.py Job started with ID: 695f6cd8d2f3efac77e8cf7f View at: https://huggingface.co/jobs/lhoestq/695f6cd8d2f3efac77e8cf7f Hello from uv script! ``` Run a Docker command ```bash >>> hf jobs run ubuntu echo 'Hello from the cloud!' Job started with ID: 693aee76c67c9f186cfe233e View at: https://huggingface.co/jobs/lhoestq/693aee76c67c9f186cfe233e Hello from the cloud! ``` ### 4. Check your first jobs The job logs appear in your terminal, but you can also see them in your jobs page. Open the job page to see the job information, status and logs: ## The training script Here is a simple training script to fine-tune a base model to a conversational model using Supervised Fine-Tuning (SFT). It uses the [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) model and the [trl-lib/Capybara](https://huggingface.co/datasets/trl-lib/Capybara) dataset, and the [TRL](https://huggingface.co/docs/trl/en/index) library, and saves the resulting model to your Hugging Face account under the name `"Qwen2.5-0.5B-SFT"`: ```python from datasets import load_dataset from trl import SFTTrainer dataset = load_dataset("trl-lib/Capybara", split="train") trainer = SFTTrainer( model="Qwen/Qwen2.5-0.5B", train_dataset=dataset, ) trainer.train() trainer.push_to_hub("Qwen2.5-0.5B-SFT") ``` Save this script as `train.py`, and we can now run it with UV on Hugging Face Jobs. ## Run the training job `hf jobs` takes several arguments: select the hardware with `--flavor`, choose a maximum duration with `--timeout`, and pass environment variable with `--env` and `--secrets`. Here we use the A100 Large GPU flavor with `--flavor a100-large` and pass your Hugging Face token as a secret with `--secrets HF_TOKEN` in order to be able to push the resulting model to your account. Moreover, UV accepts the `--with` argument to define python dependencies, so we use `--with trl` to have the `trl` library available. You can now run the final command which looks like this: ```bash hf jobs uv run \ --flavor a100-large \ --timeout 6h \ --with trl \ --secrets HF_TOKEN \ train.py ``` The logs appear in your terminal, and you can safely Ctrl+C to stop streaming the logs, the job will keep running. ``` ... Downloaded nvidia-cudnn-cu12 Downloaded torch Installed 66 packages in 233ms Generating train split: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 15806/15806 [00:00<00:00, 76686.50 examples/s] Generating test split: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 200/200 [00:00<00:00, 43880.36 examples/s] Tokenizing train dataset: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 15806/15806 [00:41<00:00, 384.97 examples/s] Truncating train dataset: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 15806/15806 [00:00<00:00, 212272.92 examples/s] The model is already on multiple devices. Skipping the move to device specified in `args`. The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}. {'loss': 1.7357, 'grad_norm': 4.8733229637146, 'learning_rate': 1.9969635627530365e-05, 'entropy': 1.7238958358764649, 'num_tokens': 59528.0, 'mean_token_accuracy': 0.6124177813529968, 'epoch': 0.01} {'loss': 1.6239, 'grad_norm': 6.200186729431152, 'learning_rate': 1.9935897435897437e-05, 'entropy': 1.644005584716797, 'num_tokens': 115219.0, 'mean_token_accuracy': 0.6259662985801697, 'epoch': 0.01} {'loss': 1.4449, 'grad_norm': 6.167325496673584, 'learning_rate': 1.990215924426451e-05, 'entropy': 1.5156117916107177, 'num_tokens': 171787.0, 'mean_token_accuracy': 0.6586395859718323, 'epoch': 0.02} {'loss': 1.6023, 'grad_norm': 5.133708953857422, 'learning_rate': 1.986842105263158e-05, 'entropy': 1.6885507702827454, 'num_tokens': 226067.0, 'mean_token_accuracy': 0.6271904468536377, 'epoch': 0.02} ``` Follow the Job advancements on the job page on Hugging Face: Monitor GPU usage and other metrics in the CLI or use the [MacOS menu bar](./jobs-manage#macos-menu-bar). Here with the CLI you get: ```bash >>> hf jobs stats JOB ID CPU % NUM CPU MEM % MEM USAGE NET I/O GPU UTIL % GPU MEM % GPU MEM USAGE ------------------------ ----- ------- ----- ---------------- --------------- ---------- --------- --------------- 695e83c5d2f3efac77e8cf18 8% 12.0 7.18% 10.9GB / 152.5GB 0.0bps / 0.0bps 100% 31.92% 25.9GB / 81.2GB ``` Once the job is done, find your model on your account: Congrats ! You just run your first Job to fine-tune an open source model ๐Ÿ”ฅ Feel free to try out your model locally and evaluate it using e.g. [transformers](https://huggingface.co/docs/transformers) by clicking on "Use this model", or deploy it to [Inference Endpoints](https://huggingface.co/docs/inference-endpoints) in one click using the "Deploy" button. ### Image Dataset https://huggingface.co/docs/hub/datasets-image.md # Image Dataset This guide will show you how to configure your dataset repository with image files. You can find accompanying examples of repositories in this [Image datasets examples collection](https://huggingface.co/collections/datasets-examples/image-dataset-6568e7cf28639db76eb92d65). A dataset with a supported structure and [file formats](./datasets-adding#file-formats) automatically has a Dataset Viewer on its page on the Hub. Additional information about your images - such as captions or bounding boxes for object detection - is automatically loaded as long as you include this information in a metadata file (`metadata.csv`/`metadata.jsonl`/`metadata.parquet`). Alternatively, images can be in Parquet files or in TAR archives following the [WebDataset](https://github.com/webdataset/webdataset) format. ## Only images If your dataset only consists of one column with images, you can simply store your image files at the root: ``` my_dataset_repository/ โ”œโ”€โ”€ 1.jpg โ”œโ”€โ”€ 2.jpg โ”œโ”€โ”€ 3.jpg โ””โ”€โ”€ 4.jpg ``` or in a subdirectory: ``` my_dataset_repository/ โ””โ”€โ”€ images โ”œโ”€โ”€ 1.jpg โ”œโ”€โ”€ 2.jpg โ”œโ”€โ”€ 3.jpg โ””โ”€โ”€ 4.jpg ``` Multiple [formats](./datasets-adding#file-formats) are supported at the same time, including PNG, JPEG, TIFF and WebP. ``` my_dataset_repository/ โ””โ”€โ”€ images โ”œโ”€โ”€ 1.jpg โ”œโ”€โ”€ 2.png โ”œโ”€โ”€ 3.tiff โ””โ”€โ”€ 4.webp ``` If you have several splits, you can put your images into directories named accordingly: ``` my_dataset_repository/ โ”œโ”€โ”€ train โ”‚ย ย  โ”œโ”€โ”€ 1.jpg โ”‚ย ย  โ””โ”€โ”€ 2.jpg โ””โ”€โ”€ test โ”œโ”€โ”€ 3.jpg โ””โ”€โ”€ 4.jpg ``` See [File names and splits](./datasets-file-names-and-splits) for more information and other ways to organize data by splits. ## Additional columns If there is additional information you'd like to include about your dataset, like text captions or bounding boxes, add it as a `metadata.csv` file in your repository. This lets you quickly create datasets for different computer vision tasks like [text captioning](https://huggingface.co/tasks/image-to-text) or [object detection](https://huggingface.co/tasks/object-detection). ``` my_dataset_repository/ โ””โ”€โ”€ train โ”œโ”€โ”€ 1.jpg โ”œโ”€โ”€ 2.jpg โ”œโ”€โ”€ 3.jpg โ”œโ”€โ”€ 4.jpg โ””โ”€โ”€ metadata.csv ``` Your `metadata.csv` file must have a `file_name` column which links image files with their metadata: ```csv file_name,text 1.jpg,a drawing of a green pokemon with red eyes 2.jpg,a green and yellow toy with a red nose 3.jpg,a red and white ball with an angry look on its face 4.jpg,a cartoon ball with a smile on its face ``` You can also use a [JSONL](https://jsonlines.org/) file `metadata.jsonl`: ```jsonl {"file_name": "1.jpg","text": "a drawing of a green pokemon with red eyes"} {"file_name": "2.jpg","text": "a green and yellow toy with a red nose"} {"file_name": "3.jpg","text": "a red and white ball with an angry look on its face"} {"file_name": "4.jpg","text": "a cartoon ball with a smile on its face"} ``` And for bigger datasets or if you are interested in advanced data retrieval features, you can use a [Parquet](https://parquet.apache.org/) file `metadata.parquet`. ## Relative paths Metadata file must be located either in the same directory with the images it is linked to, or in any parent directory, like in this example: ``` my_dataset_repository/ โ””โ”€โ”€ train โ”œโ”€โ”€ images โ”‚ย ย  โ”œโ”€โ”€ 1.jpg โ”‚ย ย  โ”œโ”€โ”€ 2.jpg โ”‚ย ย  โ”œโ”€โ”€ 3.jpg โ”‚ย ย  โ””โ”€โ”€ 4.jpg โ””โ”€โ”€ metadata.csv ``` In this case, the `file_name` column must be a full relative path to the images, not just the filename: ```csv file_name,text images/1.jpg,a drawing of a green pokemon with red eyes images/2.jpg,a green and yellow toy with a red nose images/3.jpg,a red and white ball with an angry look on its face images/4.jpg,a cartoon ball with a smile on it's face ``` Metadata files cannot be put in subdirectories of a directory with the images. More generally, any column named `file_name` or `*_file_name` should contain the full relative path to the images. ## Image classification For image classification datasets, you can also use a simple setup: use directories to name the image classes. Store your image files in a directory structure like: ``` my_dataset_repository/ โ”œโ”€โ”€ green โ”‚ย ย  โ”œโ”€โ”€ 1.jpg โ”‚ย ย  โ””โ”€โ”€ 2.jpg โ””โ”€โ”€ red โ”œโ”€โ”€ 3.jpg โ””โ”€โ”€ 4.jpg ``` The dataset created with this structure contains two columns: `image` and `label` (with values `green` and `red`). You can also provide multiple splits. To do so, your dataset directory should have the following structure (see [File names and splits](./datasets-file-names-and-splits) for more information): ``` my_dataset_repository/ โ”œโ”€โ”€ test โ”‚ย ย  โ”œโ”€โ”€ green โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ 2.jpg โ”‚ย ย  โ””โ”€โ”€ red โ”‚ย ย  โ””โ”€โ”€ 4.jpg โ””โ”€โ”€ train โ”œโ”€โ”€ green โ”‚ย ย  โ””โ”€โ”€ 1.jpg โ””โ”€โ”€ red โ””โ”€โ”€ 3.jpg ``` You can disable this automatic addition of the `label` column in the [YAML configuration](./datasets-manual-configuration). If your directory names have no special meaning, set `drop_labels: true` in the README header: ```yaml configs: - config_name: default # Name of the dataset subset, if applicable. drop_labels: true ``` ## Large scale datasets ### WebDataset format The [WebDataset](./datasets-webdataset) format is well suited for large scale image datasets (see [timm/imagenet-12k-wds](https://huggingface.co/datasets/timm/imagenet-12k-wds) for example). It consists of TAR archives containing images and their metadata and is optimized for streaming. It is useful if you have a large number of images and to get streaming data loaders for large scale training. ``` my_dataset_repository/ โ”œโ”€โ”€ train-0000.tar โ”œโ”€โ”€ train-0001.tar โ”œโ”€โ”€ ... โ””โ”€โ”€ train-1023.tar ``` To make a WebDataset TAR archive, create a directory containing the images and metadata files to be archived and create the TAR archive using e.g. the `tar` command. The usual size per archive is generally around 1GB. Make sure each image and metadata pair share the same file prefix, for example: ``` train-0000/ โ”œโ”€โ”€ 000.jpg โ”œโ”€โ”€ 000.json โ”œโ”€โ”€ 001.jpg โ”œโ”€โ”€ 001.json โ”œโ”€โ”€ ... โ”œโ”€โ”€ 999.jpg โ””โ”€โ”€ 999.json ``` Note that for user convenience and to enable the [Dataset Viewer](./data-studio), every dataset hosted in the Hub is automatically converted to Parquet format up to 5GB. Read more about it in the [Parquet format](./data-studio#access-the-parquet-files) documentation. ### Parquet format Instead of uploading the images and metadata as individual files, you can embed everything inside a [Parquet](https://parquet.apache.org/) file. This is useful if you have a large number of images, if you want to embed multiple image columns, or if you want to store additional information about the images in the same file. Parquet is also useful for storing data such as raw bytes, which is not supported by JSON/CSV. ``` my_dataset_repository/ โ””โ”€โ”€ train.parquet ``` Parquet files with image data can be created using `pandas` or the `datasets` library. To create Parquet files with image data in `pandas`, you can use [pandas-image-methods](https://github.com/lhoestq/pandas-image-methods) and `df.to_parquet()`. In `datasets`, you can set the column type to `Image()` and use the `ds.to_parquet(...)` method or `ds.push_to_hub(...)`. You can find a guide on loading image datasets in `datasets` [here](/docs/datasets/image_load). Alternatively you can manually set the image type of Parquet created using other tools. First, make sure your image columns are of type _struct_, with a binary field `"bytes"` for the image data and a string field `"path"` for the image file name or path. Then you should specify the feature types of the columns directly in YAML in the README header, for example: ```yaml dataset_info: features: - name: image dtype: image - name: caption dtype: string ``` Note that Parquet is recommended for small images (<1MB per image) and small row groups (100 rows per row group, which is what `datasets` uses for images). For larger images it is recommended to use the WebDataset format, or to share the original image files (optionally with metadata files, and following the [repositories recommendations and limits](https://huggingface.co/docs/hub/en/storage-limits) for storage and number of files). ### Uploading datasets https://huggingface.co/docs/hub/datasets-adding.md # Uploading datasets The [Hub](https://huggingface.co/datasets) is home to an extensive collection of community-curated and research datasets. We encourage you to share your dataset to the Hub to help grow the ML community and accelerate progress for everyone. All contributions are welcome; adding a dataset is just a drag and drop away! Start by [creating a Hugging Face Hub account](https://huggingface.co/join) if you don't have one yet. ## Upload using the Hub UI The Hub's web-based interface allows users without any developer experience to upload a dataset. ### Create a repository A repository hosts all your dataset files, including the revision history, making storing more than one dataset version possible. 1. Click on your profile and select **New Dataset** to create a [new dataset repository](https://huggingface.co/new-dataset). 2. Pick a name for your dataset, and choose whether it is a public or private dataset. A public dataset is visible to anyone, whereas a private dataset can only be viewed by you or members of your organization. ### Upload dataset 1. Once you've created a repository, navigate to the **Files and versions** tab to add a file. Select **Add file** to upload your dataset files. We support many text, audio, image and other data extensions such as `.csv`, `.mp3`, and `.jpg` (see the full list of [File formats](#file-formats)). 2. Drag and drop your dataset files. 3. After uploading your dataset files, they are stored in your dataset repository. ### Create a Dataset card Adding a Dataset card is super valuable for helping users find your dataset and understand how to use it responsibly. 1. Click on **Create Dataset Card** to create a [Dataset card](./datasets-cards). This button creates a `README.md` file in your repository. 2. At the top, you'll see the **Metadata UI** with several fields to select from such as license, language, and task categories. These are the most important tags for helping users discover your dataset on the Hub (when applicable). When you select an option for a field, it will be automatically added to the top of the dataset card. You can also look at the [Dataset Card specifications](https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1), which has a complete set of allowed tags, including optional like `annotations_creators`, to help you choose the ones that are useful for your dataset. 3. Write your dataset documentation in the Dataset Card to introduce your dataset to the community and help users understand what is inside: what are the use cases and limitations, where the data comes from, what are important ethical considerations, and any other relevant details. You can click on the **Import dataset card template** link at the top of the editor to automatically create a dataset card template. For a detailed example of what a good Dataset card should look like, take a look at the [CNN DailyMail Dataset card](https://huggingface.co/datasets/cnn_dailymail). ## Using the `huggingface_hub` client library The rich features set in the `huggingface_hub` library allows you to manage repositories, including creating repos and uploading datasets to the Hub. Visit [the client library's documentation](/docs/huggingface_hub/index) to learn more. ## Using other libraries Some libraries like [๐Ÿค— Datasets](/docs/datasets/index), [Pandas](https://pandas.pydata.org/), [Polars](https://pola.rs), [Dask](https://www.dask.org/), [DuckDB](https://duckdb.org/), or [Daft](https://daft.ai/) can upload files to the Hub. See the list of [Libraries supported by the Datasets Hub](./datasets-libraries) for more information. ## Using Git Since dataset repos are Git repositories, you can use Git to push your data files to the Hub. Follow the guide on [Getting Started with Repositories](repositories-getting-started) to learn about using the `git` CLI to commit and push your datasets. ## Ingest datasets If you have data in databases, cloud storage or behind APIs, you can ingest them to Hugging Face as ready-to-use datasets. Find more information in the [documentation on ingesting datasets](./datasets-ingesting). ## File formats The Hub natively supports multiple file formats: - Parquet (.parquet) - CSV (.csv, .tsv) - JSON Lines, JSON (.jsonl, .json) - Arrow streaming and IPC formats (.arrow) - Text (.txt) - Images (.png, .jpg, etc.) - Audio (.wav, .mp3, etc.) - Video (.mp4, .mov, .avi, etc.) - PDF (.pdf) - [WebDataset](./datasets-webdataset) (.tar) - [Lance](./datasets-lance) (.lance) It supports files compressed using ZIP (.zip), GZIP (.gz), ZSTD (.zst), BZ2 (.bz2), LZ4 (.lz4) and LZMA (.xz). Image and audio files can also have additional metadata files. See the [Data files Configuration](./datasets-data-files-configuration#image-and-audio-datasets) on image and audio datasets, as well as the collections of [example datasets](https://huggingface.co/datasets-examples) for CSV, TSV and images. You may want to convert your files to these formats to benefit from all the Hub features. Other formats and structures may not be recognized by the Hub. ### Which file format should I use? For most types of datasets, **Parquet** is the recommended format due to its efficient compression, rich typing, and since a variety of tools supports this format with optimized read and batched operations. Alternatively, CSV or JSON Lines/JSON can be used for tabular data (prefer JSON Lines for nested data). Although easy to parse compared to Parquet, these formats are not recommended for data larger than several GBs. For image and audio datasets, uploading raw files is the most practical for most use cases since it's easy to access individual files. For large scale image and audio datasets streaming, [WebDataset](https://github.com/webdataset/webdataset) should be preferred over raw image and audio files to avoid the overhead of accessing individual files. Though for more general use cases involving analytics, data filtering or metadata parsing, Parquet is the recommended option for large scale image and audio datasets. ### Data Studio The [Data Studio](./data-studio) is useful to know how the data actually looks like before you download it. It is enabled by default for all public datasets. It is also available for private datasets owned by a [PRO user](https://huggingface.co/pricing) or a [Team or Enterprise organization](https://huggingface.co/enterprise). After uploading your dataset, make sure the Dataset Viewer correctly shows your data, or [Configure the Dataset Viewer](./datasets-viewer-configure). ## Large scale datasets The Hugging Face Hub supports large scale datasets, usually uploaded in Parquet (e.g. via `push_to_hub()` using [๐Ÿค— Datasets](/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.push_to_hub)) or [WebDataset](https://github.com/webdataset/webdataset) format. You can upload large scale datasets at high speed using the `huggingface_hub` library. See [how to upload a folder by chunks](/docs/huggingface_hub/guides/upload#upload-a-folder-by-chunks), the [tips and tricks for large uploads](/docs/huggingface_hub/guides/upload#tips-and-tricks-for-large-uploads) and the [repository storage limits and recommendations](./storage-limits). ### DDUF https://huggingface.co/docs/hub/dduf.md # DDUF ## Overview DDUF (**D**DUFโ€™s **D**iffusion **U**nified **F**ormat) is a single-file format for diffusion models that aims to unify the different model distribution methods and weight-saving formats by packaging all model components into a single file. It is language-agnostic and built to be parsable from a remote location without downloading the entire file. This work draws inspiration from the [GGUF](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md) format. Check out the [DDUF](https://huggingface.co/DDUF) org to start using some of the most popular diffusion models in DDUF. > [!TIP] > We welcome contributions with open arms! > > To create a widely adopted file format, we need early feedback from the community. Nothing is set in stone, and we value everyone's input. Is your use case not covered? Please let us know in the DDUF organization [discussions](https://huggingface.co/spaces/DDUF/README/discussions/2). Its key features include the following. 1. **Single file** packaging. 2. Based on **ZIP file format** to leverage existing tooling. 3. No compression, ensuring **`mmap` compatibility** for fast loading and saving. 4. **Language-agnostic**: tooling can be implemented in Python, JavaScript, Rust, C++, etc. 5. **HTTP-friendly**: metadata and file structure can be fetched remotely using HTTP Range requests. 6. **Flexible**: each model component is stored in its own directory, following the current Diffusers structure. 7. **Safe**: uses [Safetensors](https://huggingface.co/docs/diffusers/using-diffusers/other-formats#safetensors) as a weight-saving format and prohibits nested directories to prevent ZIP bombs. ## Technical specifications Technically, a `.dduf` file **is** a [`.zip` archive](https://en.wikipedia.org/wiki/ZIP_(file_format)). By building on a universally supported file format, we ensure robust tooling already exists. However, some constraints are enforced to meet diffusion models' requirements: - Data must be stored uncompressed (flag `0`), allowing lazy-loading using memory-mapping. - Data must be stored using ZIP64 protocol, enabling saving files above 4GB. - The archive can only contain `.json`, `.safetensors`, `.model` and `.txt` files. - A `model_index.json` file must be present at the root of the archive. It must contain a key-value mapping with metadata about the model and its components. - Each component must be stored in its own directory (e.g., `vae/`, `text_encoder/`). Nested files must use UNIX-style path separators (`/`). - Each directory must correspond to a component in the `model_index.json` index. - Each directory must contain a json config file (one of `config.json`, `tokenizer_config.json`, `preprocessor_config.json`, `scheduler_config.json`). - Sub-directories are forbidden. Want to check if your file is valid? Check it out using this Space: https://huggingface.co/spaces/DDUF/dduf-check. ## Usage The `huggingface_hub` provides tooling to handle DDUF files in Python. It includes built-in rules to validate file integrity and helpers to read and export DDUF files. The goal is to see this tooling adopted in the Python ecosystem, such as in the `diffusers` integration. Similar tooling can be developed for other languages (JavaScript, Rust, C++, etc.). ### How to read a DDUF file? Pass a path to `read_dduf_file` to read a DDUF file. Only the metadata is read, meaning this is a lightweight call that won't explode your memory. In the example below, we consider that you've already downloaded the [`FLUX.1-dev.dduf`](https://huggingface.co/DDUF/FLUX.1-dev-DDUF/blob/main/FLUX.1-dev.dduf) file locally. ```python >>> from huggingface_hub import read_dduf_file # Read DDUF metadata >>> dduf_entries = read_dduf_file("FLUX.1-dev.dduf") ``` `read_dduf_file` returns a mapping where each entry corresponds to a file in the DDUF archive. A file is represented by a `DDUFEntry` dataclass that contains the filename, offset, and length of the entry in the original DDUF file. This information is useful to read its content without loading the whole file. In practice, you won't have to handle low-level reading but rely on helpers instead. For instance, here is how to load the `model_index.json` content: ```python >>> import json >>> json.loads(dduf_entries["model_index.json"].read_text()) {'_class_name': 'FluxPipeline', '_diffusers_version': '0.32.0.dev0', '_name_or_path': 'black-forest-labs/FLUX.1-dev', ... ``` For binary files, you'll want to access the raw bytes using `as_mmap`. This returns bytes as a memory-mapping on the original file. The memory-mapping allows you to read only the bytes you need without loading everything in memory. For instance, here is how to load safetensors weights: ```python >>> import safetensors.torch >>> with dduf_entries["vae/diffusion_pytorch_model.safetensors"].as_mmap() as mm: ... state_dict = safetensors.torch.load(mm) # `mm` is a bytes object ``` > [!TIP] > `as_mmap` must be used in a context manager to benefit from the memory-mapping properties. ### How to write a DDUF file? Pass a folder path to `export_folder_as_dduf` to export a DDUF file. ```python # Export a folder as a DDUF file >>> from huggingface_hub import export_folder_as_dduf >>> export_folder_as_dduf("FLUX.1-dev.dduf", folder_path="path/to/FLUX.1-dev") ``` This tool scans the folder, adds the relevant entries and ensures the exported file is valid. If anything goes wrong during the process, a `DDUFExportError` is raised. For more flexibility, use [`export_entries_as_dduf`] to explicitly specify a list of files to include in the final DDUF file: ```python # Export specific files from the local disk. >>> from huggingface_hub import export_entries_as_dduf >>> export_entries_as_dduf( ... dduf_path="stable-diffusion-v1-4-FP16.dduf", ... entries=[ # List entries to add to the DDUF file (here, only FP16 weights) ... ("model_index.json", "path/to/model_index.json"), ... ("vae/config.json", "path/to/vae/config.json"), ... ("vae/diffusion_pytorch_model.fp16.safetensors", "path/to/vae/diffusion_pytorch_model.fp16.safetensors"), ... ("text_encoder/config.json", "path/to/text_encoder/config.json"), ... ("text_encoder/model.fp16.safetensors", "path/to/text_encoder/model.fp16.safetensors"), ... # ... add more entries here ... ] ... ) ``` `export_entries_as_dduf` works well if you've already saved your model on the disk. But what if you have a model loaded in memory and want to serialize it directly into a DDUF file? `export_entries_as_dduf` lets you do that by providing a Python `generator` that tells how to serialize the data iteratively: ```python (...) # Export state_dicts one by one from a loaded pipeline >>> def as_entries(pipe: DiffusionPipeline) -> Generator[Tuple[str, bytes], None, None]: ... # Build a generator that yields the entries to add to the DDUF file. ... # The first element of the tuple is the filename in the DDUF archive. The second element is the content of the file. ... # Entries will be evaluated lazily when the DDUF file is created (only 1 entry is loaded in memory at a time) ... yield "vae/config.json", pipe.vae.to_json_string().encode() ... yield "vae/diffusion_pytorch_model.safetensors", safetensors.torch.save(pipe.vae.state_dict()) ... yield "text_encoder/config.json", pipe.text_encoder.config.to_json_string().encode() ... yield "text_encoder/model.safetensors", safetensors.torch.save(pipe.text_encoder.state_dict()) ... # ... add more entries here >>> export_entries_as_dduf(dduf_path="my-cool-diffusion-model.dduf", entries=as_entries(pipe)) ``` ### Loading a DDUF file with Diffusers Diffusers has a built-in integration for DDUF files. Here is an example on how to load a pipeline from a stored checkpoint on the Hub: ```py from diffusers import DiffusionPipeline import torch pipe = DiffusionPipeline.from_pretrained( "DDUF/FLUX.1-dev-DDUF", dduf_file="FLUX.1-dev.dduf", torch_dtype=torch.bfloat16 ).to("cuda") image = pipe( "photo a cat holding a sign that says Diffusers", num_inference_steps=50, guidance_scale=3.5 ).images[0] image.save("cat.png") ``` ## F.A.Q. ### Why build on top of ZIP? ZIP provides several advantages: - Universally supported file format - No additional dependencies for reading - Built-in file indexing - Wide language support ### Why not use a TAR with a table of contents at the beginning of the archive? See the explanation in this [comment](https://github.com/huggingface/huggingface_hub/pull/2692#issuecomment-2519863726). ### Why no compression? - Enables direct memory mapping of large files - Ensures consistent and predictable remote file access - Prevents CPU overhead during file reading - Maintains compatibility with safetensors ### Can I modify a DDUF file? No. For now, DDUF files are designed to be immutable. To update a model, create a new DDUF file. ### Which frameworks/apps support DDUFs? - [Diffusers](https://github.com/huggingface/diffusers) We are constantly reaching out to other libraries and frameworks. If you are interested in adding support to your project, open a Discussion in the [DDUF org](https://huggingface.co/spaces/DDUF/README/discussions). ### Using sample-factory at Hugging Face https://huggingface.co/docs/hub/sample-factory.md # Using sample-factory at Hugging Face [`sample-factory`](https://github.com/alex-petrenko/sample-factory) is a codebase for high throughput asynchronous reinforcement learning. It has integrations with the Hugging Face Hub to share models with evaluation results and training metrics. ## Exploring sample-factory in the Hub You can find `sample-factory` models by filtering at the left of the [models page](https://huggingface.co/models?library=sample-factory). All models on the Hub come up with useful features: 1. An automatically generated model card with a description, a training configuration, and more. 2. Metadata tags that help for discoverability. 3. Evaluation results to compare with other models. 4. A video widget where you can watch your agent performing. ## Install the library To install the `sample-factory` library, you need to install the package: `pip install sample-factory` SF is known to work on Linux and MacOS. There is no Windows support at this time. ## Loading models from the Hub ### Using load_from_hub To download a model from the Hugging Face Hub to use with Sample-Factory, use the `load_from_hub` script: ``` python -m sample_factory.huggingface.load_from_hub -r -d ``` The command line arguments are: - `-r`: The repo ID for the HF repository to download from. The repo ID should be in the format `/` - `-d`: An optional argument to specify the directory to save the experiment to. Defaults to `./train_dir` which will save the repo to `./train_dir/` ### Download Model Repository Directly Hugging Face repositories can be downloaded directly using `git clone`: ``` git clone git@hf.co: # example: git clone git@hf.co:bigscience/bloom ``` ## Using Downloaded Models with Sample-Factory After downloading the model, you can run the models in the repo with the enjoy script corresponding to your environment. For example, if you are downloading a `mujoco-ant` model, it can be run with: ``` python -m sf_examples.mujoco.enjoy_mujoco --algo=APPO --env=mujoco_ant --experiment= --train_dir=./train_dir ``` Note, you may have to specify the `--train_dir` if your local train_dir has a different path than the one in the `cfg.json` ## Sharing your models ### Using push_to_hub If you want to upload without generating evaluation metrics or a replay video, you can use the `push_to_hub` script: ``` python -m sample_factory.huggingface.push_to_hub -r / -d ``` The command line arguments are: - `-r`: The repo_id to save on HF Hub. This is the same as `hf_repository` in the enjoy script and must be in the form `/` - `-d`: The full path to your experiment directory to upload ### Using enjoy.py You can upload your models to the Hub using your environment's `enjoy` script with the `--push_to_hub` flag. Uploading using `enjoy` can also generate evaluation metrics and a replay video. The evaluation metrics are generated by running your model on the specified environment for a number of episodes and reporting the mean and std reward of those runs. Other relevant command line arguments are: - `--hf_repository`: The repository to push to. Must be of the form `/`. The model will be saved to `https://huggingface.co//` - `--max_num_episodes`: Number of episodes to evaluate on before uploading. Used to generate evaluation metrics. It is recommended to use multiple episodes to generate an accurate mean and std. - `--max_num_frames`: Number of frames to evaluate on before uploading. An alternative to `max_num_episodes` - `--no_render`: A flag that disables rendering and showing the environment steps. It is recommended to set this flag to speed up the evaluation process. You can also save a video of the model during evaluation to upload to the hub with the `--save_video` flag - `--video_frames`: The number of frames to be rendered in the video. Defaults to -1 which renders an entire episode - `--video_name`: The name of the video to save as. If `None`, will save to `replay.mp4` in your experiment directory For example: ``` python -m sf_examples.mujoco_examples.enjoy_mujoco --algo=APPO --env=mujoco_ant --experiment= --train_dir=./train_dir --max_num_episodes=10 --push_to_hub --hf_username= --hf_repository= --save_video --no_render ``` ### Using Asteroid at Hugging Face https://huggingface.co/docs/hub/asteroid.md # Using Asteroid at Hugging Face `asteroid` is a Pytorch toolkit for audio source separation. It enables fast experimentation on common datasets with support for a large range of datasets and recipes to reproduce papers. ## Exploring Asteroid in the Hub You can find `asteroid` models by filtering at the left of the [models page](https://huggingface.co/models?filter=asteroid). All models on the Hub come up with the following features: 1. An automatically generated model card with a description, training configuration, metrics, and more. 2. Metadata tags that help for discoverability and contain information such as licenses and datasets. 3. An interactive widget you can use to play out with the model directly in the browser. 4. An Inference Providers widget that allows to make inference requests. ## Using existing models For a full guide on loading pre-trained models, we recommend checking out the [official guide](https://github.com/asteroid-team/asteroid/blob/master/docs/source/readmes/pretrained_models.md). All model classes (`BaseModel`, `ConvTasNet`, etc) have a `from_pretrained` method that allows to load models from the Hub. ```py from asteroid.models import ConvTasNet model = ConvTasNet.from_pretrained('mpariente/ConvTasNet_WHAM_sepclean') ``` If you want to see how to load a specific model, you can click `Use in Adapter Transformers` and you will be given a working snippet that you can load it! ## Sharing your models At the moment there is no automatic method to upload your models to the Hub, but the process to upload them is documented in the [official guide](https://github.com/asteroid-team/asteroid/blob/master/docs/source/readmes/pretrained_models.md#share-your-models). All the recipes create all the needed files to upload a model to the Hub. The process usually involves the following steps: 1. Create and clone a model repository. 2. Moving files from the recipe output to the repository (model card, model filte, TensorBoard traces). 3. Push the files (`git add` + `git commit` + `git push`). Once you do this, you can try out your model directly in the browser and share it with the rest of the community. ## Additional resources * Asteroid [website](https://asteroid-team.github.io/). * Asteroid [library](https://github.com/asteroid-team/asteroid). * Integration [docs](https://github.com/asteroid-team/asteroid/blob/master/docs/source/readmes/pretrained_models.md). ### Security https://huggingface.co/docs/hub/security.md # Security The Hugging Face Hub offers several security features to ensure that your code and data are secure. Beyond offering [private repositories](./repositories-settings#private-repositories) for models, datasets, and Spaces, the Hub supports access tokens, resource groups, MFA, commit signatures, malware scanning, and more. Hugging Face is GDPR compliant. If a contract or specific data storage is something you'll need, we recommend taking a look at our [Team & Enterprise Support](https://huggingface.co/support). Hugging Face can also offer Business Associate Addendums or GDPR data processing agreements through an [Enterprise Plan](https://huggingface.co/pricing). Hugging Face is also [SOC2 Type 2 certified](https://us.aicpa.org/interestareas/frc/assuranceadvisoryservices/aicpasoc2report.html), meaning we provide security certification to our customers and actively monitor and patch any security weaknesses. For any other security questions, please feel free to send us an email at security@huggingface.co. ## Contents - [User Access Tokens](./security-tokens) - [Two-Factor Authentication (2FA)](./security-2fa) - [Git over SSH](./security-git-ssh) - [Signing commits with GPG](./security-gpg) - [Single Sign-On (SSO)](./security-sso) - [Advanced Access Control (Resource Groups)](./security-resource-groups) - [Malware Scanning](./security-malware) - [Pickle Scanning](./security-pickle) - [Secrets Scanning](./security-secrets) - [Third-party scanner: Protect AI](./security-protectai) - [Third-party scanner: JFrog](./security-jfrog) ### Popular Images https://huggingface.co/docs/hub/jobs-popular-images.md # Popular Images Here is the list of ready-to-use Docker images from popular frameworks that you can use in Jobs with uv. These Docker images already have uv installed but if you want to use an image + uv for an image without uv installed youโ€™ll need to make sure uv is installed first. This will work well in many cases but for LLM inference libraries which can have quite specific requirements, it can be useful to use a specific image that has the library installed. ## vLLM vLLM is a very well known and heavily used inference engine. It is known for its ability to scale inference for LLMs. They provide the `vllm/vllm-openai` Docker image with vLLM and UV ready. This image is ideal to run batch inference. Use the `--image` argument to use this Docker image: ```bash >>> hf jobs uv run --image vllm/vllm-openai --flavor l4x4 generate-responses.py ``` You can find more information on vLLM batch inference on Jobs in [Daniel Van Strien's blog post](https://danielvanstrien.xyz/posts/2025/hf-jobs/vllm-batch-inference.html). ## TRL TRL is a library designed for post-training models using techniques like Supervised Fine-Tuning (SFT), Group Relative Policy Optimization (GRPO), and Direct Preference Optimization (DPO). An up-to-date Docker image with UV and all TRL dependencies is available at `huggingface/trl` and can be used directly with Hugging Face Jobs. Use the `--image` argument to use this Docker image: ```bash >>> hf jobs uv run --image huggingface/trl --flavor a100-large -s HF_TOKEN train.py ``` ### Webhook guide: Setup an automatic system to re-train a model when a dataset changes https://huggingface.co/docs/hub/webhooks-guide-auto-retrain.md # Webhook guide: Setup an automatic system to re-train a model when a dataset changes This guide will help walk you through the setup of an automatic training pipeline on the Hugging Face platform using HF Datasets, Webhooks, Spaces, and AutoTrain. We will build a Webhook that listens to changes on an image classification dataset and triggers a fine-tuning of [microsoft/resnet-50](https://huggingface.co/microsoft/resnet-50) using [AutoTrain](https://huggingface.co/autotrain). ## Prerequisite: Upload your dataset to the Hub We will use a [simple image classification dataset](https://huggingface.co/datasets/huggingface-projects/auto-retrain-input-dataset) for the sake of the example. Learn more about uploading your data to the Hub [here](https://huggingface.co/docs/datasets/upload_dataset). ![dataset](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/002-auto-retrain/dataset.png) ## Create a Webhook to react to the dataset's changes First, let's create a Webhook from your [settings]( https://huggingface.co/settings/webhooks). - Select your dataset as the target repository. We will target [huggingface-projects/input-dataset](https://huggingface.co/datasets/huggingface-projects/input-dataset) in this example. - You can put a dummy Webhook URL for now. Defining your Webhook will let you look at the events that will be sent to it. You can also replay them, which will be useful for debugging! - Input a secret to make it more secure. - Subscribe to "Repo update" events as we want to react to data changes Your Webhook will look like this: ![webhook-creation](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/002-auto-retrain/webhook-creation.png) ## Create a Space to react to your Webhook We now need a way to react to your Webhook events. An easy way to do this is to use a [Space](https://huggingface.co/docs/hub/spaces-overview)! You can find an example Space [here](https://huggingface.co/spaces/huggingface-projects/auto-retrain/tree/main). This Space uses Docker, Python, [FastAPI](https://fastapi.tiangolo.com/), and [uvicorn](https://www.uvicorn.org) to run a simple HTTP server. Read more about Docker Spaces [here](https://huggingface.co/docs/hub/spaces-sdks-docker). The entry point is [src/main.py](https://huggingface.co/spaces/huggingface-projects/auto-retrain/blob/main/src/main.py). Let's walk through this file and detail what it does: 1. It spawns a FastAPI app that will listen to HTTP `POST` requests on `/webhook`: ```python from fastapi import FastAPI # [...] @app.post("/webhook") async def post_webhook( # ... ): # ... ``` 2. 2. This route checks that the `X-Webhook-Secret` header is present and that its value is the same as the one you set in your Webhook's settings. The `WEBHOOK_SECRET` secret must be set in the Space's settings and be the same as the secret set in your Webhook. ```python # [...] WEBHOOK_SECRET = os.getenv("WEBHOOK_SECRET") # [...] @app.post("/webhook") async def post_webhook( # [...] x_webhook_secret: Optional[str] = Header(default=None), # ^ checks for the X-Webhook-Secret HTTP header ): if x_webhook_secret is None: raise HTTPException(401) if x_webhook_secret != WEBHOOK_SECRET: raise HTTPException(403) # [...] ``` 3. The event's payload is encoded as JSON. Here, we'll be using pydantic models to parse the event payload. We also specify that we will run our Webhook only when: - the event concerns the input dataset - the event is an update on the repo's content, i.e., there has been a new commit ```python # defined in src/models.py class WebhookPayloadEvent(BaseModel): action: Literal["create", "update", "delete", "move"] scope: str class WebhookPayloadRepo(BaseModel): type: Literal["dataset", "model", "space"] name: str id: str private: bool headSha: str class WebhookPayload(BaseModel): event: WebhookPayloadEvent repo: WebhookPayloadRepo # [...] @app.post("/webhook") async def post_webhook( # [...] payload: WebhookPayload, # ^ Pydantic model defining the payload format ): # [...] if not ( payload.event.action == "update" and payload.event.scope.startswith("repo.content") and payload.repo.name == config.input_dataset and payload.repo.type == "dataset" ): # no-op if the payload does not match our expectations return {"processed": False} #[...] ``` 4. If the payload is valid, the next step is to create a project on AutoTrain, schedule a fine-tuning of the input model (`microsoft/resnet-50` in our example) on the input dataset, and create a discussion on the dataset when it's done! ```python def schedule_retrain(payload: WebhookPayload): # Create the autotrain project try: project = AutoTrain.create_project(payload) AutoTrain.add_data(project_id=project["id"]) AutoTrain.start_processing(project_id=project["id"]) except requests.HTTPError as err: print("ERROR while requesting AutoTrain API:") print(f" code: {err.response.status_code}") print(f" {err.response.json()}") raise # Notify in the community tab notify_success(project["id"]) ``` Visit the link inside the comment to review the training cost estimate, and start fine-tuning the model! ![community tab notification](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/002-auto-retrain/notification.png) In this example, we used Hugging Face AutoTrain to fine-tune our model quickly, but you can of course plug in your training infrastructure! Feel free to duplicate the Space to your personal namespace and play with it. You will need to provide two secrets: - `WEBHOOK_SECRET` : the secret from your Webhook. - `HF_ACCESS_TOKEN` : a User Access Token with `write` rights. You can create one [from your settings](https://huggingface.co/settings/tokens). You will also need to tweak the [`config.json` file](https://huggingface.co/spaces/huggingface-projects/auto-retrain/blob/main/config.json) to use the dataset and model of you choice: ```json { "target_namespace": "the namespace where the trained model should end up", "input_dataset": "the dataset on which the model will be trained", "input_model": "the base model to re-train", "autotrain_project_prefix": "A prefix for the AutoTrain project" } ``` ## Configure your Webhook to send events to your Space Last but not least, you'll need to configure your webhook to send POST requests to your Space. Let's first grab our Space's "direct URL" from the contextual menu. Click on "Embed this Space" and copy the "Direct URL". ![embed this Space](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/002-auto-retrain/duplicate-space.png) ![direct URL](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/002-auto-retrain/direct-url.png) Update your Webhook to send requests to that URL: ![webhook settings](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/002-auto-retrain/update-webhook.png) And that's it! Now every commit to the input dataset will trigger a fine-tuning of ResNet-50 with AutoTrain ๐ŸŽ‰ ### Sign in with Hugging Face https://huggingface.co/docs/hub/oauth.md # Sign in with Hugging Face You can use the HF OAuth / OpenID connect flow to create a **"Sign in with HF"** flow in any website or App. This will allow users to sign in to your website or app using their HF account, by clicking a button similar to this one: ![Sign in with Hugging Face](https://huggingface.co/datasets/huggingface/badges/resolve/main/sign-in-with-huggingface-xl-dark.svg) After clicking this button your users will be presented with a permissions modal to authorize your app: ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/oauth-accept-application.png) ## Creating an oauth app You can create your application in your [settings](https://huggingface.co/settings/applications/new): ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/oauth-create-application.png) ### Public OAuth apps (no secret) You can create or use OAuth apps without a client secret. This is useful for native apps, CLIs, or other contexts where keeping a secret is impractical. - **At app creation**: When creating a new OAuth app, you can choose to create it without a secret. - **After creation**: For an existing app, you can delete the client secret in the app settings. The app will then work as a public app. Public apps authenticate using only the client ID (e.g. in device code or authorization code flows with PKCE). Apps that have a secret can still use the secret when needed (e.g. `Authorization: Basic` for token requests). ### If you are hosting in Spaces > [!TIP] > If you host your app on Spaces, then the flow will be even easier to implement (and built-in to Gradio directly); Check our [Spaces OAuth guide](https://huggingface.co/docs/hub/spaces-oauth). ### Automated oauth app creation Hugging Face supports CIMD aka [Client ID Metadata Documents](https://datatracker.ietf.org/doc/draft-ietf-oauth-client-id-metadata-document/), which allows you to create an oauth app for your website in an automated manner: - Add an endpoint to your website `/.well-known/oauth-cimd` which returns the following JSON: ```json { client_id: "[your website url]/.well-known/oauth-cimd", client_name: "Your Website", redirect_uris: ["[your website url]/oauth/callback/huggingface"], token_endpoint_auth_method: "none", logo_uri: "https://....", // optional client_uri: "[your website url]", // optional } ``` - Use `"[your website url]/.well-known/oauth-cimd"` as client ID, and PCKE as auth mechanism This is particularly useful for ephemeral environments or MCP clients. See an [implementation example](https://github.com/huggingface/chat-ui/pull/1978) in Hugging Chat. ## Device code OAuth Device code flow lets users authorize an app on one device (e.g. a CLI) by entering a short code on another device (e.g. a phone or browser). No redirect URI or browser on the device running the app is required. ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/oauth-device-first-step.png) ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/oauth-device-second-step.png) ### Testing with a sample script You can test a device-code OAuth app with the following script. Replace `` with your appโ€™s client ID. For **public apps** (no secret), the script works as-is. For **apps with a secret**, add an `Authorization: Basic` header (Base64 of `client_id:client_secret`) to both the device and token requests. ```sh #!/bin/bash CLIENT_ID="" # Step 1: Get device code RESPONSE=$(curl -s -X POST https://huggingface.co/oauth/device \ -d "client_id=$CLIENT_ID") DEVICE_CODE=$(echo $RESPONSE | jq -r '.device_code') USER_CODE=$(echo $RESPONSE | jq -r '.user_code') VERIFICATION_URI=$(echo $RESPONSE | jq -r '.verification_uri') echo "Device Code: $DEVICE_CODE" echo "User Code: $USER_CODE" echo "" echo "Open: ${VERIFICATION_URI}" echo "Enter the user code: $USER_CODE" echo "" read -p "Press Enter after authorizing..." # Step 3: Get token curl -X POST https://huggingface.co/oauth/token \ -d "grant_type=urn:ietf:params:oauth:grant-type:device_code" \ -d "device_code=$DEVICE_CODE" \ -d "client_id=$CLIENT_ID" ``` > [!NOTE] > For OAuth apps that have a client secret, include an `Authorization: Basic` header (with Base64-encoded `client_id:client_secret`) on both the device code request and the token request. ## Currently supported scopes The currently supported scopes are: - `openid`: Get the ID token in addition to the access token. - `profile`: Get the user's profile information (username, avatar, etc.) - `email`: Get the user's email address. - `read-billing`: Know whether the user has a payment method set up. - `read-repos`: Get read access to the user's personal repos. - `gated-repos`: Get read access to the content of public gated repos the user has been granted access to. Unlike `read-repos`, this does not grant access to private repos. - `contribute-repos`: Can create repositories and access those created by this app. Cannot access any other repositories unless additional permissions are granted. - `write-repos`: Get write/read access to the user's personal repos. - `manage-repos`: Get full access to the user's personal repos. Also grants repo creation and deletion. - `read-collections`: Get read access to the user's personal collections. - `write-collections`: Get write/read access to the user's personal collections. Also grants collection creation and deletion. - `inference-api`: Get access to the [Inference Providers](https://huggingface.co/docs/inference-providers/index), you will be able to make inference requests on behalf of the user. - `jobs`: Run [jobs](https://huggingface.co/docs/huggingface_hub/main/en/guides/jobs) - `webhooks`: Manage [webhooks](https://huggingface.co/docs/huggingface_hub/main/en/guides/webhooks) - `write-discussions`: Open discussions and Pull Requests on behalf of the user as well as interact with discussions (including reactions, posting/editing comments, closing discussions, ...). To open Pull Requests on private repos, you need to request the `read-repos` scope as well. All other information is available in the [OpenID metadata](https://huggingface.co/.well-known/openid-configuration). > [!WARNING] > Please contact us if you need any extra scopes. ## Accessing organization resources By default, the oauth app does not need to access organization resources. But some scopes like `read-repos` or `read-billing` apply to organizations as well. The user can select which organizations to grant access to when authorizing the app. If you require access to a specific organization, you can add `orgIds=ORG_ID` as a query parameter to the OAuth authorization URL. You have to replace `ORG_ID` with the organization ID, which is available in the `organizations.sub` field of the userinfo response. ## Branding You are free to use your own design for the button. Below are some SVG images helpfully provided. Check out [our badges](https://huggingface.co/datasets/huggingface/badges#sign-in-with-hugging-face) with explanations for integrating them in markdown or HTML. [![Sign in with Hugging Face](https://huggingface.co/datasets/huggingface/badges/resolve/main/sign-in-with-huggingface-sm.svg)](https://huggingface.co/oauth/authorize?client_id=CLIENT_ID&redirect_uri=REDIRECT_URI&scope=openid%20profile&state=STATE) [![Sign in with Hugging Face](https://huggingface.co/datasets/huggingface/badges/resolve/main/sign-in-with-huggingface-sm-dark.svg)](https://huggingface.co/oauth/authorize?client_id=CLIENT_ID&redirect_uri=REDIRECT_URI&scope=openid%20profile&state=STATE) [![Sign in with Hugging Face](https://huggingface.co/datasets/huggingface/badges/resolve/main/sign-in-with-huggingface-md.svg)](https://huggingface.co/oauth/authorize?client_id=CLIENT_ID&redirect_uri=REDIRECT_URI&scope=openid%20profile&state=STATE) [![Sign in with Hugging Face](https://huggingface.co/datasets/huggingface/badges/resolve/main/sign-in-with-huggingface-md-dark.svg)](https://huggingface.co/oauth/authorize?client_id=CLIENT_ID&redirect_uri=REDIRECT_URI&scope=openid%20profile&state=STATE) [![Sign in with Hugging Face](https://huggingface.co/datasets/huggingface/badges/resolve/main/sign-in-with-huggingface-lg.svg)](https://huggingface.co/oauth/authorize?client_id=CLIENT_ID&redirect_uri=REDIRECT_URI&scope=openid%20profile&state=STATE) [![Sign in with Hugging Face](https://huggingface.co/datasets/huggingface/badges/resolve/main/sign-in-with-huggingface-lg-dark.svg)](https://huggingface.co/oauth/authorize?client_id=CLIENT_ID&redirect_uri=REDIRECT_URI&scope=openid%20profile&state=STATE) [![Sign in with Hugging Face](https://huggingface.co/datasets/huggingface/badges/resolve/main/sign-in-with-huggingface-xl.svg)](https://huggingface.co/oauth/authorize?client_id=CLIENT_ID&redirect_uri=REDIRECT_URI&scope=openid%20profile&state=STATE) [![Sign in with Hugging Face](https://huggingface.co/datasets/huggingface/badges/resolve/main/sign-in-with-huggingface-xl-dark.svg)](https://huggingface.co/oauth/authorize?client_id=CLIENT_ID&redirect_uri=REDIRECT_URI&scope=openid%20profile&state=STATE) ## Token Exchange for Organizations (RFC 8693) > [!WARNING] > This feature is part of the Enterprise plan. Token Exchange allows organizations to programmatically issue access tokens for their members without requiring interactive user consent. This is particularly useful for building internal tools, automation pipelines, and enterprise integrations that need to access Hugging Face resources on behalf of organization members. This feature implements [RFC 8693 - OAuth 2.0 Token Exchange](https://www.rfc-editor.org/rfc/rfc8693.html), a standard protocol for token exchange scenarios. ### Use cases Token Exchange is designed for scenarios where your organization needs to: - **Build internal platforms**: Create dashboards or portals that access Hugging Face resources on behalf of your team members, without requiring each user to manually authenticate. - **Automate CI/CD pipelines**: Issue short-lived, scoped tokens for automated workflows that need to push models or datasets to organization repositories. - **Integrate with enterprise identity systems**: Bridge your existing identity provider with Hugging Face by issuing tokens based on your internal user directory. - **Implement custom access controls**: Build middleware that issues tokens with specific scopes based on your organization's internal policies. ### How it works 1. Your organization has an OAuth application bound to your organization with the `token-exchange` privilege. 2. Your backend service authenticates with this OAuth app using client credentials. 3. Your service requests an access token for a specific organization member (identified by email). 4. Hugging Face verifies the user is a member of your organization and issues a scoped token. 5. The issued token can only access resources within your organization's scope. ### Prerequisites To use Token Exchange, you need an organization-bound OAuth application with the `token-exchange` privilege. Contact Hugging Face support to set up an eligible OAuth app for your organization. Once configured, you will receive: - A **Client ID** (e.g., `a1b2c3d4-e5f6-7890-abcd-ef1234567890`) - A **Client Secret** (keep this secure!) > [!WARNING] > Organization administrators can manage the OAuth app after creation, including refreshing the client secret and configuring the token duration. ### Authentication Token Exchange uses HTTP Basic Authentication with your OAuth app credentials. Create the authorization header by Base64-encoding your `client_id:client_secret`: ```bash # Create the authorization header export CLIENT_ID="your-client-id" export CLIENT_SECRET="your-client-secret" export AUTH_HEADER=$(echo -n "${CLIENT_ID}:${CLIENT_SECRET}" | base64) ``` ### Issuing tokens by email To issue an access token for an organization member using their email address: ```bash curl -X POST "https://huggingface.co/oauth/token" \ -H "Content-Type: application/x-www-form-urlencoded" \ -H "Authorization: Basic ${AUTH_HEADER}" \ -d "grant_type=urn:ietf:params:oauth:grant-type:token-exchange" \ -d "subject_token=user@yourorg.com" \ -d "subject_token_type=urn:huggingface:token-type:user-email" ``` ### Response A successful request returns an access token: ```json { "access_token": "hf_oauth_...", "token_type": "bearer", "expires_in": 28800, "scope": "openid profile email read-repos", "id_token": "eyJhbGciOiJS...", "issued_token_type": "urn:ietf:params:oauth:token-type:access_token" } ``` The `id_token` field is included when the `openid` scope is requested. You can then use this token to make API requests on behalf of the user: ```bash curl "https://huggingface.co/api/whoami-v2" \ -H "Authorization: Bearer ${ACCESS_TOKEN}" ``` ### Scope control By default, issued tokens inherit all scopes configured on the OAuth app. You can request specific scopes by adding the `scope` parameter. See [Currently supported scopes](#currently-supported-scopes) for available values. The token's effective permissions are limited both by the requested scope and by the user's role within the organization. ```bash curl -X POST "https://huggingface.co/oauth/token" \ -H "Content-Type: application/x-www-form-urlencoded" \ -H "Authorization: Basic ${AUTH_HEADER}" \ -d "grant_type=urn:ietf:params:oauth:grant-type:token-exchange" \ -d "subject_token=user@yourorg.com" \ -d "subject_token_type=urn:huggingface:token-type:user-email" \ -d "scope=openid profile" ``` > [!TIP] > Follow the principle of least privilege: request only the scopes your application actually needs. ### Security considerations Tokens issued via Token Exchange have built-in security restrictions: - **Organization-scoped**: Tokens can only access resources within your organization (models, datasets, Spaces, and collections owned by the org). Outside the org, access is read-only and limited to: public collections from any user or organization, and public gated repos the user has been individually granted access to. - **No personal access**: Tokens cannot access the user's personal private repositories or private repos from other organizations. - **Short-lived**: Tokens expire after 8 hours by default. Organization administrators can configure the token duration (up to 30 days) in the OAuth app settings. No refresh tokens are provided. - **Auditable**: All token exchanges are logged and visible in your organization's [audit logs](./audit-logs). > [!WARNING] > Protect your OAuth app credentials carefully. Anyone with access to your client secret can issue tokens for any member of your organization. ### Error responses | Error | Description | |-------|-------------| | `invalid_client` | Client is not authorized to use token exchange, or the app is not bound to an organization | | `invalid_grant` | User not found in the bound organization | | `invalid_scope` | Requested scope is not valid | ### Reference **Grant type:** ``` urn:ietf:params:oauth:grant-type:token-exchange ``` **Request parameter (`subject_token_type`):** | Value | Description | |-------|-------------| | `urn:huggingface:token-type:user-email` | Identify the user by their email address | **Response field (`issued_token_type`):** | Value | Description | |-------|-------------| | `urn:ietf:params:oauth:token-type:access_token` | Indicates an access token was issued | **Related documentation:** - [RFC 8693 - OAuth 2.0 Token Exchange](https://www.rfc-editor.org/rfc/rfc8693.html) - [Audit Logs](./audit-logs) ### Using Spaces for Organization Cards https://huggingface.co/docs/hub/spaces-organization-cards.md # Using Spaces for Organization Cards Organization cards are a way to describe your organization to other users. They take the form of a `README.md` static file, inside a Space repo named `README`. Please read more in the [dedicated doc section](./organizations-cards). ### Query datasets https://huggingface.co/docs/hub/datasets-duckdb-select.md # Query datasets Querying datasets is a fundamental step in data analysis. Here, we'll guide you through querying datasets using various methods. There are [several ways](https://duckdb.org/docs/data/parquet/overview.html) to select your data. Using the `FROM` syntax: ```bash FROM 'hf://datasets/jamescalam/world-cities-geo/train.jsonl' SELECT city, country, region LIMIT 3; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ city โ”‚ country โ”‚ region โ”‚ โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ Kabul โ”‚ Afghanistan โ”‚ Southern Asia โ”‚ โ”‚ Kandahar โ”‚ Afghanistan โ”‚ Southern Asia โ”‚ โ”‚ Mazar-e Sharif โ”‚ Afghanistan โ”‚ Southern Asia โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` Using the `SELECT` and `FROM` syntax: ```bash SELECT city, country, region FROM 'hf://datasets/jamescalam/world-cities-geo/train.jsonl' USING SAMPLE 3; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ city โ”‚ country โ”‚ region โ”‚ โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ Wenzhou โ”‚ China โ”‚ Eastern Asia โ”‚ โ”‚ Valdez โ”‚ Ecuador โ”‚ South America โ”‚ โ”‚ Aplahoue โ”‚ Benin โ”‚ Western Africa โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` Count all JSONL files matching a glob pattern: ```bash SELECT COUNT(*) FROM 'hf://datasets/jamescalam/world-cities-geo/*.jsonl'; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ count_star() โ”‚ โ”‚ int64 โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ 9083 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` You can also query Parquet files using the `read_parquet` function (or its alias `parquet_scan`). This function, along with other [parameters](https://duckdb.org/docs/data/parquet/overview.html#parameters), provides flexibility in handling Parquet files specially if they dont have a `.parquet` extension. Let's explore these functions using the auto-converted Parquet files from the same dataset. Select using [read_parquet](https://duckdb.org/docs/guides/file_formats/query_parquet.html) function: ```bash SELECT * FROM read_parquet('hf://datasets/jamescalam/world-cities-geo@~parquet/default/**/*.parquet') LIMIT 3; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ city โ”‚ country โ”‚ region โ”‚ continent โ”‚ latitude โ”‚ longitude โ”‚ x โ”‚ y โ”‚ z โ”‚ โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ double โ”‚ double โ”‚ double โ”‚ double โ”‚ double โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ Kabul โ”‚ Afghanistan โ”‚ Southern Asia โ”‚ Asia โ”‚ 34.5166667 โ”‚ 69.1833344 โ”‚ 1865.546409629258 โ”‚ 4906.785732164055 โ”‚ 3610.1012966606136 โ”‚ โ”‚ Kandahar โ”‚ Afghanistan โ”‚ Southern Asia โ”‚ Asia โ”‚ 31.61 โ”‚ 65.6999969 โ”‚ 2232.782351694877 โ”‚ 4945.064042683584 โ”‚ 3339.261233224765 โ”‚ โ”‚ Mazar-e Sharif โ”‚ Afghanistan โ”‚ Southern Asia โ”‚ Asia โ”‚ 36.7069444 โ”‚ 67.1122208 โ”‚ 1986.5057687360124 โ”‚ 4705.51748048584 โ”‚ 3808.088900172991 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` Read all files that match a glob pattern and include a filename column specifying which file each row came from: ```bash SELECT city, country, filename FROM read_parquet('hf://datasets/jamescalam/world-cities-geo@~parquet/default/**/*.parquet', filename = true) LIMIT 3; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ city โ”‚ country โ”‚ filename โ”‚ โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ Kabul โ”‚ Afghanistan โ”‚ hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet โ”‚ โ”‚ Kandahar โ”‚ Afghanistan โ”‚ hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet โ”‚ โ”‚ Mazar-e Sharif โ”‚ Afghanistan โ”‚ hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ## Get metadata and schema The [parquet_metadata](https://duckdb.org/docs/data/parquet/metadata.html) function can be used to query the metadata contained within a Parquet file. ```bash SELECT * FROM parquet_metadata('hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet'); โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ file_name โ”‚ row_group_id โ”‚ row_group_num_rows โ”‚ compression โ”‚ โ”‚ varchar โ”‚ int64 โ”‚ int64 โ”‚ varchar โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet โ”‚ 0 โ”‚ 1000 โ”‚ SNAPPY โ”‚ โ”‚ hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet โ”‚ 0 โ”‚ 1000 โ”‚ SNAPPY โ”‚ โ”‚ hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet โ”‚ 0 โ”‚ 1000 โ”‚ SNAPPY โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` Fetch the column names and column types: ```bash DESCRIBE SELECT * FROM 'hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet'; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ column_name โ”‚ column_type โ”‚ null โ”‚ key โ”‚ default โ”‚ extra โ”‚ โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ city โ”‚ VARCHAR โ”‚ YES โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ country โ”‚ VARCHAR โ”‚ YES โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ region โ”‚ VARCHAR โ”‚ YES โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ continent โ”‚ VARCHAR โ”‚ YES โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ latitude โ”‚ DOUBLE โ”‚ YES โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ longitude โ”‚ DOUBLE โ”‚ YES โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ x โ”‚ DOUBLE โ”‚ YES โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ y โ”‚ DOUBLE โ”‚ YES โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ z โ”‚ DOUBLE โ”‚ YES โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` Fetch the internal schema (excluding the file name): ```bash SELECT * EXCLUDE (file_name) FROM parquet_schema('hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet'); โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ name โ”‚ type โ”‚ type_length โ”‚ repetition_type โ”‚ num_children โ”‚ converted_type โ”‚ scale โ”‚ precision โ”‚ field_id โ”‚ logical_type โ”‚ โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ int64 โ”‚ varchar โ”‚ int64 โ”‚ int64 โ”‚ int64 โ”‚ varchar โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ schema โ”‚ โ”‚ โ”‚ REQUIRED โ”‚ 9 โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ city โ”‚ BYTE_ARRAY โ”‚ โ”‚ OPTIONAL โ”‚ โ”‚ UTF8 โ”‚ โ”‚ โ”‚ โ”‚ StringType() โ”‚ โ”‚ country โ”‚ BYTE_ARRAY โ”‚ โ”‚ OPTIONAL โ”‚ โ”‚ UTF8 โ”‚ โ”‚ โ”‚ โ”‚ StringType() โ”‚ โ”‚ region โ”‚ BYTE_ARRAY โ”‚ โ”‚ OPTIONAL โ”‚ โ”‚ UTF8 โ”‚ โ”‚ โ”‚ โ”‚ StringType() โ”‚ โ”‚ continent โ”‚ BYTE_ARRAY โ”‚ โ”‚ OPTIONAL โ”‚ โ”‚ UTF8 โ”‚ โ”‚ โ”‚ โ”‚ StringType() โ”‚ โ”‚ latitude โ”‚ DOUBLE โ”‚ โ”‚ OPTIONAL โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ longitude โ”‚ DOUBLE โ”‚ โ”‚ OPTIONAL โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ x โ”‚ DOUBLE โ”‚ โ”‚ OPTIONAL โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ y โ”‚ DOUBLE โ”‚ โ”‚ OPTIONAL โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ z โ”‚ DOUBLE โ”‚ โ”‚ OPTIONAL โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค ``` ## Get statistics The `SUMMARIZE` command can be used to get various aggregates over a query (min, max, approx_unique, avg, std, q25, q50, q75, count). It returns these statistics along with the column name, column type, and the percentage of NULL values. ```bash SUMMARIZE SELECT latitude, longitude FROM 'hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet'; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ column_name โ”‚ column_type โ”‚ min โ”‚ max โ”‚ approx_unique โ”‚ avg โ”‚ std โ”‚ q25 โ”‚ q50 โ”‚ q75 โ”‚ count โ”‚ null_percentage โ”‚ โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ int64 โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ int64 โ”‚ decimal(9,2) โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ latitude โ”‚ DOUBLE โ”‚ -54.8 โ”‚ 67.8557214 โ”‚ 7324 โ”‚ 22.5004568364307 โ”‚ 26.770454684690925 โ”‚ 6.089858461951687 โ”‚ 29.321258648324747 โ”‚ 44.90191158328915 โ”‚ 9083 โ”‚ 0.00 โ”‚ โ”‚ longitude โ”‚ DOUBLE โ”‚ -175.2166595 โ”‚ 179.3833313 โ”‚ 7802 โ”‚ 14.699333721953098 โ”‚ 63.93672742608224 โ”‚ -6.877990418604821 โ”‚ 19.12963979385393 โ”‚ 43.873513093419966 โ”‚ 9083 โ”‚ 0.00 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ### Manual Configuration https://huggingface.co/docs/hub/datasets-manual-configuration.md # Manual Configuration This guide will show you how to configure a custom structure for your dataset repository. The [companion collection of example datasets](https://huggingface.co/collections/datasets-examples/manual-configuration-655e293cea26da0acab95b87) showcases each section of the documentation. A dataset with a supported structure and [file formats](./datasets-adding#file-formats) automatically has a Dataset Viewer on its dataset page on the Hub. You can use YAML to define the splits, subsets and builder parameters that are used by the Viewer. It is also possible to define multiple subsets (also called "configurations") for the same dataset (e.g. if the dataset has various independent files). ## Splits If you have multiple files and want to define which file goes into which split, you can use YAML at the top of your README.md. For example, given a repository like this one: ``` my_dataset_repository/ โ”œโ”€โ”€ README.md โ”œโ”€โ”€ data.csv โ””โ”€โ”€ holdout.csv ``` You can define a subset for your splits by adding the `configs` field in the YAML block at the top of your README.md: ```yaml --- configs: - config_name: default data_files: - split: train path: "data.csv" - split: test path: "holdout.csv" --- ``` You can select multiple files per split using a list of paths: ``` my_dataset_repository/ โ”œโ”€โ”€ README.md โ”œโ”€โ”€ data/ โ”‚ โ”œโ”€โ”€ abc.csv โ”‚ โ””โ”€โ”€ def.csv โ””โ”€โ”€ holdout/ โ””โ”€โ”€ ghi.csv ``` ```yaml --- configs: - config_name: default data_files: - split: train path: - "data/abc.csv" - "data/def.csv" - split: test path: "holdout/ghi.csv" --- ``` Or you can use glob patterns to automatically list all the files you need: ```yaml --- configs: - config_name: default data_files: - split: train path: "data/*.csv" - split: test path: "holdout/*.csv" --- ``` > [!WARNING] > Note that `config_name` field is required even if you have a single subset. ## Multiple Subsets Your dataset might have several subsets of data that you want to be able to use separately. For example each subset has its own dropdown in the Dataset Viewer the Hugging Face Hub. In that case you can define a list of subsets inside the `configs` field in YAML: ``` my_dataset_repository/ โ”œโ”€โ”€ README.md โ”œโ”€โ”€ main_data.csv โ””โ”€โ”€ additional_data.csv ``` ```yaml --- configs: - config_name: main_data data_files: "main_data.csv" - config_name: additional_data data_files: "additional_data.csv" --- ``` Note that the order of subsets shown in the viewer is the default one first, then alphabetical. > [!TIP] > You can set a default subset using `default: true` > > ```yaml > - config_name: main_data > data_files: "main_data.csv" > default: true > ``` > > This is useful to set which subset the Dataset Viewer shows first, and which subset data libraries load by default. ## Data Directory Instead of listing individual files with `data_files`, you can use `data_dir` to point to a directory. Files inside that directory are resolved automatically based on file extensions. This is especially useful when your data is organized in subdirectories: For example in a case like this, you can simply use `data_dir` since each subset's data lives in its own directory: ``` my_dataset_repository/ โ”œโ”€โ”€ README.md โ”œโ”€โ”€ main/ โ”‚ โ”œโ”€โ”€ train.csv โ”‚ โ””โ”€โ”€ test.csv โ””โ”€โ”€ extra/ โ”œโ”€โ”€ train.csv โ””โ”€โ”€ test.csv ``` ```yaml --- configs: - config_name: main data_dir: "main" - config_name: extra data_dir: "extra" --- ``` When `data_dir` is set, the builder resolves files relative to that directory. If the directory contains files matching the default split naming pattern (e.g. `train.csv`, `test.csv`), splits are assigned automatically without needing explicit `data_files`. You can also combine `data_dir` with `data_files` for more control: ```yaml --- configs: - config_name: default data_dir: "data" data_files: - split: train path: "training_*.csv" - split: test path: "eval_*.csv" --- ``` In this case, the `path` patterns in `data_files` are resolved relative to the `data_dir`. ## Builder parameters Not only `data_files`, but other builder-specific parameters can be passed via YAML, allowing for more flexibility on how to load the data while not requiring any custom code. For example, define which separator to use in which subset to load your `csv` files: ```yaml --- configs: - config_name: tab data_files: "main_data.csv" sep: "\t" - config_name: comma data_files: "additional_data.csv" sep: "," --- ``` Refer to the [specific builders' documentation](/docs/datasets/package_reference/builder_classes) to see what parameters they have. ### Spaces Changelog https://huggingface.co/docs/hub/spaces-changelog.md # Spaces Changelog ## [2026-03-18] - Protected Spaces visibility - Spaces now support a **protected** visibility option, in addition to public and private. In Space settings, visibility is now set through a dropdown with three options instead of a simple public/private toggle. - Protected visibility is available on [PRO](https://huggingface.co/pro) and [Team & Enterprise](https://huggingface.co/enterprise) plans. - A protected Space keeps its source code private on the Hub, while the app remains publicly accessible through its embed URL or [custom domain](./spaces-custom-domain). - This is especially useful for hosting websites without publishing source code. - Read more in the [Spaces Overview](./spaces-overview#space-visibility). ## [2025-04-30] - Deprecate Streamlit SDK - Streamlit is no longer provided as a default built-in SDK option. Streamlit applications are now created using the Docker template. ## [2023-07-28] - Upstream Streamlit frontend for `>=1.23.0` - Streamlit SDK uses the upstream packages published on PyPI for `>=1.23.0`, so the newly released versions are available from the day of release. ## [2023-05-30] - Add support for Streamlit 1.23.x and 1.24.0 - Added support for Streamlit `1.23.0`, `1.23.1`, and `1.24.0`. - Since `1.23.0`, the Streamlit frontend has been changed to the upstream version from the HF-customized one. ## [2023-05-30] - Add support for Streamlit 1.22.0 - Added support for Streamlit `1.22.0`. ## [2023-05-15] - The default Streamlit version - The default Streamlit version is set as `1.21.0`. ## [2023-04-12] - Add support for Streamlit up to 1.19.0 - Support for `1.16.0`, `1.17.0`, `1.18.1`, and `1.19.0` is added and the default SDK version is set as `1.19.0`. ## [2023-03-28] - Bug fix - Fixed a bug causing inability to scroll on iframe-embedded or directly accessed Streamlit apps, which was reported at https://discuss.huggingface.co/t/how-to-add-scroll-bars-to-a-streamlit-app-using-space-direct-embed-url/34101. The patch has been applied to Streamlit>=1.18.1. ## [2022-12-15] - Spaces supports Docker Containers - Read more doc about: [Docker Spaces](./spaces-sdks-docker) ## [2022-12-14] - Ability to set a custom `sleep` time - Read more doc here: [Spaces sleep time](./spaces-gpus#sleep-time) ## [2022-12-07] - Add support for Streamlit 1.15 - Announcement : https://twitter.com/osanseviero/status/1600881584214638592. ## [2022-06-07] - Add support for Streamlit 1.10.0 - The new multipage apps feature is working out-of-the-box on Spaces. - Streamlit blogpost : https://blog.streamlit.io/introducing-multipage-apps. ## [2022-05-23] - Spaces speedup and reactive system theme - All Spaces using Gradio 3+ and Streamlit 1.x.x have a significant speedup in loading. - System theme is now reactive inside the app. If the user changes to dark mode, it automatically changes. ## [2022-05-21] - Default Debian packages and Factory Reboot - Spaces environments now come with pre-installed popular packages (`ffmpeg`, `libsndfile1`, etc.). - This way, most of the time, you don't need to specify any additional package for your Space to work properly. - The `packages.txt` file can still be used if needed. - Added factory reboot button to Spaces, which allows users to do a full restart avoiding cached requirements and freeing GPU memory. ## [2022-05-17] - Add support for Streamlit 1.9.0 - All `1.x.0` versions are now supported (up to `1.9.0`). ## [2022-05-16] - Gradio 3 is out! - This is the default version when creating a new Space, don't hesitate to [check it out](https://huggingface.co/blog/gradio-blocks). ## [2022-03-04] - SDK version lock - The `sdk_version` field is now automatically pre-filled at Space creation time. - It ensures that your Space stays on the same SDK version after an updatE. ## [2022-03-02] - Gradio version pinning - The `sdk_version` configuration field now works with the Gradio SDK. ## [2022-02-21] - Python versions - You can specify the version of Python that you want your Space to run on. - Only Python 3 versions are supported. ## [2022-01-24] - Automatic model and dataset linking from Spaces - We attempt to automatically extract model and dataset repo ids used in your code - You can always manually define them with `models` and `datasets` in your YAML. ## [2021-10-20] - Add support for Streamlit 1.0 - We now support all versions between 0.79.0 and 1.0.0 ## [2021-09-07] - Streamlit version pinning - You can now choose which version of Streamlit will be installed within your Space ## [2021-09-06] - Upgrade Streamlit to `0.84.2` - Supporting Session State API - [Streamlit changelog](https://github.com/streamlit/streamlit/releases/tag/0.84.0) ## [2021-08-10] - Upgrade Streamlit to `0.83.0` - [Streamlit changelog](https://github.com/streamlit/streamlit/releases/tag/0.83.0) ## [2021-08-04] - Debian packages - You can now add your `apt-get` dependencies into a `packages.txt` file ## [2021-08-03] - Streamlit components - Add support for [Streamlit components](https://streamlit.io/components) ## [2021-08-03] - Flax/Jax GPU improvements - For GPU-activated Spaces, make sure Flax / Jax runs smoothly on GPU ## [2021-08-02] - Upgrade Streamlit to `0.82.0` - [Streamlit changelog](https://github.com/streamlit/streamlit/releases/tag/0.82.0) ## [2021-08-01] - Raw logs available - Add link to raw logs (build and container) from the space repository (viewable by users with write access to a Space) ### Jobs Overview https://huggingface.co/docs/hub/jobs-overview.md # Jobs Overview Run compute jobs on Hugging Face infrastructure with a familiar UV & Docker-like interface! UV & Docker-like CLI uv,run,ps,logs,stats,inspect Any Hardware CPUs to A100s & TPUs Run Anything UV, Docker, HF Spaces & more Pay-as-you-go Pay only for seconds used The Hugging Face Hub provides compute for AI and data workflows via Jobs. Jobs runs on Hugging Face infrastructure and aim at providing AI builders, Data engineers, developers and AI agents an easy access to cloud infrastructure to run their workloads. They are ideal to fine tune AI models and run inference with GPUs, but also for data ingestion and processing as well. A job is defined with a command to run (e.g. a UV or python command), a hardware flavor (CPU, GPU, TPU), and optionally a Docker image from Hugging Face Spaces or Docker Hub. Many jobs can run in parallel, which is useful e.g. for parameter tuning or parallel inference and data processing. ## Run Jobs from anywhere There are multiple tools you can use to run jobs: * the `hf` Command Line Interface (see the [CLI installation steps](https://huggingface.co/docs/huggingface_hub/main/en/guides/cli) and the [Jobs CLI documentation](https://huggingface.co/docs/huggingface_hub/guides/cli#hf-jobs) for more information) * the `huggingface_hub` Python client (see the [`huggingface_hub` Jobs documentation](https://huggingface.co/docs/huggingface_hub/guides/jobs) for more information) * the Jobs HTTP API (see the [Jobs OpenAPI](https://huggingface-openapi.hf.space/#tag/jobs) for more information) ## Run any workload The `hf` Jobs CLI and the `huggingface_hub` Python client offer a UV-like interface to run Python workloads. UV installs the required Python dependencies and run the Python script in one single command. Python dependencies may also be defined in a self-contained UV script, and in this case there is no need to specify anything but the UV script to run the Job. ```diff - uv run + hf jobs uv run ``` More generally, Hugging Face Jobs supports any workload based on Docker and a command. Jobs offers a Docker-like interface to rub Jobs, where you can specify a Docker image from Hugging Face Spaces or Docker Hub, as well as the command to run. Docker provides the ability to package ready-to-use environments as Docker images that are shared by the community or custom made. Therefore you may choose or define your Docker image based on what your workloads need (e.g. python, torch, vllm) and run any command. This is more advanced than using UV but provides more flexibility. ```diff - docker run + hf jobs run ``` ## Automate Jobs Trigger Jobs automatically with a schedule or using webhooks. With a schedule, you can run Jobs every X minutes, hours, days, weeks or months. Scheduling Jobs uses the `cron` syntax like `"*/5 * * * *"` for "every 5 minutes", or aliases like `"@hourly"`, `"@daily"`, `"weekly"` or `"@monthly"`. With webhooks, Jobs can run whenever there is an update on a Hugging Face repository. For example you can configure webhooks to trigger for every model update under a given account, and retrieve the updated model from the webhook payload in the Job. ### Cookie limitations in Spaces https://huggingface.co/docs/hub/spaces-cookie-limitations.md # Cookie limitations in Spaces In Hugging Face Spaces, applications have certain limitations when using cookies. This is primarily due to the structure of the Spaces' pages (`https://huggingface.co/spaces//`), which contain applications hosted on a different domain (`*.hf.space`) within an iframe. For security reasons, modern browsers tend to restrict the use of cookies from iframe pages hosted on a different domain than the parent page. ## Impact on Hosting Streamlit Apps with Docker SDK One instance where these cookie restrictions can become problematic is when hosting Streamlit applications using the Docker SDK. By default, Streamlit enables cookie-based XSRF protection. As a result, certain components that submit data to the server, such as `st.file_uploader()`, will not work properly on HF Spaces where cookie usage is restricted. To work around this issue, you would need to set the `server.enableXsrfProtection` option in Streamlit to `false`. There are two ways to do this: 1. Command line argument: The option can be specified as a command line argument when running the Streamlit application. Here is the example command: ```shell streamlit run app.py --server.enableXsrfProtection false ``` 2. Configuration file: Alternatively, you can specify the option in the Streamlit configuration file `.streamlit/config.toml`. You would write it like this: ```toml [server] enableXsrfProtection = false ``` > [!TIP] > When you are using the Streamlit SDK, you don't need to worry about this because the SDK does it for you. ### Editing datasets https://huggingface.co/docs/hub/datasets-editing.md # Editing datasets The [Hub](https://huggingface.co/datasets) enables collaborative curation of community and research datasets. We encourage you to explore the datasets available on the Hub and contribute to their improvement to help grow the ML community and accelerate progress for everyone. All contributions are welcome! Start by [creating a Hugging Face Hub account](https://huggingface.co/join) if you don't have one yet. ## Edit using the Hub UI > [!WARNING] > This feature is only available for CSV, TSV, and Parquet datasets for now. The Hub's web interface allows users without any technical expertise to edit a dataset. Open the dataset page and navigate to the **Data Studio** tab to begin editing. Click on **Toggle edit mode** to enable dataset editing. Edit as many cells as you want and finally click **Commit** to commit your changes and leave a commit message. ## Using the `huggingface_hub` client library The `huggingface_hub` library can manage Hub repositories including editing datasets. For example here is how to edit a CSV file using the [Hugging Face FileSystem API](https://huggingface.co/docs/huggingface_hub/en/guides/hf_file_system): ```python from huggingface_hub import hffs path = f"datasets/{repo_id}/data.csv" with hffs.open(path, "r") as f: content = f.read() edited_content = content.replace("foo", "bar") with hffs.open(path, "w") as f: f.write(edited_content) ``` You can also apply edit locally on your disk and commit the changes: ```python from huggingface_hub import hf_hub_download, upload_file local_path = hf_hub_download(repo_id=repo_id, path_in_repo= "data.csv", repo_type="dataset") with open(path, "r") as f: content = f.read() edited_content = content.replace("foo", "bar") with open(path, "w") as f: f.write(edited_content) upload_file(repo_id=repo_id, path_in_repo=local_path, repo_type="dataset") ``` > [!TIP] > > To have the entire dataset repository locally and edit many files at once, use `snapshot_download` and `upload_folder` instead of `hf_hub_download` and `upload_file` Visit [the client library's documentation](/docs/huggingface_hub/index) to learn more. ## Integrated libraries If a dataset on the Hub is compatible with a [supported library](./datasets-libraries), loading, editing, and pushing the dataset takes just a few lines. Here is how to edit a CSV file with Pandas: ```python import pandas as pd # Load the dataset df = pd.read_csv(f"hf://datasets/{repo_id}/data.csv") # Edit df = df.apply(...) # Commit the changes df.to_csv(f"hf://datasets/{repo_id}/data.csv") ``` Libraries like Polars and DuckDB also implement the `hf://` protocol to read, edit and write files on Hugging Face. And other libraries are useful to edit datasets made of many files like Spark, Dask or ๐Ÿค— Datasets. See the full list of supported libraries [here](./datasets-libraries) For information on accessing the dataset on the website, you can click on the "Use this dataset" button on the dataset page to see how to do so. For example, [`samsum`](https://huggingface.co/datasets/knkarthick/samsum?library=datasets) shows how to do so with ๐Ÿค— Datasets below. ## Only upload the new data Hugging Face's storage is powered by [Xet](https://huggingface.co/docs/hub/en/xet), which uses chunk deduplication to make uploads more efficient. Unlike traditional cloud storage, Xet doesn't require the entire dataset to be re-uploaded to commit changes. Instead, it automatically detects which parts of the dataset have changed and instructs the client library only to upload the updated parts. To do that, Xet uses a smart algorithm to find chunks of 64kB that already exist on Hugging Face. Let's see our previous example with Pandas: ```python import pandas as pd # Load the dataset df = pd.read_csv(f"hf://datasets/{repo_id}/data.csv") # Edit part of the dataset df = df.apply(...) # Commit the changes df.to_csv(f"hf://datasets/{repo_id}/data.csv") ``` This code first loads a dataset and then edits it. Once the edits are done, `to_csv()` materializes the file in memory, chunks it, asks Xet which chunks are already on Hugging Face and which chunks have changed, and then uploads only the new data. ## Optimized Parquet editing The amount of data to upload depends on the edits and the file structure. The Parquet format is columnar and compressed at the page level (pages are around ~1MB). We optimized Parquet for Xet with [Parquet Content Defined Chunking](https://huggingface.co/blog/parquet-cdc), which ensures unchanged data generally result in unchanged pages. For example, this code uploads the content of `df` and then for `edited_df` the upload is faster since it only uploads the chunks that changed: ```python import pandas as pd df.to_parquet( "hf://datasets/username/my_dataset/imdb.parquet", # Optimize for Xet use_content_defined_chunking=True, write_page_index=True, ) edited_df = ... # e.g. with added/modified/removed rows or columns edited_df.to_parquet( "hf://datasets/username/my_dataset/imdb.parquet", # Optimize for Xet use_content_defined_chunking=True, write_page_index=True, ) ``` Chunks are ~64kB and Parquet saves data column per column, so in practice this is what happens when editing an Optimized Parquet file: * add a new column -> only the chunks of the new column are uploaded * add/edit/delete a row -> one chunk per column is uploaded And in addition to this, the chunks of the Parquet footer containing metadata are also uploaded. Check out if your library supports Optimized Parquet in the [supported libraries](./datasets-libraries) page. ## Streaming For big datasets, libraries with dataset streaming features for end-to-end streaming pipelines are recommended. In this case, the dataset processing runs progressively as the old data arrives and the new data is uploaded to the Hub. Check out if your library supports streaming in the [supported libraries](./datasets-libraries) page. ### Manage Jobs https://huggingface.co/docs/hub/jobs-manage.md # Manage Jobs ## List Jobs Find your list of Jobs in the Jobs page or your organization Jobs page (user/organization page > settings > Jobs): It is also available in the Hugging Face CLI. Show the list of running Jobs with `hf jobs ps` and use `-a` to show all the Jobs: ```bash >>> hf jobs ps JOB ID IMAGE/SPACE COMMAND CREATED STATUS ------------ ---------------- ----------- ------------------- ------- 69402ea6c... ghcr.io/astra... uv run p... 2025-12-15 15:52:06 RUNNING >>> hf jobs ps -a JOB ID IMAGE/SPACE COMMAND CREATED STATUS ------------ ---------- --------------- ------------------- --------- 69402ea6c... ghcr.io... uv run pytho... 2025-12-15 15:52:06 RUNNING 693b06b8c... ghcr.io... uv run pytho... 2025-12-11 18:00:24 CANCELED 693b069fc... ghcr.io... uv run pytho... 2025-12-11 17:59:59 ERROR 693aef401... ghcr.io... uv run pytho... 2025-12-11 16:20:16 COMPLETED 693aee76c... ubuntu echo Hello f... 2025-12-11 16:16:54 COMPLETED 693ae8e3c... python:... python -c pr... 2025-12-11 15:53:07 COMPLETED ``` Specify your organization `namespace` to list Jobs under your organization: ```bash >>> hf jobs ps --namespace ``` ## Filter Jobs Click on a Job's label to filter Jobs by label: In the CLI, you can filter Jobs based on conditions provided, using the format key=value: Filter by labels: ```bash >>> hf jobs ps --filter label=fine-tuning --filter label=model=Qwen3-06B -a JOB ID IMAGE/SPACE COMMAND CREATED STATUS ------------ ------------ ---------------- ------------------- --------- 6978b1254... ghcr.io/a... uv run --with... 2026-01-27 12:35:49 COMPLETED 6978b11d4... ghcr.io/a... uv run --with... 2026-01-27 12:33:53 COMPLETED ``` Filter on any condition: ```bash >>> hf jobs ps --filter status=error -a JOB ID IMAGE/SPACE COMMAND CREATED STATUS ------------ ---------- ------------------ ------------------- ------ 693b069fc... ghcr.io... uv run python -... 2025-12-11 17:59:59 ERROR 693996dec... ghcr.io... bash -c python ... 2025-12-10 15:50:54 ERROR 69399695c... ghcr.io... uv run --with t... 2025-12-10 15:49:41 ERROR 693994bdc... ghcr.io... uv run --with t... 2025-12-10 15:41:49 ERROR 68d3c1af3... ghcr.io... uv run bash -c ... 2025-09-24 10:02:23 ERROR ``` Filtering supports negation `!=` and glob patterns (including `*` and `?`): ```bash # Show Jobs that are not completed >>> hf jobs ps -a --filter status!=completed # Show Jobs with a command that ends with "train.py" >>> hf jobs ps -a --filter "command=*train.py" # Show Jobs with a "fine-tuning" label >>> hf jobs ps -a --filter label=fine-tuning # Show Jobs that don't have the "prod" label and have a label that starts with "data-" >>> hf jobs ps -a --filter label!=prod --filter "label=data-*" # Show Jobs based on key=value labels >>> hf jobs ps -a --filter label=model=Qwen3-06B --filter label=dataset!=Capybara ``` ## Monitor resource usage Use `hf jobs stats` to get the usage statistics for CPU, memory, network and GPU (if any) of running Jobs: ```bash >>> hf jobs stats JOB ID CPU % NUM CPU MEM % MEM USAGE NET I/O GPU UTIL % GPU MEM % GPU MEM USAGE ------------------------ ----- ------- ----- ---------------- --------------- ---------- --------- --------------- 695e83c5d2f3efac77e8cf18 8% 12.0 7.18% 10.9GB / 152.5GB 0.0bps / 0.0bps 100% 31.92% 25.9GB / 81.2GB ``` Specify one or several Job ids to only show the statistics of certain Jobs: ```bash >>> hf jobs stats [job-ids]... ``` ## Inspect a Job You can see the status logs of a Job in the Job page: Alternatively using the CLI ```bash >>> hf jobs inspect 693994e21a39f67af5a41ad0 [ { "id": "693994e21a39f67af5a41ad0", "created_at": "2025-12-10 15:42:26.835000+00:00", "docker_image": "ghcr.io/astral-sh/uv:python3.12-bookworm", "space_id": null, "command": ["bash", "-c", "python -c \"import urllib.request; import os; from pathlib import Path; o = urllib.request.build_opener(); o.addheaders = [(\\\"Authorization\\\", \\\"Bearer \\\" + os.environ[\\\"UV_SCRIPT_HF_TOKEN\\\"])]; Path(\\\"/tmp/script.py\\\").write_bytes(o.open(os.environ[\\\"UV_SCRIPT_URL\\\"]).read())\" && uv run --with trl /tmp/script.py"], "arguments": [], "environment": {"UV_SCRIPT_URL": "https://huggingface.co/datasets/lhoestq/hf-cli-jobs-uv-run-scripts/resolve/728cc5682eb402d7ffe66a2f6f97645b34cb08dd/train.py"}, "secrets": ["HF_TOKEN", "UV_SCRIPT_HF_TOKEN"], "flavor": "a100-large", "status": {"stage": "COMPLETED", "message": null}, "owner": {"id": "5e9ecfc04957053f60648a3e", "name": "lhoestq", "type": "user"}, "endpoint": "https://huggingface.co", "url": "https://huggingface.co/jobs/lhoestq/693994e21a39f67af5a41ad0" } ] ``` and for the logs ```bash >>> hf jobs logs 693994e21a39f67af5a41ad0 Downloading nvidia-cuda-nvrtc-cu12 (84.0MiB) Downloading numpy (15.8MiB) Downloading nvidia-cuda-cupti-cu12 (9.8MiB) Downloading tokenizers (3.1MiB) Downloading nvidia-cusolver-cu12 (255.1MiB) Downloading nvidia-cufft-cu12 (184.2MiB) Downloading transformers (11.4MiB) Downloading setuptools (1.1MiB) ... ``` Specify your organization `namespace` to inspect a Job under your organization: ```bash hf jobs inspect --namespace hf jobs logs --namespace ``` ## Debug a Job If a Job has an error, you can see it in on the Job page Look at the status message and the logs on the Job page to see what went wrong. You may also look at the last lines of logs to see what happened before a Job failed. You can see that in the Job page, or using the CLI: ```bash >>> hf jobs logs 69405cf51a39f67af5a41f29 | tail -n 10 Downloaded nvidia-cudnn-cu12 Downloaded torch Installed 66 packages in 226ms Generating train split: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 15806/15806 [00:00<00:00, 73330.17 examples/s] Generating test split: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 200/200 [00:00<00:00, 45427.32 examples/s] Traceback (most recent call last): File "/tmp/script.py", line 7, in train_dataset=train_dataset, ^^^^^^^^^^^^^ NameError: name 'train_dataset' is not defined. Did you mean: 'load_dataset'? ``` Debug a Job locally using your local UV or Docker setup: * `hf jobs uv run ...` -> `uv run ...` * `hf jobs run ...` -> `docker run ...` The status message can say "Job timeout": it means the Job didn't finish in time before the timeout (the default is 30min) and therefore it was stopped. In this case you need to specify a higher timeout, using `--timeout` in the CLI, e.g. ```bash hf jobs uv run --timeout 3h ... ``` ## Cancel Jobs Use the "Cancel" button on the Job page to cancel a Job: or in the CLI: ```bash hf jobs cancel 693b06b8c67c9f186cfe239e ``` Specify your organization `namespace` to cancel a Job under your organization: ```bash hf jobs cancel --namespace ``` ## MacOS menu bar Find your list of Jobs in the MacOS [`hfjobs-menubar`](https://github.com/drbh/hfjobs-menubar) client: Get Jobs information, and monitor logs and resource usage statistics: ### Using Xet Storage https://huggingface.co/docs/hub/xet/using-xet-storage.md # Using Xet Storage ## Python To access a Xet-aware version of the `huggingface_hub`, simply install the latest version: ```bash pip install -U huggingface_hub ``` As of `huggingface_hub` 0.32.0, this will also install `hf_xet`. The `hf_xet` package integrates `huggingface_hub` with [`xet-core`](https://github.com/huggingface/xet-core), the Rust client for the Xet backend. If you use the `transformers` or `datasets` libraries, it's already using `huggingface_hub`. So long as the version of `huggingface_hub` >= 0.32.0, no further action needs to be taken. Where versions of `huggingface_hub` >= 0.30.0 and < 0.32.0 are installed, `hf_xet` must be installed explicitly: ```bash pip install -U hf-xet ``` And that's it! You now get the benefits of Xet deduplication for both uploads and downloads. Team members using a version of `huggingface_hub` < 0.30.0 will still be able to upload and download repositories through the [backwards compatibility provided by the LFS bridge](legacy-git-lfs#backward-compatibility-with-lfs). To see more detailed usage docs, refer to the `huggingface_hub` docs for: - [Upload](https://huggingface.co/docs/huggingface_hub/guides/upload#faster-uploads) - [Download](https://huggingface.co/docs/huggingface_hub/guides/download#hfxet) - [Managing the `hf_xet` cache](https://huggingface.co/docs/huggingface_hub/guides/manage-cache#chunk-based-caching-xet) ## Git Git users can access the benefits of Xet by downloading and installing the Git Xet extension. Once installed, simply use the [standard workflows for managing Hub repositories with Git](../repositories-getting-started) - no additional changes necessary. ### Prerequisites Install [Git](https://git-scm.com/) and [Git LFS](https://git-lfs.com/). ### Install on macOS or Linux (amd64 or aarch64) Install using an installation script with the following command in your terminal (requires `curl` and `unzip`): ``` curl --proto '=https' --tlsv1.2 -sSf https://raw.githubusercontent.com/huggingface/xet-core/refs/heads/main/git_xet/install.sh | sh ``` Or, install using [Homebrew](https://brew.sh/): ``` brew install git-xet git xet install ``` To verify the installation, run: ``` git xet --version ``` ### Windows (amd64) Using `winget`: ``` winget install git-xet ``` Using an installer: - Download `git-xet-windows-installer-x86_64.zip` ([available here](https://github.com/huggingface/xet-core/releases/download/git-xet-v0.2.0/git-xet-windows-installer-x86_64.zip)) and unzip. - Run the `msi` installer file and follow the prompts. Manual installation: - Download `git-xet-windows-x86_64.zip` ([available here](https://github.com/huggingface/xet-core/releases/download/git-xet-v0.2.0/git-xet-windows-x86_64.zip)) and unzip. - Place the extracted `git-xet.exe` under a `PATH` directory. - Run `git xet install` in a terminal. To verify the installation, run: ``` git xet --version ``` ### Using Git Xet Once installed on your platform, using Git Xet is as simple as following the Hub's standard Git workflows. Make sure all [prerequisites are installed and configured](https://huggingface.co/docs/hub/repositories-getting-started#requirements), follow the [setup instructions for working with repositories on the Hub](https://huggingface.co/docs/hub/repositories-getting-started#set-up), then commit your changes, and `push` to the Hub: ``` # Create any files you like! Then... git add . git commit -m "Uploading new models" # You can choose any descriptive message git push ``` Under the hood, the [Xet protocol](https://huggingface.co/docs/xet/index) is invoked to upload large files directly to Xet storage, increasing upload speeds through the power of [chunk-level deduplication](./deduplication). ### Uninstall on macOS or Linux Using Homebrew: ```bash git xet uninstall brew uninstall git-xet ``` If you used the installation script (for MacOS or Linux), run the following in your terminal: ```bash git xet uninstall sudo rm $(which git-xet) ``` ### Uninstall on Windows If you used `winget`: ``` winget uninstall git-xet ``` If you used the installer: - Navigate to Settings -> Apps -> Installed apps - Find "Git-Xet". - Select the "Uninstall" option available in the context menu. If you manually installed: - Run `git xet uninstall` in a terminal. - Delete the `git-xet.exe` file from the location where it was originally placed. ## Recommendations Xet integrates seamlessly with all of the Hub's workflows. However, there are a few steps you may consider to get the most benefits from Xet storage. When uploading or downloading with Python: - **Make sure `hf_xet` is installed**: While Xet remains backward compatible with legacy clients optimized for Git LFS, the `hf_xet` integration with `huggingface_hub` delivers optimal chunk-based performance and faster iteration on large files. - **Adaptive concurrency is on by default**: `hf_xet` automatically adjusts the number of parallel transfer streams based on real-time network conditions โ€” no configuration required. The default settings will saturate most network paths without any tuning. - **Advanced tuning**: For fine-grained control, `HF_XET_FIXED_DOWNLOAD_CONCURRENCY` and `HF_XET_FIXED_UPLOAD_CONCURRENCY` let you pin concurrency to a fixed value, bypassing the adaptive controller. See `hf_xet`'s [environment variables](https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables#xet) for the full list of options. When uploading or downloading in Git or Python: - **Leverage frequent, incremental commits**: Xet's chunk-level deduplication means you can safely make incremental updates to models or datasets. Only changed chunks are uploaded, so frequent commits are both fast and storage-efficient. - **Be Specific in .gitattributes**: When defining patterns for Xet or LFS, use precise file extensions (e.g., `*.safetensors`, `*.bin`) to avoid unnecessarily routing smaller files through large-file storage. - **Prioritize community access**: Xet substantially increases the efficiency and scale of large file transfers. Instead of structuring your repository to reduce its total size (or the size of individual files), organize it for collaborators and community users so they may easily navigate and retrieve the content they need. ## Environment Variables Both `hf_xet` and Git Xet are powered by `xet-core`, which can be configured via environment variables. The tables below list the individual variables for fine-grained control. Most users will not need to change any of these โ€” the defaults are tuned to saturate most network paths automatically. > [!NOTE] > `HF_XET_HIGH_PERFORMANCE=1` is a convenience flag that adjusts several settings at once (concurrency bounds, buffer sizes, and parallel file limits). It is intended for machines with high bandwidth **and at least 64 GB of RAM** for buffering. On machines with less memory, it may degrade performance. ### Adaptive Concurrency By default, `xet-core` uses adaptive concurrency โ€” dynamically adjusting parallelism based on real-time network conditions. These are advanced settings that are unlikely to be needed in most cases. The variables below control the adaptive controller's behavior: | Environment Variable | Default | Description | |---|---|---| | `HF_XET_CLIENT_ENABLE_ADAPTIVE_CONCURRENCY` | `true` | Enable or disable adaptive concurrency control. When disabled, concurrency stays at the initial value. | | `HF_XET_CLIENT_AC_INITIAL_UPLOAD_CONCURRENCY` | `1` | Starting number of concurrent upload streams. HP mode: `16`. | | `HF_XET_CLIENT_AC_INITIAL_DOWNLOAD_CONCURRENCY` | `1` | Starting number of concurrent download streams. HP mode: `16`. | | `HF_XET_CLIENT_AC_MIN_UPLOAD_CONCURRENCY` | `1` | Lower bound for upload concurrency. HP mode: `4`. | | `HF_XET_CLIENT_AC_MIN_DOWNLOAD_CONCURRENCY` | `1` | Lower bound for download concurrency. HP mode: `4`. | | `HF_XET_CLIENT_AC_MAX_UPLOAD_CONCURRENCY` | `64` | Upper bound for upload concurrency. HP mode: `124`. | | `HF_XET_CLIENT_AC_MAX_DOWNLOAD_CONCURRENCY` | `64` | Upper bound for download concurrency. HP mode: `124`. | | `HF_XET_CLIENT_AC_TARGET_RTT` | `60s` | Target round-trip time. Concurrency increases as long as the predicted round-trip time for a full transfer is below this value. | | `HF_XET_CLIENT_AC_MAX_HEALTHY_RTT` | `90s` | Maximum acceptable round-trip time. Transfers taking longer than this are counted as failures by the adaptive controller. | | `HF_XET_CLIENT_AC_HEALTHY_SUCCESS_RATIO_THRESHOLD` | `0.8` | Success ratio above which the controller increases concurrency. | | `HF_XET_CLIENT_AC_UNHEALTHY_SUCCESS_RATIO_THRESHOLD` | `0.5` | Success ratio below which the controller decreases concurrency. | | `HF_XET_CLIENT_AC_LOGGING_INTERVAL_MS` | `10000` | Interval (in ms) at which concurrency status is logged. | > [!TIP] > To pin concurrency to a fixed value (bypassing the adaptive controller), use the convenience aliases `HF_XET_FIXED_UPLOAD_CONCURRENCY` and `HF_XET_FIXED_DOWNLOAD_CONCURRENCY`. These set the initial, minimum, and maximum concurrency to the same value. ### Network and Retry | Environment Variable | Default | Description | |---|---|---| | `HF_XET_CLIENT_RETRY_MAX_ATTEMPTS` | `5` | Maximum number of retry attempts for failed requests. | | `HF_XET_CLIENT_RETRY_BASE_DELAY` | `3000ms` | Base delay between retries (with exponential backoff). | | `HF_XET_CLIENT_RETRY_MAX_DURATION` | `360s` | Maximum total time to spend retrying a request. | | `HF_XET_CLIENT_CONNECT_TIMEOUT` | `60s` | TCP connection timeout. | | `HF_XET_CLIENT_READ_TIMEOUT` | `120s` | Read timeout for HTTP responses. | | `HF_XET_CLIENT_IDLE_CONNECTION_TIMEOUT` | `60s` | Timeout before idle connections are closed. | | `HF_XET_CLIENT_MAX_IDLE_CONNECTIONS` | `16` | Maximum number of idle connections in the pool. | ### Data Transfer | Environment Variable | Default | Description | |---|---|---| | `HF_XET_DATA_MAX_CONCURRENT_FILE_INGESTION` | `8` | Maximum number of files processed concurrently during upload. HP mode: `100`. | | `HF_XET_DATA_MAX_CONCURRENT_FILE_DOWNLOADS` | `8` | Maximum number of files downloaded concurrently. | | `HF_XET_DATA_INGESTION_BLOCK_SIZE` | `8mb` | Size of blocks read during file ingestion. | | `HF_XET_DATA_PROGRESS_UPDATE_INTERVAL` | `200ms` | How often progress bars are updated. | | `HF_XET_DATA_PROGRESS_UPDATE_SPEED_SAMPLING_WINDOW` | `10s` | Time window used for aggregating transfer speed measurements in progress reporting. | ### Download Buffers These control memory usage during downloads. `HF_XET_HIGH_PERFORMANCE=1` raises these significantly. | Environment Variable | Default | HP Mode | Description | |---|---|---|---| | `HF_XET_RECONSTRUCTION_MIN_RECONSTRUCTION_FETCH_SIZE` | `256mb` | `1gb` | Minimum fetch size for reconstruction requests. | | `HF_XET_RECONSTRUCTION_MAX_RECONSTRUCTION_FETCH_SIZE` | `8gb` | `16gb` | Maximum fetch size for reconstruction requests. | | `HF_XET_RECONSTRUCTION_DOWNLOAD_BUFFER_SIZE` | `2gb` | `16gb` | Total download buffer size. | | `HF_XET_RECONSTRUCTION_DOWNLOAD_BUFFER_PERFILE_SIZE` | `512mb` | `2gb` | Per-file download buffer size. | | `HF_XET_RECONSTRUCTION_DOWNLOAD_BUFFER_LIMIT` | `8gb` | `64gb` | Hard limit on total download buffer memory. | | `HF_XET_RECONSTRUCTION_TARGET_BLOCK_COMPLETION_TIME` | `15m` | โ€” | Target time for completing a prefetch block. Used to determine how much data to prefetch ahead during downloads. | | `HF_XET_RECONSTRUCTION_MIN_PREFETCH_BUFFER` | `1gb` | โ€” | Minimum amount of data to keep prefetched during downloads, regardless of estimated completion time. | ### Logging | Environment Variable | Default | Description | |---|---|---| | `HF_XET_LOG_DEST` | (none) | Log destination. Accepts a file path or directory path (ending with `/`). When set to a directory, log files are created with timestamped names. When set to an empty string, logs go to the console. When unset, logs go to the `logs/` subdirectory in the Hugging Face Xet cache directory. | | `HF_XET_LOG_FORMAT` | (none) | Log format. Set to `json` for JSON-formatted logs; otherwise plain text. By default, file logging uses JSON and console logging uses text. | | `HF_XET_LOG_PREFIX` | `xet` | Prefix for log file names when logging to a directory. | | `HF_XET_LOG_DIR_DISABLE_CLEANUP` | `false` | Disable automatic cleanup of old log files in the log directory. | | `HF_XET_LOG_DIR_MAX_SIZE` | `250mb` | Maximum total size of log files in the log directory. Old files are pruned to stay under this limit. | | `HF_XET_LOG_DIR_MIN_DELETION_AGE` | `1d` | Minimum age before a log file can be deleted during cleanup. | | `HF_XET_LOG_DIR_MAX_RETENTION_AGE` | `14d` | Maximum age for log files. Files older than this are always deleted during cleanup. | ## Current Limitations While Xet brings fine-grained deduplication and enhanced performance to Git-based storage, some features and platform compatibilities are still in development. As a result, keep the following constraints in mind when working with a Xet-enabled repository: - **64-bit systems only**: Both `hf_xet` and Git Xet currently require a 64-bit architecture; 32-bit systems are not supported. ### Xet: our Storage Backend https://huggingface.co/docs/hub/xet/index.md # Xet: our Storage Backend Repositories on the Hugging Face Hub are different from those on software development platforms. They contain files that are: - Large - model or dataset files are in the range of GB and above. We have a few TB-scale files! - Binary - not in a human readable format by default (e.g., [Safetensors](https://huggingface.co/docs/safetensors/en/index) or [Parquet](https://huggingface.co/docs/dataset-viewer/en/parquet#what-is-parquet)) While the Hub leverages modern version control with the support of Git, these differences make [Model](https://huggingface.co/docs/hub/models) and [Dataset](https://huggingface.co/docs/hub/datasets) repositories quite different from those that contain only source code. Storing these files directly in a pure Git repository is impractical. Not only are the typical storage systems behind Git repositories unsuited for such files, but when you clone a repository, Git retrieves the entire history, including all file revisions. This can be prohibitively large for massive binaries, forcing you to download gigabytes of historic data you may never need. Instead, on the Hub, these large files are tracked using "pointer files" and identified through a `.gitattributes` file (both discussed in more detail below), which remain in the Git repository while the actual data is stored in remote storage (like [Amazon S3](https://aws.amazon.com/s3/)). As a result, the repository stays small and typical Git workflows remain efficient. Historically, Hub repositories have relied on [Git LFS](https://git-lfs.com/) for this mechanism. While Git LFS remains supported (see [Backwards Compatibility & Legacy](./legacy-git-lfs)), the Hub has adopted Xet, a modern custom storage system built specifically for AI/ML development. It enables chunk-level deduplication, smaller uploads, and faster downloads than Git LFS. ## Open Source Xet Protocol If you are looking to understand the underlying Xet protocol or are looking to build a new client library to access Xet Storage, check out the [Xet Protocol Specification](https://huggingface.co/docs/xet/index). In these pages you will get started in using Xet Storage. ## Contents - [Xet History & Overview](./overview) - [Using Xet Storage](./using-xet-storage) - [Security](./security) - [Backwards Compatibility & Legacy](./legacy-git-lfs) - [Deduplication](./deduplication) ### Xet History & Overview https://huggingface.co/docs/hub/xet/overview.md # Xet History & Overview [In August 2024 Hugging Face acquired XetHub](https://huggingface.co/blog/xethub-joins-hf), a [seed-stage startup based in Seattle](https://www.geekwire.com/2023/ex-apple-engineers-raise-7-5m-for-new-seattle-data-storage-startup/), to replace Git LFS on the Hub. Like Git LFS, a Xet-backed repository utilizes S3 as the remote storage with a `.gitattributes` file at the repository root helping identify what files should be stored remotely. A Git LFS pointer file provides metadata to locate the actual file contents in remote storage: - **SHA256**: Provides a unique identifier for the actual large file. This identifier is generated by computing the SHA-256 hash of the fileโ€™s contents. - **Pointer size**: The size of the pointer file stored in the Git repository. - **Size of the remote file**: Indicates the size of the actual large file in bytes. This metadata is useful for both verification purposes and for managing storage and transfer operations. A Xet pointer includes all of this information by design. Refer to the section on [backwards compatibility with Git LFS](legacy-git-lfs#backward-compatibility-with-lfs) with the addition of a `Xet backed hash` field for referencing the file in Xet storage. Unlike Git LFS, which deduplicates at the file level, Xet-enabled repositories deduplicate at the level of bytes. When a file backed by Xet storage is updated, only the modified data is uploaded to remote storage, significantly saving on network transfers. For many workflows, like incremental updates to model checkpoints or appending/inserting new data into a dataset, this improves iteration speed for yourself and your collaborators. To learn more about deduplication in Xet storage, refer to [Deduplication](deduplication). ### Deduplication https://huggingface.co/docs/hub/xet/deduplication.md # Deduplication Xet-enabled repositories utilize [content-defined chunking (CDC)](https://huggingface.co/blog/from-files-to-chunks) to deduplicate on the level of bytes (~64KB of data, also referred to as a "chunk"). Each chunk is identified by a rolling hash that determines chunk boundaries based on the actual file contents, making it resilient to insertions or deletions anywhere in the file. When a file is uploaded to a Xet-backed repository using a Xet-aware client, its contents are broken down into these variable-sized chunks. Only new chunks not already present in Xet storage are kept after chunking, everything else is discarded. ## How Content-Defined Chunking Works To understand content-defined chunking, imagine a file as a long passage of text. The system scans the data using a rolling hash โ€” a small mathematical function that slides over the bytes. Whenever the hash hits a special pattern, a chunk boundary is placed at that position. Because the boundaries are determined by the *content itself* (not by fixed positions), identical regions of data always produce the same chunks, even if surrounding content changes. ### Why Not Fixed-Size Chunks? Consider what happens when you insert a small amount of data in the middle of a file. With fixed-size chunking, every chunk boundary after the insertion shifts, invalidating all downstream chunks โ€” even though most of the data is unchanged: ```text Original file, fixed 6-byte chunks: |The qu|ick br|own fo|x jump|s over| the l|azy do|g | chunk1 chunk2 chunk3 chunk4 chunk5 chunk6 chunk7 chunk8 Insert "very " before "lazy": |The qu|ick br|own fo|x jump|s over| the v|ery la|zy dog| chunk1 chunk2 chunk3 chunk4 chunk5 chunk6 chunk7 chunk8 ~~~~~~ ~~~~~~ ~~~~~~ 3 chunks changed! ``` Even though only 5 bytes were inserted, **3 out of 8 chunks changed** because all boundaries after the edit shifted by 5 positions. In real files at a 64KB chunk size, a small edit can invalidate hundreds of megabytes of chunks. ### Content-Defined Chunking Keeps Boundaries Stable With CDC, boundaries are placed where the *content* matches a pattern โ€” not at fixed intervals. This means an insertion only affects the chunk where the edit occurs. Chunks before and after remain identical: ```text Original file, content-defined chunks (boundaries marked by "|"): |The quick |brown fox |jumps over |the lazy dog| chunk 1 chunk 2 chunk 3 chunk 4 Insert "very " before "lazy": |The quick |brown fox |jumps over |the very lazy dog| chunk 1 chunk 2 chunk 3 chunk 4' (same) (same) (same) (changed) ``` Only **1 out of 4 chunks changed** โ€” the one containing the edit. The other three are byte-for-byte identical and are deduplicated. This is why CDC is so effective for versioned data: when you update a model checkpoint or append rows to a dataset, only the modified portions need to be uploaded and stored. ### From Chunks to Storage The full deduplication pipeline works as follows: ```mermaid flowchart LR A["File"] --> B["Content-Defined\nChunking"] B --> C{"Chunk already\nstored?"} C -- "Yes (duplicate)" --> D["Skip upload\n(reuse existing)"] C -- "No (new)" --> E["Group into\n64 MB blocks"] E --> F["Upload to\nXet Storage"] ``` When a file is chunked, each chunk's hash is checked against what is already stored. This happens at multiple levels: first against chunks already seen in the current upload session, then against a local cache of previously uploaded metadata, and finally a subset of chunks are checked against all of Xet storage via a global deduplication query. Duplicate chunks are skipped entirely. New chunks are grouped into 64 MB blocks and uploaded. Each block is stored once in a content-addressed store (CAS), keyed by its hash. ## Storage Savings in Practice The Hub's [current recommendation](https://huggingface.co/docs/hub/storage-limits#recommendations) is to limit files to 200 GB. At a 64KB chunk size, a 20GB file has 312,500 chunks, many of which go unchanged from version to version. Git LFS is designed to notice only that a file has changed and store the entirety of that revision. By deduplicating at the level of chunks, the Xet backend enables storing only the modified content in a file (which might only be a few KB or MB) and securely deduplicates shared blocks across repositories. For the large binary files found in Model and Dataset repositories, this provides significant improvements to file transfer times. For more details, refer to the [From Files to Chunks](https://huggingface.co/blog/from-files-to-chunks) and [From Chunks to Blocks](https://huggingface.co/blog/from-chunks-to-blocks) blog posts, or the [Git is for Data](https://www.cidrdb.org/cidr2023/papers/p43-low.pdf) paper by Low et al. that served as the launch point for XetHub prior to being acquired by Hugging Face. ### Backward Compatibility with LFS https://huggingface.co/docs/hub/xet/legacy-git-lfs.md # Backward Compatibility with LFS Uploads from legacy / nonโ€‘Xetโ€‘aware clients still follow the standard Gitโ€ฏLFS path, even if the repo is already Xet-backed. Once the file is uploaded to LFS, a background process automatically migrates the file to using Xet storage. The Xet architecture provides backwards compatibility for legacy clients downloading files from Xet-backed repos by offering a Git LFS bridge. While a Xet-aware client will receive file reconstruction information from CAS to download the Xet-backed file, a legacy client will get a single URL from the bridge which does the work of reconstructing the request file and returning the URL to the resource. This allows downloading files through a URL so that you can continue to use the Hub's web interface or `curl`. By having LFS file uploads automatically migrate and having older clients continue to download files from Xet-backed repositories, maintainers and the rest of the Hub can update their pipelines at their own pace. Xet storage provides a seamless transition for existing Hub repositories. It isn't necessary to know if the Xet backend is involved at all. Xet-backed repositories continue to use the Git LFS pointer file format; the addition of the `Xet backed hash` is only added to the web interface as a convenience. Practically, this means existing repos and newly created repos will not look any different if you do a `bare clone` of them. Each of the large files (or binary files) will continue to have a pointer file that matches the Git LFS pointer file specification. This symmetry allows non-Xet-aware clients (e.g., older versions of the `huggingface_hub`) to interact with Xet-backed repositories without concern. In fact, within a repository a mixture of Git LFS and Xet backed files are supported. The Xet backend indicates whether a file is in Git LFS or Xet storage, allowing downstream services to request the proper URL(s) from S3, regardless of which storage system holds the content. ## Legacy Storage: Git LFS The legacy storage system on the Hub, Git LFS utilizes many of the same conventions as Xet-backed repositories. The Hub's Git LFS backend is [Amazon Simple Storage Service (S3)](https://aws.amazon.com/s3/). When Git LFS is invoked, it stores the file contents in S3 using the SHA256 hash to name the file for future access. This storage architecture is relatively simple and has allowed Hub to store millions of models, datasets, and spaces repositories' files. The primary limitation of Git LFS is its file-centric approach to deduplication. Any change to a file, irrespective of how large of small that change is, means the entire file is versioned - incurring significant overheads in file transfers as the entire file is uploaded (if committing to a repository) or downloaded (if pulling the latest version to your machine). This leads to a worse developer experience along with a proliferation of additional storage. ### Security Model https://huggingface.co/docs/hub/xet/security.md # Security Model Xet storage provides data deduplication over all chunks stored in Hugging Face. This is done via cryptographic hashing in a privacy sensitive way. The contents of chunks are protected and are associated with repository permissions, i.e. you can only read chunks which are required to reproduce files you have access to, and no more. More information and details on how deduplication is done in a privacy-preserving way are described in the [Xet Protocol Specification](https://huggingface.co/docs/xet/deduplication).