Explore skills
5,474 skills found
Category alert
Get new Pesquisa e Web skills every Monday
tensorboard
Visualize training metrics, debug models with histograms, compare experiments, visualize model graphs, and profile performance with TensorBoard - Google's ML visualization toolkit.
optimizing-attention-flash
Optimizes transformer attention with Flash Attention, achieving 2-4x speedup and 10-20x memory reduction. Ideal for long sequences (>512 tokens), addressing GPU memory issues, or accelerating inference, it supports PyTorch native SDPA, flash-attn, H100 FP8, and sliding window attention.
axolotl
Expert guidance for fine-tuning LLMs with Axolotl, covering YAML configurations, over 100 models, LoRA/QLoRA, DPO/KTO/ORPO/GRPO, and multimodal support.
huggingface-accelerate
The simplest distributed training API, enabling distributed support for any PyTorch script in just 4 lines. It offers a unified API for DeepSpeed/FSDP/Megatron/DDP, automatic device placement, mixed precision, and is a HuggingFace ecosystem standard.
training-llms-megatron
Trains large language models (2B-462B parameters) using NVIDIA Megatron-Core with advanced parallelism strategies. It's ideal for models >1B parameters, maximum GPU efficiency (47% MFU on H100), or requiring various parallelism types, and is a production-ready framework used for Nemotron, LLaMA, and DeepSeek.
peft-fine-tuning
Parameter-efficient fine-tuning for LLMs using LoRA, QLoRA, and 25+ methods. Use for fine-tuning large models (7B-70B) with limited GPU memory, training <1% of parameters with minimal accuracy loss, or for multi-adapter serving, as it's HuggingFace's official library integrated with the transformers ecosystem.
mamba-architecture
Mamba is a state-space model with O(n) complexity, offering 5x faster inference, million-token sequences, and no KV cache, contrasting with Transformers' O(n²) complexity. It employs selective SSM with a hardware-aware design, with Mamba-1 and Mamba-2 models available on HuggingFace.
sparse-autoencoder-training
Provides guidance for training and analyzing Sparse Autoencoders (SAEs) using SAELens to decompose neural network activations into interpretable features. Use when discovering interpretable features, analyzing superposition, or studying monosemantic representations in language models.
autoresearch
Orchestrates end-to-end autonomous AI research projects using a two-loop architecture. The inner loop runs rapid experiment iterations with optimization targets, while the outer loop synthesizes results to steer research direction.
rwkv-architecture
An RNN+Transformer hybrid with O(n) inference, offering linear time and infinite context without a KV cache. It trains like GPT and infers like an RNN, used in Windows, Office, and NeMo, with models up to 14B parameters.
nemo-curator
GPU-accelerated data curation for LLM training, supporting text, image, video, and audio. It features fuzzy deduplication (16x faster), quality filtering, semantic deduplication, PII redaction, and NSFW detection, scaling across GPUs with RAPIDS to prepare high-quality datasets.
quantizing-models-bitsandbytes
Quantizes LLMs to 8-bit or 4-bit, reducing memory by 50-75% with minimal accuracy loss, ideal for limited GPU memory or faster inference. It supports INT8, NF4, FP4, QLoRA training, 8-bit optimizers, and works with HuggingFace Transformers.