Published skills
Showing 48 of 98
tensorrt-llm
Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.
autogpt-agents
Autonomous AI agent platform for building and deploying continuous agents. Use when creating visual workflow agents, deploying persistent autonomous agents, or building complex multi-step AI automation systems.
guidance
Control LLM output with regex and grammars, guarantee valid JSON/XML/code generation, enforce structured formats, and build multi-step workflows with Guidance - Microsoft Research's constrained generation framework.
nanogpt
A ~300-line educational GPT implementation by Andrej Karpathy, reproducing GPT-2 (124M) on OpenWebText. It offers clean, hackable code perfect for learning transformers and understanding GPT architecture from scratch, trainable on Shakespeare (CPU) or OpenWebText (multi-GPU).
pytorch-lightning
A high-level PyTorch framework featuring a Trainer class, automatic distributed training (DDP/FSDP/DeepSpeed), and a callbacks system, designed for minimal boilerplate. It scales from laptops to supercomputers with the same code, providing clean training loops with built-in best practices.
skypilot-multi-cloud-orchestration
Multi-cloud orchestration for ML workloads with automatic cost optimization. Use when you need to run training or batch jobs across multiple clouds, leverage spot instances with auto-recovery, or optimize GPU costs across providers.
serving-llms-vllm
Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Ideal for deploying production LLM APIs, optimizing inference, or serving models with limited GPU memory, it supports OpenAI-compatible endpoints, quantization, and tensor parallelism.
hqq-quantization
Half-Quadratic Quantization for LLMs without calibration data. Use when quantizing models to 4/3/2-bit precision without needing calibration datasets, for fast quantization workflows, or when deploying with vLLM or HuggingFace Transformers.
weights-and-biases
Track ML experiments with automatic logging, visualize training in real-time, optimize hyperparameters with sweeps, and manage model registry with W&B - a collaborative MLOps platform.
evolving-ai-agents
Provides guidance for automatically evolving and optimizing AI agents across any domain using LLM-driven evolution algorithms. Use when building self-improving agents, optimizing agent prompts and skills against benchmarks, or implementing automated agent evaluation loops.
llama-cpp
Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware, ideal for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. It supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10x speedup vs PyTorch on CPU.
sglang
Fast structured generation and serving for LLMs using RadixAttention prefix caching. It's ideal for JSON/regex outputs, constrained decoding, agentic workflows, or when 5x faster inference than vLLM with prefix sharing is needed, powering over 300,000 GPUs at major tech companies.
deepspeed
Expert guidance for distributed training with DeepSpeed, covering ZeRO optimization stages, pipeline parallelism, FP16/BF16/FP8, 1-bit Adam, and sparse attention.
evaluating-llms-harness
Evaluates LLMs across 60+ academic benchmarks like MMLU and HumanEval. It's an industry standard for benchmarking model quality, comparing models, and tracking training progress, supporting HuggingFace, vLLM, and APIs.
nemo-guardrails
NVIDIA's runtime safety framework for LLM applications features jailbreak, hallucination, and toxicity detection, alongside input/output validation, fact-checking, and PII filtering. It uses Colang 2.0 DSL for programmable rails, is production-ready, and runs on T4 GPUs.
mlflow
Track ML experiments, manage model registry with versioning, deploy models to production, and reproduce experiments with MLflow, a framework-agnostic ML lifecycle platform.
constitutional-ai
Anthropic's method for training harmless AI through self-improvement. It employs a two-phase approach: supervised learning with self-critique/revision, followed by RLAIF, used for safety alignment and reducing harmful outputs without human labels, powering Claude's safety system.
ray-train
Orchestrates distributed training for PyTorch/TensorFlow/HuggingFace across clusters, scaling from laptops to thousands of nodes, with built-in hyperparameter tuning (Ray Tune), fault tolerance, and elastic scaling, ideal for massive models or distributed hyperparameter sweeps.
nnsight-remote-interpretability
Provides guidance for interpreting and manipulating neural network internals using nnsight, with optional NDIF remote execution. Use when needing to run interpretability experiments on massive models (70B+) without local GPU resources, or with any PyTorch architecture.
grpo-rl-training
Expert guidance for GRPO/RL fine-tuning with TRL for reasoning and task-specific model training.
fine-tuning-with-trl
Fine-tune LLMs using reinforcement learning with TRL, employing SFT for instruction tuning, DPO for preference alignment, and PPO/GRPO for reward optimization and reward model training. This is ideal for RLHF, aligning models with preferences, or training from human feedback, and integrates with HuggingFace Transformers.
huggingface-tokenizers
Fast, Rust-based tokenizers optimized for research and production, processing 1GB in under 20 seconds. They support BPE, WordPiece, and Unigram, offering custom vocabulary training and seamless integration with transformers for high-performance tokenization.
openrlhf-training
A high-performance RLHF framework with Ray+vLLM acceleration for PPO, GRPO, RLOO, and DPO training of large models (7B-70B+). Built on Ray, vLLM, and ZeRO-3, it achieves 2x faster performance than DeepSpeedChat through distributed architecture and GPU resource sharing.
gguf-quantization
GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible 2-8 bit quantization without GPU requirements.
evaluating-code-models
Evaluates code generation models across HumanEval, MBPP, MultiPL-E, and 15+ benchmarks with pass@k metrics. This industry standard from BigCode Project, used by HuggingFace leaderboards, is ideal for benchmarking code models, comparing coding abilities, and testing multi-language support.
pyvene-interventions
Provides guidance for performing causal interventions on PyTorch models using pyvene's declarative intervention framework. Use when conducting causal tracing, activation patching, interchange intervention training, or testing causal hypotheses about model behavior.
miles-rl-training
Provides guidance for enterprise-grade RL training using miles, a production-ready fork of slime. Use when training large MoE models with FP8/INT4, needing train-inference alignment, or requiring speculative RL for maximum throughput.
prompt-guard
Meta's 86M prompt injection and jailbreak detector filters malicious prompts and third-party data for LLM applications. It boasts over 99% TPR, under 1% FPR, is fast (<2ms GPU), multilingual (8 languages), and can be deployed via HuggingFace or batch processing for RAG security.
gptq
Post-training 4-bit quantization for LLMs with minimal accuracy loss. It enables deploying large models (70B, 405B) on consumer GPUs, offering 4x memory reduction with <2% perplexity degradation or 3-4x faster inference than FP16, and integrates with transformers and PEFT for QLoRA fine-tuning.
ray-data
Scalable data processing for ML workloads with streaming execution across CPU/GPU, supporting various formats like Parquet/CSV/JSON/images. It integrates with Ray Train, PyTorch, and TensorFlow, scaling from a single machine to hundreds of nodes for tasks like batch inference, data preprocessing, and distributed ETL.
verl-rl-training
Provides guidance for training LLMs with reinforcement learning using verl (Volcano Engine RL).
lambda-labs-gpu-cloud
Reserved and on-demand GPU cloud instances for ML training and inference. Use when you need dedicated GPU instances with simple SSH access, persistent filesystems, or high-performance multi-node clusters for large-scale training.
instructor
Extract structured data from LLM responses with Pydantic validation, retry failed extractions automatically, parse complex JSON with type safety, and stream partial results with Instructor - a battle-tested structured output library.
outlines
Guarantee valid JSON/XML/code structure during generation, use Pydantic models for type-safe outputs, support local models (Transformers, vLLM), and maximize inference speed with Outlines - dottxt.ai's structured generation library.
long-context
Extend transformer model context windows using RoPE, YaRN, ALiBi, and position interpolation techniques. This is useful for processing long documents, extending pre-trained models, or implementing efficient positional encodings, covering various embedding and extrapolation strategies for LLMs.
brainstorming-research-ideas
Guides researchers through structured ideation frameworks to discover high-impact research directions. Use when exploring new problem spaces, pivoting between projects, or seeking novel angles on existing work.
qdrant-vector-search
High-performance vector similarity search engine for RAG and semantic search. Use it for production RAG systems needing fast nearest neighbor search, hybrid search with filtering, or scalable Rust-powered vector storage.
ml-paper-writing
Write publication-ready ML/AI papers for NeurIPS, ICML, ICLR, ACL, AAAI, COLM. Use when drafting papers from research repos, structuring arguments, verifying citations, or preparing camera-ready submissions; for systems venues, use 'systems-paper-writing'.
model-merging
Merge multiple fine-tuned models with mergekit to combine capabilities without retraining, ideal for creating specialized models by blending domain-specific expertise or improving performance. It covers various merging techniques like SLERP, TIES-Merging, DARE, Task Arithmetic, and linear merging, plus production deployment strategies.
segment-anything-model
Foundation model for image segmentation with zero-shot transfer. Use when you need to segment any object in images using points, boxes, or masks as prompts, or automatically generate all object masks in an image.
presenting-conference-talks
Generates conference presentation slides (Beamer LaTeX PDF and editable PPTX) from a compiled paper with speaker notes and talk script. Use when preparing oral talks, spotlight presentations, or invited talks for ML and systems conferences.
implementing-llms-litgpt
Implements and trains LLMs using Lightning AI's LitGPT, supporting over 20 pretrained architectures like Llama, Gemma, Phi, Qwen, and Mistral. It's suitable for clean model implementations, educational understanding of architectures, or production fine-tuning with LoRA/QLoRA, featuring single-file implementations without abstraction layers.
awq-quantization
This 4-bit LLM compression method, winner of the MLSys 2024 Best Paper Award, uses activation-aware weight quantization, providing a 3x speedup and minimal accuracy loss. It's ideal for deploying large models on limited GPU memory or for faster, more accurate inference than GPTQ, especially for instruction-tuned and multimodal models.
nemo-evaluator-sdk
NVIDIA's enterprise-grade platform evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. It provides scalable evaluation on local Docker, Slurm HPC, or cloud platforms, featuring a container-first architecture for reproducible benchmarking.
pytorch-fsdp2
Adds PyTorch FSDP2 (fully_shard) to training scripts with correct init, sharding, mixed precision/offload config, and distributed checkpointing. Use when models exceed single-GPU memory or when you need DTensor-based sharding with DeviceMesh.
distributed-llm-pretraining-torchtitan
Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). It is ideal for pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs, leveraging Float8, torch.compile, and distributed checkpointing.
tensorboard
Visualize training metrics, debug models with histograms, compare experiments, visualize model graphs, and profile performance with TensorBoard - Google's ML visualization toolkit.
optimizing-attention-flash
Optimizes transformer attention with Flash Attention, achieving 2-4x speedup and 10-20x memory reduction. Ideal for long sequences (>512 tokens), addressing GPU memory issues, or accelerating inference, it supports PyTorch native SDPA, flash-attn, H100 FP8, and sliding window attention.
Category alert