DataRobot Model Training Skill

This skill provides guidance for the complete model training workflow in DataRobot, from project creation through model selection and validation.

Quick Start

Most common use case: Create a project and train models

Upload dataset: upload_dataset(file_path, dataset_name) to upload training data
Create project: create_project(dataset_id, project_name) to create new project
Start training: start_automl(project_id, mode) to begin AutoML training

Example: "Create a new project with sales_data.csv, set 'revenue' as target, and start Quick AutoML training"

When to use this skill

Use this skill when you need to:

Create new DataRobot projects
Upload training datasets
Configure AutoML experiments
Monitor training progress
Select and compare models
Understand feature engineering results
Export trained models

Key capabilities

1. Project Management

Create new projects with appropriate settings
Upload datasets (CSV, Parquet, database connections)
Configure project settings (target, partitioning, time series)
Manage multiple projects and experiments

2. AutoML Configuration

Set training modes (Quick, Manual, Comprehensive)
Configure feature engineering options
Set time limits and resource constraints
Choose algorithms and model types

3. Training Execution

Start AutoML training runs
Monitor training progress
Handle training errors and warnings
Pause/resume training if needed

4. Model Analysis

Compare model performance metrics
Review feature importance
Analyze model insights and explanations
Select best models for deployment

Workflow examples

Example 1: Create and train a new project

User request: "Create a new project using my sales_data.csv file, predict 'revenue' as the target, and start AutoML training."

Agent workflow:

Upload the dataset to DataRobot
Create a new project with the dataset
Set 'revenue' as the target variable
Configure project settings (detect partitioning, handle time series if needed)
Start AutoML training with appropriate mode
Monitor training progress
Report when training completes with top model metrics

Example 2: Configure advanced training options

User request: "Train a model with time series settings: datetime column 'date', series ID 'store_id', forecast window 1-7 days."

Agent workflow:

Create project with time series configuration
Set datetime column and series ID columns
Configure forecast window (1-7 days)
Set appropriate time series validation
Start training with time series-aware algorithms
Monitor progress and report results

Using DataRobot SDK

This skill guides you to use the DataRobot Python SDK directly. Install the SDK if needed:

pip install datarobot

Key SDK Operations

Use these DataRobot SDK methods for model training:

Projects:

dr.Project.create_from_dataset(dataset_id, project_name) - Create project
dr.Project.get(project_id) - Get project details
dr.Project.list() - List all projects
project.set_target(target_column) - Set target variable

Training:

project.start(autopilot_on=True) - Start AutoML training
project.get_status() - Check training status
dr.Model.list(project_id) - List trained models
dr.Model.get(model_id) - Get model details

Model Analysis:

model.get_metrics() - Get performance metrics
model.get_feature_impact() - Get feature importance

See the Common Patterns section below for complete examples.

Helper Scripts

This skill includes executable helper scripts that Claude can run directly:

scripts/create_project.py - Create a new project from a dataset
scripts/start_training.py - Start AutoML training
scripts/list_models.py - List trained models with metrics

Usage example:

# Create project and set target
python scripts/create_project.py dataset_123 "Sales Prediction" revenue

# Start training
python scripts/start_training.py project_456 Quick

# List models
python scripts/list_models.py project_456 AUC

Claude can run these scripts directly or use them as reference when writing code.

Best practices

Data preparation: Ensure data is clean and properly formatted before upload
Target selection: Choose appropriate target variable (avoid leakage)
Partitioning: Use proper partitioning for time-aware or grouped data
Feature engineering: Let AutoML handle feature engineering, but review results
Model selection: Compare multiple models, not just the top performer
Validation: Review validation strategy and ensure it matches your use case

Common patterns

Pattern 1: Standard classification/regression

import datarobot as dr
import os

# Initialize client
client = dr.Client(
    token=os.getenv("DATAROBOT_API_TOKEN"),
    endpoint=os.getenv("DATAROBOT_ENDPOINT")
)

# Upload dataset
dataset = dr.Dataset.create_from_file(
    file_path="training_data.csv",
    name="Sales Data"
)

# Create project
project = dr.Project.create_from_dataset(
    dataset_id=dataset.id,
    project_name="Sales Prediction"
)

# Set target
project.set_target(
    target="revenue",
    mode=dr.AUTOPILOT_MODE.QUICK
)

# Start AutoML (Quick mode)
project.start(autopilot_on=True, max_wait=3600)

# Monitor training
while project.get_status()['status'] not in ['complete', 'error']:
    import time
    time.sleep(30)
    project.get_status()

# Get trained models
models = dr.Model.list(project.id)
best_model = max(models, key=lambda m: m.metrics.get('AUC', 0))
print(f"Best model: {best_model.id}, AUC: {best_model.metrics.get('AUC')}")

Pattern 2: Time series forecasting

import datarobot as dr

# Upload dataset
dataset = dr.Dataset.create_from_file("sales_data.csv", "Sales Forecast Data")

# Create project
project = dr.Project.create_from_dataset(
    dataset_id=dataset.id,
    project_name="Sales Forecast"
)

# Configure time series settings
project.set_target(
    target="sales",
    mode=dr.AUTOPILOT_MODE.COMPREHENSIVE,
    partitioning_method=dr.PARTITIONING_METHOD.DATETIME,
    datetime_partition_column="date",
    multiseries_id_columns=["store_id"],
    forecast_window_start=1,
    forecast_window_end=7
)

# Start training
project.start(autopilot_on=True, max_wait=7200)

# Wait for completion and get results
project.wait_for_completion()
models = dr.Model.list(project.id)

Model selection criteria

When selecting models, consider:

Performance metrics: Accuracy, AUC, RMSE, MAPE (depending on problem type)
Prediction speed: Important for real-time deployments
Interpretability: Some models are more explainable
Feature requirements: Some models need specific feature types
Deployment constraints: Consider model size and resource requirements

Error handling

Common errors and solutions:

Dataset upload failures: Check file format, size limits, encoding
Target errors: Ensure target column exists and has appropriate values
Training failures: Check data quality, feature types, missing values
Timeout errors: Adjust time limits or use Quick mode for initial exploration

SDK Setup

Install DataRobot SDK

pip install datarobot

Initialize Client

import datarobot as dr
import os

client = dr.Client(
    token=os.getenv("DATAROBOT_API_TOKEN"),
    endpoint=os.getenv("DATAROBOT_ENDPOINT", "https://app.datarobot.com")
)

Resources

DataRobot Python SDK Documentation
DataRobot AutoML Documentation
General Modeling Documentation – Time Series
[General Modeling Documentation – Feature Engineering](https://docs.datarobot.com/en/docs/modeling/index.h

datarobot-model-training

How to add

Drop this on your repo README

Related skills

xlsx

mem-search

weekly-digests

how-it-works

Get new Dados e Análise skills every Monday