# Hardware Selection Guide

Choosing the right hardware (flavor) is critical for cost-effective workloads.

## Available Hardware

### CPU
- `cpu-basic` - Basic CPU, testing only
- `cpu-upgrade` - Enhanced CPU

**Use cases:** Data processing, testing scripts, lightweight workloads
**Not recommended for:** Model training, GPU-accelerated workloads

### GPU Options

| Flavor | GPU | Memory | Use Case | Cost/hour |
|--------|-----|--------|----------|-----------|
| `t4-small` | NVIDIA T4 | 16GB | <1B models, demos, batch inference | ~$0.50-1 |
| `t4-medium` | NVIDIA T4 | 16GB | 1-3B models, development | ~$1-2 |
| `l4x1` | NVIDIA L4 | 24GB | 3-7B models, efficient workloads | ~$2-3 |
| `l4x4` | 4x NVIDIA L4 | 96GB | Multi-GPU workloads | ~$8-12 |
| `a10g-small` | NVIDIA A10G | 24GB | 3-7B models, production | ~$3-4 |
| `a10g-large` | NVIDIA A10G | 24GB | 7-13B models | ~$4-6 |
| `a10g-largex2` | 2x NVIDIA A10G | 48GB | Multi-GPU, large models | ~$8-12 |
| `a10g-largex4` | 4x NVIDIA A10G | 96GB | Multi-GPU, very large models | ~$16-24 |
| `a100-large` | NVIDIA A100 | 40GB | 13B+ models, fast workloads | ~$8-12 |

## Selection Guidelines

### By Workload Type

**Data Processing**
- **Recommended:** `cpu-upgrade` or `l4x1`
- **Use case:** Transform, filter, analyze datasets
- **Batch size:** Depends on data size
- **Time:** Varies by dataset size

**Batch Inference**
- **Recommended:** `a10g-large` or `a100-large`
- **Use case:** Run inference on thousands of samples
- **Batch size:** 8-32 depending on model
- **Time:** Depends on number of samples

**Experiments & Benchmarks**
- **Recommended:** `a10g-small` or `a10g-large`
- **Use case:** Reproducible ML experiments
- **Batch size:** Varies
- **Time:** Depends on experiment complexity

**Model Training** (see `model-trainer` skill for details)
- **Recommended:** See model-trainer skill
- **Use case:** Fine-tuning models
- **Batch size:** Depends on model size
- **Time:** Hours to days

**Synthetic Data Generation**
- **Recommended:** `a10g-large` or `a100-large`
- **Use case:** Generate datasets using LLMs
- **Batch size:** Depends on generation method
- **Time:** Hours for large datasets

### By Budget

**Minimal Budget (<$5 total)**
- Use `cpu-basic` or `t4-small`
- Process small datasets
- Quick tests and demos

**Small Budget ($5-20)**
- Use `t4-medium` or `a10g-small`
- Process medium datasets
- Run experiments

**Medium Budget ($20-50)**
- Use `a10g-small` or `a10g-large`
- Process large datasets
- Production workloads

**Large Budget ($50-200)**
- Use `a10g-large` or `a100-large`
- Large-scale processing
- Multiple experiments

### By Model Size (for inference/processing)

**Tiny Models (<1B parameters)**
- **Recommended:** `t4-small`
- **Example:** Qwen2.5-0.5B, TinyLlama
- **Batch size:** 8-16

**Small Models (1-3B parameters)**
- **Recommended:** `t4-medium` or `a10g-small`
- **Example:** Qwen2.5-1.5B, Phi-2
- **Batch size:** 4-8

**Medium Models (3-7B parameters)**
- **Recommended:** `a10g-small` or `a10g-large`
- **Example:** Qwen2.5-7B, Mistral-7B
- **Batch size:** 2-4

**Large Models (7-13B parameters)**
- **Recommended:** `a10g-large` or `a100-large`
- **Example:** Llama-3-8B
- **Batch size:** 1-2

**Very Large Models (13B+ parameters)**
- **Recommended:** `a100-large`
- **Example:** Llama-3-13B, Llama-3-70B
- **Batch size:** 1

## Memory Considerations

### Estimating Memory Requirements

**For inference:**
```
Memory (GB) ≈ (Model params in billions) × 2-4
```

**For training:**
```
Memory (GB) ≈ (Model params in billions) × 20 (full) or × 4 (LoRA)
```

**Examples:**
- Qwen2.5-0.5B inference: ~1-2GB ✅ fits t4-small
- Qwen2.5-7B inference: ~14-28GB ✅ fits a10g-large
- Qwen2.5-7B training: ~140GB ❌ not feasible without LoRA

### Memory Optimization

If hitting memory limits:

1. **Reduce batch size**
   ```python
   batch_size = 1
   ```

2. **Process in chunks**
   ```python
   for chunk in chunks:
       process(chunk)
   ```

3. **Use smaller models**
   - Use quantized models
   - Use LoRA adapters

4. **Upgrade hardware**
   - cpu → t4 → a10g → a100

## Cost Estimation

### Formula

```
Total Cost = (Hours of runtime) × (Cost per hour)
```

### Example Calculations

**Data processing:**
- Hardware: cpu-upgrade ($0.50/hour)
- Time: 1 hour
- Cost: $0.50

**Batch inference:**
- Hardware: a10g-large ($5/hour)
- Time: 2 hours
- Cost: $10.00

**Experiments:**
- Hardware: a10g-small ($3.50/hour)
- Time: 4 hours
- Cost: $14.00

### Cost Optimization Tips

1. **Start small:** Test on cpu-basic or t4-small
2. **Monitor runtime:** Set appropriate timeouts
3. **Optimize code:** Reduce unnecessary compute
4. **Choose right hardware:** Don't over-provision
5. **Use checkpoints:** Resume if job fails
6. **Monitor costs:** Check running jobs regularly

## Multi-GPU Workloads

Multi-GPU flavors automatically distribute workloads:

**Multi-GPU flavors:**
- `l4x4` - 4x L4 GPUs
- `a10g-largex2` - 2x A10G GPUs
- `a10g-largex4` - 4x A10G GPUs

**When to use:**
- Large models (>13B parameters)
- Need faster processing (linear speedup)
- Large datasets (>100K samples)
- Parallel workloads

**Example:**
```python
hf_jobs("uv", {
    "script": "process.py",
    "flavor": "a10g-largex2",  # 2 GPUs
    "timeout": "4h",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"}
})
```

## Choosing Between Options

### CPU vs GPU

**Choose CPU when:**
- No GPU acceleration needed
- Data processing only
- Budget constrained
- Simple workloads

**Choose GPU when:**
- Model inference/training
- GPU-accelerated libraries
- Need faster processing
- Large models

### a10g vs a100

**Choose a10g when:**
- Model <13B parameters
- Budget conscious
- Processing time not critical

**Choose a100 when:**
- Model 13B+ parameters
- Need fastest processing
- Memory requirements high
- Budget allows

### Single vs Multi-GPU

**Choose single GPU when:**
- Model <7B parameters
- Budget constrained
- Simpler debugging

**Choose multi-GPU when:**
- Model >13B parameters
- Need faster processing
- Large batch sizes required
- Cost-effective for large jobs

## Quick Reference

```python
# Workload type → Hardware selection
HARDWARE_MAP = {
    "data_processing": "cpu-upgrade",
    "batch_inference_small": "t4-small",
    "batch_inference_medium": "a10g-large",
    "batch_inference_large": "a100-large",
    "experiments": "a10g-small",
    "training": "see model-trainer skill"
}
```