# Data for Paper Table Reproduction

This directory contains the minimal experimental results needed to reproduce the main results table from the Aurora paper.

## Directory Structure

```
data-for-export/
├── README.md (this file)
├── deep_drebin_svc/
│   ├── parallel_ce_no_aug_v2.pkl     # DeepDrebin experiments (AndroZoo, API-Graph)
│   ├── parallel_svc_v2.pkl            # SVM baseline experiments (AndroZoo, API-Graph)
│   └── transcendent/                  # Transcendent dataset results
│       ├── deep_drebin_full_*.pkl
│       ├── deep_drebin_subsampled_*.pkl
│       ├── drebin_full_*.pkl
│       └── drebin_subsampled_*.pkl
└── others_v2/
    ├── hcc_mlp_warm-androzoo.json
    ├── hcc_mlp_warm-androzoo-subsampling.json
    ├── hcc_mlp_warm-apigraph.json
    ├── hcc_mlp_warm-apigraph-subsampling.json
    ├── hcc_mlp_warm-transcendent.json
    ├── hcc_mlp_warm-transcendent-subsampling.json
    ├── cade_mlp_cold-androzoo.json
    ├── cade_mlp_cold-apigraph.json
    ├── cade_mlp_cold-transcendent.json
    ├── cade_mlp_warm-androzoo.json
    ├── cade_mlp_warm-apigraph.json
    └── cade_mlp_warm-transcendent.json
```

## File Descriptions

### DeepDrebin Results (Pickle Format)

**`parallel_ce_no_aug_v2.pkl`** (~1.3GB)
- Deep neural network classifier (DeepDrebin architecture)
- Cross-entropy training without augmentation
- Multiple sampling strategies and label budgets
- All three datasets (AndroZoo, APIGraph, Transcendent)
- 5 random seeds per configuration

**`parallel_svc_v2.pkl`** (~1.6GB)
- Linear SVM baseline (Drebin)
- Margin-based uncertainty estimation
- Same experimental configurations as DeepDrebin

### HCC Results (JSON Format)

**`hcc_mlp_warm-*.json`** (~2GB total)
- HCC (Heterogeneous Concept Change) adaptation method
- Warm-start training (uses validation data for adaptation)
- Includes both full first-year and subsampled variants
- NCM expansion creates Pseudo-Loss and MSP uncertainty variants

### CADE Results (JSON Format)

**`cade_mlp_*.json`** (~3GB total)
- CADE (Concept Adaptation through Domain Embedding) method
- Both cold-start and warm-start configurations
- NCM expansion creates OOD and MSP uncertainty variants

## Data Format

### Pickle Files (.pkl)

Python pickle format containing list of result dictionaries. Each result dict contains:
- `Dataset`: str - "androzoo", "apigraph", or "transcendent"
- `Sampler-Mode`: str - Training data sampling strategy
- `Monthly-Label-Budget`: int - Number of labels per month (50, 100, 200, 400)
- `Random-Seed`: int - Random seed (0-4)
- `Predictions`: List[np.ndarray] - Monthly predictions
- `Labels`: List[np.ndarray] - Monthly ground truth labels
- `Uncertainties (Month Ahead)`: List[np.ndarray] - Uncertainty scores
- `Hyperparameters`: dict - Training hyperparameters

### JSON Files (.json)

JSON format with same structure as pickle files, but with:
- NumPy arrays serialized as nested lists
- Additional trainer-specific metadata

## Usage

These files are loaded and processed by:
- `examples/reproduce_paper_table/load_all_aurora_data.py`
- `examples/reproduce_paper_table/generate_main_table.py`

See `examples/reproduce_paper_table/README.md` for detailed instructions.

## Size Information

| Category | Size | Files |
|----------|------|-------|
| DeepDrebin & SVC (AndroZoo, API-Graph) | ~2.9GB | 2 pickle files |
| Transcendent (DeepDrebin & SVC) | ~0.4GB | 4 pickle files |
| HCC (JSON) | ~2.0GB | 6 JSON files |
| CADE (JSON) | ~3.0GB | 6 JSON files |
| **Total** | **~8.3GB** | **18 files** |

## Obtaining the Data

This directory is **not tracked by git** due to its large size.

### Option 1: Copy from Full Results

If you have access to the full `results/` directory:

```bash
# From repository root
mkdir -p data-for-export/{deep_drebin_svc,others_v2}

# Copy DeepDrebin and SVC
cp results/deep_drebin_svc/parallel_ce_no_aug_v2.pkl data-for-export/deep_drebin_svc/
cp results/deep_drebin_svc/parallel_svc_v2.pkl data-for-export/deep_drebin_svc/

# Copy HCC and CADE
cp results/others_v2/hcc_mlp_warm-*.json data-for-export/others_v2/
cp results/others_v2/cade_mlp_*.json data-for-export/others_v2/
```

### Option 2: Download from Archive

Contact the authors for access to the pre-computed experimental results archive.

### Option 3: Re-run Experiments

Re-run the experiments from scratch (requires significant computational resources):
- See main `README.md` for experiment execution instructions
- See `src/aurora/experiments.py` for experiment configurations
- Estimated time: Several days on multi-core machine

## Verification

To verify file integrity after copying/downloading:

```bash
# From repository root
python -c "
import pickle, json
from pathlib import Path

# Check pickle files
for pkl_file in Path('data-for-export/deep_drebin_svc').glob('*.pkl'):
    with open(pkl_file, 'rb') as f:
        data = pickle.load(f)
        print(f'✓ {pkl_file.name}: {len(data)} results')

# Check JSON files
for json_file in Path('data-for-export/others_v2').glob('*.json'):
    with open(json_file, 'r') as f:
        data = json.load(f)
        print(f'✓ {json_file.name}: {len(data)} results')
"
```

Expected output: 14 files loaded successfully with result counts.

## Notes

- All experiments use 5 random seeds for statistical significance
- Results include predictions, labels, and uncertainty scores for each test month
- Hyperparameter configurations are embedded in result dictionaries
- Data is already filtered to appropriate cutoff months per dataset
