# Aurora: Evaluation Framework for Malware Classifiers Under Concept Drift

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)

**Aurora** is a comprehensive evaluation framework for assessing malware classifiers under temporal concept drift, focusing on **stability**, **reliability**, and **uncertainty quality**.

---

## Overview

Modern malware classifiers show promising performance figures, but do these translate to genuine operational reliability? Aurora provides a framework to evaluate malware classifiers based on their:

- **Reliability**: Traditional metrics (F1, FNR, FPR) plus uncertainty calibration (AURC)
- **Stability**: Temporal consistency (CV[F1], Mann-Kendall trend, Max Drawdown)
- **Operational Resilience**: Selective classification and rejection-based analysis

### Key Features

- Temporal evaluation with monthly retraining under labeling budgets
- Multiple uncertainty quantification methods (MSP, Pseudo-Loss, OOD, Margin)
- Selective classification simulation with configurable rejection budgets
- Pareto-optimal method identification across multiple objectives
- Support for three malware datasets: AndroZoo, API-Graph, Transcendent

---

## Installation

```bash
# Clone the repository
git clone https://github.com/aurora-framework/aurora.git
cd aurora

# Install in development mode
pip install -e .
```

**Requirements**: Python 3.9+, NumPy, Pandas, SciPy, scikit-learn, PyTorch, tqdm

---

## Reproducing Paper Results

### Step 1: Download Experimental Data

Download the experimental results from Zenodo:

**DOI**: [Coming soon - will be updated upon publication]

```bash
# Extract to data-for-export/ in the repository root
unzip aurora-data-v1.0.zip -d .

# Verify structure
ls data-for-export/
# Should show:
#   deep_drebin_svc/
#   others_v2/
```

**Data Contents** (~8GB total):
- `deep_drebin_svc/parallel_ce_no_aug_v2.pkl` - DeepDrebin results (AndroZoo, API-Graph)
- `deep_drebin_svc/parallel_svc_v2.pkl` - Drebin/SVC baseline results
- `deep_drebin_svc/transcendent/` - DeepDrebin/Drebin results (Transcendent)
- `others_v2/*.json` - HCC and CADE method results (all datasets)

### Step 2: Generate Paper Tables

```bash
# Generate the main comprehensive results table
python examples/reproduce_paper_table/generate_main_table.py
```

**Output**: `examples/reproduce_paper_table/results/comprehensive_results_mirror_new_data.tex`

### Step 3: Verify Results

Compare the generated table with the pre-computed reference:
```bash
diff examples/reproduce_paper_table/results/comprehensive_results_mirror_new_data.tex \
     examples/reproduce_paper_table/results/comprehensive_results_mirror_new_data.tex.reference
```

---

## Repository Structure

```
aurora/
├── src/aurora/                     # Core framework (22 modules)
│   ├── analyzer.py                 # AuroraAnalyzer class
│   ├── metrics.py                  # F1, FNR, FPR, stability metrics
│   ├── tools.py                    # compute_metrics_numba, compute_aurc
│   ├── ingestion.py                # PickleResultsLoader, JSONResultsLoader
│   ├── filters.py                  # FilterChain for result filtering
│   ├── pareto.py                   # Pareto dominance analysis
│   ├── performance_rejection.py    # FNR/FPR-based rejection simulation
│   └── ...
│
├── tests/                          # Test suite
│   └── test_*.py
│
├── examples/reproduce_paper_table/ # Paper reproduction scripts
│   ├── README.md                   # Reproduction guide
│   ├── generate_main_table.py      # Main table generation
│   ├── generate_pareto_table.py    # Pareto analysis
│   ├── generate_rejection_table.py # Rejection analysis
│   ├── generate_figures.py         # Paper figures
│   └── results/                    # Generated tables
│
├── docs/
│   └── QUICK_REFERENCE.md          # Metrics reference
│
└── data-for-export/                # Experimental data (download separately)
```

---

## Metrics Reference

### Baseline Performance
| Metric | Description | Direction |
|--------|-------------|-----------|
| F1 (%) | Harmonic mean of precision and recall | Higher is better |
| FNR (%) | False Negative Rate | Lower is better |
| AUROC (%) | Area Under ROC Curve | Higher is better |

### Reliability (Uncertainty Quality)
| Metric | Description | Direction |
|--------|-------------|-----------|
| AURC | Area Under Risk-Coverage Curve | Lower is better |
| AURC[F1]* | F1-based risk over coverage spectrum | Lower is better |

### Stability (Temporal Consistency)
| Metric | Description | Direction |
|--------|-------------|-----------|
| σ[F1] | Standard deviation of monthly F1 | Lower is better |
| τ | Mann-Kendall trend coefficient | >0 = improving |
| BF* (%) | Benefit Fraction at budget B | Higher is better |
| ΔRej* | Rejection bias at budget B | Closer to 0 is better |
| σ[Rej]* | Rejection std at budget B | Lower is better |

### Pareto Status
| Symbol | Meaning |
|--------|---------|
| ★ | Universal Pareto-optimal (on frontier in all datasets) |
| △ | Partial Pareto-optimal (on frontier in some datasets) |
| ○ | Dominated (never on Pareto frontier) |

---

## Datasets

| Dataset | Training | Validation | Test | Months |
|---------|----------|------------|------|--------|
| AndroZoo | 2019 | 2020-01 to 2020-06 | 2020-07 to 2021-12 | 18 |
| API-Graph | 2012 | 2013-01 to 2013-06 | 2013-07 to 2018-12 | 66 |
| Transcendent | 2014 | 2015-01 to 2015-06 | 2015-07 to 2018-12 | 42 |

---

## Methods Evaluated

| Method | Uncertainty | Initial Training |
|--------|-------------|------------------|
| DeepDrebin | MSP (Max Softmax Probability) | Full D₀ or Subsampled (4800) |
| HCC | Pseudo-Loss or MSP | Full D₀ or Subsampled (4800) |
| CADE | OOD or MSP | Cold (0) or Warm |
| Drebin (SVM) | Margin distance | Full D₀ or Subsampled (4800) |

---

## API Usage

```python
from aurora import (
    PickleResultsLoader,
    compute_metrics_numba,
    compute_aurc,
    StabilityMetrics,
)
from aurora.pareto import ParetoAnalyzer

# Load experimental results
loader = PickleResultsLoader()
collection = loader.load("data-for-export/deep_drebin_svc/parallel_ce_no_aug_v2.pkl")

# Compute metrics for a single experiment
result = collection.results[0]
f1, fnr, fpr = compute_metrics_numba(result.labels, result.predictions)

# Compute stability metrics
stability = StabilityMetrics.from_monthly_f1_scores(monthly_f1_list)
print(f"CV[F1]: {stability.cv_f1:.3f}")
print(f"σ[F1]: {stability.std_f1:.3f}")
print(f"τ: {stability.mann_kendall_tau:.3f}")

# Pareto analysis
analyzer = ParetoAnalyzer()
frontier = analyzer.find_pareto_frontier(methods_metrics)
```

---

## Citation

```bibtex
@inproceedings{aurora2026,
  title={Aurora: Evaluating Malware Classifiers Under Concept Drift},
  author={[Authors]},
  booktitle={Proceedings of the ACM Conference on Computer and Communications Security (CCS)},
  year={2026}
}
```

---

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

---

## Acknowledgments

- HCC Dataset: https://github.com/wagner-group/active-learning
- Transcendent/Tesseract: https://github.com/AliGhahremanian/Tesseract

---

## Contact

For questions or issues, please open a GitHub issue or contact the authors.
