Research

Heterogeneous Feature Augmentation

Deep hurdle models + CTABGAN for telework prediction — 55% → 75% accuracy

Rutgers RUCI Lab

6 technologies
2 key decisions
4 results

Problem

Problem

Telework prediction at the Census Block Group level is a zero-inflated binary classification problem — most areas have zero remote workers, and the feature space is heterogeneous (demographic, geographic, economic, infrastructure). Standard classifiers fail because they treat the zero-inflation as class imbalance rather than as a structural data characteristic requiring a two-stage model.

Approach

Approach

The project implements deep hurdle models in PyTorch: a first-stage binary classifier predicts whether telework is non-zero, and a second-stage regressor models the count conditional on being non-zero. Heterogeneous feature augmentation applies focal loss and subspace tuning to handle the mixed feature types across 40,000+ geographic units. CTABGAN generators synthesize additional training samples for underrepresented geographic configurations. A joint multi-task variant trains both stages end-to-end.

Architecture

Architecture

Heterogeneous Feature Augmentation — system diagram

ACS / HPS Survey D…CTABGAN Feature Au…Stage 1: Binary Cl…Stage 2: Count ModelHurdle Model OutputMoran's I Spatial …

Key Technical Decisions

Key Technical Decisions

Assembly Instructions — 2 Steps
01

Hurdle model over SMOTE for zero-inflation

SMOTE treats zero-inflation as class imbalance and synthesizes minority samples. A hurdle model treats it as a structural data characteristic: the zero-generating process is fundamentally different from the count-generating process. Modeling them separately produced much better-calibrated predictions.

02

Focal loss for heterogeneous subspace tuning

The feature space mixes demographic, geographic, and economic variables with very different scales and distributions. Focal loss downweights well-classified easy examples and focuses training on hard subspace configurations — critical when the model needs to generalize across 40,000+ geographically diverse census units.

Results

Results

  • Classification accuracy improved from 55% to 75% with the deep hurdle approach
  • CTABGAN augmentation improved generalization across underrepresented geographic configurations
  • Moran's I spatial autocorrelation analysis validated spatial consistency of predictions
  • Research conducted at Rutgers Urban and Civic Informatics Laboratory

Tech Stack

Tech Stack

PyTorchXGBoostPythonNumPyScikit-learnJupyter

Links