Heterogeneous Feature Augmentation
Deep hurdle models + CTABGAN for telework prediction — 55% → 75% accuracy
Rutgers RUCI Lab
Problem
Problem
Telework prediction at the Census Block Group level is a zero-inflated binary classification problem — most areas have zero remote workers, and the feature space is heterogeneous (demographic, geographic, economic, infrastructure). Standard classifiers fail because they treat the zero-inflation as class imbalance rather than as a structural data characteristic requiring a two-stage model.
Approach
Approach
The project implements deep hurdle models in PyTorch: a first-stage binary classifier predicts whether telework is non-zero, and a second-stage regressor models the count conditional on being non-zero. Heterogeneous feature augmentation applies focal loss and subspace tuning to handle the mixed feature types across 40,000+ geographic units. CTABGAN generators synthesize additional training samples for underrepresented geographic configurations. A joint multi-task variant trains both stages end-to-end.
Architecture
Architecture
Heterogeneous Feature Augmentation — system diagram
Key Technical Decisions
Key Technical Decisions
Hurdle model over SMOTE for zero-inflation
SMOTE treats zero-inflation as class imbalance and synthesizes minority samples. A hurdle model treats it as a structural data characteristic: the zero-generating process is fundamentally different from the count-generating process. Modeling them separately produced much better-calibrated predictions.
Focal loss for heterogeneous subspace tuning
The feature space mixes demographic, geographic, and economic variables with very different scales and distributions. Focal loss downweights well-classified easy examples and focuses training on hard subspace configurations — critical when the model needs to generalize across 40,000+ geographically diverse census units.
Results
Results
- ✓Classification accuracy improved from 55% to 75% with the deep hurdle approach
- ✓CTABGAN augmentation improved generalization across underrepresented geographic configurations
- ✓Moran's I spatial autocorrelation analysis validated spatial consistency of predictions
- ✓Research conducted at Rutgers Urban and Civic Informatics Laboratory
Tech Stack
Tech Stack
Links
Links