Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

Yeah, tree based models are great for tabular datasets that are primarily numeric, with only a few categorical variables. But as soon as you categorical variables have a 1000+ potential values that need 1-hot encoding or if you have any natural language text associated with your rows, deep learning almost always outperforms, especially if you have over 50K instances in my experience.

The major downside of DL is the slow training, and therefore slow iteration feedback loop. Couple that with an exponentially growing number or hparams to tune, and you get something very powerful but costly in terms of time to use.

But if you want the best possible accuracy, and data collection isn't expensive, DL is the way to go. Just expect to spend 10x the amount of time tuning it vs trees to get a 10% to 20% reduction in error.



>categorical variables have a 1000+ potential values that need 1-hot encoding

You typically do not need to 1-hot encode categorical variables as the common implementations like LightGBM and Catboost have native efficient ways to handle them. Googling around I can't easily find cases where people get better results with GBM+one-hot and I haven't either, though I haven't worked with 1000+ values categorical variables much.

>deep learning almost always outperforms

This doesn't the case in the article we are commenting on, nor on Kaggle but given that DL models occasionally (though rarely) outperform I'm willing to believe this is one of those cases. Any recommendation on which DL models in particular I should test this claim?




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: