Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

n=50000 of tabular data is a good sample size, and results will likely have a low standard error assuming no systemic bias. (Although it's not "big" data)

n=50000 of text data is different, since there will be less repetition of contextual structures and words (particularly with proper nouns). The fact that the dataset only uses "hundreds" as mentioned in the original post is interesting.



Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: