Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

>> ...in the past 2 years, % of any given project that involves ML: 15%, that involves moving, monitoring, and counting data to feed ML: 85%

As it should be. In order to have confidence in your ML you need to really understand your data and data processing.



Yes. The point I took away from this is that this is not at all a focus of most academic settings. This ends up leaving a huge gap and leaving candidates with an academic DS background woefully unprepared and undesirable.


That seems strange to me. People on forums like this often describe Data Science practitioners as "statisticians that can code". If academic Data Science programs aren't emphasizing data engineering as part of their curriculum, what differentiates a Data Science program from statistics or business intelligence?


> If academic Data Science programs aren't emphasizing data engineering as part of their curriculum, what differentiates a Data Science program from statistics or business intelligence?

In my experience, they're emphasizing software-based data work like machine learning, but not the (vital) peripherals like cleaning/studying/loading data or monitoring and sanity-checking outputs.

A data science student might get a process-first task like making predictions from data using KNN, regressions, t-tests, or neural nets, choosing a method and optimizing based on performance. A statistics student might focus on theory, choosing an appropriate analysis method in advance based on the dataset, and reasoning about the effects of error instead of just trying to reduce it.

But the data scientist could still be training on a clean, wholly-theoretical dataset or a highly predictable online-training environment. The result is a lot of entry-level data scientists who are mechanically talented but stymied by real-world hurdles. Issues handling dirty or inconstant data, for one. But there are a lot of others: a tendency to do analysis in a vacuum, without taking advantage of knowledge about the domain and data source; or judging output effectiveness based on training accuracy, without asking whether the dataset is (and will stay) well-matched to the actual task.

I don't mean that to sound dismissive; there are lots of people who do all of that well, even newly-trained. But it does seem to be a common gap in a lot of data science education.


4th year EE undegraduate student here, taking both "Data Analysis/Pattern Rec" and "Computer Vision" electives this term. My early courses prepared me more for a path focused in circuit design, but I jumped ship through exposure to wonderful, wonderful DSP. A lot of what I'm learning now is very new to me, so, I appreciate comments like yours that give a sense of potential gaps in my learning. Thank you.

I'm currently working on an assignment for CV in which we extract Histogram of Oriented Gradient features from the CIFAR-10 dataset using python, then use them to train one of three classifiers (SVM, Gaussian Naive Bayes, Logistic Regression). I had asked about preprocessing, but was told it was outside the scope of this assignment, so we're just using the dataset as-is. :(

The nice bit is, I have a research internship coming up in a lab that will have me working on actual datasets, rather than toy examples. And, there's a data science club on campus that has an explicit focus on cleaning data which I plan on regularly attending. So... hopefully I'm on the right track!


Don't worry, when you have real problems you will have time to learn. Most of the time is not even data cleaning, but debugging, getting into the details of the data or code written by somebody else to understand why something is not working (and there's always something that's not working :) ). The main differentiator is whether you have interest / patience for that or not.


I'm not familiar with academic Data Science programs but I've worked with statisticians for over fifteen years and they are usually very involved on the data engineering side. If they aren't running the systems then they are working closely with those people to test and confirm inputs and outputs before running analyses.


> they are working closely with those people to test and confirm inputs and outputs before running analyses

In terms of data science training, at least, this is often a missing element. It's easy to create classroom tasks that focus on teaching how to do analyses and neglect practical aspects like validating data and sanity-checking results. People pick it up on the job, of course, but I wouldn't be surprised if statisticians get a better academic grounding from things like reasoning about uncertainty.

(It's not a problem specific to data science, either. I've heard plenty of complaints about new engineers who are so used to made-up problems that they don't balk at ludicrous data or results when they start doing real work.)


> It's not a problem specific to data science, either. I've heard plenty of complaints about new engineers who are so used to made-up problems that they don't balk at ludicrous data or results when they start doing real work.

This is one of the reasons I think we need to better integrate technology (and general data analysis follows the same reasoning) across the curriculum: an increasing share of work (and a more rapidly increasing share of good paying work) is knowledge-based work that involves both data analysis and working with people who are doing automation, on top of that which is primarily automation or data analysis. But we don't trash other knowledge skills in relation to automation and analysis, which leaves people specialized in automation and analysis and people specialized in domain skills talking to each other over a wide gap too often, with a lot falling through the cracks.


The differentiator is breadth over depth. Not getting much into theoretical underpinnings.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: