Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

Sometimes an idea has a time. It may seem obvious now but it isn't like Instagram and Snapchat were the first image sharing applications developed. Slack was nowhere near the first chat app.

I happened to have a long discussion on the topic of data businesses last night with a friend. We brainstormed datasets that would be a combination of hard/expensive to obtain while also having resell ability to thousands of customers who would be willing to pay a high value for them. I don't want to get involved in datasets that are easy to obtain (too many competitors, no bar to entry) or datasets specific to a particular company (too much dependence on a small set of customers, cost of acquiring new customers also includes cost of acquiring data, no economy of scale).

It's easy to start with the tech problem: how to collect, clean and analyze data. But reasoning backwards from the business side is much more difficult. Expensive data I can sell once feels easy. Cheap data I can sell frequently feels like a race to the bottom. Expensive data I can re-sell 1000s of times to a niche audience feels like a perfect middle ground ... I just can't think of any examples.



I think the machine learning problem is not data. It's not models. It's not compute.

It's annotation. That is a workforce problem. You want to automate contracts? You need attorneys. You want to automate radiology? You need radiologists. You want to automate driving? You need drivers.

This makes ML less like a SaaS business, more like a mining business. There's tons, literally tons, of data/ore for any interesting problem. That's why it's an interesting problem. There are buyers of iron, gold, and marble. There are buyers of driverless cars, physician decision support systems, and contract automation solutions. But recovering the data from the mine (digitization) and enriching it (annotation) cost money. So much that the market variation may make it lucrative at some times and not others. If you are near peak employment, the value of a model goes up, but the cost of annotators is high also.

I'm not sure how finance guys capture that problem: how do you make a profit when there's high demand on both sides at one time, and low demand on both sides at other times? I submit that when both sides are low is the time to do annotation, and the time with both sides are high is the time to sell models.

But then you need an investor who can ride out the market.


> Sometimes an idea has a time.

Isn't it more like "nearly always"? It's pretty hard to find examples of things that haven't been tried multiple times before in some variant. You can argue if it was timing or execution that worked "this time" of course, but almost nothing happens in isolation.


> Isn't it more like "nearly always"?

Maybe? Hard to see past survivorship bias. My intuition says some ideas will never see their time. Hard to quantify how often that is the case as a percentage of all ideas.


Oh fair enough - I wasn't thinking of that direction.

Lots of ideas are just bad. The ones that work out though, are very rarely original I think (in this context, at least).


See: flying cars


I imagine black box data from plane crashes, or in general data that comes out of a tragic event that no one can or would want to replicate, but is otherwise extremely valuable.

Of course, if you can sell it to one person, they can just pass it off to others, so this will quickly turn into a DRM business profiting off tragedy. Probably not a good idea.


What makes a dataset hard/expensive to obtain?


I had a friend tell me recently about a client using commercial real estate data for lead gen. He mentioned https://compstak.com/

Basically, identifying companies that are doing well / expanding by how big the space is they leased. This sort of data is apparently very hard to get, but gives users a competitive advantage.


Real estate data, and companies like compstak are exactly the kind of niche markets I'm talking about. Agents are willing to spend large sums to get access to this data and it can be resold multiple times. Unfortunately it is also a market full of existing competition with some established players.

What other markets for data are similar? In general, data that leads to prospect generation is desirable because sales agents are willing to spend money to make money. Are there any other markets like that?


So it sounds like the salient aspect here isn't necessarily the type of data, but the manner in which that data is collected. Looks like compstak's success is a result of creating a platform that facilitates crowdsourced data points that are difficult to acquire using traditional data collection approaches...that scarcity is what makes the data valuable, especially since that data can be used for leverage in a negotiation. Also, they appear to prop up the overall scarcity by only granting access of existing data to users who provide new data.[1]

I'm curious how they figure out how much to charge companies for this data? And also how they stop real estate insiders from gaining access without sharing new data?

[1] https://techcrunch.com/2012/10/18/compstak/


The medical markets, but I don't want to go any further because that's what I'm doing right now :p

SMART on FHIR is a newish standard for medical applications that is getting a HUGE push from large companies like Cerner, Epic, along with all the tech giants. Hospitals are itching for more FHIR apps that can integrate directly into their Electronic Health Record system (and web apps be delivered directly on a doctor's web portal within the hospital's IT system).

So that might be a good place to start poking around...

Here's a good brief overview: https://healthtechmagazine.net/article/2018/10/everything-yo...


A really good example of this is labeled medical imaging data.

Some key contributing factors: multiple stakeholders & consent/approval issues, legal & technical constraints on access, depending on the application labeling may only be possible using very expensive experts. Lot's of human interaction.


I agree it is expensive to gather but is it something that can be sold at high cost to 1000s of customers? It seems the market for purchasers of that data might be limited to a small number of companies, probably hoping to build ML models.


The question I responded to was what made it hard and/or expensive to obtain, which I think I answered.

Commercial viability of doing so for profit is a different issue, but I see that's the other part of your original comment. It's not an obvious answer, partially because there are a lot of different scenarios within that blanket "medical imaging", and what the putative customer might want to do with it.


Yes, I should have followed the thread better. My mind is focused on a particular kind of commercial viability which is niche markets (in the 1000s to 10,000s) willing to pay for access to data.


One aspect I know of is internal private-business analytics. For example:

- How many truckloads of widgets did the widget company ship out of Warehouse A compared to Warehouse B in 2019 vs 2018.

- What is the purchase ratio of titanium to steel for Company X over the past 5 years?

This type of data is valuable to seek out emerging trends, risk minimization, stock analysis, etc. Very hard to find legitimate data on your own.


One example is that it could require a significant amount of human legwork - e.g. Google street view. Another example is it might require significant dev effort to clean and combine several raw data sets into a refined output data set.


The costs of digitization and annotation.


> I just can't think of any examples

My first thought is specific business industry analysis data. I've often been an hour into an online deep-dive only to hit a paywall related to this. However, I'd think it would be hard to acquire the valuable aspects of this data without some kind of insider access (compared to web scraping, creative api mining, etc).

New data source needs seem to popup out of nowhere - what about building a platform that would facilitate the "collect, clean and analyze data" aspect of this for non-technical business owners?


> what about building a platform that would facilitate the "collect, clean and analyze data" aspect of this for non-technical business owners?

One of the problems with this is how custom each data set for each client would be. My mind has been on this topic since the a16z article on "The new business of AI ..." [1] which was posted to HN in the last couple of days. The key idea is the question of how to decouple the process of collecting, cleaning and analyzing data from the process of acquiring customers for that data, and not how to solve the problem of collecting, cleaning and analyzing data. Developers want to solve the technical challenge (how to build the processes) but not the business challenge (how to find customers willing to buy the resultant data).

I do believe there is a market for start-ups to partner with exiting companies to help them wrangle their data. It just isn't the market I'm thinking of.

1. https://a16z.com/2020/02/16/the-new-business-of-ai-and-how-i...


Sounds like you're thinking of creating a gold-mine business model for data....do the hard searching/digging up/processing of rare and valuable data, then sell it at a premium.

A few questions: What types of businesses would be "customers of that data"? Brainstorming all potential customers separated by industry would be a good start. Are there any "data purchasing" trends you've seen lately?


> What types of businesses would be "customers of that data"?

That is the exact question! That is the wall we hit. If I could consistently answer that question then there is a business to be had.

Who is willing to pay for that kind of data? I actually considered getting together a larger group of friends to do that exact brainstorm. But even then I'm not sure it is such an easy question to answer.

Also a bit funny you called it a gold-mine business. I called data that meets that criteria Goldilocks data.


Can you be more specific as to what you mean by industry analysis data?


Not OP but I think he means market research data. I've thought about this as well. You pay some researcher some money to write a report on growth of a particular segment of an industry. Trade groups often do this and you hear stats like "Mobile usage expected to grow by X% in developing countries over the next Z years". But it is a multi-page report, probably including graphs, on some particular topic.

It matches roughly the kind of data I was talking about. It is expensive to generate since you have to pay a researcher some amount of money to write the report. The resultant report generally can be re-sold multiple times.

My problem with this kind of data is that you will be competing against AIs pretty soon which will drive the cost to generate such reports down. And the price you can charge per report will be tied to how good a report you are capable of generating. It is also a saturated market already so the real play is driving the cost of generation down, not what I want.


Instead of selling reports, would it make more sense to create dashboards that let users slice + dice data and view the insights?

In other words, instead of being a Gartner, focus on being a Crunchbase. That way, you can sell to both the end users of these insights (the companies in these industries) as well as the market research companies, themselves.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: