You know, I was playing with a couple scripts (mainly Markov chain implementation), that used Craigslist as a source for data. I got a good ways into it, and got a working model (I could generate text based on the handful of inputs I had set up for it). It wasn't until that point (I know, stupid) that I went back through all the ToS and realized I really couldn't do that.
Now, I know I might be able to get some where with them if I tried emailing them, and asked for the data I wanted. Then again, it sounds like I probably wouldn't. Either way, while the data was surprising well suited for what I wanted to do, it isn't really available, at least not in any way I could actually use it.
It makes me kind of sad every time I see someone do something like this. You've created something super cool, something with tons and tons of data. For what I was using it for (localized text samples), I can't think of a better, more complete place to find data from. But when you see something cool, something you never thought of/never thought was important enough to implement, and go out of your way to squash it, it makes the part of me that loves data feel bad about it. Craiglist has, at any given time, literally gigs of written text with all sorts of metadata on it. In the short bit I processed, I took all of the data from just a handful of boards, and ended up with over 150mb of data. Considering that most of that was <2 weeks old, They likely process up to a terabyte of data every year, with fairly good metadata on it. Seeing all of that only being used at face value seems like such a waste.
(As a side note, anyone with a bunch of written text samples, with information as to where each of them was written, I would really like to talk to you. Actually, that goes for anyone with any sizeable amount of data. Just because you don't see a use for it doesn't meant that there isn't someone out there who would absolutely love to get their hands on it.)
Now, I know I might be able to get some where with them if I tried emailing them, and asked for the data I wanted. Then again, it sounds like I probably wouldn't. Either way, while the data was surprising well suited for what I wanted to do, it isn't really available, at least not in any way I could actually use it.
It makes me kind of sad every time I see someone do something like this. You've created something super cool, something with tons and tons of data. For what I was using it for (localized text samples), I can't think of a better, more complete place to find data from. But when you see something cool, something you never thought of/never thought was important enough to implement, and go out of your way to squash it, it makes the part of me that loves data feel bad about it. Craiglist has, at any given time, literally gigs of written text with all sorts of metadata on it. In the short bit I processed, I took all of the data from just a handful of boards, and ended up with over 150mb of data. Considering that most of that was <2 weeks old, They likely process up to a terabyte of data every year, with fairly good metadata on it. Seeing all of that only being used at face value seems like such a waste.
(As a side note, anyone with a bunch of written text samples, with information as to where each of them was written, I would really like to talk to you. Actually, that goes for anyone with any sizeable amount of data. Just because you don't see a use for it doesn't meant that there isn't someone out there who would absolutely love to get their hands on it.)