Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

I think this is a cool project to undertake but yes, I would group myself among those who think Summly was trivial. Not in a technical sense, but in a do-people-need-it-sense? That is, the best news sites are already going to have relevant headlines and good meta-descriptions. What else could even the best extraction algorithm pull out -- from the text -- that would aid in my decision to read the article? In this problem space, more is not better, because the point is to keep things as concise as possible. I argue that this is largely already done by the content editors. And for sites that do it poorly, I would suspect that their articles will have equally poor extractions.

This is not to say that there aren't valuable things that can be added to an article's overview...but I think direct quote/passages or summaries thereof are the least interesting things. I'd rather see something like, how many popular users on Twitter (i.e. not spambots) have tweeted this post, and what are the best comments that are not merely tweets of the title? Or, maybe just pull out the best comments from the stories if Disqus has been enabled. These are the kind of things that would be useful auxiliary information before actually reading the article.



Danso, You just opened up an entire new wave of thinking in my head with this line I'd rather see something like, how many popular users on Twitter (i.e. not spambots) have tweeted this post, and what are the best comments that are not merely tweets of the title? Or, maybe just pull out the best comments from the stories if Disqus has been enabled. :)

Do-people-need-it? No probably not. But I think for me it helps get to the news articles I really want to read. If I read an interesting sentence or two that was extracted, I feel like I am more likely to go on and read the article.

>>And for sites that do it poorly, I would suspect that their articles will have equally poor extractions. You could not be more right. I have been experimenting with an "Entertainment" section. You will not believe how poor some of the extractions are purely based on the writing style.

I don't think that this solution is perfect, but the wheels are turning inside my head on where I should take this.


Well I'm glad you approve, because as a former content writer, I resent any one who thinks that their algorithm can provide more value than me ;)

No really though, I think that because human writing (and just as importantly, HTML layout) is so diverse, even a very good algorithm risks extracting something very banal or redundant. Cross-referencing an article with social data, however, is very low-hanging fruit and at the same time, pretty useful.

For example, take a post like the currently popular "Introduction to Position Based Fluids" (http://physxinfo.com/news/11109/introduction-to-position-bas...)

An extractor is immediately going to have a problem here because most of the meat is in the video...there is some descriptive text, but it doesn't add much more than the title and the meta description ("Position Based Fluids is a way of simulating liquids using Position Based Dynamics (PBD) approach")

However, the top HN comments are insightful...for example:

"The only uncanny valley effect here is that the water doesn't wet anything - everything is perfect Teflon. So it's hard to judge whether the movement of the water is perfectly realistic, because I keep looking at the transitions. If I don't look at surfaces, it seems very convincing."

And since the mechanism is the same, why not extract the top comment from Reddit?

"Wow. Not much else to say but wow."

OK, not that great in this case...but the retrieval work is trivial. More importantly, both of these sites have built in meta data telling you the value of each comment...not just upvotes, but how much discussion it generated...there are plenty of useful ways to slice this data numerically...And the same pattern would apply to Twitter/Facebook...pull in the comments/reactions from users with high social cred.

In other words, why resort to machine-summarization when you have perfectly fine humans to do so, and who are actually (in some cases) adding new insights?

And there's nothing stopping you from applying text-mining to the social data/commentary you've collected...for example, showing only comments that directly refer to the main keywords/proper nouns that are in the article (this will keep your algorithm from collecting comments like: WOW, AWESOME!! that get upvotes merely for being positive).

Anyway, this seems like an easier route with a lot of value...maybe it's so obvious and technically easy that it's not sexy...but I think it would be very useful.


I argue that this is largely already done by the content editors.

Certainly for professional media outlets and therefore true of the mass market most summarizers are trying to hit. But just take most of the best stuff that reaches HN.. Personal blogs, GitHub repos, documentation. Really hard but potentially rewarding stuff to summarize. I currently do this by hand in the email newsletter space but am certainly interested in any advancements that'll save me time one day :-)


Peter,

So funny that you mentioned github docs, as I am working on something custom for that (my sentence parsing works really bad with code... who would've thought!). You are exactly right on the potential of things that can be summarized. Currently Summary.io only shows sites with images, but the summaries are still being generated! Appreciate the comment. If you have any other ideas on what I could summarize let me know!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: