HN2new | past | comments | ask | show | jobs | submitlogin

(I wrote the classifier4j summariser, as outlined here: https://hackertimes.com/item?id=1803020)

In your version you said you weren't happy with the HTML extractor. It's pretty hard to generalize that part, but one technique I found useful was having a flag that told the program to ignore all text until it found the first <p> tag.

In my testing, that removed ~90% of navigation text (although I note you are only looking in <p> tags. I had a flag for that too, but found it was unnecessary most of the time).

Also, I found regular expressions weren't terrible for sentence boundary detection. OTOH, there was nothing like NLTK for Java when I wrote it anyway.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: