In your version you said you weren't happy with the HTML extractor. It's pretty hard to generalize that part, but one technique I found useful was having a flag that told the program to ignore all text until it found the first <p> tag.
In my testing, that removed ~90% of navigation text (although I note you are only looking in <p> tags. I had a flag for that too, but found it was unnecessary most of the time).
Also, I found regular expressions weren't terrible for sentence boundary detection. OTOH, there was nothing like NLTK for Java when I wrote it anyway.
In your version you said you weren't happy with the HTML extractor. It's pretty hard to generalize that part, but one technique I found useful was having a flag that told the program to ignore all text until it found the first <p> tag.
In my testing, that removed ~90% of navigation text (although I note you are only looking in <p> tags. I had a flag for that too, but found it was unnecessary most of the time).
Also, I found regular expressions weren't terrible for sentence boundary detection. OTOH, there was nothing like NLTK for Java when I wrote it anyway.