I am working at a startup, and we have to parse semi-structured documents and understand them. Any pointers that will help me get started in this? I have been trying out using a naive classifier with some success. Would like to know other techniques and/or libraries(i am currently using ruby) too.
Thanks