Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: How to extract text from popular document formats
6 points by JanezStupar on Oct 14, 2010 | hide | past | favorite | 1 comment
I guess this is probably the best place on teh interwebz to get this kind of information.

I'm looking for a library, tool, whatever widget that would enable me to extract raw text from popular document formats (pdf, Word 97/2003/2007, OpenOffice, rtf - any others would be a bonus).

The tool does not need to be OpenSource - commercial tools are also welcome, as long as they are not prohibitively expensive.

The use case is extracting text for full-text indexing via Apache Solr. I am aware that Solr can handle indexing whole documents - however I would rather not have it juggle loads of raw documents. And I simply haven't enough time/motivation to roll my own parsers.

Update: Too lazy to google it for myself? Apache Tika: http://tika.apache.org/

Thank you a lot.



antiword (http://freshmeat.net/projects/antiword) works well for Microsoft word documents.

Are you trying to create an index/search utility for documents? If so, maybe we can combine efforts. I was planning to build one, wrote a small amount code, but then got busy with other stuff and the project fell through the cracks.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: