I guess this is probably the best place on teh interwebz to get this kind of information.
I'm looking for a library, tool, whatever widget that would enable me to extract raw text from popular document formats (pdf, Word 97/2003/2007, OpenOffice, rtf - any others would be a bonus).
The tool does not need to be OpenSource - commercial tools are also welcome, as long as they are not prohibitively expensive.
The use case is extracting text for full-text indexing via Apache Solr. I am aware that Solr can handle indexing whole documents - however I would rather not have it juggle loads of raw documents. And I simply haven't enough time/motivation to roll my own parsers.
Update: Too lazy to google it for myself? Apache Tika: http://tika.apache.org/
Thank you a lot.
Are you trying to create an index/search utility for documents? If so, maybe we can combine efforts. I was planning to build one, wrote a small amount code, but then got busy with other stuff and the project fell through the cracks.