Ask HN: How to extract text from popular document formats | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		Ask HN: How to extract text from popular document formats
		6 points by JanezStupar on Oct 14, 2010 \| hide \| past \| favorite \| 1 comment
		I guess this is probably the best place on teh interwebz to get this kind of information. I'm looking for a library, tool, whatever widget that would enable me to extract raw text from popular document formats (pdf, Word 97/2003/2007, OpenOffice, rtf - any others would be a bonus). The tool does not need to be OpenSource - commercial tools are also welcome, as long as they are not prohibitively expensive. The use case is extracting text for full-text indexing via Apache Solr. I am aware that Solr can handle indexing whole documents - however I would rather not have it juggle loads of raw documents. And I simply haven't enough time/motivation to roll my own parsers. Update: Too lazy to google it for myself? Apache Tika: http://tika.apache.org/ Thank you a lot.

tsycho on Oct 14, 2010 [–]

antiword (http://freshmeat.net/projects/antiword) works well for Microsoft word documents.

Are you trying to create an index/search utility for documents? If so, maybe we can combine efforts. I was planning to build one, wrote a small amount code, but then got busy with other stuff and the project fell through the cracks.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact