libtika-java

Apache tika - content analysis toolkit
  http://tika.apache.org
  0
  no reviews



The apache tika toolkit detects and extracts metadata and text content from various documents (ppt, csv, pdf, mp3, html and more) using existing parser libraries. tika unifies these parsers under a single interface to allow you to easily parse over a thousand different file types. tika is useful for search engine indexing, content analysis, translation, and much more.