apache-tika
A content analysis toolkit that automatically detects file types and extracts both metadata and text content.
What is apache-tika?
The Apache Tika image packages Apache Tika, a content analysis toolkit that automatically detects file types and extracts both metadata and text content. It supports hundreds of formats, including PDFs, Microsoft Office documents, HTML, XML, images, audio, and video files.
Tika is widely used in search and indexing pipelines, document management systems, and content analytics workflows. It provides a uniform API for parsing diverse file formats, making it especially useful for integrating unstructured data into search engines like Apache Solr or Elasticsearch.
In containerized environments, the Apache Tika image allows developers to run Tika as a standalone server or microservice without installing the toolkit directly.
How to use this image
The Apache Tika image is commonly run in server mode to expose REST endpoints for parsing content.
Examples:
<code># Run Tika server on port 9998
docker run -d -p 9998:9998 apache/tika:latest</code>
<code># Extract text from a file using curl
curl -T document.pdf <http://localhost:9998/tika></code><br />
<code># Detect file type
curl -T image.png <http://localhost:9998/detect/stream></code>
The image can also be used as a library container for batch processing documents in pipelines.
Image variants
The Apache Tika image is generally published in the apache/tika
repository with these forms:
apache/tika:<version>
Version-pinned images (e.g., apache/tika:2.9.0
) tied to specific releases. Recommended for production use.
apache/tika:latest
Tracks the most recent stable release. Useful for evaluation but not ideal for production due to version drift.
apache/tika:alpine
A lightweight variant based on Alpine Linux, providing smaller image sizes for environments that value efficiency.