>apache-tika

apache-tika

A content analysis toolkit that automatically detects file types and extracts both metadata and text content.

solr

elasticsearch

apache-nutch

tika-server

What is apache-tika?

The Apache Tika image packages Apache Tika, a content analysis toolkit that automatically detects file types and extracts both metadata and text content. It supports hundreds of formats, including PDFs, Microsoft Office documents, HTML, XML, images, audio, and video files.

Tika is widely used in search and indexing pipelines, document management systems, and content analytics workflows. It provides a uniform API for parsing diverse file formats, making it especially useful for integrating unstructured data into search engines like Apache Solr or Elasticsearch.

In containerized environments, the Apache Tika image allows developers to run Tika as a standalone server or microservice without installing the toolkit directly.

How to use this image

The Apache Tika image is commonly run in server mode to expose REST endpoints for parsing content.

Examples:

<code># Run Tika server on port 9998
docker run -d -p 9998:9998 apache/tika:latest</code>

<code># Extract text from a file using curl
curl -T document.pdf <http://localhost:9998/tika></code><br />

<code># Detect file type
curl -T image.png <http://localhost:9998/detect/stream></code>

The image can also be used as a library container for batch processing documents in pipelines.

Image variants

The Apache Tika image is generally published in the apache/tika repository with these forms:

apache/tika:<version>

Version-pinned images (e.g., apache/tika:2.9.0) tied to specific releases. Recommended for production use.

apache/tika:latest

Tracks the most recent stable release. Useful for evaluation but not ideal for production due to version drift.

apache/tika:alpine

A lightweight variant based on Alpine Linux, providing smaller image sizes for environments that value efficiency.

Interested in base images that start and stay clean?

Oops! Something went wrong while submitting the form.