Release 1.24.1 - 4/17/2020 * Allow gzip compression of input and output streams for tika-server (TIKA-3073). Release 1.24 - 3/11/2020 * Add scripts to run tika-server as a service via Eric Pugh, and add these scripts and jar as a new artifact in the release (TIKA-3010). * Upgrade Drew Noakes' metadata-extractor (TIKA-2952). * Enable optional extraction of structural tags in PDFs (alpha-grade) (TIKA-3026). * Tika app's --extract mode now outputs to STDOUT (TIKA-3035). * Add an optional Preflight parser for PDFs (TIKA-3055). * Improve detection of some zip-based formats (TIKA-3057). * Upgrade metadata-extractor to 2.13.0 (TIKA-2952). * Upgrade to POI 4.1.2 (TIKA-3047). * Extract XMP from PSD files (TIKA-3050). * Added XMLProfiler as an optional parser to profile XFA and XMP in PDFs (TIKA-3045). * Extract inline images that rely on the DCT filter from PDFs (TIKA-3041). * Upgrade to PDFBox 2.0.19 (TIKA-3033). * Fix bug in ASM parser configuration (TIKA-2992). * Upgrade to java-libpst 0.9.3 (TIKA-2546). * Fixed XLIFF12Parser failures with ToXMLHandler (TIKA-3014). Release 1.23 - 12/02/2019 * NOTE: The PDFParser now relies on OCRDPI to render page images when users configure OCR on rendered page images. This will have the effect of increasing rendered image size (TIKA-2624). * NOTE: tika-server no longer returns 415 for file types for which there is no parser. * Fix bug in AUTO OCR strategy in the PDFParser (TIKA-3002). * Fix incorrect height and width metadata extraction from JPEG images (TIKA-2630). * Upgrade to POI 4.1.1 (TIKA-2851). * Upgrade to PDFBox 2.0.17 (TIKA-2951). * Ensure that the PDFParser respects custom configuration of Tesseract from tika-config.xml via Eric Pugh (TIKA-2970). * Add parser for XLIFF v1.2 files (TIKA-2975). * Add mime type detection support for WebAssembly (TIKA-2894), HEIF / HEIC images (TIKA-2942), Digilite FDF (TIKA-2988); and xml-root detection for XFDF (TIKA-2990) and XDP (TIKA-2989). * Add an XLZ Parser (TIKA-2976). * Fix deadlock with ForkParser when InputStream throws IOException (TIKA-2892). Release 1.22 - 07/29/2019 * NOTE: tika-server no longer hard-codes the HtmlParser to handle XML files (TIKA-2910). Users must now configure that behavior via a tika-config.xml file. * NOTE: Known regression: PDFBOX-4587 -- PDF passwords with codepoints between 0xF000 and 0XF0000 will cause an exception. * Add parser for HWP v5 files via SooMyung Lee (soomyung) and JinSup Kim (ddoleye) (TIKA-2909). * Fix order of closing streams to avoid "Failed to close temporary resource" exception in TesseractOCRParser (TIKA-2908). * Improve AutoDetectReader performance by caching encoding detector (TIKA-1568). * Prevent RTFParser from outputting illegal tag combinations (TIKA-2889). * Fix RereadableInputStream to release all resources (TIKA-2903). * Implement custom language identifier in the tika-eval module based on OpenNLP's language detector; add 18 languages and add common words lists for all 121 languages (TIKA-2790). * Fix NPE in MimeTypesReader.releaseParser() via Eamonn Saunders (TIKA-2896). * Fix RTFParser to extract more content (TIKA-2883). * Add clientSubmitTime to the metadata extracted from PST files (TIKA-2898). * Improve StreamingZipContainerDetector for xltx, xltm and several other file formats (TIKA-2886). Release 1.21 - 05/14/2019 * Add optional AUTO mode to OCR'ing of PDFs. If tesseract is installed and on the path, and this option is selected programmatically or via TikaConfig(), the PDFParser will use heuristics to decide whether or not to run OCR per page on PDFs. (TIKA-2749) * The ZipContainerDetector's default behavior was changed to run streaming detection up to its markLimit. Users can get the legacy behavior (spool-to-file/rely-on-underlying-file-in-TikaInputStream) by setting markLimit=-1. The POIFSContainerDetector requires an underlying file; it will try to spool the file to disk; if the file's length is > markLimit, it will not attempt detection; set markLimit to -1 for legacy behavior (TIKA-2849). * Upgrade PDFBox to 2.0.14 (TIKA-2834). * Add CSV detection and replace TXTParser with TextAndCSVParser; users can turn off CSV detection by excluding the TextAndCSVParser and adding back the TXTParser via tika-config (TIKA-2833). * Add a CSVParser. CSV detection is currently based solely on filename and/or information conveyed via Metadata (TIKA-2826). * General upgrades: asm, bouncycastle, commons-codec, commons-lang3, cxf, guava, h2, httpcomponents, jackcess, junrar, Lucene, mime4j, opennlp, parso, sqlite-jdbc (provided), zstd-jni (provided) (TIKA-2824) * Bundle xerces2 with tika-parsers (TIKA-2802). * Upgrade jaxb to 2.3.2 (TIKA-2819). * Upgrade jackson to 2.9.8 (TIKA-2717). * Update tika-eval's common tokens lists (TIKA-2822). * Handle bad tags in tika-eval more robustly (TIKA-2810). * Add reports for tags in tika-eval (TIKA-2809). * Extract text from SDT element within textboxes in .docx files (TIKA-2807). * Try to handle truncated OOXML files more robustly (TIKA-2765). Release 1.20 - 12/17/2018 * Upgrade to POI 4.0.1 (TIKA-2751). * Integrate/parameterize new angles handling in PDFBox (TIKA-2779). * Upgrade to PDFBox 2.0.13 (TIKA-2788). * Prevent content within