Release 2.0.0-ALPHA - 01/13/2021 BREAKING CHANGES in 2.0.0 * General * OCR is now triggered automatically for PDFs if tesseract is on the user's path see (https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr) for how to disable OCR. * Remove deprecated Metadata keys/properties (TIKA-1974). * Removed dangerous calls to read an inputstream or convert to bytes without specifying a charset * tika-parsers * The parser modules have been broken into three main modules: tika-parsers-classic, tika-parsers-extended and tika-parsers-advanced. Users may now need to add tika-parsers-extended to tika-app and tika-server to include parsers that used to be included by default (for example: envi, gdal, grib, isatab, netcdf). * ChmParser was moved to org.apache.tika.parser.microsoft.chm * RTFParser was moved to org.apache.tika.parser.microsoft.rtf * tika-app * tika-server * tika-server now by default forks a process to isolate the parsing in the forked process (this was called the -spawnChild option in tika-1.x). Clients must now expect that tika-server will restart on OOM, timeouts, crashes or after parsing a large number of files. When this happens tika-server will restand and not receive connections for brief periods. The less robust, legacy behavior of not forking a process is available with "-noFork" * tika-server's /metadata endpoint requires tika-server-classic to write XMP/rdf output. This output is not available in tika-server-core. Other changes: * General code cleanup (PeterAlfredLee) * Great optimization in ForkParser (TIKA-3237). * Fix parsing of emails attached to other emails in PST files (TIKA-3004). Release 1.25 - 11/25/2020 * Fix inconsistent license in xmpcore (TIKA-3204). * General upgrades including some dependencies with recently found security vulnerabilities (TIKA-3119). * Add detection and a parser for flat ODF files (TIKA-3159). * Add extraction of macros from ODF files (TIKA-3161). * Add mime detection for hprof and hprof text files (TIKA-3144). * Add TextSignature and TextProfileSignature to tika-eval (TIKA-3145 and TIKA-3146) * Create a metadata filter to trigger tika-eval stats post parsing (TIKA-3140) * Add a configurable metadata-filter for the RecursiveParserWrapper (TIKA-3137) * Parameterize writeLimit and maxEmbeddedResources for RecursiveParserWrapper in tika-server (TIKA-3133) * Add status endpoint to tika-server (TIKA-3129). * Remove whitelist/blacklist terminology (TIKA-3120) * Add detection for parquet files (TIKA-3115). * Add detection and parsing for bplist (TIKA-3104). * Enable metadata value filtering for RecursiveParserWrapper (TIKA-3137) * Add a basic parser for plist files based on com.googlecode.plist:dd-plist (TIKA-3104). * Read hyperlinked images from ODT files (TIKA-3156). * Updated GrobidRESTParser to use new API location (TIKA-3191). * Add FileProfiler to tika-eval (TIKA-3216). * Add status endpoint to tika-server (TIKA-3129). * Improved handling of zip files with STORED entries with data descriptor (TIKA-3196). * Add parsers for XLZ, IDML and MIF (TIKA-2976, TIKA-3188 and TIKA-3189). * Add the beginnings of a format-aware fuzzing module (TIKA-3083). * Add wrapper for Linux 'file' command for mime detection (TIKA-3215). * Added ability to skip parsing of embedded files in Tika Server (TIKA-3227). Release 1.24.1 - 4/17/2020 * Allow gzip compression of input and output streams for tika-server (TIKA-3073). Release 1.24 - 3/11/2019 * Add scripts to run tika-server as a service via Eric Pugh, and add these scripts and jar as a new artifact in the release (TIKA-3010). * Upgrade Drew Noakes' metadata-extractor (TIKA-2952). * Enable optional extraction of structural tags in PDFs (alpha-grade) (TIKA-3026). * Tika app's --extract mode now outputs to STDOUT (TIKA-3035). * Add an optional Preflight parser for PDFs (TIKA-3055). * Improve detection of some zip-based formats (TIKA-3057). * Upgrade metadata-extractor to 2.13.0 (TIKA-2952). * Upgrade to POI 4.1.2 (TIKA-3047). * Extract XMP from PSD files (TIKA-3050). * Added XMLProfiler as an optional parser to profile XFA and XMP in PDFs (TIKA-3045). * Extract inline images that rely on the DCT filter from PDFs (TIKA-3041). * Upgrade to PDFBox 2.0.19 (TIKA-3033). * Fix bug in ASM parser configuration (TIKA-2992). * Upgrade to java-libpst 0.9.3 (TIKA-2546). * Fixed XLIFF12Parser failures with ToXMLHandler (TIKA-3014). Release 1.23 - 12/02/2019 * NOTE: The PDFParser now relies on OCRDPI to render page images when users configure OCR on rendered page images. This will have the effect of increasing rendered image size (TIKA-2624). * NOTE: tika-server no longer returns 415 for file types for which there is no parser. * Fix bug in AUTO OCR strategy in the PDFParser (TIKA-3002). * Fix incorrect height and width metadata extraction from JPEG images (TIKA-2630). * Upgrade to POI 4.1.1 (TIKA-2851). * Upgrade to PDFBox 2.0.17 (TIKA-2951). * Ensure that the PDFParser respects custom configuration of Tesseract from tika-config.xml via Eric Pugh (TIKA-2970). * Add parser for XLIFF v1.2 files (TIKA-2975). * Add mime type detection support for WebAssembly (TIKA-2894), HEIF / HEIC images (TIKA-2942), Digilite FDF (TIKA-2988); and xml-root detection for XFDF (TIKA-2990) and XDP (TIKA-2989). * Add an XLZ Parser (TIKA-2976). * Fix deadlock with ForkParser when InputStream throws IOException (TIKA-2892). Release 1.22 - 07/29/2019 * NOTE: tika-server no longer hard-codes the HtmlParser to handle XML files (TIKA-2910). Users must now configure that behavior via a tika-config.xml file. * NOTE: Known regression: PDFBOX-4587 -- PDF passwords with codepoints between 0xF000 and 0XF0000 will cause an exception. * Add parser for HWP v5 files via SooMyung Lee (soomyung) and JinSup Kim (ddoleye) (TIKA-2909). * Fix order of closing streams to avoid "Failed to close temporary resource" exception in TesseractOCRParser (TIKA-2908). * Improve AutoDetectReader performance by caching encoding detector (TIKA-1568). * Prevent RTFParser from outputting illegal tag combinations (TIKA-2889). * Fix RereadableInputStream to release all resources (TIKA-2903). * Implement custom language identifier in the tika-eval module based on OpenNLP's language detector; add 18 languages and add common words lists for all 121 languages (TIKA-2790). * Fix NPE in MimeTypesReader.releaseParser() via Eamonn Saunders (TIKA-2896). * Fix RTFParser to extract more content (TIKA-2883). * Add clientSubmitTime to the metadata extracted from PST files (TIKA-2898). * Improve StreamingZipContainerDetector for xltx, xltm and several other file formats (TIKA-2886). Release 1.21 - 05/14/2019 * Add optional AUTO mode to OCR'ing of PDFs. If tesseract is installed and on the path, and this option is selected programmatically or via TikaConfig(), the PDFParser will use heuristics to decide whether or not to run OCR per page on PDFs. (TIKA-2749) * The ZipContainerDetector's default behavior was changed to run streaming detection up to its markLimit. Users can get the legacy behavior (spool-to-file/rely-on-underlying-file-in-TikaInputStream) by setting markLimit=-1. The POIFSContainerDetector requires an underlying file; it will try to spool the file to disk; if the file's length is > markLimit, it will not attempt detection; set markLimit to -1 for legacy behavior (TIKA-2849). * Upgrade PDFBox to 2.0.14 (TIKA-2834). * Add CSV detection and replace TXTParser with TextAndCSVParser; users can turn off CSV detection by excluding the TextAndCSVParser and adding back the TXTParser via tika-config (TIKA-2833). * Add a CSVParser. CSV detection is currently based solely on filename and/or information conveyed via Metadata (TIKA-2826). * General upgrades: asm, bouncycastle, commons-codec, commons-lang3, cxf, guava, h2, httpcomponents, jackcess, junrar, Lucene, mime4j, opennlp, parso, sqlite-jdbc (provided), zstd-jni (provided) (TIKA-2824) * Bundle xerces2 with tika-parsers (TIKA-2802). * Upgrade jaxb to 2.3.2 (TIKA-2819). * Upgrade jackson to 2.9.8 (TIKA-2717). * Update tika-eval's common tokens lists (TIKA-2822). * Handle bad tags in tika-eval more robustly (TIKA-2810). * Add reports for tags in tika-eval (TIKA-2809). * Extract text from SDT element within textboxes in .docx files (TIKA-2807). * Try to handle truncated OOXML files more robustly (TIKA-2765). Release 1.20 - 12/17/2018 * Upgrade to POI 4.0.1 (TIKA-2751). * Integrate/parameterize new angles handling in PDFBox (TIKA-2779). * Upgrade to PDFBox 2.0.13 (TIKA-2788). * Prevent content within