Release 1.19 - 9/14/2018 * Require Java 8 (TIKA-2679). * Enable building with Java 11 (TIKA-2668) * Add an option to make tika-server robust against infinite loops, OOMs, and memory leaks (TIKA-2725). * Allow configuration of the Tesseract parser via the standard tika-config.xml options (TIKA-2705). * Improve handling of empty cells across table-based formats (TIKA-2479). * Add a Standards compliant HTML encoding detector via Gerard Bouchar (TIKA-2673). * Improved XML parsing -- limited default entity expansions to 20. To raise this limit, add -Djdk.xml.entityExpansionLimit=XXX to your commandline. * Mime magic improvements for Olympus RAW (TIKA-2658), interpreted server-side languages via HTTP (TIKA-2648), MHTML (TIKA-2723) * Add absolute timeout to ForkParser rather than testing for active (TIKA-2656). * Make the RecursiveParserWrapper work with the ForkParser (TIKA-2655). * Allow the ForkParser to specify a directory containing tika-app.jar for use by the ForkServer. This allows users to keep most of the parser dependencies out of their code; and it allows for an easy addition of optional jars for Parser dependencies, such as the xerial sqlite jar (TIKA-2653). * Use a pool for SAXParsers and DOMBuilders rather than creating a new parser/builder for every parse. For better performance, set XMLReaderUtils.setPoolSize() to the number of threads you're using with Tika (TIKA-2645). * Add the RecursiveParserWrapperHandler to improve the RecursiveParserWrapper API slightly (TIKA-2644). * Upgraded to Commons-Compress 1.18 (TIKA-2707). * Upgraded to Apache POI 4.0.0 (TIKA-2552). * Upgraded to Apache PDFBox 2.0.11 (TIKA-2681). * Upgraded to deeplearning4j 1.0.0-beta2 (TIKA-2672). * Upgraded jmatio to 1.4 (TIKA-2667) * Upgraded Apache Lucene to 7.4.0 in tika-eval and tika-examples (TIKA-2695). * Upgraded junrar to 1.0.1 (TIKA-2664). * Numerous other upgrades (TIKA-2692). * Excluded Spring as a transitive dependency (TIKA-2721). Release 1.18 - 4/20/2018 * Upgrade jackson to 2.9.5 (TIKA-2634). * Add support for brotli (TIKA-2621). * Upgrade PDFBox to 2.0.9 and include new jbig2-imageio from org.apache.pdfbox (TIKA-2579 and TIKA-2607). * Support for TIFF images in PDF files (TIKA-2338) * Detection of full encrypted 7z files (TIKA-2568) * Various new mimes and typo fixes in tika-mimetypes.xml via Andreas Meier (TIKA-2527). * Revert to listenForAllRecords=false in ExcelExtractor via Grigoriy Alekseev (TIKA-2590) * Add workaround to identify TIFFs that might confuse commons-compress's tar detection via Daniel Schmidt (TIKA-2591) * Ignore non-IANA supported charsets in HTML meta-headers during charset detection in HTMLEncodingDetector via Andreas Meier (TIKA-2592) * Add detection and parsing of zstd (if user provides com.github.luben:zstd-jni) via Andreas Meier (TIKA-2576) * Allow for RFC822 detection for files starting with "dkim-" and/or "x-" via Andreas Meier (TIKA-2578 and TIKA-2587) * Extract xlsx files embedded in OLE objects within PPT and PPTX via Brian McColgan (TIKA-2588). * Extract files embedded in HTML and javascript inside HTML that are stored in the Data URI scheme (TIKA-2563). * Extract text from grouped text boxes in PPT (TIKA-2569). * Extract language metadata item from PDF files via Matt Sheppard (TIKA-2559) * RFC822 with multipart/mixed, first text element should be treated as the main body of the email, not an attachment (TIKA-2547). * Swap out com.tdunning:json for com.github.openjson:openjson to avoid jar conflicts (TIKA-2556). * No longer hardcode HtmlParser for XML files in tika-server (TIKA-2551). * Require Java 8 (TIKA-2553). * Add a parser for XPS (TIKA-2524). * Mime magic for Dolby Digital AC3 and EAC3 files * Fixed bug where TesseractOCRParser ignores configured ImageMagickPath, and set rotation script to ignore Python warnings (TIKA-2509) * Upgrade geo-apis to 3.0.1 (TIKA-2535) * Mime definition and magic improvements for text-based programming and config formats (TIKA-2554, TIKA-2567, TIKA-1141) * Added local Docker image build using dockerfile-maven-plugin to allow images to be built from source (TIKA-1518). * Support for SAS7BDAT data files (TIKA-2462) * Handle .epub files using .htm rather than .html extensions for the embedded contents (TIKA-1288) * Mime magic for ACES Images (TIKA-2628) and DPX Images (TIKA-2629) * For sparse XLSX and XLSB files, always output missing cells to the left of filled ones (matching XLS), and optionally output missing rows on all 3 formats if requested via the OfficeParserContext (TIKA-2479) Release 1.17 - 12/8/2017 ***NOTE: THIS IS THE LAST VERSION OF TIKA THAT WILL RUN ON Java 7. The next versions will require Java 8*** * Fix thread-safety in ChmExtractor (TIKA-2519). * Upgrade cxf to 3.0.16 (TIKA-2516). * Allow users to configure maxMainMemoryBytes for PDFs via shrike (PR-213). * Extract underline and strikethrough in docx (TIKA-2347 and TIKA-2512). * Cache TikaConfig in EmbeddedDocumentUtil for better performance in documents with large number of attachments (TIKA-2511). * Extract media files from ooxml (TIKA-2510). * Standardize the way the Image and Video captioning dockers and extraction work (TIKA-2400, GitHub-208) * Upgrade to xmpcore 5.1.3 (TIKA-2034). * Upgrade to metadata-extractor 2.10.1 (TIKA-2486). * Upgrade to OpenNLP 1.8.3 (TIKA-2502). * Upgrade to Jackson 2.9.2 (TIKA-2501). * Catch potential NPE in getting InputStream for attachments in PST file (TIKA-2488). * Upgrade to PDFBox 2.0.8 (TIKA-2489). * Allow configuration of markLimit in EncodingDetectors via tika-config.xml (TIKA-2485). * RFC822Parser now selects the best alternative for multipart/alternative body components. This aligns with the behavior of the OutlookParser (TIKA-2478). Users can select legacy behavior via the "extractAllAlternatives" parameter in the RFC822 parser definition in tika-config.xml. * Narrow mime detection for ms-owner files and add detection for .nls files (TIKA-2469). * Fix bug in CharsetDetector that led to different detected charsets depending on whether user setText with a byte[] or an InputStream via Sean Story (TIKA-2475). * Remove JAXB for easier use with Java 9 via Robert Munteanu (TIKA-2466). * Upgrade to POI 3.17 (TIKA-2429). * Enabling extraction of standard references from text (TIKA-2449). * Load external custom mimetypes XML from system property tika.custom-mimetypes (TIKA-2460). * Extract number of tiffs in a multi-page tiff (TIKA-2451). * Fix detection of emails extracted from mbox (TIKA-2456). * Add OverrideDetector and allow PSTParser to specify body content type as text or html -- to avoid incorrect auto-detection of rfc/mbox, etc. (TIKA-2454) * AutoDetectParser throws ZeroByteFileException for zero-byte files after detection on the file extension (TIKA-2450). * Extract phonetic runs in docx with experimental SAX parser (TIKA-2448). * Extract phonetic runs from xls and allow users to turn off extraction of phonetic runs in both xls and xlsx (TIKA-2440). * OOXML locale should be set by POI's LocaleUtil not Locale.getDefault(). Fix unit tests to be robust against different locales in OOXML and ExcelParser (TIKA-2438). * Upgrade to PDFBox 2.0.7 (TIKA-2431). * Tika now has support for automatic image captioning, that combines Computer Vision and Natural Language Processing to automatically generate a readable caption for an image (TIKA-2262, TIKA-2355, TIKA-2402, Gh-198, Gh-196, Gh-189). * Add TestCorruptedFiles to allow devs to test parsers against corrupted input files (TIKA-2430). * Correct Mimetype definition for Windows batch files (CMD and BAT) which are the same (TIKA-2445) * PSDParser memory use improvements (TIKA-2447) * Add underline extraction from Word documents (doc/docx) via Stuart Hendren as well as strikethrough extraction in docx (TIKA-2347, GitHub-173) * Corrected Tesseract OCR rotation.py script and made it a configurable option via Peter Weiss (TIKA-2385) Release 1.16 - 7/7/2017 * Exclude jj2000 from edu.ucar grip to avoid potential license conflicts with ASL 2.0 * Add Age recognition using Ensemble model for Linear regression and Apache OpenNLP Maximum Entropy. Tika can now detect age from text (TIKA-1988). * Add Tika Deep Learning support for the VGG16 model for Very Deep Convolutional Networks for Large-Scale Image Recognition. Now Tika supports both Inception v3/v4 and VGG16 based image recognition (TIKA-2298). * Extract macros from PPT (TIKA-2089). * Extract absolute path for last saved location when available in .xlsx and .xlsb (TIKA-2335). * Rename SentimentParser to SentimentAnalysisParser to prevent conflict with dependency (TIKA-2368). * tika-app now extracts inline images in PDFs by default, and it includes a warning to users that this is not the default behavior elsewhere in Tika (TIKA-2374). * Allow configurability of warnings for problems during parser initialization (TIKA-2389). * Upgrade to Jackcess 2.1.8 (TIKA-2380). * Upgrade to POI 3.17-beta1 (TIKA-2336). * Remove non-ASL-2.0-compatible org.json (TIKA-1804). * Allow extraction of