Release 1.17 - December 8, 2017 ***NOTE: THIS IS THE LAST VERSION OF TIKA THAT WILL RUN ON Java 7. The next versions will require Java 8*** * Fix thread-safety in ChmExtractor (TIKA-2519). * Upgrade cxf to 3.0.16 (TIKA-2516). * Allow users to configure maxMainMemoryBytes for PDFs via shrike (PR-213). * Extract underline and strikethrough in docx (TIKA-2347 and TIKA-2512). * Cache TikaConfig in EmbeddedDocumentUtil for better performance in documents with large number of attachments (TIKA-2511). * Extract media files from ooxml (TIKA-2510). * Standardize the way the Image and Video captioning dockers and extraction work (TIKA-2400, GitHub-208) * Upgrade to xmpcore 5.1.3 (TIKA-2034). * Upgrade to metadata-extractor 2.10.1 (TIKA-2486). * Upgrade to OpenNLP 1.8.3 (TIKA-2502). * Upgrade to Jackson 2.9.2 (TIKA-2501). * Catch potential NPE in getting InputStream for attachments in PST file (TIKA-2488). * Upgrade to PDFBox 2.0.8 (TIKA-2489). * Allow configuration of markLimit in EncodingDetectors via tika-config.xml (TIKA-2485). * RFC822Parser now selects the best alternative for multipart/alternative body components. This aligns with the behavior of the OutlookParser (TIKA-2478). Users can select legacy behavior via the "extractAllAlternatives" parameter in the RFC822 parser definition in tika-config.xml. * Narrow mime detection for ms-owner files and add detection for .nls files (TIKA-2469). * Fix bug in CharsetDetector that led to different detected charsets depending on whether user setText with a byte[] or an InputStream via Sean Story (TIKA-2475). * Remove JAXB for easier use with Java 9 via Robert Munteanu (TIKA-2466). * Upgrade to POI 3.17 (TIKA-2429). * Enabling extraction of standard references from text (TIKA-2449). * Load external custom mimetypes XML from system property tika.custom-mimetypes (TIKA-2460). * Extract number of tiffs in a multi-page tiff (TIKA-2451). * Fix detection of emails extracted from mbox (TIKA-2456). * Add OverrideDetector and allow PSTParser to specify body content type as text or html -- to avoid incorrect auto-detection of rfc/mbox, etc. (TIKA-2454) * AutoDetectParser throws ZeroByteFileException for zero-byte files after detection on the file extension (TIKA-2450). * Extract phonetic runs in docx with experimental SAX parser (TIKA-2448). * Extract phonetic runs from xls and allow users to turn off extraction of phonetic runs in both xls and xlsx (TIKA-2440). * OOXML locale should be set by POI's LocaleUtil not Locale.getDefault(). Fix unit tests to be robust against different locales in OOXML and ExcelParser (TIKA-2438). * Upgrade to PDFBox 2.0.7 (TIKA-2431). * Tika now has support for automatic image captioning, that combines Computer Vision and Natural Language Processing to automatically generate a readable caption for an image (TIKA-2262, TIKA-2355, TIKA-2402, Gh-198, Gh-196, Gh-189). * Add TestCorruptedFiles to allow devs to test parsers against corrupted input files (TIKA-2430). * Correct Mimetype definition for Windows batch files (CMD and BAT) which are the same (TIKA-2445) * PSDParser memory use improvements (TIKA-2447) * Add underline extraction from Word documents (doc/docx) via Stuart Hendren as well as strikethrough extraction in docx (TIKA-2347, GitHub-173) * Corrected Tesseract OCR rotation.py script and made it a configurable option via Peter Weiss (TIKA-2385) Release 1.16 - 7/7/2017 * Exclude jj2000 from edu.ucar grip to avoid potential license conflicts with ASL 2.0 * Add Age recognition using Ensemble model for Linear regression and Apache OpenNLP Maximum Entropy. Tika can now detect age from text (TIKA-1988). * Add Tika Deep Learning support for the VGG16 model for Very Deep Convolutional Networks for Large-Scale Image Recognition. Now Tika supports both Inception v3/v4 and VGG16 based image recognition (TIKA-2298). * Extract macros from PPT (TIKA-2089). * Extract absolute path for last saved location when available in .xlsx and .xlsb (TIKA-2335). * Rename SentimentParser to SentimentAnalysisParser to prevent conflict with dependency (TIKA-2368). * tika-app now extracts inline images in PDFs by default, and it includes a warning to users that this is not the default behavior elsewhere in Tika (TIKA-2374). * Allow configurability of warnings for problems during parser initialization (TIKA-2389). * Upgrade to Jackcess 2.1.8 (TIKA-2380). * Upgrade to POI 3.17-beta1 (TIKA-2336). * Remove non-ASL-2.0-compatible org.json (TIKA-1804). * Allow extraction of