Release 1.20 - 12/17/2018
* Upgrade to POI 4.0.1 (TIKA-2751).
* Integrate/parameterize new angles handling in
PDFBox (TIKA-2779).
* Upgrade to PDFBox 2.0.13 (TIKA-2788).
* Prevent content within and elements
to be written in the ToTextContentHandler (TIKA-2550).
* Switch child to parent communication to a shared memory-mapped
file in tika-server's -spawnChild mode.
* Fix bug in tika-server when run in legacy mode (not -spawnChild)
that caused it to return 503 on documents submitted after
it hit an OutOfMemoryError (TIKA-2776).
* Upgrade jaxb-runtime and javax.activation (TIKA-2778).
* tika-app in batch mode now requires an interrupt or
kill signal to the parent process to stop the parent
and the child processes (TIKA-2780).
* Bulk upgrade of dependencies (TIKA-2775).
* Improve language id efficiency in tika-eval (TIKA-2777).
* Upgrade sqlite "provided" dependency to 3.25.2 (TIKA-2773).
* Remove duplication of notes in PPT slides (TIKA-2735)
* Use -javaHome or $JAVA_HOME (if they exist) when
spawning child in tika-server's -spawnChild mode.
* Fixed closing of styles around Hyperlinks in Word Parser
Contributed by Ronan O'Sullivan (TIKA-2599).
Release 1.19.1 - 10/4/2018
* Update PDFBox to 2.0.12, jempbox to 1.8.16
and jbig2 to 3.0.2 (TIKA-2745).
* Fix regression in parser for MP3 files (TIKA-2730).
* Updated Python Dependency Check for TesseractOCR (TIKA-2740).
* Improve SAXParser robustness (TIKA-2727).
* Remove dependency on slf4j-log4j12 by upgrading jmatio (TIKA-2742).
* Replace com.sun.xml.bind:jaxb-impl and jaxb-core with
org.glassfish.jaxb:jaxb-runtime and jaxb-core (TIKA-2743)
Release 1.19 - 9/14/2018
* Require Java 8 (TIKA-2679).
* Enable building with Java 11 (TIKA-2668)
* Add an option to make tika-server robust against infinite loops,
OOMs, and memory leaks (TIKA-2725).
* Allow configuration of the Tesseract parser via the standard
tika-config.xml options (TIKA-2705).
* Improve handling of empty cells across table-based
formats (TIKA-2479).
* Add a Standards compliant HTML encoding detector
via Gerard Bouchar (TIKA-2673).
* Improved XML parsing -- limited default entity expansions to 20.
To raise this limit, add -Djdk.xml.entityExpansionLimit=XXX to
your commandline.
* Mime magic improvements for Olympus RAW (TIKA-2658), interpreted
server-side languages via HTTP (TIKA-2648), MHTML (TIKA-2723)
* Add absolute timeout to ForkParser rather than testing
for active (TIKA-2656).
* Make the RecursiveParserWrapper work with the ForkParser (TIKA-2655).
* Allow the ForkParser to specify a directory containing tika-app.jar
for use by the ForkServer. This allows users to keep most of the
parser dependencies out of their code; and it allows for an easy
addition of optional jars for Parser dependencies,
such as the xerial sqlite jar (TIKA-2653).
* Use a pool for SAXParsers and DOMBuilders rather than creating
a new parser/builder for every parse.
For better performance, set XMLReaderUtils.setPoolSize() to the
number of threads you're using with Tika (TIKA-2645).
* Add the RecursiveParserWrapperHandler to improve the RecursiveParserWrapper
API slightly (TIKA-2644).
* Upgraded to Commons-Compress 1.18 (TIKA-2707).
* Upgraded to Apache POI 4.0.0 (TIKA-2552).
* Upgraded to Apache PDFBox 2.0.11 (TIKA-2681).
* Upgraded to deeplearning4j 1.0.0-beta2 (TIKA-2672).
* Upgraded jmatio to 1.4 (TIKA-2667)
* Upgraded Apache Lucene to 7.4.0 in tika-eval and tika-examples (TIKA-2695).
* Upgraded junrar to 1.0.1 (TIKA-2664).
* Numerous other upgrades (TIKA-2692).
* Excluded Spring as a transitive dependency (TIKA-2721).
Release 1.18 - 4/20/2018
* Upgrade jackson to 2.9.5 (TIKA-2634).
* Add support for brotli (TIKA-2621).
* Upgrade PDFBox to 2.0.9 and include new jbig2-imageio
from org.apache.pdfbox (TIKA-2579 and TIKA-2607).
* Support for TIFF images in PDF files (TIKA-2338)
* Detection of full encrypted 7z files (TIKA-2568)
* Various new mimes and typo fixes in tika-mimetypes.xml
via Andreas Meier (TIKA-2527).
* Revert to listenForAllRecords=false in ExcelExtractor
via Grigoriy Alekseev (TIKA-2590)
* Add workaround to identify TIFFs that might confuse
commons-compress's tar detection via Daniel Schmidt
(TIKA-2591)
* Ignore non-IANA supported charsets in HTML meta-headers
during charset detection in HTMLEncodingDetector
via Andreas Meier (TIKA-2592)
* Add detection and parsing of zstd (if user provides
com.github.luben:zstd-jni) via Andreas Meier (TIKA-2576)
* Allow for RFC822 detection for files starting with "dkim-"
and/or "x-" via Andreas Meier (TIKA-2578 and TIKA-2587)
* Extract xlsx files embedded in OLE objects within PPT and PPTX
via Brian McColgan (TIKA-2588).
* Extract files embedded in HTML and javascript inside HTML
that are stored in the Data URI scheme (TIKA-2563).
* Extract text from grouped text boxes in PPT (TIKA-2569).
* Extract language metadata item from PDF files via Matt Sheppard (TIKA-2559)
* RFC822 with multipart/mixed, first text element should be treated
as the main body of the email, not an attachment (TIKA-2547).
* Swap out com.tdunning:json for com.github.openjson:openjson to avoid
jar conflicts (TIKA-2556).
* No longer hardcode HtmlParser for XML files in tika-server (TIKA-2551).
* Require Java 8 (TIKA-2553).
* Add a parser for XPS (TIKA-2524).
* Mime magic for Dolby Digital AC3 and EAC3 files
* Fixed bug where TesseractOCRParser ignores configured ImageMagickPath,
and set rotation script to ignore Python warnings (TIKA-2509)
* Upgrade geo-apis to 3.0.1 (TIKA-2535)
* Mime definition and magic improvements for text-based programming
and config formats (TIKA-2554, TIKA-2567, TIKA-1141)
* Added local Docker image build using dockerfile-maven-plugin to allow
images to be built from source (TIKA-1518).
* Support for SAS7BDAT data files (TIKA-2462)
* Handle .epub files using .htm rather than .html extensions for the
embedded contents (TIKA-1288)
* Mime magic for ACES Images (TIKA-2628) and DPX Images (TIKA-2629)
* For sparse XLSX and XLSB files, always output missing cells to
the left of filled ones (matching XLS), and optionally output
missing rows on all 3 formats if requested via the
OfficeParserContext (TIKA-2479)
Release 1.17 - 12/8/2017
***NOTE: THIS IS THE LAST VERSION OF TIKA THAT WILL RUN
ON Java 7. The next versions will require Java 8***
* Fix thread-safety in ChmExtractor (TIKA-2519).
* Upgrade cxf to 3.0.16 (TIKA-2516).
* Allow users to configure maxMainMemoryBytes for PDFs via shrike (PR-213).
* Extract underline and strikethrough in docx (TIKA-2347 and TIKA-2512).
* Cache TikaConfig in EmbeddedDocumentUtil for better performance
in documents with large number of attachments (TIKA-2511).
* Extract media files from ooxml (TIKA-2510).
* Standardize the way the Image and Video captioning
dockers and extraction work (TIKA-2400, GitHub-208)
* Upgrade to xmpcore 5.1.3 (TIKA-2034).
* Upgrade to metadata-extractor 2.10.1 (TIKA-2486).
* Upgrade to OpenNLP 1.8.3 (TIKA-2502).
* Upgrade to Jackson 2.9.2 (TIKA-2501).
* Catch potential NPE in getting InputStream for attachments
in PST file (TIKA-2488).
* Upgrade to PDFBox 2.0.8 (TIKA-2489).
* Allow configuration of markLimit in EncodingDetectors
via tika-config.xml (TIKA-2485).
* RFC822Parser now selects the best alternative for
multipart/alternative body components. This aligns with the
behavior of the OutlookParser (TIKA-2478). Users can select
legacy behavior via the "extractAllAlternatives" parameter
in the RFC822 parser definition in tika-config.xml.
* Narrow mime detection for ms-owner files and add detection
for .nls files (TIKA-2469).
* Fix bug in CharsetDetector that led to different detected charsets
depending on whether user setText with a byte[] or an InputStream
via Sean Story (TIKA-2475).
* Remove JAXB for easier use with Java 9 via Robert Munteanu (TIKA-2466).
* Upgrade to POI 3.17 (TIKA-2429).
* Enabling extraction of standard references from text (TIKA-2449).
* Load external custom mimetypes XML from system property
tika.custom-mimetypes (TIKA-2460).
* Extract number of tiffs in a multi-page tiff (TIKA-2451).
* Fix detection of emails extracted from mbox (TIKA-2456).
* Add OverrideDetector and allow PSTParser to specify body content type
as text or html -- to avoid incorrect auto-detection of
rfc/mbox, etc. (TIKA-2454)
* AutoDetectParser throws ZeroByteFileException for zero-byte files after
detection on the file extension (TIKA-2450).
* Extract phonetic runs in docx with experimental SAX parser (TIKA-2448).
* Extract phonetic runs from xls and allow users to turn off extraction
of phonetic runs in both xls and xlsx (TIKA-2440).
* OOXML locale should be set by POI's LocaleUtil not Locale.getDefault().
Fix unit tests to be robust against different locales in OOXML
and ExcelParser (TIKA-2438).
* Upgrade to PDFBox 2.0.7 (TIKA-2431).
* Tika now has support for automatic image captioning, that
combines Computer Vision and Natural Language Processing to
automatically generate a readable caption for an image
(TIKA-2262, TIKA-2355, TIKA-2402, Gh-198, Gh-196, Gh-189).
* Add TestCorruptedFiles to allow devs to test parsers against
corrupted input files (TIKA-2430).
* Correct Mimetype definition for Windows batch files (CMD and BAT)
which are the same (TIKA-2445)
* PSDParser memory use improvements (TIKA-2447)
* Add underline extraction from Word documents (doc/docx) via Stuart Hendren
as well as strikethrough extraction in docx (TIKA-2347, GitHub-173)
* Corrected Tesseract OCR rotation.py script and made it a configurable
option via Peter Weiss (TIKA-2385)
Release 1.16 - 7/7/2017
* Exclude jj2000 from edu.ucar grip to avoid potential
license conflicts with ASL 2.0
* Add Age recognition using Ensemble model for Linear regression
and Apache OpenNLP Maximum Entropy. Tika can now detect age from
text (TIKA-1988).
* Add Tika Deep Learning support for the VGG16 model for
Very Deep Convolutional Networks for Large-Scale Image Recognition.
Now Tika supports both Inception v3/v4 and VGG16 based image
recognition (TIKA-2298).
* Extract macros from PPT (TIKA-2089).
* Extract absolute path for last saved location when available
in .xlsx and .xlsb (TIKA-2335).
* Rename SentimentParser to SentimentAnalysisParser to
prevent conflict with dependency (TIKA-2368).
* tika-app now extracts inline images in PDFs by
default, and it includes a warning to users that this is not the
default behavior elsewhere in Tika (TIKA-2374).
* Allow configurability of warnings for problems during
parser initialization (TIKA-2389).
* Upgrade to Jackcess 2.1.8 (TIKA-2380).
* Upgrade to POI 3.17-beta1 (TIKA-2336).
* Remove non-ASL-2.0-compatible org.json (TIKA-1804).
* Allow extraction of