Can’t be handled as Microsoft document
… Can’t be handled as Microsoft document.
java.lang.ArrayIndexOutOfBoundsException …..
If you see this kind of exception while parsing the word document using Nutch then it indicates that document has problematic content & includes weired special characters which were not properly handled by the parser.
August 6th, 2008 in
Nutch - (Crawler), Open Source