I got the following pretty obscure error the other day from a cfscheduler job that runs nightly to index documents uploaded to our site:
org/apache/pdfbox/pdmodel/PDDocument null
Turns out that the error is caused by a file having the extension of .PDF instead of .pdf. No, really. Luckily I only had one offending file, but what if I had many? Also, what if users uploaded more after I renamed the problematic one? There are two parts to “future proofing” my situation. The first part it to address the .PDF extensions in the uploads. The second part, and what I’m going to pass on to you, is a custom tag that will look in a directory you specify and rename all .PDF extensions to .pdf.
To implement:
- Download the pdf_cleanup custom tag
- Unzip it to whatever directory you keep you custom tags in
- Call it using the following syntax just before you run your <cfindex> operation(s):
<cf_pdf_cleanup dirToClean="C:\mysuperdocs">
Be forewarned I take no responsibility for your use of the tag 😉
Did you try this in 901+CHF? Solr fixes were added that may got this. If so, please be sure to file a bug report. Adobe does NOT search out blog posts like this so it’s up to us guys to use the public bug tracker.
Unfortunately I did have 901+CHF. I have filed a bug as you suggested. Thanks!
Hi, this happened to me but in my particular case, there was a PDF file without extension. So instead of File.pdf it was only File. Thanks for this info.
Wow! Thanks so much for this. I was going crazy trying to find the single PDF in my collection that was causing SOLR to crash with a 500 error. I narrowed it down to one PDF (after putting one in a folder, re-indexing, putting a second pdf file, re-indexing, etc. etc — until I narrowed it down to a single file that would always bomb the indexing.) Anyway, I didn’t even notice the upper-case PDF extension.
Wow — this is a *BIG* bug in SOLR. Crazy, crazy. Thank you so much!