Killa Hertz & The Case of the Missing Documents – Part 4

Killa Hertz and the Case of the Missing Documents

… continued from Part 3 [All Episodes]An ECM Detective story - Killa Hertz and the case of the Missing Documents

Killa and Trudy were delving deeper into the problem. The crawl log indicated that it had finished crawling successfully, but something wasn’t right…

Killa Hertz & The Case of the Missing Documents – Part 4

“How many documents are in the docbase, Trudy?”

“500 000” she said, but she didn’t look me in the eye. I could see that she wasn’t really sure.

“OK,  Trudy, let’s swap seats, and let’s find out what’s going on. She jumped up and I took her place.

Starting Documentum’s Administrator tool (DA), I opened up the DQL screen. This would let me query the docbase using Documentum’s query language. I prefer Repoint for this type of job, but running a query through DA was just as good. I ran a count on the number of objects in the docbase.

dql-query

The result I got back was 824,129. Trudy was surprised. “Wow!” she said in that squeaky, excited voice. “That’s a lot more than I had expected.”

I was curious about what these documents were. What type of files they were.

Quickly I ran another DQL query and got a list of the content types. I looked through the list. There were Excel 3.0 spreadsheets, Excel 8 spreadsheet, Word 6 documents, Word 8 documents, PDF files, scanned Tiffs, several mpgs, and mp3 files, log files, jpeg files, html files, and several rich-text format files.

“OK – now let’s look at the files that SharePoint can crawl”.  Starting up SharePoint’s Central Administrator, I navigated to the Shared Service Provider, and then to the Search Administrator. After clicking on the File Types link, I noted down the file types listed.

Cross-referencing the two, I could see that for most of the files in the Docbase there was a suitable iFilter in SharePoint allowing SharePoint to be able to read the document. (The mpg files, the mp3 files, the jpeg files and tiff files weren’t getting crawled.) They were using the PDF iFilter from Foxit.

I ran the count query again, this time only including the content types that were crawled. The count came to 801,232 objects. Even taking into account the documents that SharePoint didn’t crawl there was still a discrepancy. Why was SharePoint stopping happily after about 350 thousand documents?

Grabbing my notepad, I jotted down what I knew so far:

  1. Documents were stored in Documentum
  2. Documents were also stored in SharePoint doclibs
  3. There were two content sources set up
    1. One for SharePoint
    2. One for Documentum.
  4. Most of the file formats had a suitable iFilter that allowed SharePoint to crawl them.
  5. According to SharePoint’s crawl log, there were no errors, but it would stop crawling too early missing a large bulk of the documents.

So … what was I missing? I looked further into the process…

to be continued…