Technical Post ahead…
So you’ve been able to configure SharePoint so that it can crawl and index content in a Documentum docbase.
Cool.
You go and look at the crawl log, and can see that x number of documents have been crawled. But what does that actually mean?
You fire up your favourite DQL query tool (Repoint, dqMan, Documentum Administrator) and, with panache, type in a query that will return a count on the dm_document table.
For example:
SELECT count(*) FROM dm_document
But…the value returned does not quite match what SharePoint’s crawl log was showing.
This is because SharePoint crawls files based on the extension of the file (PDF, DWG, DOC, etc). And when you run a query against Documentum, the table you are querying against does not have extensions listed.
As shown above, the table to query against is dm_document. This contains all the document objects it has, and lists the object’s “content_type”. This is not the same as the file type. (It is also not the same as what SharePoint defines as a “content type”.) In Documentum, Content_types are the file format of the object’s content.
Here’s an example list of the sort of Content Types that can be found in the dm_document table.
abt | amipro | bmp |
cgm | crtext | css |
csv | ddd | did |
dtd | excel | excel12book |
excel5book | excel8book | excel8template |
gif | html | jpeg |
js | mp3 | mpeg |
ms_access7 | ms_access8 | msg |
msproject | msw12 | msw12me |
msw6 | msw6template | msw8 |
msw8template | msww | |
pdx | powerpoint | ppt8 |
ppt8_template | ps | quicktime |
rtf | swf | text |
tiff | trn | txt |
wld | wmv | wp6 |
wp7 | wp8 | xml |
xpt | xsl | zip |
So even if you ran a DQL query that listed the objects in dm_document you would only see the “content_type”, which doesn’t help you work out the documents that are crawled.
The “content_type” in dm_document references the table dm_format. This table lists the content types and their associated file formats. Here’s an example of this table:
name | description | ext. | name | description | ext. |
bmp | BMP Image (Windows, OS/2) | bmp | crtext | Text Document (Windows) | txt |
css | Cascading Style Sheet Document | css | csv | Comma Seperated values | csv |
dtd | Document Type Definition | dtd | excel | Excel 3.x worksheet (MacOS, Windows) | xls |
excel12book | Microsoft Office Excel Workbook | xlsx | excel5book | Excel workbook 5.0 (MacOS), 5.0-7.0 (Windows) | xls |
excel8book | Microsoft Office Excel Worksheet 8.0-2003 (Windows) | xls | excel8template | Microsoft Office Excel Template 8.0-2003 (Windows) | xlt |
gif | GIF image | gif | html | HTML Document | htm |
jpeg | JPEG Image | jpg | js | JavaScript File | js |
mdoc55 | FrameMaker 5.5 document | fm | mp3 | MP3 File | mp3 |
msg | Outlook Message Format | msg | msproject | MS Project – project | mpp |
msw12 | Microsoft Office Word Document | docx | msw12me | Microsoft Office Word Macro-enabled Document | docm |
msw6 | Word 6.0 (MacOS), 6.0-7.0 (Windows) | doc | msw6template | Word 6.x template (MacOS, Windows) | dot |
msw8 | Microsoft Office Word Document 8.0-2003 (Windows) | doc | msww | Word 1.x, 2.x (Windows) | doc |
ms_access7 | Access 95 database | mdb | ms_access8 | Access 97 / 2000 database | mdb |
Watermarked Acrobat PDF | pdx | Acrobat Catalog index file | pdx | ||
powerpoint | PowerPoint pre-3.0 (MacOS, Windows) | ppt | ppt8 | Microsoft Office PowerPoint Presentation 8.0-2003 (windows) | ppt |
ppt8_template | Microsoft Office PowerPoint Template 8.0-2003 (Windows) | pot | ps | PostScript | ps |
quicktime | QuickTime Movie | mov | rtf | Rich Text Format (RTF) | rtf |
swf | Shockwave Flash File |
swf | text | Text Document (Unix) | txt |
tga | Targa Image | tga | tiff | TIFF Image | tif |
trn | Transaction file | trn | txt | Log File | txt |
wmv | Windows Media Video | wmv | wp6 | WordPerfect 6.0 | wpd |
wp7 | WordPerfect 7.0 | wpd | wp8 | WordPerfect 8.0 | wpd |
xml | XML Document | xml | xsl | XSL File | xsl |
From this you can see that multiple content_types can have the same dos extension.
So – the trick is to work out what the file formats (or DOS extensions) are of the documents in the docbase.
What follows is what I did. (Note – I am not the world’s smartest guy, so there may be a better way to do this.)
1. Determine the content types that are in use in the docbase
SELECT distinct d.a_content_type, f.dos_extension FROM dm_document d, dm_format f WHERE d.a_content_type=f.name
This gives you the list of :
- the unique content types of the documents in the docbase,
- and the extensions.
Now with this list, you can do a quick cross-reference.
2. Compare the dos extensions in the list with the File Types list in SharePoint, and you can confirm which objects (documents) in Documentum are actually being crawled by SharePoint.
Take the content type/dos extension list from above and remove all the ones that SharePoint doesn’t crawl.
3. Then run another DQL command. This time use the following query:
SELECT count(*) FROM dm_document WHERE a_content_type IN ( list content_types from above )
And this should give you a count of the documents in the Documentum docbase that SharePoint crawls, and this should be much, much closer to the number listed in SharePoint’s Crawl log.
One thing to be aware of – if you docbase is very active (that is, documents are being worked on most of the day), then the count that the DQL query gives will be as of the time the query was run.The number listed in the crawl log is from the time the crawl was last run. This may cause a difference.