MarkJOwen

Thinking Business Analysis; Thinking Information Management; Looking at UX

Main menu

Skip to primary content
Skip to secondary content
  • Home
    • All My Posts
    • Random Post
  • Business Analysis
  • Information Management
    • Search
    • Social Media
  • Other
    • Innovation
    • Project Management
    • Psychology
  • Fun
  • Who am I?
  • Random Post

Tag Archives: crawl log

Determining the real number of crawled documents in a SharePoint-Documentum system

Posted on August 17, 2010 by markjowen
Reply

Technical Post ahead…

So you’ve been able to configure SharePoint so that it can crawl and index content in a Documentum docbase.

Cool.

You go and look at the crawl log, and can see that x number of documents have been crawled. But what does that actually mean?

You fire up your favourite DQL query tool (Repoint, dqMan, Documentum Administrator) and, with panache, type in a query that will return a count on the dm_document table.

For example:

SELECT count(*) FROM dm_document

But…the value returned does not quite match what SharePoint’s crawl log was showing.

This is because SharePoint crawls files based on the extension of the file (PDF, DWG, DOC, etc). And when you run a query against Documentum, the table you are querying against does not have extensions listed.

As shown above, the table to query against is dm_document. This contains all the document objects it has, and lists the object’s “content_type”. This is not the same as the file type. (It is also not the same as what SharePoint defines as a “content type”.) In Documentum, Content_types are the file format of the object’s content.

Here’s an example list of the sort of Content Types that can be found in the dm_document table.

Content Types
abt amipro bmp
cgm crtext css
csv ddd did
dtd excel excel12book
excel5book excel8book excel8template
gif html jpeg
js mp3 mpeg
ms_access7 ms_access8 msg
msproject msw12 msw12me
msw6 msw6template msw8
msw8template msww pdf
pdx powerpoint ppt8
ppt8_template ps quicktime
rtf swf text
tiff trn txt
wld wmv wp6
wp7 wp8 xml
xpt xsl zip

So even if you ran a DQL query that listed the objects in dm_document you would only see the “content_type”, which doesn’t help you work out the documents that are crawled.

The “content_type” in dm_document references the table dm_format. This table lists the content types and their associated file formats. Here’s an example of this table:


Content Types, Descriptions, and associated File Extensions 
name description ext. name description ext.
bmp BMP Image (Windows, OS/2) bmp crtext Text Document (Windows) txt
css Cascading Style Sheet Document css csv Comma Seperated values csv
dtd Document Type Definition dtd excel Excel 3.x worksheet (MacOS, Windows) xls
excel12book Microsoft Office Excel Workbook xlsx excel5book Excel workbook 5.0 (MacOS), 5.0-7.0 (Windows) xls
excel8book Microsoft Office Excel Worksheet 8.0-2003 (Windows) xls excel8template Microsoft Office Excel Template 8.0-2003 (Windows) xlt
gif GIF image gif html HTML Document htm
jpeg JPEG Image jpg js JavaScript File js
mdoc55 FrameMaker 5.5 document fm mp3 MP3 File mp3
msg Outlook Message Format msg msproject MS Project – project mpp
msw12 Microsoft Office Word Document docx msw12me Microsoft Office Word Macro-enabled Document docm
msw6 Word 6.0 (MacOS), 6.0-7.0 (Windows) doc msw6template Word 6.x template (MacOS, Windows) dot
msw8 Microsoft Office Word Document 8.0-2003 (Windows) doc msww Word 1.x, 2.x (Windows) doc
ms_access7 Access 95 database mdb ms_access8 Access 97 / 2000 database mdb
pdf Watermarked Acrobat PDF pdf pdx Acrobat Catalog index file pdx
powerpoint PowerPoint pre-3.0 (MacOS, Windows) ppt ppt8 Microsoft Office PowerPoint Presentation 8.0-2003 (windows) ppt
ppt8_template Microsoft Office PowerPoint Template 8.0-2003 (Windows) pot ps PostScript ps
quicktime QuickTime Movie mov rtf Rich Text Format (RTF) rtf
swf Shockwave Flash File
swf text Text Document (Unix) txt
tga Targa Image tga tiff TIFF Image tif
trn Transaction file trn txt Log File txt
wmv Windows Media Video wmv wp6 WordPerfect 6.0 wpd
wp7 WordPerfect 7.0 wpd wp8 WordPerfect 8.0 wpd
xml XML Document xml xsl XSL File xsl

From this you can see that multiple content_types can have the same dos extension.

So – the trick is to work out what the file formats (or DOS extensions) are of the documents in the docbase.

What follows is what I did. (Note – I am not the world’s smartest guy, so there may be a better way to do this.)

1. Determine the content types that are in use in the docbase

SELECT distinct d.a_content_type, f.dos_extension FROM dm_document d, dm_format f WHERE d.a_content_type=f.name

This gives you the list of :

  • the unique content types of the documents in the docbase,
  • and the extensions.

Now with this list, you can do a quick cross-reference.

2. Compare the dos extensions in the list with the  File Types list in SharePoint, and you can confirm which objects (documents) in Documentum are actually being crawled by SharePoint.
Take the content type/dos extension list from above and remove all the ones that SharePoint doesn’t crawl.

3. Then run another DQL command. This time use the following query:

SELECT  count(*) FROM  dm_document WHERE a_content_type IN  (   list content_types from above  )

And this should give you a count of the documents in the Documentum docbase that SharePoint crawls, and this should be much, much closer to the number listed in SharePoint’s Crawl log.

One thing to be aware of  – if you docbase is very active (that is, documents are being worked on most of the day), then the count that the DQL query gives will be as of the time the query was run.The number listed in the crawl log is from the time the crawl was last run. This may cause a difference.

Related articles
  • Will EMC Dump Documentum?
  • ECM Noir – Killa Hertz & The Case of the Missing Documents – Part 9
  • Documentum, Not Going Away Without a Fight

Share this:

  • Click to share on Twitter (Opens in new window)
  • Click to share on Facebook (Opens in new window)
  • Click to share on Google+ (Opens in new window)
  • Click to share on LinkedIn (Opens in new window)
  • More
  • Click to email this to a friend (Opens in new window)
  • Click to share on Pocket (Opens in new window)
  • Click to share on Reddit (Opens in new window)
  • Click to share on Pinterest (Opens in new window)
Posted in Information Management, Search | Tagged BA-Insight, content types, crawl log, Documentum, DQL, file types, metadata, search, SharePoint | Leave a reply

Social

  • View @markjowen’s profile on Twitter
  • View markjowen’s profile on LinkedIn

Top Posts

Scott Berkun’s lecture on the "Myths of Innovation"
The BA is the least knowledgeable about Agile
I think I underestimated what AIIM's
SPX Series - SharePoint eXperience - (aka SPX) - Series Introduction
Big Pharma and the Prisoner's Dilemma
Post-move SharePoint site Comparison

Archives

“Mark Owen is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to amazon.com.”
Proudly powered by WordPress
Cancel
Post was not sent - check your email addresses!
Email check failed, please try again
Sorry, your blog cannot share posts by email.