Google, 1997, Pagerank & Barrels

Did you know Google originally used barrels?

Here’s an overview of Google’s system architect (at least as it was in 1997)

Can you see the barrels?

Messrs Sergey Brin and Lawrence Page presented a paper at a conference back in 1997. According to this, once a page has been indexed, the word occurrences are stored in “barrels”.

More about these barrels, and the page rank algorithm, and how Google does what it does (at least in 1997) can be read here:  The Anatomy of a Large-Scale Hypertextual Web Search Engine

The Anatomy of a Large-Scale Hypertextual Web Search Engine

The Anatomy of a Large-Scale Hypertextual Web Search Engine

We want Google!

White color Search box

One of the things I come across frequently when discussing client’s requirements for a new enterprise content management system is that they “want Google search”.

When I delve deeper I often discover that the client wants to be able to type one or two words in a small text box, and get back exactly the document that they were looking for.Within seconds.

Unfortunately what the client doesn’t always see, or is aware of, is that there is a lot of work put into making Google work efficiently, and that there is a staff of Google employees monitoring, and tweaking, and caring for the Google search engine. What the client also doesn’t realise is that Google (the company) has over 100 million servers around the world dedicated to indexing the internet.

Here is an interesting article from CMS Wire about Google and what it actually does: Enterprise Search and Pursuit of the Google Experience. (The included video by Google engineer Matt Cutts is also interesting)

Also of interest:

Learn how Google work: In Glory Detail

Google reveal the real technology behind their system.

http://www.cmswire.com/cms/enterprise-cms/enterprise-search-and-pursuit-of-the-google-experience-008150.php

Killa Hertz & The Case of the Missing Documents – Part 5

An ECM Detective story - Killa Hertz and the case of the Missing Documents
… continued from Part 4 —  [All Episodes]

Killa Hertz had taken control of the computer. Running a few DQL queries he had been able to determine how many, and what sort of, documents there actually were in the docbase. The number that SharePoint was crawling didn’t match…

 

Killa Hertz & The Case of the Missing Documents – Part 5

After taking a swig of cold coffee, I decided to learn more about eResults.

It was a protocol handler that allows SharePoint to talk with Documentum. As well as that, it keeps track of the security of  the documents in Documentum. This ensures that security trimming is applied correctly to the documents returned in the search results.

Looking at the SharePoint’s Search Administration screen, and the configuration screen for eResults I deduced the following:

  • A content source has been set up that points to the Documentum docbase.
  • At regular intervals, SharePoint connects to Documentum using the information defined in the content source.
  • Based on a custom filter, SharePoint retrieves a list of the documents that Documentum has 20,000 at a time.
  • Using a web server as the intermediary, SharePoint copies the documents from the docbase to a folder on the Index server.
  • The documents are then crawled, and when each document is finished, it is deleted from the folder.
  • At the same time, that the list of documents is retrieved from Documentum, the Documentum security is also translated so that it matches a corresponding group in Active Directory.

Luckily eResults kept detailed logs about its activities. I opened the latest one. There was a lot of information. I started looking through it.

There were several places where the word “error” appeared in the log. They looked like pretty harmless entries, but I wanted to be sure. I’d have to call in a favour.

Mike Budrewski was an old friend of mine. He was born with a copy of “The Geeks Guide to Being a Geek” in his hands. What Mike didn’t know about technology wasn’t worth knowing. Problem with Mike was, if you didn’t have a keyboard, and a monitor, he didn’t really feel comfortable talking to you. Mike wasn’t really a people person.

I looked at Trudy. She was looking tired. “Let’s call it a night” I said. “Copy these log files onto a CD for me. I’ve got a guy who will look over them tonight, and I should have an answer by tomorrow.”

Trudy looked pleased. It was 3 o’clock in the morning, & the caffeine was starting to wear off.

An ECM Detective story - Killa Hertz and the case of the Missing Documents

to be continued…

Part 6

 

Recommended Content on Amazon


This made me smile…

Came across this image while cruzin the web. It made me smile.

Original link: The Next Web

SharePoint Will Not Own ECM (At Least, Not Anytime Soon) – My 2¢ worth

Came across a post via via the other day…

@SharePointBuzz retweeted  @JoeShepley about a post that was written by Linda Andrews in response to a post Joe had originally written about SharePoint 2010. (You may have to read that again).

Here’s Joe’s original post: http://joeshepley.wordpress.com/2010/06/08/sharepoint-will-own-ecm/.

And here’s Linda’s response: http://www.doculabs.com/?p=1260.

I read Linda’s post first, and started writing a response. Once finished, I thought it might be prudent to read what Joe had originally said, and tweaked my response slightly. Originally I was going to post my response as a comment on the page of Linda’s post. But then thought “Nah – I’ll post this on my blog to give it the glory that it deserves”…

Here is my contribution to the debate:

Hi Linda

Interesting article – thanks.

I’ve been working in the ECM for about 16 years now, having cut my teeth on FileNet, and have worked for the last three years with Documentum (and also SharePoint).

 

When SharePoint 2003 appeared on the scene, it did not even show on my radar. I was aware of the name, but that was it. When SharePoint 2007 took to the stage, I watched the hype and excitement that it bought with it for the first 6 months, but watched that die quickly. While its strengths definitely didn’t lie with ECM, it did offer a lot to collaboration.

SharePoint 2010, on the other hand, I am treating with a modicum of respect, and I have been looking at the “threat” that it is supposedly bringing with it.

I have read Joe Shepleys original post. He makes some very valid points, and while, in principle, you do too, I’d like to share my own thoughts…

For many companies that already have an existing ECM solution in place, the cost, as you pointed out, of swapping to SharePoint is more a reason not to. To uproot a working system, as well as to migrate the documents is not something undertaken lightly.

However, consider a minus of some of the big ECM products. The cost of licences can be quite hefty. This does make SharePoint attractive (even taking into account the points you have made in Reason #3). Any smart company will try and reduce the cost of something that is considered an overhead. As a result, during times of document management system upgrades, it may be that the move to SharePoint could be worthy of consideration.

And, with that in mind, I would like to reiterate Joe’s Shepley’s closing paragraph, by saying that it is not unreasonable to consider that, for the sake of reducing costs, a change in expectations may also be considered. Analyse the actual business process and, if the cost savings are really worth it, adapt it. Maybe a less complex process, that has been built around the “reduced” functionality that SharePoint has, could be put into place.

I’m not going to make any hard predictions, but, maybe SharePoint will actually start owning more and more of the ECM world…

  • AIIM White Paper on SharePoint Deployment (arnoldit.com)

You want to complain? Go ahead.

Just read a great post by Steve Radick. His post discusses how, even though we have such a great opportunity at the moment, to communicate how we feel about something in the work place, we still don’t.

And we never have. Who ever wanted to send an “anonymous” e-mail to the HR department about something that was royally pissing you off? Or to fill in an on-line survey about management in a truly honest way. We all know that in an organisation, if the communication is via the computer, then it ain’t that anonymous – “Oh look, unknown employee using computer WEX321 (ip address 10.15.1.243) has criticised the way the Director runs the company. Let’s just make note of that.”

Steve’s post is quite interesting and he discusses ways to encourage honest feedback.

Here’s a link to his post: Got a Problem with the Organization? Speak Up!!

—————————————————————————————————–

Click on graphic for a larger version

Got a Problem with the Organization? Speak Up!!Got a Problem with the Organization? Speak Up!!

  • A Dazzling List of 50+ Customer Service Resources for Entrepreneurs
  • In the age of the social media hissy fit, how to win back angry customers?
  • The Real Power Of Blogging – You Do Have A Voice You Know

My document is not being crawled!!!

OK – here’s one that I struggled with.

I was installing SharePoint Search Server 2008. The objective was to index, and crawl, several file shares. This would allow the users to “find” all those documents that they had been hoarding in the file shares over several years.

As well as the standard Office documents, there were also a lot of PDF documents. As you may already know, out-of-the box, SharePoint doesn’t have an iFilter that will allow it to crawl PDFs. You have to install a third-party iFilter.

Adobe used to offer a separate download that could be installed. Since version 8 of Adobe Reader, the iFilter is included as part of the application. At the time I was doing this project, Adobe Reader 9 was the latest version, so I installed it.

In the beginning I thought that all I had to do was install the application, and then ensure that PDFs were added to the File Types that SharePoint would crawl.

But no…there is some fancy work that has to be done in the registry. Although there are several excellent articles/posts on the Internet that discussed this, I still found that , even when I had followed them to the letter, my PDFs remained happily “undiscovered”.

I know that we are now in the age of Sharepoint 2010, and as just mentioned been documented to the nth degree, but for posterities sake, I have recorded here the process that helped me. (There is also a list of useful links at the end of this post).

Here’s what I did:

  1. Installed Adobe Reader on the same server SharePoint was installed on.
  2. Added the pdf icon (instructions for this can be found in any of the articles/posts that were mentioned above)
  3. Add PDF as a new File Type (also – standard SharePoint procedure)
  4. Opened Regedit, and navigated to
    [HKEY_LOCAL_MACHINESOFTWAREMicrosoftShared ToolsWeb Server Extensions12.0SearchSetupContentIndexCommonFiltersExtension.pdf]
  5. If necessary, changed the existing Multi-string value to:
    {E8978DA6-047F-4E3D-9C78-CDBE46041603}
  6. Navigate to the following key:
    [HKEY_LOCAL_MACHINESOFTWAREMicrosoftOffice Server12.0SearchSetupContentIndexCommonFiltersExtension.pdf]  (note that this is different from Step 4)
  7. If necessary, changed the existing Multi-string value to:
    {E8978DA6-047F-4E3D-9C78-CDBE46041603}
  8. Navigate to the following key:
    [HKEY_LOCAL_MACHINESOFTWAREMicrosoftShared ToolsWeb Server Extensions12.0SearchSetupFilters.pdf]
  9. If not already present, add the following entries:

Name: Default
Data: (value not set)

Name: Extension
Type: REG_SZ
Data: pdf

Name: FileTypeBucket
Type: REG_DWORD
Data: 0x00000001 (1)

Name: MimeTypes
Type: REG_SZ
Data: application/pdf

Type: REG_SZ

This whole process seems pretty straight forward – but it cost me a lot of pain, and many lost hours.

One thing to also be aware of.

If you install SP2 for SharePoint after setting this up, you will need to go back and change those GUIDs again. Chris Even, a giant in the Search world, points this out in his blog. (I strongly recommend having a look at it.)

Useful Links

  • Chris Even on SharePoint Search
  • http://support.microsoft.com/kb/2018558
  • http://support.microsoft.com/kb/944447
  • http://blogs.officezealot.com/zaidi/default.aspx

Of course, there are plenty more – just check with your favorite search engine :O)

http://sharepointsearch.com/cs/blogs/notorioustech/archive/2009/07/28/moss-and-wss-sp2-can-break-pdf-searching.aspx

Lose 9 Words, Improve Your Results

Just read a great article. Not related specifically to ECM, or compliance, but still some good advice.

Lose 9 Words, Improve Your Results

Killa Hertz & The Case of the Missing Documents – Part 4

An ECM Detective story - Killa Hertz and the case of the Missing Documents
… continued from Part 3 [All Episodes]

Killa and Trudy were delving deeper into the problem. The crawl log indicated that it had finished crawling successfully, but something wasn’t right…

 

Killa Hertz & The Case of the Missing Documents – Part 4

“How many documents are in the docbase, Trudy?”

“500 000” she said, but she didn’t look me in the eye. I could see that she wasn’t really sure.

“OK,  Trudy, let’s swap seats, and let’s find out what’s going on. She jumped up and I took her place.

Starting Documentum’s Administrator tool (DA), I opened up the DQL screen. This would let me query the docbase using Documentum’s query language. I prefer Repoint for this type of job, but running a query through DA was just as good. I ran a count on the number of objects in the docbase.

The result I got back was 824,129. Trudy was surprised. “Wow!” she said in that squeaky, excited voice. “That’s a lot more than I had expected.”

I was curious about what these documents were. What type of files they were.

Quickly I ran another DQL query and got a list of the content types. I looked through the list. There were Excel 3.0 spreadsheets, Excel 8 spreadsheet, Word 6 documents, Word 8 documents, PDF files, scanned Tiffs, several mpgs, and mp3 files, log files, jpeg files, html files, and several rich-text format files.

“OK – now let’s look at the files that SharePoint can crawl”.  Starting up SharePoint’s Central Administrator, I navigated to the Shared Service Provider, and then to the Search Administrator. After clicking on the File Types link, I noted down the file types listed.

Cross-referencing the two, I could see that for most of the files in the Docbase there was a suitable iFilter in SharePoint allowing SharePoint to be able to read the document. (The mpg files, the mp3 files, the jpeg files and tiff files weren’t getting crawled.) They were using the PDF iFilter from Foxit.

I ran the count query again, this time only including the content types that were crawled. The count came to 801,232 objects. Even taking into account the documents that SharePoint didn’t crawl there was still a discrepancy. Why was SharePoint stopping happily after about 350 thousand documents?

Grabbing my notepad, I jotted down what I knew so far:

  1. Documents were stored in Documentum
  2. Documents were also stored in SharePoint doclibs
  3. There were two content sources set up
    1. One for SharePoint
    2. One for Documentum.
  4. Most of the file formats had a suitable iFilter that allowed SharePoint to crawl them.
  5. According to SharePoint’s crawl log, there were no errors, but it would stop crawling too early missing a large bulk of the documents.

So … what was I missing? I looked further into the process…

to be continued…

Part 5

Recommended Content on Amazon


Killa Hertz & The Case of the Missing Documents – Part 3

… continued from Part 2An ECM Detective story - Killa Hertz and the case of the Missing Documents —  [All Episodes]

Trudy had explained the set-up at the law firm to Killa. Now Killa wants to find out more.

Killa Hertz & The Case of the Missing Documents – Part 3

“Ok, let’s see if I’ve got this right”, I said. “You know that the documents are in the system, but they aren’t showing up in the search?”

The group of lawyers in cheap suits all nodded together like a set of those toy dogs you see in the back of old people’s cars.

Trudy explained that they had browsed to a document directly in the system using the client interface, so they knew it was there, but when they did a search in SharePoint nothing was being found.”

“And”, I asked, while looking around for a coffee machine, “do you know whether these documents are actually being indexed?”

“Oh yes”, Trudy replied, her voice hitting a high C. “The crawl runs every 2 hours.”

“Yeah – but is it working properly? Is the crawl actually crawling the documents?” There was silence. Trudy glanced nervously at the suits, and then back at me.

In a plaintive voice, she said, “I’m not sure”.

I’ve worked cases like this for the more years than I’d like to remember. You get to learn that things are not always as they appear.

“Let me see the logs.”

Trudy led the way to her office. It was a real contrast to the reception area which was uncomfortably clean, and lifeless. In her office, there was a desk covered with papers. On the wall was a whiteboard on which she had written several numbers and drawn arrows. Next to her computer was a picture of a dog – some small, fluffy thing…

Why was I not surprised.

An ECM Detective story - Killa Hertz and the case of the Missing Documents - Trudy's dog

The office wasn’t big, and fortunately. the suits had dispersed to their dark corners with neon lighting.

Trudy lifted a pile of paper from a chair, and dragged it over to her desk, so I could sit on it. With her thin fingers, she logged into the system and opened the crawl log. She slid her chair to one side, so I could see it.

Because the documents were stored in Documentum, as well as SharePoint, there were two content sources. (That’s another name for the place where the documents are.)

One was the default one which pointed at all the documents in the SharePoint repositories  (as well as the web pages, etc. that made up SharePoint). The other pointed to the Documentum repository.

I pulled out my notepad, grabbed a pen that was lurking under some industry magazine on Trudy’s desk, and started writing things down.

“OK, Trudy – According to this crawl log, 354,054 documents in the Documentum repository have been crawled. But  how many are actually there?

to be continued…

Part 4

 

Recommended Content on Amazon