Problem with Enterprise Search- new veriosn of GSA launched

Google has launched new version of GSA (Google Search Appliance) meant for deployment within the firewall to do enterprise searches. There are some improvements built inside yellow box which is how standard GSA is shipped. Google claims that GSA can now index many million more pages, can conduct searches faster than ever and show role based search results. So for so good. The problem is that enterprise search is not about searching millions of pages or spitting query results a bit faster. The problem is that enterprise search engines don't work. The real problem is that it is not easy to find anything relevant. And the problem is not with search engine like GSA. The real problem is in very nature of enterprise and its content management and intranet systems.

I have worked a bit with GSA sometime back. It is extremely easy to install and use. Mostly, you point it to sites and sources you want to index, trigger its crawler, let it build its index and there you start searching with familiar Google UI, all out of box. Google has spent a decade perfecting search algorithms. It uses complex rules to bubble up the most relevant results for a user query on top of result pages. Page rank or relevancy of a page is calculated based on how page is cited by other pages, their interlinking and metadata which crawlers extracts from page elements like title, header and other structural elements (typical SEO domain). And it works so well on web because web has huge scale. Web is huge mass of interlinking web pages where cross referencing through links is a norm. This Interlinking of pages is most important factor for determining relevancy of result pages for a query.

When I say that enterprise search doesn’t work, what I mean is that results which search engines throw are hardly relevant. Unlike on web search where first few results are almost always most relevant to a query, in enterprise search, it is hardly true. You can get results which are no where relevant to your query and most important and relevant documents might be buried somewhere in huge pile of results, where you would never reach. The reason is simple, that enterprise systems and intranets don’t use cross-linking of pages and sources. There no almost non existent cross references. For example how many times would you find a "Banking Industry opportunity PoV" document or page cross referencing to another document which could be "Banking Industry Trend Analysis"!! In fact, many enterprise systems are like document storage systems where all the documents in form of excels, ppts, pdfs are dumped as equal. Search engine has no way to find which most relevant document for a query is. On top of that, most enterprise systems and intranets are not optimized for search engines. How many times, have you seen web pages on intranet which have no title or use proper Meta data!! The content creators don’t use some basic practices to make their content “findable”. So another vital source for search engine to determine page relevancy is lost. And what we get in effect is a sputtering and struggling search engine trying hard to fish out that most important document for you.

In a way, it is not problem with technology but with very nature and realities of enterprise. IMHO, an effective intranet search engine has to provide more than "out-of- box" features; it should be "tweak-able". It has to work with understanding that:

  1. Enterprise content is not cross linked and cross referenced, so relevancy logic successful on web wouldn't be much useful.
  2. Scale of content is limited, unlike web where million of pages are cross-linking, enterprise content is not so vast.
  3. Pages are not optimized for search engines. (But that should be fixed by a company's policy)
  4. Some documents written by some "experts" or “communities” in enterprise become naturally more important or relevant! The relevancy logic has to account for that, but how! Engine administrators should be able to feed new relevancy rules into the engine.
  5. User ratings of documents and META -TAGS on web pages should be given more weightage in calculating relevancy. On web, these are mostly ignored due to their misuse for search engine spamming, but this is not the case in enterprise where problem is opposite.


From another perspective, a human edited search engine could be more useful and effective within firewall. Automated search engine can still be used to find out what are users searching most of time (trends), and experts, knowledge managers or users can contribute to index and ranking of pages manually. New version of GSA also seems to have a similar feature for Do It Yourself (DIY) key-match and some features for administrators to influence search results.

2 comments:

baradas said...

The fault, dear Gaurav, lies not in our search but in ourselves.

The problem is that enterprise content as pointed is not structured along the lines of the web.
With the web, publishing has traditionally been in the form of HTML. Whereas in the case of the big bad (well ok not so bad..) Enterprise, content is mostly published as word or pdf based documents.
Now it's probably not such a bad idea to have documents in these formats. But hey they are really not search friendly unless we structure them with metatags. How many documents do we see that have tags relevant to the document's purpose? Enterprises have always faced this problem with knowledge management and so they tend to have knowledge managers to organise content.

Gaurav said...

Barada, I agree . Thats why I said we would need search engines which can be influenced for results and ranking. Thats what even GSA seems to go towards. At same time, enterprises have to have policies in place so that content creators ensure that their conent is "findable" through search engine (like using proper meta data). What is the point of most incredible presenation on market trends when nobody can find it.