Building Country Level Search Engines
The top tier is Google, Yahoo and MSN. Below it are the country level search engines. The country level search engines may be second tier but are often superior to the big three for the simple reason that they know their market. These country level search engines are based in the market and can tell dross from gold. Building a country level search engine is not just a case of lobbing a few URLs into a php script and hoping for the best.
Every few years in Ireland, some web developer gets the idea that it would be a good thing to start an Irish search engine without really understanding what a country level search engine involves. This typically means that the aforesaid web developer will scrape pages from WhoisIreland.com for data on .ie websites. It is, after all, the world’s biggest resource on Irish domains and websites. But there is a lot more to building a country level search engine than merely ripping the URLs from a few key directories.
This year has been no different and the latest “irish search engine” is a php script on a US server which its web developer claims is “Ireland’s only dedicated search engine”. This of course is quite funny considering that Ireland has a few dedicated search engines - Google, Yahoo, MSN all have Dublin operations and WhoisIreland.com and IndexIreland.com have been spidering the Irish web for years.
The quality of country level search engines can vary considerably. Some can provide competition for Google and others can give mediocrity a bad name. Eventually, every country level search engine hits the brick wall of keeping the index fresh. This is the search engine killer.
Apart from the obvious lack of a business plan on monetizing search, there has to be a strategy for acquiring new websites to keep an index fresh. Google et al have an advantage in that they are perpetually crawling the web and following links to detect new websites. While country level search engines can dedicate a computer or two to the task it is a very inefficient way of getting new sites. Many of the lower tier country level search engines rely on their users to submit URLs. This might work for well trafficked search engines but a few new URLs a day is not enough to keep a search engine in the game with Google and the other large search engines. For Ireland, the number of new domains and sites per month can be in the thousands. For the UK and Germany that figure can get into the tens of thousands. For the web as a whole, that figure is in millions of new sites and domains every month.
Building a country level search engine is quite different from building a generic search engine. With a generic search engine the strategy is to blindly follow links. This strategy has been discussed from both sides on WebmasterWorld. The Blind Spidering strategy is not a good one because it depends on effectively spidering the whole web and relying on the search engine algorithms to filter the results. That’s fine if you have unlimited resources. But building a country level search engine requires intelligence rather than computing brawn. The thread on WebmasterWorld illustrates the complete difference in view between the “Blind Spidering” approach used by Google et al and the view of a country level search operation. It is not simply a case of dropping a slice of Dmoz into the mix and hoping for the best.
In addition to providing statistical reports on the hosting industry of a few countries, some of the work here involves building country level search engine indices. In some respects it is like the Planet of Magrathea for search engines. These indices are not simplistic ones limited to just the country code top level domain (cctld - eg: .ie for Ireland). These are indices covering a country’s data footprint in its cctld, .com, .net, .org, .biz and .info. While it may sound easy, it is anything but.
Depending on the state of its hosting infrastructure, a country may have anything from 10% to 90% of its websites and domains hosted outside of its IP space. This is the dark web of a country’s webspace. It is not easily defined and it is hard to detect. The big problem for all search engine operators is in correlating a website with a particular country. The simplistic method that Google, Yahoo and Microsoft uses is based on IP and cctld. If the IP or cctld is associated with a particular country then pages from those sites will be included in the “pages from $country” search.
Some hosters will have the DNS of their websites handled by big registrars such as Network Solutions or Register.com and their website IPs may also be on the IP ranges of large registrars. Hosting on large US and UK resellers is actually very common with small hosters. Typically these are small web developer businesses hosting their clients on shared hosting. They haven’t made the jump to their own dedicated hosting with dedicated DNS so they remain almost invisible to the country level search engines. For Ireland, this dark web could be as many as thirty thousand websites.
The new ‘Ghosthunter’ algorithm being tested here managed to detect Irish websites on these large registrars. In the first few minutes of operation, it detected over 1100 Irish sites with a 95% hit rate. At the moment, it is being used exclusively to detect Irish sites. However there is no reason why the same algorithm should not be able to detect the dark web of other countries’ websites.
The big question of course is whether it would be a good thing to market a continually updated feed of Irish website URLs to Irish search/directory operators?
Tags: Irishblogs - Search - Irish Search Engines - Web Directories
Written by John McCormac on July 2nd, 2005 with
comments disabled.
Read more articles on Search Engines.
- [+] Digg: Feature this article
- [+] Del.icio.us: Bookmark this article
- [+] Furl: Bookmark this article