Search Engines

You are currently browsing the articles from WhoisIreland Review matching the category Search Engines.

Searchtheowl.com Searchengine Goes Dead Parrot

Searchtheowl.com has given up being “Ireland’s only dedicated search engine”. The move from trying to be a search engine to being a web directory has much in common with Monty Python’s Dead Parrot Sketch. A post on the Searchtheowl.com blog seemed to be upset with the internet community in Dublin for questioning Searchtheowl.com’s pretensions. Though the post on Scrudu.ie just mentioned that Searchtheowl.com had finally given up on trying to run an Irish search engine.

I really don’t think that Mike Russen (the operator of searchtheowl.com) understood the magnitude of the work involved in running a country level search engine. The software used by searchtheowl.com for running the search engine was a glorified site search site rather than one suitable for running a search engine that covers at least 100K sites. And that is only indexing the index page on each site. Going beyond that would involve indexing nearly a million webpages depending on how deeply the spider was to spider each site.

Then there is the site acquisition process where new websites have to be added to the search engine and indexed. User submissions only work when there is sufficient traffic - a Catch 22 situation for any new search engine or directory. And more than one search index has to be maintained. I don’t think that Searchtheowl.com ever got sufficient traffic to create a genuine flow of new website submissions.

It just is not as easy as sticking a php script on a website and hoping for the best. That said, we in the Irish search engine business wish Mike the best of luck with his web directory.

Tag: - -

Written by John McCormac on June 27th, 2006 with 2 comments.
Read more articles on Search Engines.

Google Blogsearch - A Threat To Technorati?

Google’s new blogsearch launched this week. It is still in beta but it has one major advantage over Technorati - it is blazingly fast. While Google is still limited - it has only indexed blog posts back to March 2005, it seems to be using much the same algorithm and the advanced search seems to be minimalistically elegant.

Back in the late dot.bomb period, the first mover advantage used to be important. Which ever company was first to market, even if the product was shoddy was supposed to have had an advantage over the later entrants to the market. It didn’t always work out and some of the first movers learned the hard way that it is the pioneers that end up with the arrows in their backs.

While Google’s blogsearch does not have tags to classify blogs (Technorati’s main selling point), it does apply Google’s algorithm to the text.

Tags: - - - -

Written by John McCormac on September 16th, 2005 with comments disabled.
Read more articles on Search Engines.

Finding The Missing Irish Websites

Over the last few years, the Irish web has changed drastically. Most of it had been hosted outside of Ireland due to the extortionate fees charged by Irish ISPs to host websites. It was cheaper to host Irish websites in the UK and the US. But with the growth of the Irish hosting industry, it is now easier and cheaper than it was to host a website in Ireland. The Irish ISPs now only account for 25% of the Irish website business and that continues to fall. Irish Hosting Service Providers (HSPs) account for the rest. But there is still a section of the Irish web that hosts on US and UK servers.

The reasons for this are many. The traditionally high cost of hosting in Ireland is a factor but one of the more important reasons is that these websites are hosted abroad because of the webdevelopers. The Irish hosting business is a curious mix between dedicated HSPs like Hosting365,Novara and Blacknightsolutions.com and web development companies with a lot of clients.

While the HSPs are easy to categorise, but many of the web development companies still use legacy solutions and host their clients in the US or UK on shared or dedicated hosting. WhoisIreland.com tracks the hosting patterns of approximately 970 Irish hosters. These are the easily identified Irish hosters and the statistics on these hoster forms part of the Irish hosting industry reports that WhoisIreland.com provides each month. But there is a part of the Irish web that has been, up until now, difficult to track - the missing Irish web.

For the Irish web, that missing section hosted in the UK, US and elsewhere could be as much as 15% of the overall Irish web. That’s potentially thousands of Irish websites that are missing from the “pages from Ireland” searches of the large search engines.

To date, the large search engines like Google, Yahoo and MSN have not solved this problem of the missing web. WhoisIreland.com has come a lot closer to solving it. The Ghosthunter algorithm developed to detect these missing hosters has identified 2979 potential Irish micro hosters on US IP space and 1814 potential Irish hosters on UK IP space. In real terms, these figures will drop as the different levels of the algorithm are applied.

Three significant Irish web development companies on UK IP space were identified with 100% accuracy on the first run along with the websites hosted by one of IEDR’s resellers. The size of these Irish micro-hosters varies from a couple of sites hosted to over a hundred sites hosted.

Tags: - - - -

Written by John McCormac on August 25th, 2005 with comments disabled.
Read more articles on Search Engines.

Yahoo and Google Argue Over Search Index size

It seems that Yahoo and Google have different views as to which has the bigger search index. A post on the Yahoo Search blog announced that Yahoo’s index had grown to 19.2 Billion web documents. The New York Times quoted Sergey Brin of Google as saying : “The comprehensiveness of any search engine should be measured by real Web pages that can be returned in response to real search queries and verified to be unique,”. In the same article, he was quoted as saying that: “We [Google] report the total index size of Google based on this approach.” Google’s webpage count currently stands at 8,168,684,336 web pages

A brief study at the US National Centre for Supercomputing Application put the claims to the test. The study was based on approximating the sizes of the indices. But it does express doubt over Yahoo’s claims.

With hundreds of thousands of domains and websites being deleted globally each day and hundreds of thousands of new domains and websites being created, it would only be possible to give an approximation as to the size of the web.

Tags: - - - -

Written by John McCormac on August 16th, 2005 with comments disabled.
Read more articles on Search Engines.

Content Filter Company Scraping Around?

Last year, Secure Computing Corporation claimed that .ie ccTLD had tens of thousands of pages of iffy content. It claimed to have done a “global study” of the number of porn pages on the web and it found that the ccTLDs were riddled with the stuff. They had millions of pages of it. Of course .com/net/org/biz/info websites were not included in this “study”. This “global study” amounted to nothing more than entering a few obviously dodgy keywords into Google and limiting the results by using site:.cctld.

It was a very crude attempt by SCC to market its content filtering software. ENN ran it without question but later corrected the article after getting the headline “Study reveals 60,000 Irish porn sites” seriously wrong. (It was 68000 webpages rather than websites.) Silicon Republic did a good analysis.

So what has this got to do with the present day? Well it seems that an IP from SCC has been sniffing around on some Irish sites with an incompetently forged browser User Agent: Microsoft_Internet_Explorer_5.00.438 . Perhaps a new press release on another dubious “study” should be expected. I wonder if this time, editors will be so eager to run SCC’s claims without verification.

Tags: - - -

Written by John McCormac on July 28th, 2005 with comments disabled.
Read more articles on Search Engines.

Microsoft Sues Google Over Competition

Microsoft does not like competition. It either assimilates it or crushes it. But with Google, it may have found its match. A key player in Microsoft’s search operation apparently defected to Google. Microsoft is suing. Dr Kai-Fu Lee was corporate vice president of Microsoft’s Interactive Services Division. Google wants him to head its China operation.

The News.com article has some interesting items from the law suit. It states that Dr Kai-Fu Lee had been “responsible for overall development of the MSN Internet search application.” and had been involved in Microsoft’s China strategy.

The general reaction to Microsoft’s search offering has been mixed. Some people think it is good for Google to have competition. On country level search (specifically Ireland) , it does not seem to move much beyond the simplistic country IP/country code TLD (ccTLD) grouping of websites. In this respect it seems that Microsoft, like Google cannot identify websites from specific countries hosted outside that country’s IPs/ccTLDs - a problem that WhoisIreland.com has solved.

Ultimately this law suit may not be about the the search business. It may really be about China as a future market for Microsoft. Like all great Empires, Microsoft needs to continue expanding. China may be crucial to the survival of Microsoft. However Google also needs to keep expanding.

Tags: - - - - -

Written by John McCormac on July 22nd, 2005 with comments disabled.
Read more articles on Search Engines.

Metadata - A great idea?

Metadata is a great idea. It will be even better when websites actually use it on a widespread basis. The statistics for .ie websites show how poorly meta data is used:

Websites With Title, Keywords and Description: 10460

That’s out of approximately 36198 .ie websites. There are significant opportunities for Search Engine Optimisation companies in Ireland, if only the owners of the websites can be convinced of the importance of SEO and its effects.

Metadata was great in the 1990s. Back then technology was expensive and it was a lot cheaper and easier for search engines to strip the meta data from a page and use it instead of the body text of the page. The falling cost of harddrive space and processing power changed all that and made it possible for search engines to cache complete copies of the webpages and implement better searching algorithms.

Google and its link based algorithm changed the emphasis from metadata to link structure. It was quite innovative but turned out to be as easily gamed as many other algorithms.

Now there is talk of the Semantic Web being the next big thing in search. Again this is another nice theoretical solution that ignores the reality of the patchy nature of the web. These “solutions” have to be easy to use. They have to be integrated an almost organic level in the web design programs.

It is over ten years since the appearance of the web’s first major search engines. Full metadata is still not included in the bulk of webpages. So what hope is there for the Semantic Web? Will it end up being an academic exercise in futility?

Tags: - - -

Written by John McCormac on July 19th, 2005 with comments disabled.
Read more articles on Search Engines.

Local Search - Defining “Near”

Local search is more than just matching websites to a location. A few years ago, I did a lot of research on local search, theorising and also building experimental local search engines. One of them was a mobile phone based search engine. It was perhaps a bit more advanced than a simplistic SMS based query search engine interface in that it took the user’s location into account in generating the results.

The quoting of an entire Google labs newsgroups post of mine from around that time by the operator of the searchtheowl website in a post on his blog shows how badly understood “local search” is, even today.

Localised and Local search is more than just stuffing a pile of URLs in a database and claiming that they are local because they are in the same country or even in the same county. The problem with local search is that the user wants to know what websites or resources are “near” to them. It is the definition of the term “near” that is at the heart of local search.

Tags: - - -

Written by John McCormac on July 16th, 2005 with comments disabled.
Read more articles on Search Engines.

« Older articles

Newer articles »