A “search engine” wannabe mistook the number of webpages that Google has indexed for the number of websites on the net. The blog post is linked here but this is what it says:
“There I was sitting in the garden with a laptop, wireless broadband is great isn’t it, anyway, taking in the rays when it occurred to me how hard it is for a website to be seen when currently there are around 8,058,044,651 (yes all 8,058,044,651 of them.)”
Perhaps it was a bit too warm today. The number of web pages that Google has indexed is not the same as the number of websites on the net.
As a country level search engine operator, I frequently see these claims. Even Microsoft now claims to have an “Irish” search engine. But someone confusing the number of webpages with the number of websites is definitely a first.
Tags: Irishblogs - Search - Irish Search Engines
Written by John McCormac on July 12th, 2005 with 7 comments.
Read more articles on General.
The European parliament voted to reject software patents. The vote was 648 to 14 with 18 abstentions to reject the legislation. The software industry in Europe has reason to celebrate. The European Commission drafted legislation ended up pleasing nobody. Apparently the European Commission will not submit new legislation on the subject. The legislation, would have made software patents legal in Europe and reduced the European patents system to the completely broken level of the US Patents Office.
The Irish software industry has been distinctively anti-software patents. The problem is that the big players were the ones with the budgets for PR and for planting stories in the Irish media that the Irish software industry was pro-software patents.
A bunch of self-appointed operations claiming to represent the Irish software industry were lobbying hard to portray the Irish software industry as being pro-software patents. The reality is that these organisations really were just shills for Microsoft and their friends. The average small programming business is not going to waste money on joining these organisations. Big companies however will. Thus what these organisations end up representing is the party line of large multinationals rather than the Irish software industry.
The Irish software industry has been anti-software patents with a few exceptions. This has been a victory for the European software industry and the people of Europe.
Tags: Irishblogs - Software Patents - EU - Open Source- Free Software
Written by John McCormac on July 6th, 2005 with comments disabled.
Read more articles on Irish Tech News.
A number of online news sources including Reuters are reporting that the proposed European Software Patents legislation may be in trouble. This is good news for the anti-Software Patents lobby but the vote has to be taken on July 6th. The EuObserver also reports that the legislation may be in trouble.
Lobbying has been rife and the pro-patent lobby has been planting stories all over the Irish media. PR operations claiming to represent the Irish software industry have been trying to get MEPs to vote for software patents. The reality is that these people only represent their backers, typically large multinationals with armies of lawyers ready to corrupt the European patents system so that it becomes a mirror of the failed US patents system. The Irish software industry is very much against software patents.
Tags: Irishblogs - Software Patents - EU
Written by John McCormac on July 5th, 2005 with comments disabled.
Read more articles on Tech Commentary.
One of the biggest problems with running a country level search engine is getting a good search index. Microsoft has a reasonable index but the Irish index seems to be based on links from Irish sites and Irish IPs. Unfortunately Microsoft is not good at building country level search indices. Then again even Google has problems when it comes to country level searching.
The promo for the launch didn’t quite go as planned. ENN.ie picked up on the UAE results problem. But the results were coming from an unfiltered search. The ‘pages from Ireland’ option is not the default option. Perhaps Microsoft Search should spend less on researchers looking at the “semantic web” and more on people who know how to build search engines and search indices.
Tags: Irishblogs - Search - Irish Search Engines - Microsoft - MSN
Written by John McCormac on July 5th, 2005 with comments disabled.
Read more articles on Search Engines.
The top tier is Google, Yahoo and MSN. Below it are the country level search engines. The country level search engines may be second tier but are often superior to the big three for the simple reason that they know their market. These country level search engines are based in the market and can tell dross from gold. Building a country level search engine is not just a case of lobbing a few URLs into a php script and hoping for the best.
Every few years in Ireland, some web developer gets the idea that it would be a good thing to start an Irish search engine without really understanding what a country level search engine involves. This typically means that the aforesaid web developer will scrape pages from WhoisIreland.com for data on .ie websites. It is, after all, the world’s biggest resource on Irish domains and websites. But there is a lot more to building a country level search engine than merely ripping the URLs from a few key directories.
This year has been no different and the latest “irish search engine” is a php script on a US server which its web developer claims is “Ireland’s only dedicated search engine”. This of course is quite funny considering that Ireland has a few dedicated search engines - Google, Yahoo, MSN all have Dublin operations and WhoisIreland.com and IndexIreland.com have been spidering the Irish web for years.
The quality of country level search engines can vary considerably. Some can provide competition for Google and others can give mediocrity a bad name. Eventually, every country level search engine hits the brick wall of keeping the index fresh. This is the search engine killer.
Apart from the obvious lack of a business plan on monetizing search, there has to be a strategy for acquiring new websites to keep an index fresh. Google et al have an advantage in that they are perpetually crawling the web and following links to detect new websites. While country level search engines can dedicate a computer or two to the task it is a very inefficient way of getting new sites. Many of the lower tier country level search engines rely on their users to submit URLs. This might work for well trafficked search engines but a few new URLs a day is not enough to keep a search engine in the game with Google and the other large search engines. For Ireland, the number of new domains and sites per month can be in the thousands. For the UK and Germany that figure can get into the tens of thousands. For the web as a whole, that figure is in millions of new sites and domains every month.
Building a country level search engine is quite different from building a generic search engine. With a generic search engine the strategy is to blindly follow links. This strategy has been discussed from both sides on WebmasterWorld. The Blind Spidering strategy is not a good one because it depends on effectively spidering the whole web and relying on the search engine algorithms to filter the results. That’s fine if you have unlimited resources. But building a country level search engine requires intelligence rather than computing brawn. The thread on WebmasterWorld illustrates the complete difference in view between the “Blind Spidering” approach used by Google et al and the view of a country level search operation. It is not simply a case of dropping a slice of Dmoz into the mix and hoping for the best.
In addition to providing statistical reports on the hosting industry of a few countries, some of the work here involves building country level search engine indices. In some respects it is like the Planet of Magrathea for search engines. These indices are not simplistic ones limited to just the country code top level domain (cctld - eg: .ie for Ireland). These are indices covering a country’s data footprint in its cctld, .com, .net, .org, .biz and .info. While it may sound easy, it is anything but.
Depending on the state of its hosting infrastructure, a country may have anything from 10% to 90% of its websites and domains hosted outside of its IP space. This is the dark web of a country’s webspace. It is not easily defined and it is hard to detect. The big problem for all search engine operators is in correlating a website with a particular country. The simplistic method that Google, Yahoo and Microsoft uses is based on IP and cctld. If the IP or cctld is associated with a particular country then pages from those sites will be included in the “pages from $country” search.
Some hosters will have the DNS of their websites handled by big registrars such as Network Solutions or Register.com and their website IPs may also be on the IP ranges of large registrars. Hosting on large US and UK resellers is actually very common with small hosters. Typically these are small web developer businesses hosting their clients on shared hosting. They haven’t made the jump to their own dedicated hosting with dedicated DNS so they remain almost invisible to the country level search engines. For Ireland, this dark web could be as many as thirty thousand websites.
The new ‘Ghosthunter’ algorithm being tested here managed to detect Irish websites on these large registrars. In the first few minutes of operation, it detected over 1100 Irish sites with a 95% hit rate. At the moment, it is being used exclusively to detect Irish sites. However there is no reason why the same algorithm should not be able to detect the dark web of other countries’ websites.
The big question of course is whether it would be a good thing to market a continually updated feed of Irish website URLs to Irish search/directory operators?
Tags: Irishblogs - Search - Irish Search Engines - Web Directories
Written by John McCormac on July 2nd, 2005 with comments disabled.
Read more articles on Search Engines.