Search Engines

You are currently browsing the articles from WhoisIreland Review matching the category Search Engines.

Search Engines Moving On Blogs?

Steve Rubel’s Micro Persuasion blog came across Yahoo’s test RSS search and posted images on this post. Yahoo pulled the test site. But it is interesting that Yahoo is taking RSS search so seriously.

The large search engines (Google, Yahoo, MSN) have been slowly evaluating adding blog search to their indices. Some like Google have incorporated a lot of blogs in their live index. Yahoo and MSN have also been busy. This
Business Week article
outlines some of the background.

Tags: - -

Written by John McCormac on July 16th, 2005 with comments disabled.
Read more articles on Search Engines.

Microsoft Launches “Irish” Search Engine

One of the biggest problems with running a country level search engine is getting a good search index. Microsoft has a reasonable index but the Irish index seems to be based on links from Irish sites and Irish IPs. Unfortunately Microsoft is not good at building country level search indices. Then again even Google has problems when it comes to country level searching.

The promo for the launch didn’t quite go as planned. ENN.ie picked up on the UAE results problem. But the results were coming from an unfiltered search. The ‘pages from Ireland’ option is not the default option. Perhaps Microsoft Search should spend less on researchers looking at the “semantic web” and more on people who know how to build search engines and search indices.

Tags: - - - -

Written by John McCormac on July 5th, 2005 with comments disabled.
Read more articles on Search Engines.

Building Country Level Search Engines

The top tier is Google, Yahoo and MSN. Below it are the country level search engines. The country level search engines may be second tier but are often superior to the big three for the simple reason that they know their market. These country level search engines are based in the market and can tell dross from gold. Building a country level search engine is not just a case of lobbing a few URLs into a php script and hoping for the best.

Every few years in Ireland, some web developer gets the idea that it would be a good thing to start an Irish search engine without really understanding what a country level search engine involves. This typically means that the aforesaid web developer will scrape pages from WhoisIreland.com for data on .ie websites. It is, after all, the world’s biggest resource on Irish domains and websites. But there is a lot more to building a country level search engine than merely ripping the URLs from a few key directories.

This year has been no different and the latest “irish search engine” is a php script on a US server which its web developer claims is “Ireland’s only dedicated search engine”. This of course is quite funny considering that Ireland has a few dedicated search engines - Google, Yahoo, MSN all have Dublin operations and WhoisIreland.com and IndexIreland.com have been spidering the Irish web for years.

The quality of country level search engines can vary considerably. Some can provide competition for Google and others can give mediocrity a bad name. Eventually, every country level search engine hits the brick wall of keeping the index fresh. This is the search engine killer.

Apart from the obvious lack of a business plan on monetizing search, there has to be a strategy for acquiring new websites to keep an index fresh. Google et al have an advantage in that they are perpetually crawling the web and following links to detect new websites. While country level search engines can dedicate a computer or two to the task it is a very inefficient way of getting new sites. Many of the lower tier country level search engines rely on their users to submit URLs. This might work for well trafficked search engines but a few new URLs a day is not enough to keep a search engine in the game with Google and the other large search engines. For Ireland, the number of new domains and sites per month can be in the thousands. For the UK and Germany that figure can get into the tens of thousands. For the web as a whole, that figure is in millions of new sites and domains every month.

Building a country level search engine is quite different from building a generic search engine. With a generic search engine the strategy is to blindly follow links. This strategy has been discussed from both sides on WebmasterWorld. The Blind Spidering strategy is not a good one because it depends on effectively spidering the whole web and relying on the search engine algorithms to filter the results. That’s fine if you have unlimited resources. But building a country level search engine requires intelligence rather than computing brawn. The thread on WebmasterWorld illustrates the complete difference in view between the “Blind Spidering” approach used by Google et al and the view of a country level search operation. It is not simply a case of dropping a slice of Dmoz into the mix and hoping for the best.

In addition to providing statistical reports on the hosting industry of a few countries, some of the work here involves building country level search engine indices. In some respects it is like the Planet of Magrathea for search engines. These indices are not simplistic ones limited to just the country code top level domain (cctld - eg: .ie for Ireland). These are indices covering a country’s data footprint in its cctld, .com, .net, .org, .biz and .info. While it may sound easy, it is anything but.

Depending on the state of its hosting infrastructure, a country may have anything from 10% to 90% of its websites and domains hosted outside of its IP space. This is the dark web of a country’s webspace. It is not easily defined and it is hard to detect. The big problem for all search engine operators is in correlating a website with a particular country. The simplistic method that Google, Yahoo and Microsoft uses is based on IP and cctld. If the IP or cctld is associated with a particular country then pages from those sites will be included in the “pages from $country” search.

Some hosters will have the DNS of their websites handled by big registrars such as Network Solutions or Register.com and their website IPs may also be on the IP ranges of large registrars. Hosting on large US and UK resellers is actually very common with small hosters. Typically these are small web developer businesses hosting their clients on shared hosting. They haven’t made the jump to their own dedicated hosting with dedicated DNS so they remain almost invisible to the country level search engines. For Ireland, this dark web could be as many as thirty thousand websites.

The new ‘Ghosthunter’ algorithm being tested here managed to detect Irish websites on these large registrars. In the first few minutes of operation, it detected over 1100 Irish sites with a 95% hit rate. At the moment, it is being used exclusively to detect Irish sites. However there is no reason why the same algorithm should not be able to detect the dark web of other countries’ websites.

The big question of course is whether it would be a good thing to market a continually updated feed of Irish website URLs to Irish search/directory operators?

Tags: - - -

Written by John McCormac on July 2nd, 2005 with comments disabled.
Read more articles on Search Engines.

Why HTML Scrambling Is Not Secure Encryption

Scrambling the HTML of a webpage with Javascript is not unbreakable encryption and it could be a great way to get a site kicked out of the search engines. Indeed to a search engine operator, a webpage with only meta data and no page text is typically that of SPAM.

Some websites use Javascript based HTML scrambling to protect the source code. Others use it to prevent the saving or printing of the webpages. But this obfuscation is sometimes sold to gullible website owners as unbreakable encryption. The reality is that it is very simple to break - well it would be. The algorithm to decode it is included in the webpage.

This particular HTML scrambling scheme relies on the browser to decode the “encrypted” HTML source and display it in the browser. The algorithm itself is typically included in a fragment of escaped Javascript. It often looks like this:

eval(unescape('%6B%3D%75%6E%65%73%63%61

Basically the Javascript is unescaped, interpreted and run to unscramble the HTML source. The unscrambled webpage is displayed in the browser. The algorithm from one example is below:

function und1(s){var un="";
// 'un' is the unscrambled HTML

l=s.length;
// l is the length of the scrambled HTML block in characters
oh=Math.round(l/2);
// oh is half of l
for(i=0;i< =oh;i++){a=s.charAt(i); b=s.charAt(i+oh); c=a+b;
un=un+c;};
// the loop. Take the character at i and the character at
// i + oh and put them on to the end of the 'un' string

X=un.substr(0,l);
};

The scrambled HTML is not that difficult to read. The first character is read, then the character at half the length of the scrambled HTML block is read. And the scrambled HTML is decoded two characters at a time. To a cryppie, this kind of scrambled text looks different to text enrypted with a hard algorithm. It still has the characteristics of natural language - something that ciphertext does not have.

From a cryptographic viewpoint, a Javascript scrambled webpage offers only the most elementary protection. It may stop casual printing and saving of webpages but that is it. The model itself is flawed because the unscrambled HTML has to be displayed in the browser and therefore the algorithm to unscramble the HTML has to be included.

So why do people use it? Some people want to protect their HTML. Others want to protect the links in their pages from poaching. Some sites have rather dubious links that they want to keep away from the attention of search engines. By using this kind of HTML obfuscation, they think that it evades search engines and content filtering.

However the downside is that search engine operators are aware of this kind of cloaking and so are content filter programmers. It would be an easy win for search engines to drop such sites from their indices and some content filters now apparently block websites with obfuscated HTML.

Tags: - - -

Written by John McCormac on June 30th, 2005 with comments disabled.
Read more articles on Search Engines.

Searching For A Clue?

A few days ago, the contact e-mail for the WhoisIreland.com spider got an e-mail to say that the site had been included in Ireland’s only dedicated search engine. Considering that WhoisIreland.com is the web’s biggest Irish search engine, it was a rather surreal thing. The fact that there are at least one other Irish search engine and a pile of other Irish web directories made it all the stranger. That and the fact that the e-mail began “A Chara”. This is the way that all the Irish government agencies used to start letters like tax demands.

Some of us Irish search engine and directory operators invest thousands of Euros in dedicated servers and research and building sites. But the Irish search and directory business is not exactly the business for the clueless. It is a tough battleground where only the best survive against the behemoths like Google and Yahoo.

Fergal O’Byrne’s OMNI SEO blog posted an interview with the operator of the site. It didn’t seem to be quite on the level of a real interview. Sure the buzz words and the marketing speak were all in place but the cornerstone of the business was missing. It seems that everyone sees search engine results and thinks that building and maintaining search engines is as easy as sticking a few URLs in an off the shelf php script on a shared hosting account and calling it a search engine.

The comments on Fergal’s blog were interesting in that some others contacted regarded the e-mails as spam. Though I’m still trying to figure out why someone would e-mail a search engine’s contact e-mail to tell it that it was included in a search engine. Such are the perils of being a search engine operator. I wonder how Google deals with it?

Tag: - -

Written by John McCormac on June 17th, 2005 with 3 comments.
Read more articles on Search Engines.

The Irish Blogosphere - How Big Is It?

Knowing the size of the Irish Blogosphere is an important part of building a search engine for Irish blogs. So far there is no accurate figure for the number of Irish blogs and the rate at which the IrishBlogs group on Yahoo Groups is growing suggests that it could be in the high hundreds. The IrishBlogs tag on Technorati shows only a small cross section of the Irish Blogosphere and depends mainly on the use of tags. These tags, without a quicktag solution (Wordpress) are cumbersome to include with each post.

The main WhoisIreland.com Irish search engine spiders are running here and I’ve just been checking a few keywords on the raw search database. This part of the index (the Irish ie/com/net/org/biz/info websites) is almost complete and the next section is the user/personal/subs websites. The surprising thing is that the term “blog” shows 1771 hits with at least 150 of these sites having their own distinct top level domain. The Irish blogosphere could be somewhat larger, though more fragmented, than was first thought.

Tags: , ,

Written by John McCormac on March 29th, 2005 with 5 comments.
Read more articles on Search Engines.

Irish Search Engine Update

For the past few days, the WhoisIreland.com spiders have been busy updating the index of Irish websites. The number of .ie websites has grown considerably and there are approximately 31,000 .ie sites in this update and slightly more Irish .com/net/org/biz/info websites. The Irish section of the Dmoz directory will also be included. This will result in an index of approximately 75,000 Irish websites making it the biggest index of Irish websites in the world. This new index will have a dynamic submission facility that will allow Irish sites not included in the index to be spidered within minutes of being submitted. The beta version of this new index will be available next week.

The beta test of the Irishblogosphere search engine will follow. The three fold strategy of detection, submission and monitoring will form the basis for this search engine and it will be separate from the main WhoisIreland.com search engine. As a result it will be faster and combined with a dynamically updated directory of Irish related blogs. It should make it easier to distill the voices of Irish blogs from the cacophony of the web’s blogosphere.

Tags: , ,

Written by John McCormac on March 19th, 2005 with comments disabled.
Read more articles on Search Engines.

A Search Engine For Irish Blogs - 2

A central resource for the Irish blogosphere is evidently needed. From the discussion on the Connecting The Irish Blogosphere post, determining what is and is not an Irish related blog is going to be difficult but the interconnectedness of blogs is a very useful aspect compared to ordinary business websites. A business website is less likely to have links to other sites. A blog, on the other hand, tends to rely on linking to other blogs.

Over the last week or so, I’ve been doing the preliminary work on building the search engine for the Irish blogosphere. The linkage structure for a blog search engine is quite different to that of an ordinary website search engine. An ordinary search engine links websites - a blogosphere search engine links people. The hierarchical model of links and authority hubs does not work well. Sure you’ve got the star bloggers but to concentrate solely on that aspect is wrong - it is the old model of authoritative hubs and trickle down relevancy. The key aspect of blog sites are the posts rather than the blog site itself. The discussions and referrals generally concern posts and articles rather than websites.

The lifetime of a blog post discussion is brief varying from a few hours to a few weeks. This makes it somewhat different to the ordinary web and triggered (blog ping or monitoring) spidering is necessary. It may be necessary to split the search engine into a historical search and a current search.

The search engine will provide the infrastructure but the resource will also extract topics under active discussion and show the linkages. In some respects it would be a meta-blog but the the topic and post monitoring will be automated. It will effectively provide a single page (or maybe more than one page) insight into what is going on in the Irish blogosphere.

Tags: , , Irish Search Engines

Written by John McCormac on March 7th, 2005 with 1 comment.
Read more articles on Search Engines.

« Older articles

Newer articles »