Search Innovations

Making the Web more useful

Principles of Spamless Search

I’ll present some principles of spamless search, mostly obvious but some may be controversial. These principles will serve as desirable properties of an ideal SST (spamless search technique). Note that a SST involves not just the actual searching, say, based on a specific query, but also all the pre-processing (e.g., crawling and indexing) required before that searching. While some of these principles have been culled from existing techniques and published papers, this seems to be the first explicit listing of spamless search principles.

  • SST should include automated spam detection. Since there are billions of web pages and many of them are frequently updated, manual detection of spam pages is prohibitively expensive, and almost impossible to do in a timely fashion.
  • Spam detection should not be just binary. A Page X may have more spam than Page Y, and everything else being equal, X should not be ranked before Y in any search result. Thus, rather than just judging whether a page is spam or not, it should be assigned a spam score, reflecting the amount of spam in the page. A complete spam rating assigns such a score to each page. The scores in a rating may be restricted to a pre-specified range of values, say, real numbers in the set [0,1], [0,∞), [-1,1], or (∞,∞).
  • SST should leverage external guidance in spam detection. External guidance may include “black list” of spam pages and “white list” of non-spam pages, computed by an algorithm, say, using content or statistical analysis, or assigned manually, say, by trusted experts. The spam scores computed by SST should be consistent with the external guidance.
  • The spam score of a page should depend on both its analyzed content as well as its outgoing links. The spam scores computed by SST should be consistent with the link structure; for e.g., a page with only outgoing links to spam pages should get a high spam score. The content may be analyzed for a variety of spam, like, term spamming. However, using unanalyzed content to determine spam score would be unreasonable.
  • The relative distribution of links is more important than the quantity of links. For e.g., a page with just 2 outgoing links, both to spam pages, should get a higher spam score than a page that has 3 links to spam pages and 3 links to non-spam pages.
  • SST should allow censure links. A censure link from page X to Y may reflect X’s opinion that Y is a spam page. The current convention of treating each link as a (positive) endorsement prevents pages to link to spam pages, even to explicitly illustrate spam pages! Even the nofollow tag is not sufficient; for e.g., it does not provide any benefit to a page that compiles a list of spam pages for warning its users to avoid them.
  • SST should accept a trust score for any link. Designating a link just as either (positive) endorsement or censure prevents finer nuances in opinion, say, that one page is more spam than another. Thus, just like using a range of spam scores for pages, links should also be allowed a similar range of trust scores. A trust score of a link may be assigned manually, say, by its page owner, or may be computed by an algorithm, say, using adjacent-text analysis, or may be inherited from a default value set at a higher level like page, site, host, domain, or Web. To avoid retaliation, SST could allow hiding of trust scores.
  • SST should encourage endorsing non-spam pages and censuring spam pages, and discourage endorsing spam pages and censuring non-spam pages. Endorsing spam pages and censuring non-spam pages should increase the spam score, while endorsing non-spam pages and censuring spam pages should decrease the spam score.
  • SST should allow a variety of customizations. SST should allow different interpretations of spam scores and trust scores, like quality, reputation, badness, recommendation, etc., as long as spam indicates some undesirable trait and trust indicates some desirable trait. SST should also allow customization of spam scores to a specific social or professional group; for e.g. advertisers may have a slightly different view of spam and some researchers may deliberately search for spam pages! SST should also allow personalization of spam scores based on an individual preference.
  • SST should be robust against malicious attacks. obvious!
  • SST should scale to the already-huge and still-growing Web. The convergence and scalability should be similar or even better than that of techniques like PageRank.
  • Spam rating should be usable for further processing. The further processing may include computing popularity rating like PageRank, or integrating with other approaches, say, based of IR techniques.
  • SST should remove spam even before generating the search results. Filtering (or even re-ranking) spam pages out from search results is a less efficient approach.

I welcome your comments on these principles, especially if you disagree with or if you want to add something specific.


Submit post to: Digg | Del.icio.us | BlinkList | Furl | Spurl | Reddit | Simpy | RawSugar
Subscribe to RSS feed: Entries | Comments
Site Search Tags: , , , , , , , , , , , , , , , , ,

July 27, 2006 Posted by | Spam | Leave a Comment

Spamless Search: An Introduction

Spamdexing (or search engine spamming) increases web search failures and costs. Combating spam is an arms race, where continually evolving spamdexing techniques require developing even more powerful techniques for removing spam from search results, that is, spamless search. This series of posts will:

  • introduce principles of spamless search;
  • present a new approach for spamless search;
  • analyze various approaches using these principles; and
  • compare these approaches using illustrative examples.

We plan to analyze and compare the following approaches:

Background Info:
Some recent statistics on search failures and costs, partly due to spamdexing:

  • The cost of not finding the right information is about $5.3 million per year for a company with 1,000 knowledge workers (IDC).
  • Internet searches fail 30% of the time (Outsell).

Wikipedia defines spamdexing as dishonest practices that mislead search and indexing programs to give a page a search result ranking it does not deserve. In contrast, search engine optimization (SEO) uses “white hat” techniques for making a website indexable by search engines, without misleading the indexing process.

Common spamdexing techniques involve one or both of the following:

  • Content spam: Dishonest web page content, for example, keyword stuffing and invisible text.
  • Link spam: Dishonest web links, for example, link farms and hidden links.

Submit post to: Digg | Del.icio.us | BlinkList | Furl | Spurl | Reddit | Simpy | RawSugar
Subscribe to RSS feed: Entries | Comments
Site Search Tags: , , , , , , , , , , , ,

July 24, 2006 Posted by | Spam | Leave a Comment

Search Innovations: Introduction

The purpose of Search Innovations (SI) blog is to discuss novel technical ideas for improving web search and its applications. We specifically seek improvements that will make the web more useful. Important topics of current interest are:

  • Web spam prevention: How to remove spam pages from search results? How to avoid crawling spam pages? How to discourage creation of spam pages?
  • Personalized search: How to tailor search results to a specific user?
  • Social search: How to customize search results based on the preferences of a community?
  • Contextual search: How to customize search results based on the current context?
  • Others to be added, as requested or needed!

You are invited to comment on the ideas presented here and to contribute new ideas. If you do not wish to participate in a public discussion, you are invited to email your comments to Mukesh, the current moderator of SI blog.


Submit post to: Digg | Del.icio.us | BlinkList | Furl | Spurl | Reddit | Simpy | RawSugar
Subscribe to RSS feeds: Entries | Comments
Site Search Tags: , , , , , , , , , , ,

July 18, 2006 Posted by | Admin | Leave a Comment

   

Follow

Get every new post delivered to your Inbox.