Search Innovations

Making the Web more useful

Principles of Spamless Search

I’ll present some principles of spamless search, mostly obvious but some may be controversial. These principles will serve as desirable properties of an ideal SST (spamless search technique). Note that a SST involves not just the actual searching, say, based on a specific query, but also all the pre-processing (e.g., crawling and indexing) required before that searching. While some of these principles have been culled from existing techniques and published papers, this seems to be the first explicit listing of spamless search principles.

  • SST should include automated spam detection. Since there are billions of web pages and many of them are frequently updated, manual detection of spam pages is prohibitively expensive, and almost impossible to do in a timely fashion.
  • Spam detection should not be just binary. A Page X may have more spam than Page Y, and everything else being equal, X should not be ranked before Y in any search result. Thus, rather than just judging whether a page is spam or not, it should be assigned a spam score, reflecting the amount of spam in the page. A complete spam rating assigns such a score to each page. The scores in a rating may be restricted to a pre-specified range of values, say, real numbers in the set [0,1], [0,∞), [-1,1], or (∞,∞).
  • SST should leverage external guidance in spam detection. External guidance may include “black list” of spam pages and “white list” of non-spam pages, computed by an algorithm, say, using content or statistical analysis, or assigned manually, say, by trusted experts. The spam scores computed by SST should be consistent with the external guidance.
  • The spam score of a page should depend on both its analyzed content as well as its outgoing links. The spam scores computed by SST should be consistent with the link structure; for e.g., a page with only outgoing links to spam pages should get a high spam score. The content may be analyzed for a variety of spam, like, term spamming. However, using unanalyzed content to determine spam score would be unreasonable.
  • The relative distribution of links is more important than the quantity of links. For e.g., a page with just 2 outgoing links, both to spam pages, should get a higher spam score than a page that has 3 links to spam pages and 3 links to non-spam pages.
  • SST should allow censure links. A censure link from page X to Y may reflect X’s opinion that Y is a spam page. The current convention of treating each link as a (positive) endorsement prevents pages to link to spam pages, even to explicitly illustrate spam pages! Even the nofollow tag is not sufficient; for e.g., it does not provide any benefit to a page that compiles a list of spam pages for warning its users to avoid them.
  • SST should accept a trust score for any link. Designating a link just as either (positive) endorsement or censure prevents finer nuances in opinion, say, that one page is more spam than another. Thus, just like using a range of spam scores for pages, links should also be allowed a similar range of trust scores. A trust score of a link may be assigned manually, say, by its page owner, or may be computed by an algorithm, say, using adjacent-text analysis, or may be inherited from a default value set at a higher level like page, site, host, domain, or Web. To avoid retaliation, SST could allow hiding of trust scores.
  • SST should encourage endorsing non-spam pages and censuring spam pages, and discourage endorsing spam pages and censuring non-spam pages. Endorsing spam pages and censuring non-spam pages should increase the spam score, while endorsing non-spam pages and censuring spam pages should decrease the spam score.
  • SST should allow a variety of customizations. SST should allow different interpretations of spam scores and trust scores, like quality, reputation, badness, recommendation, etc., as long as spam indicates some undesirable trait and trust indicates some desirable trait. SST should also allow customization of spam scores to a specific social or professional group; for e.g. advertisers may have a slightly different view of spam and some researchers may deliberately search for spam pages! SST should also allow personalization of spam scores based on an individual preference.
  • SST should be robust against malicious attacks. obvious!
  • SST should scale to the already-huge and still-growing Web. The convergence and scalability should be similar or even better than that of techniques like PageRank.
  • Spam rating should be usable for further processing. The further processing may include computing popularity rating like PageRank, or integrating with other approaches, say, based of IR techniques.
  • SST should remove spam even before generating the search results. Filtering (or even re-ranking) spam pages out from search results is a less efficient approach.

I welcome your comments on these principles, especially if you disagree with or if you want to add something specific.


Submit post to: Digg | Del.icio.us | BlinkList | Furl | Spurl | Reddit | Simpy | RawSugar
Subscribe to RSS feed: Entries | Comments
Site Search Tags: , , , , , , , , , , , , , , , , ,

July 27, 2006 Posted by Mukesh | Spam | | No Comments Yet

Spamless Search: An Introduction

Spamdexing (or search engine spamming) increases web search failures and costs. Combating spam is an arms race, where continually evolving spamdexing techniques require developing even more powerful techniques for removing spam from search results, that is, spamless search. This series of posts will:

  • introduce principles of spamless search;
  • present a new approach for spamless search;
  • analyze various approaches using these principles; and
  • compare these approaches using illustrative examples.

We plan to analyze and compare the following approaches:

Background Info:
Some recent statistics on search failures and costs, partly due to spamdexing:

  • The cost of not finding the right information is about $5.3 million per year for a company with 1,000 knowledge workers (IDC).
  • Internet searches fail 30% of the time (Outsell).

Wikipedia defines spamdexing as dishonest practices that mislead search and indexing programs to give a page a search result ranking it does not deserve. In contrast, search engine optimization (SEO) uses “white hat” techniques for making a website indexable by search engines, without misleading the indexing process.

Common spamdexing techniques involve one or both of the following:

  • Content spam: Dishonest web page content, for example, keyword stuffing and invisible text.
  • Link spam: Dishonest web links, for example, link farms and hidden links.

Submit post to: Digg | Del.icio.us | BlinkList | Furl | Spurl | Reddit | Simpy | RawSugar
Subscribe to RSS feed: Entries | Comments
Site Search Tags: , , , , , , , , , , , ,

July 24, 2006 Posted by Mukesh | Spam | | No Comments Yet

Thanks

We are grateful to the following people, sites, and organizations for direct or indirect help or influence in creating this SI blog:

(Disclaimer: The following may not necessarily endorse or agree with the contents of this blog. Please send email if you would like your name to be removed from here. This list will be updated periodically.)

  • Lorelle VanFossen of Lorelle on WordPress: excellent tips on how to conveniently add “Site search tags” and “Sumbit post to” on each post.

July 19, 2006 Posted by Mukesh | Uncategorized | | No Comments Yet

Search Innovations: Introduction

The purpose of Search Innovations (SI) blog is to discuss novel technical ideas for improving web search and its applications. We specifically seek improvements that will make the web more useful. Important topics of current interest are:

  • Web spam prevention: How to remove spam pages from search results? How to avoid crawling spam pages? How to discourage creation of spam pages?
  • Personalized search: How to tailor search results to a specific user?
  • Social search: How to customize search results based on the preferences of a community?
  • Contextual search: How to customize search results based on the current context?
  • Others to be added, as requested or needed!

You are invited to comment on the ideas presented here and to contribute new ideas. If you do not wish to participate in a public discussion, you are invited to email your comments to Mukesh, the current moderator of SI blog.


Submit post to: Digg | Del.icio.us | BlinkList | Furl | Spurl | Reddit | Simpy | RawSugar
Subscribe to RSS feeds: Entries | Comments
Site Search Tags: , , , , , , , , , , ,

July 18, 2006 Posted by Mukesh | Admin | | No Comments Yet

Awards

  • Research Initiation Award, National Science Foundation, 1994.
  • Best Paper Award (runner-up), Ninth Canadian Conference on Artificial Intelligence, 1992.
  • Best Thesis Award, Computer Science & Engineering, Indian Institute of Technology, Kanpur, India, 1986.
  • CAREER Award, National Science Foundation, 1997  (Offered, but I declined upon leaving Columbia University)

July 13, 2006 Posted by Mukesh | Uncategorized | | No Comments Yet

Patents

Back to Mukesh

  1. Method and System for Optimizing Request-Promise Workflows, USA No. 6,895,384, May 2005.
  2. Optimizing active decision making using simulated decision making, PCT No. 050401, April 2003.
  3. Computer Implemented Scheduling and Process Using Abstract Local Search Technique, USA No. 6,456,996, September 2002.
  4. Query optimization by type lattices in object-oriented logic programs and deductive databases, USA No. 5,307,445, April 1994 (also granted in several other countries).
  5. Pending:

  6. Method and system for a variety of custom searches, Filed 2006.
  7. Quality scores and popularity ranking of linked objects, Filed 2006.
  8. A Personalized Web-Based Recommendation System, Filed 2005.
  9. Generating Motions in Real-Time With Model-Based Methods, Filed 2005.
  10. Model-based optimal optical network planning and operations, Filed 2005.
  11. Making decisions with POMDPs and related formalisms, Filed 2005.
  12. Efficient and effective pathfinding heuristics based on hierarchies, Filed 2005.
  13. Generating an Optimized Price Schedule for a Product, Filed 2001.
  14. System and Method for Multi-Party Constrained Optimization, Filed 2000.
  15. Method and System for Multi-Enterprise Optimization using Flexible Trade contracts, Filed 2000.

July 13, 2006 Posted by Mukesh | Uncategorized | | No Comments Yet

Publications

Back to Mukesh

  1. M. Dalal, B. Mukherjee and A. Prieditis. Model-based Optimal Optical Network Planning and Operations, OFC/NFOEC Conference, Anaheim, CA, March 2006.
  2. Prieditis, A. and Dalal, M., Applying Model-Based Decision-Making Methods to Games: Applying the Locust AI Engine to Quake III, in Game Programming Gems 6, Michael Dickheiser (ed.), Charles River Media, March 2006.
  3. Prieditis, A. and Dalal, M. Power Tools for Project Schedulers, Constructech Magazine, December 2005.
  4. Prieditis, A, Dalal, M, Groel, B, Bair, A, and Connelly, L., Optimizing an Emergency Department Through Simulation, Proceedings of the International Conference on Health Science Simulation, New Orleans, January 23-27, 2005, pages 32-37.
  5. M. Dalal, B. Groel, and A. Prieditis. Intelligent Decision Making Using Real-Time Simulations of Continuous Processes. Proceedings of the 2004 Advance Simulation Technologies Conference (ASTC), Arlington, VA, April 2004.
  6. A. Prieditis and M. Dalal. Model-Based Swarm Control of Unmanned Ground Vehicles. Proceedings of the 2004 Advance Simulation Technologies Conference (ASTC), Arlington, VA, April 2004.
  7. A. Prieditis and M. Dalal. Model, Simulate, and Optimize Hospital workflows. 2004 Annual HIMMS Conference and Exhibition, Orlando, Florida.
  8. A. Prieditis and M. Dalal. Simulation-Based Real-Time Resource Allocation for Hospital Workflows. Proceedings of 2004 International Health Sciences Simulation Conference.
  9. A. Prieditis, M. Dalal, A. Arcilla, B. Groel, M. Van Der Bock and R. Kong. SmartSwarms: Distributed UAVs that Think. 2004 Command and Control Research and Technology Symposium (CCRTS), San Diego, CA, June 2004.
  10. M. Dalal, B. Groel, and A. Prieditis. Real-time decision making using simulation. Proceedings of the 2003 Winter Simulation Conference, New Orleans, LA, December 2003.
  11. M. Dalal. Anytime Families of Tractable Propositional Reasoners. Annals of Mathematics and Artificial Intelligence, 22(3-4):297-318, 1998.
  12. K. McKeown, S. F. Feiner, M. Dalal, and S. Chang. Generating Multimedia Briefings: Coordinating Language and Illustration. Artificial Intelligence, 103(1-2): 95-116, 1998.
  13. J.M. Crawford, M. Dalal, and J. P. Walser. Abstract local search. Proceedings of the AIPS-98 Workshop on Planning as Combinatorial Search, Pittsburgh, PA, June 1998.
  14. R. Roy-Chowdhury and M. Dalal. Model-Theoretic Semantics and Tractable Algorithm for CNF-BCP. Proceedings of the Fourteenth National Conference on Artificial Intelligence (AAAI-97), Providence, Rhode Island, 227-232, July, 1997.
  15. L. Yang and M. Dalal. Empirical results on anytime propositional reasoning. Proceedings of the 9th IEEE International Conference on Tools with Artificial Intelligence (ICTAI-97), Newport Beach, California, 77-83, November, 1997.
  16. L. Khreisat and M. Dalal. Anytime reasoning with probabilistic inequalities. Proceedings of the 9th IEEE International Conference on Tools with Artificial Intelligence (ICTAI-97), Newport Beach, California, 60-66, November, 1997.
  17. M. Dalal and L. Yang. Preliminary Empirical Results on Anytime Propositional Reasoning (Extended Abstract). Proceedings of the AAAI-97 Workshop on Building Resource-Bounded Reasoning Systems, Providence, Rhode Island, July, 1997.
  18. M. Dalal, S. Feiner, K. McKeown, S. Pan, M. Zhou, T. Hollerer, J. Shaw, Y. Feng, and J. Fromer. Negotiation for automated generation of temporal multimedia presentations. In Proceedings of the Fourth International Multimedia Conference (ACM Multimedia’96), Boston, 55-64, MA, 1996. Acceptance rate: 36 out of 162.
  19. M. Dalal, S. Feiner, K. McKeown, D. Jordan, B. Allen, and Y. alSafadi. MAGIC: An Experimental System for Generating Multimedia Briefings about Post-Bypass Patient Status. In Proceedings American Medical Informatics Association Annual Fall Symposium, Washington DC, 1996.
  20. M. Dalal and Y. Feng. Anytime Temporal Reasoning Based on Propositional Satisfiability (Extended Abstract). In Proceedings of Second International Conference on Principles and Practice of Constraint Programming (CP96), Pages 535-536, Cambridge, Massachusetts, 1996.
  21. M. Dalal. Semantics of an Anytime Family of Reasoners. In Proceedings of the European Conference on Artificial Intelligence (ECAI-96), Pages 360-364, Budapest, Hungary, 1996. Acceptance rate: 131 out of 450.
  22. M. Dalal. An Almost Quadratic Class of Satisfiability Problems. In Proceedings of the European Conference on Artificial Intelligence (ECAI-96), Pages 355-359, Budapest, Hungary, 1996. Acceptance rate: 131 out of 450.
  23. M. Dalal. Semantics of an Efficient Propositional Reasoner: Preliminary Report. In Proceedings of the Ninth Florida AI Research Symposium (FLAIRS-96), Pages 101-105, Florida 1996.
  24. M. Dalal. Anytime Families of Tractable Propositional Reasoners. In Proceedings of the Fourth International Symposium on Artificial Intelligence and Mathematics, (AI/Math-96), Pages 42-45, Florida, 1996.
  25. K. McKeown, M. Dalal, et. al. The JANUS digital library. In Proceedings of Digital Libraries ‘94, Conference on the Theory and Practice of Digital Libraries, College Station, Texas, June 1994.
  26. M. Dalal and D. W. Etherington. A hierarchy of tractable satisfiability problems. Information Processing Letters, 44(4):173-180, December 1992.
  27. M. Dalal. Tractable deduction in knowledge representation systems. In Proceedings of the Third International Conference on Principles of Knowledge Representation and Reasoning (KR ‘92), pages 393-402, Cambridge, Massachusetts, 1992. Acceptance rate: 67 out of 300.
  28. M. Dalal and D. W. Etherington. Tractable approximate deduction using limited vocabularies. In Proceedings of the Ninth Canadian Conference on Artificial Intelligence (AI ‘92), pages 206-212, Vancouver, Canada, 1992. Runner’s-up for the best paper award. Also accepted for the Tenth National Conference on Artificial Intelligence (AAAI-92), San Jose, California, but withdrawn in favor of AI ‘92.
  29. M. Dalal. Tractable instances of some hard deduction problems. In B. Neumann, editor, Proceedings of the Tenth European Conference on Artificial Intelligence (ECAI ‘92), pages 354-358, Vienna, Austria, 1992.
  30. M. Dalal. Efficient propositional constraint propagation. In Proceedings of the Tenth National Conference on Artificial Intelligence (AAAI-92), pages 409-414, San Jose, California, 1992.
  31. M. Dalal and D. W. Etherington. Tractable approximate deduction using limited vocabularies. In Proceedings of the Workshop on Tractable Reasoning, pages 23-28, San Jose, California, 1992. American Association for Artificial Intelligence.
  32. T. Imielinski and M. Dalal. Your time is up: approximate query answering in deductive databases. In Proceedings of the Workshop on Automatic Generation of Approximations and Abstractions, pages 175-185, Boston, Massachusetts, 1990.
  33. M. Dalal and D. Gangopadhyay. OOLP : A translation approach to object-oriented logic programming. In W. Kim, J.-M. Nicolas, and S. Nishio, editors, Proceedings of the First International Conference on Deductive and Object-Oriented Databases, pages 555-568, Kyoto, Japan, 1989. Acceptance rate: 27 out of 85.
  34. M. Dalal. Investigations into a theory of knowledge base revision: Preliminary report. In Proceedings of the Seventh National Conference on Artificial Intelligence (AAAI-88), pages 475-479, Saint Paul, Minnesota, 1988.
  35. M. Dalal. Logic-based Computer Programming Paradigms. In Handbook of Discrete and Combinatorial Mathematics, Ken Rosen (ed.), CRC Press.
  36. M. Dalal. Tractable reasoning in knowledge representation systems. Technical Report CUCS-017-95, Department of Computer Science, Columbia University, New York, NY, July 1995. An earlier version was published as Technical Report DCS-TR-321, Department of Computer Science, Rutgers University, New Brunswick, NJ, May 1995.
  37. M. Dalal. Updates in Propositional Databases. Technical Report DCS-TR-222, Department of Computer Science, Rutgers University, New Brunswick, NJ, February 1988.
  38. N.H. Minsky, M. Dalal, et. al. Specifications of the Darwin System. Technical Report CAIP-TR-053, CAIP Center, Rutgers University, New Brunswick, NJ, November 1987.
  39. M. Dalal and S. Arya. Segmenting and Recognizing Handwritten Script. Bachelors’ thesis, Department of Computer Science and Engineering, Indian Institute of Technology, Kanpur, India, April 1986.
  40. S. Marwah, M. Dalal, M. Agarwal, et. al. Development of a Bridge Playing Program. Technical Report TRCS-85-25, Department of Computer Science and Engineering, Indian Institute of Technology, Kanpur, India, January 1985.

Back to Mukesh

July 13, 2006 Posted by Mukesh | Uncategorized | | No Comments Yet

Mukesh

Click for …PatentsProductsPublications

  • 20+ years of IT experience across industry and academia.
  • Senior-level architect & technology executive.
  • Directly responsible for 10+ software products – from concept to deployment.
  • Expertise in R&D, product innovation & commercialization, software development, intellectual property, P&L management, and team leadership.
  • Revenue of $10+ million from direct innovations & $200+ million from contributed products.
  • Awarded 3 US and 1 European patent. 11 more patents pending.
  • Invited speaker & published author of 35+ technical papers.
  • Ph.D. in Computer Science from Rutgers University.
  • B.Tech. in CS from Indian Institute of Technology, Kanpur (IIT).
  • Principal architect, President & Co-Founder of Lookahead Decisions Inc.
  • Senior Member of Technical Staff at i2 Technologies, until 2002.
  • Assistant Professor of Computer Science at Columbia University, until 1997.
  • Alumni of AT&T Bell Laboratories and IBM T.J. Watson Research Center.
  • Research and technology grants from NIST, NSF, DARPA and NYS CAT.
  • Background in artificial intelligence, software engineering, web search, computer simulation, supply chain, real-time decision making, intelligent pricing, planning, & optimization.

July 13, 2006 Posted by Mukesh | Uncategorized | | No Comments Yet

SI

The purpose of Search Innovations (SI) blog is to discuss novel technical ideas for improving web search and its applications. We specifically seek improvements that will make the web more useful. Important topics of current interest are:

  • Web spam prevention: How to remove spam pages from search results? How to avoid crawling spam pages? How to discourage creation of spam pages?
  • Personalized search: How to tailor search results to a specific user?
  • Social search: How to customize search results based on the preferences of a community?
  • Contextual search: How to customize search results based on the current context?
  • Others to be added, as requested or needed!

You are invited to comment on the ideas presented here and to contribute new ideas. If you do not wish to participate in a public discussion, you are invited to email your comments to Mukesh, the current moderator of SI blog.


Subscribe to RSS feeds: Entries | Comments

July 13, 2006 Posted by Mukesh | Uncategorized | | No Comments Yet