[安全论文阅读笔记]Survey on Web Spam Detection: Principles and Algorithms

Date: 2016-12-16

这篇文章发表在SIGKDD Explorations 2013,作者是来自UIUC的 Nikita Spirin 和 Jiawei Han

这篇文章总结了web spam 检测的主要算法分类。主要针对的spam是搜索引擎spam,而非social media spam。

Spam的分类以及技术
1. Content Spam
因为搜索引擎对网页的内容的排名采用TFIDF模型。因此这些spam会在内容里加入一些popular的词,来提高rank。
2. Link Spam
搜索引擎采用page rank来评估网页排名,因此这些spam会通过提高incoming link数量质量来提高目标页面的排名,他们也会通过购买被抛弃的域名来获取有一定reputation的域名。
3. Cloking and Redirection
对于同一个页面,Spammers会根据不同的clients来展示不同的内容。因此对于搜索引擎爬虫,他们可以放比较有利于rank的内容,而对于普通用户,他们可以展示广告内容。

已有的检测方法大概可以分为三类
1. content-based methods
这些方法主要通过分析word counts, language models, HTML页面的结构,clocking score
2. link-based methods
这些方法主要通过分析link构成的图结构的特性,label propagation,Link pruning and reweighting, graph regularization (建议如果有意通过link结构来做检测的同学可以细读具体内容)
3. data-based methods, e.g., user behavior, clicks, HTTP sessions.
这些方法通过Markov model来分析用户行为等

Spam Filter Challenge

Adaptation of Adversaries [1]

  • The adversaries are motivated to transform the test data to reduce the learner’s effectiveness. 
  • Spam filter designers
    • Attempt to learn good filters by training their algorithms on Spam (and legitimate) email messages received in the recent past. 
  • Spammer
    • Are motivated to reverse-engineer existing Spam filters and use this knowledge to generate messages which are different enough from the (inferred) training data to circumvent the filters. 

Solutions

  • Increase the robustness of the learning algorithm to generic training/test data differences via standard methods such as regularization or minimization of worst-case loss [1]
    • However, these techniques do not account for the adversarial nature of the training/test set discrepancies and may be overly conservative.
  • Predictive analystics to anticipate and counter the adversaries [1]
    • For example, predictions can be made using extrapolation or game-theoretic considerations, and can be employed to transform training instances so that they become similar to (future) test data and therefore provider a more appropriate basis for learning.
  • Time-varying posture to increase uncertainty [1]
    • Pros
      • This approach is flexible, scalable, easy to implement, and hard to reverse-engineer.

Reference

[1] Moving Target Defense for Adaptive Adversaries, by Richard Colbaugh and Kristin Glass, in ISI 2013.