Anomaly Detection

1.  Data Labels

  • Supervised Anomaly Detection
    • labels available for both normal data and anomalies
    • similar to rare class mining
  • Semi-supervised Anomaly Detection
    • labels available only for normal data
  • Unsupervised Anomaly Detection
    • No labels assumed
    • Based on the assumption that anomalies are very rare compared to normal data
2. Type of Anomalies

P

  • Point Anomalies
    • An individual data instance is anomalous w.r.t. the data
  • Contextual Anomalies
    • An individual data instance is anomalous within a context 
    • requires a notion of context
    • also referred to as conditional anomalies

  • Collective Anomalies
    • A collection of related data instances is anomalous
    • Requires a relationship among data instances
      • sequential data
      • spatial data
      • graph data
    • The individual instances within a collective anomaly are not anomalous by themselves

3. Evaluation of Anomaly Detection

  • Accuracy is not sufficient metric for evaluation
  • Recall: = frac{TP}{TP+FN}
  • Precision =frac{TP}{TP+FP}
  • F-measure = frac{2*P*R}{R+P}
  • Recall (detection rate)
    • ratio between the number of correctly detected anomalies and the total number of anomalies
  • False alarm (false positive)
    • ratio between the number of data records from normal class that are misclassified as anomalies and the total number of data records from normal class.
  • ROC curve: a tradeoff between detection rate and false alarm rate.
  • Area under the ROC curve (AUC) is computed using a tapeziod rule.
4. Taxonomy


5. Classification Based Techniques

  • Main idea: build a classification model for normal (and anomalous (rare)) events based on labeled training data, and use it to classify each new unseen event.
  • Classification models must be able to handle skewed (imbalanced) class distribution
  • Categories
    • supervised learning techniques
      • require knowledge of both normal and anomaly class
      • build classifier to distinguish between normal and known anomalies
    • semi-supervised classification techniques
      • require knowledge of normal class only
      • use modified classification model to learn the normal behavior and then detect any deviations from normal behavior as anomalous
  • Advantage
    • supervised learning techniques
      • models that can be easily understood
      • high accuracy in detecting many kinds of known anomalies
    • semi-supervised classification techniques
      • models that can be easily  understood
      • normal behavior can be accurately learned
  • Drawbacks
    • supervised learning techniques
      • require both labels from both normal and anomaly class
      • cannot detect unknown and emerging anomalies
    • semi-supervised classification techniques
      • require labels from normal class
      • possible high false alarm rate
        • previous unseen data records may be recognized as anomalies

6. Nearest Neighbor Based

  • Key assumption
    • normal points have close neighbors while anomalies are located far from other points
  • Two approach
    • compute neighborhood for each data record
    • analyze the neighbor to determine whether data record is anomaly or not
  • Categories
    • distance-based methods
      • anomalies are data points most distant from other points
    • density-based methods
      • anomalies are data points in low density regions
  • Advantage
    • can be used in unsupervised or semi-supervised setting (do not make any assumptions about data distribution)
  • Drawbacks
    • if normal points do not have sufficient number of neighbors the techniques may fail
    • computationally expensive
    • in high dimensional spaces, data is sparse and the concept of similarity may not be meaningful anymore
      • due to the sparseness, distances between any two data records may become quite similar
        • each data record may be considered as potential outlier

7. Clustering-based Techniques

  • Key assumption
    • normal data records belong to large and dense clusters, while anomalies do not belong to any of the clusters or form very small clusters
  • Categorization according to labels
    • semi-supervised
      • cluster normal data to create models for normal behavior. If a new instance do not belong to any of the clusters or it is not close to any cluster, is anomaly.
    • unsupervised
      • post processing is needed after a clustering step to determine the size of the clusters and the distance from the cluster is required for the point to be anomaly
  • Advantage
    • No need to be supervised
    • Easily adaptable to online/incremental mode suitable for anomaly detection from temporal data
  • Drawbacks
    • computationally expensive
      • using indexing structures may alleviate this problem
    • if normal points do not create any clusters the techniques may fail
    • in high dimensional spaces, data is sparse and distances between any two data records may become quite similar
8. Statistics Based Techniques
  • Data points are modeled using stochastic distribution
    • points are determined to be outliers depending on their relationship with their model
  • Advantage
    • utilize existing statistical modeling techniques to model various type of distributions
  • Drawback
    • with high dimensions, difficult to estimate distributions
    • parametric assumptions often do no hold for real data sets
  • Parametric Techniques
    • assume that the normal (and possibly anomalous) data is generated from underlying parametric distribution
    • learn the parameters from the normal sample
    • determine the likelihood of a test instance to be generated from this distribution to detect anomalies
  • Non-parametric Techniques
    • do not assume any knowledge of parameters
    • use non-parametric techniques to learn a distribution 
      • e.g., parzen window estimation
9. Information Theory Based Techniques
  • Computer information content in data using information theoretic measures 
    • e.g., entropy, relative entropy, etc.
  • Key idea: outliers significantly alter the information content in a dataset
  • Approach: detect data instances that significantly alter the information content
    • require an information theoretic measure
  • Advantage
    • operate in an unsupervised mode
  • Drawback
    • require an information theoretic measure sensitive enough to detect irregularity induced by few outliers
  • Example
    • Kolmogorov complexity
      • detect smallest data subset whose removal leads to maximum reduction in kolmogorov complexity
    • Entropy-based approaches
      • find a k-sized subset whose removal leads to the maximal decrease in entropy
10. Spectral Techniques
  • Analysis based on eigen decomposition of data
  • key idea
    • find combination of attributes that capture bulk of variability
    • reduced set of attributes can explain normal data well, but not necessarily the outliers
  • Advantage
    • can operate in an unsupervised mode
  • Drawback
    • based on the assumption that anomalies and normal instances are distinguishable in the reduced space
  • Several methods use Principal Component Analysis
    • top few principal components capture variability in normal data
    • smallest principal component should have constant values
    • outliers have variability in the smallest component
11. Collective Anomaly Detection
  • Detect collective anomalies
  • Exploit the relationship among data instances
  • Sequential anomaly detection
    • detect anomalous sequence
  • Spatial anomaly detection
    • detect anomalous sub-regions within a spatial data set
  • Graph anomaly detection
    • detect anomalous sub-graphs in graph data

Leave a Reply