# Time Series Models (ARIMA)

Reply

Reply

**1. Data Labels**

**Supervised Anomaly Detection**- labels available for both normal data and anomalies
- similar to rare class mining
**Semi-supervised Anomaly Detection**- labels available only for normal data
**Unsupervised Anomaly Detection**- No labels assumed
- Based on the assumption that anomalies are very rare compared to normal data

P

**Point Anomalies**- An individual data instance is anomalous w.r.t. the data

**Contextual Anomalies**- An individual data instance is anomalous within a context
- requires a notion of context
- also referred to as conditional anomalies

**Collective Anomalies**- A collection of related data instances is anomalous
- Requires a relationship among data instances
- sequential data
- spatial data
- graph data
- The individual instances within a collective anomaly are not anomalous by themselves

- Accuracy is not sufficient metric for evaluation

- Recall:
- Precision
- F-measure
- Recall (detection rate)
- ratio between the number of correctly detected anomalies and the total number of anomalies
- False alarm (false positive)
- ratio between the number of data records from normal class that are misclassified as anomalies and the total number of data records from normal class.
- ROC curve: a tradeoff between detection rate and false alarm rate.
- Area under the ROC curve (AUC) is computed using a tapeziod rule.

**5. Classification Based Techniques**

**Main idea:**build a classification model for normal (and anomalous (rare)) events based on labeled training data, and use it to classify each new unseen event.**Classification models**must be able to handle skewed (imbalanced) class distribution- Categories
- supervised learning techniques
- require knowledge of both normal and anomaly class
- build classifier to distinguish between normal and known anomalies
- semi-supervised classification techniques
- require knowledge of normal class only
- use modified classification model to learn the normal behavior and then detect any deviations from normal behavior as anomalous
- Advantage
- supervised learning techniques
- models that can be easily understood
- high accuracy in detecting many kinds of known anomalies
- semi-supervised classification techniques
- models that can be easily understood
- normal behavior can be accurately learned
- Drawbacks
- supervised learning techniques
- require both labels from both normal and anomaly class
- cannot detect unknown and emerging anomalies
- semi-supervised classification techniques
- require labels from normal class
- possible high false alarm rate
- previous unseen data records may be recognized as anomalies

**6. Nearest Neighbor Based**

- Key assumption
- normal points have close neighbors while anomalies are located far from other points
- Two approach
- compute neighborhood for each data record
- analyze the neighbor to determine whether data record is anomaly or not
- Categories
- distance-based methods
- anomalies are data points most distant from other points
- density-based methods
- anomalies are data points in low density regions
- Advantage
- can be used in unsupervised or semi-supervised setting (do not make any assumptions about data distribution)
- Drawbacks
- if normal points do not have sufficient number of neighbors the techniques may fail
- computationally expensive
- in high dimensional spaces, data is sparse and the concept of similarity may not be meaningful anymore
- due to the sparseness, distances between any two data records may become quite similar
- each data record may be considered as potential outlier

- Key assumption
- normal data records belong to large and dense clusters, while anomalies do not belong to any of the clusters or form very small clusters
- Categorization according to labels
- semi-supervised
- cluster normal data to create models for normal behavior. If a new instance do not belong to any of the clusters or it is not close to any cluster, is anomaly.
- unsupervised
- post processing is needed after a clustering step to determine the size of the clusters and the distance from the cluster is required for the point to be anomaly
- Advantage
- No need to be supervised
- Easily adaptable to online/incremental mode suitable for anomaly detection from temporal data
- Drawbacks
- computationally expensive
- using indexing structures may alleviate this problem
- if normal points do not create any clusters the techniques may fail
- in high dimensional spaces, data is sparse and distances between any two data records may become quite similar

- Data points are modeled using stochastic distribution
- points are determined to be outliers depending on their relationship with their model
- Advantage
- utilize existing statistical modeling techniques to model various type of distributions
- Drawback
- with high dimensions, difficult to estimate distributions
- parametric assumptions often do no hold for real data sets
- Parametric Techniques
- assume that the normal (and possibly anomalous) data is generated from underlying parametric distribution
- learn the parameters from the normal sample
- determine the likelihood of a test instance to be generated from this distribution to detect anomalies
- Non-parametric Techniques
- do not assume any knowledge of parameters
- use non-parametric techniques to learn a distribution
- e.g., parzen window estimation

- Computer information content in data using information theoretic measures
- e.g., entropy, relative entropy, etc.
- Key idea: outliers significantly alter the information content in a dataset
- Approach: detect data instances that significantly alter the information content
- require an information theoretic measure
- Advantage
- operate in an unsupervised mode
- Drawback
- require an information theoretic measure sensitive enough to detect irregularity induced by few outliers
- Example
- Kolmogorov complexity
- detect smallest data subset whose removal leads to maximum reduction in kolmogorov complexity
- Entropy-based approaches
- find a k-sized subset whose removal leads to the maximal decrease in entropy

- Analysis based on eigen decomposition of data
- key idea
- find combination of attributes that capture bulk of variability
- reduced set of attributes can explain normal data well, but not necessarily the outliers
- Advantage
- can operate in an unsupervised mode
- Drawback
- based on the assumption that anomalies and normal instances are distinguishable in the reduced space
- Several methods use Principal Component Analysis
- top few principal components capture variability in normal data
- smallest principal component should have constant values
- outliers have variability in the smallest component

- Detect collective anomalies
- Exploit the relationship among data instances
- Sequential anomaly detection
- detect anomalous sequence
- Spatial anomaly detection
- detect anomalous sub-regions within a spatial data set
- Graph anomaly detection
- detect anomalous sub-graphs in graph data

**1. Introduction**

**Motivation**- understanding differences between groups
**Task**- provide an efficient algorithm for mining contrast contrast sets and pruning rules to reduce complexity
- provide post processing techniques to present subsets that are surprising
- control the false positives
- be statistically sound
**Goal**- To find contrast-sets whose support differs meaningfully (statistically) across groups
- ,

- Add an attribute to the set (group type) and use Association Rule Mining to find the differences
- Problems
- this will not return group differences
- the results will be difficult to interpret

**1. Definition**

**Association**: Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrence of other items in the transaction.**Support count ():**frequency of occurrence of an itemset**Support:**fraction of transactions that contain an item set**Frequent Itemset:**An itemset whose support is greater or equal to a__minsup__threshold

**Association rule****support (s)**- fraction of transactions that contains both X and Y
**confidence**- measures how often items in Y appear in transactions that contains X

- Given a set of transactions T, the goal of association rule mining is to find all rules having
- support >= minsup threshold
- confidence >= minconf threshold
- Brute-force approach
- list all possible association rules
- compute the support and confidence for each rule
- prune rules that fail the minsup and minconf thresholds
- this is computationally prohibitive!

- Step 1: frequent itemset generation
- generate all itemset whose support >= minsup
- Step 2: rule generation
- generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset
- Drawback: frequent itemset generation is still computationally expensive

- Given d unique items
- total number of item set is:
- total number of possible association rules is:
- =

- Apriori principal
- If an itemset is frequent, then all of its subsets must also be frequent
- i.e.,
- Apriori algorithm
- Let k = 1
- generate frequent itemset of length 1
- repeat until no new frequent itemsets are identified
- generate length (k+1) candidate itemsets from length k frequent itemsets
- prune candidate itemsets containing subsets of length k that are infrequent
- count the support of each candidate by scanning the db
- eliminate candidates that are infrequent, leaving only those that are frequent

- An itemset is closed if none of its immediate supersets has the same support as the itemset.

- How to set the appropriate minsup threshold?
- if minsup is set too high, we could miss items involving interesting rate items (e.g., expensive products)
- if minsup is set too low, it is computatinally expensive and the number of itemset is very large.
- Using a single minimum support threshold may not be effective.

- Numerical measures that are applied to judge various aspects of cluster validity, are classified into the following three types
- External Index: Used to measure the extent to which cluster labels match externally supplied class labels
- entropy
- Internal Index: Used to measure the goodness of a clustering structure without respect to the external information
- sum of squared error (SSE)
- Relative Index: used to compare two different clustrings of clusters
- often an external or internal index is used for this function, e.g., SSE or entropy

**Two matrix**- Proximity Matrix
- Incidence Matrix
- one row and one column for each data point
- an entry is 1 if the associated pair of points belong to the same cluster, else 0
- Compute the correlation between the two matrices
- since the matrices are symmetric, only the correlation between n(n-1)/2 entries needs to be calculated
- High correlation indicates that points that belong the same cluster are closed to each other
- Not a good measure for some density

**1. The Canopies Approach**

- Two distance metrics
- cheap & expensive
- First pass
- very inexpensive distance metric
- create overlapping canopies
- Second pass
- expensive, accurate distance metric
- canopies determine which distances calculated

**2. Using Canopies**

- Calculate expensive distances between points in the same canopy
- All other distances default to infinity
- Use finite distances and iteratively merge closest

- Small, disjoint canopies
- big time savings
- Large, overlapping canopies
- original accurate clustering
- Goal: fast and accurate
- For every cluster, there exists a canopy such that all points in the cluster are in the canopy

**1. DBScan**

- DBScan is a density-based algorithm
- density = number of points with a specified radius (Eps)
- a point is a core point if it has more than a specified number of point (MinPts) within Eps
- There are points that are at the interior of a cluster
- A boarder point has fewer then MinPts within Eps, but is the neighborhood of a core point
- A noise point is any point that is not a core point or a boarder point.

**Strongness**- Resistant to noise
- Can handle clusters of different shapes and sizes
**Weakness**- when dataset has varying densities
- high dimensional data

- The idea is that for points in a cluster, their k-th nearest neighbors are at roughly the same distance
- Noise points have the kth nearest neighbor at farther distance

**1. Hierarchical Clustering**

- Produces a set of nested clusters
- organized as a hierarchical tree
- Can be visualized as a dendrogram
- a tree like diagram that records the split

- Do not have to assume any particular number of clusters
- any desired number of clusters can be obtained by “cutting” the dendogram at the proper level
- They may correspond to meaning taxonomies
- example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …)

**Agglomerative**- Start with the points as individual clusters
- At each step, merge the closest pair of clusters until only one cluster (or k clusters) left.

**Divisive**- Start with one, all-inclusive cluster
- At each step, split a cluster until each cluster contains a point (or there are k clusters)

- Min
__strength__: can handle non-elliptical shapes__weakness__: sensitive to noise and outliers- Max
__strength__: less susceptible to noise and outliers__weakness__: tend to break large clusters; biased towards globular clusters- Group Average
- proximity of two clusters is the average of pairwise similarity between points in the two clsuters

__Strongness__: less susceptible to noise and outliers__Weakness__: biased towards globular clusters- Distance between centroids
- Other methods driven by an objective function
- Ward’s method users squared error
- Similarity of two clusters is based on the increase in squared error when two clusters are merged
- Similar to gorup average if distance between points is distance squared.
__Strongness__: less susceptible to noise and outliers__Weakness__:- Biased towards globular clusters
- Hierarchical analogue of K-means

- Once a decision is made to combine two clusters, it cannot be undone
- No objective function is directly minimized
- Different schemes have problems with one or more of the following
- sensitivity to noise and outliers
- difficulty handling different sized clusters and convex shapes
- breaking large clusters

**1. Partitional Clustering v.s. Hierarchical Clustering**

- Partitioning clustering: A division of data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset.
- Hierarchical clustering: A set of nested clusters organized as a hierarchical tree.

**Well-Separated Clusters:**a cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than any point not in the cluster.**Center-based Clusters**: A cluster is a set of objects such that any point in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster.- The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster.
**Contiguous Clusters (Nearest Neighbor or Transitive):**A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster.

**Density-based:**A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density.- Used when cluster are irregular or intertwined.

**Shared Property or Conceptual Clusters**: Finds clusters that share some common property or represent a particular concept.

- Cluster Defined by an Objective Function:
- Finds clusters that minimize or maximize an objective function
- Enumerate all possible ways of dividing points into clusters and evaluate the “goodness” of each potential set of clusters by using the given objective function (NP hard).
- Can have global or local objectives
- hierarchical clustering algorithms typically have local objectives
- partitional algorithms typically have global objectives
- A variant of the global objective function approach is to fit the data to a parameterized model
- parameters for the model are determined from the data
- mixture models assume that the data is a “mixture” of a number of statistical distributions.
- Map the clustering problem to a different domain and solve a related problem in that domain
- proximity matrix defines a weighted graph, where the nodes are the points being clustered, and the weighted edges represent the proximities between points.
- Clustering is equivalent to breaking the graph into connected components, one for each cluster.

**3. K-means Clustering**

**Code**

- Initial centroids are often chosen randomly.
- Clusters produced vary from one run to another.
- The centroid is (typically) the mean of the points in the cluster.
- “Closeness” is measured by Euclidean distance, cosine similarity, correlation, etc.
- K-means will converge for common similarity measures mentioned above.
- Most of the converge happens in the few iterations.
- often the stopping condition is changed to “until relatively few points change clusters”
- Complexity is O(n*k*i*d)
- n: number of points
- k: number of clusters
- i: number of iterations
- d: number of attributes

- Most common measure is Sum of Squared Error (SSE)
- for each point, the error in the distance to the nearest cluster
- to get SSE, we square these errors and sum them
- SSE = dist^2
- where x is the data point in cluster and is the representative point for cluster

- If there are K “real” clusters then the chance of selecting one centroid from each cluster is small
- Chances is relatively small when K is large.
- If clusters are the same size, n, then
- P =
- Solutions to Initial Centroids Problem
- Multiple runs: helps, but probability is not on your side
- Sample and user hierarchical clustering to determine initial centroids
- Select more than k initial centroids and then select among these initial centroids
- select most widely separated
- Postprocessing
- Bisecting K-means
- Not as susceptible to initialization issues

- Basic K-means algorithm can yield empty clusters
- Several strategies
- Choose the point that contributes most to SSE
- Choose a point from the cluster with the highest SSE
- If there are several empty clusters, the above can be repeated several times

- In the basic K-means algorithm, centroids are updated after all points are assigned to a centriod
- An alternative is to update the centriods after each assignment (incremental approach)
- each assignment updates zero or two centroids
- more expensive
- introduces an order dependency
- never get an empty cluster
- can use “weights” to change the impact

**Pre-processing**- normalize the data
- eliminate outliers
**Post-processing**- Eliminate small clusters that may represent outliers
- Split “loose” clusters, i.e., clusters with relatively high SSE
- Merge clusters that are “close” and that have relatively low SSE
- Can use these steps during the clustering process
- ISODATA

- Bisecting K-means algorithm
- variant of K-means that can produce a partitional or a hierarchical clustering

- K-means has problems when clusters are of differing
- sizes
- densities
- non-globular shapes
- K-means has problems when the data contains outliers.
- One solution is to use many clusters
- find parts of clusters, but need to put together