Large Clustering Problems

1. The Canopies Approach

  • Two distance metrics
    • cheap & expensive
  • First pass
    • very inexpensive distance metric
    • create overlapping canopies
  • Second pass
    • expensive, accurate distance metric
    • canopies determine which distances calculated

2. Using Canopies

  • Calculate expensive distances between points in the same canopy
  • All other distances default to infinity
  • Use finite distances and iteratively merge closest
    3. Preserve Good Clustering

    • Small, disjoint canopies
      • big time savings
      • Large, overlapping canopies
          • original accurate clustering
          • Goal: fast and accurate
            • For every cluster, there exists a canopy such that all points in the cluster are in the canopy

            Leave a Reply