abstract |
Systems, methods and apparatuses are disclosed to cluster a plurality of documents located in any number of local and/or remote systems and applications. Preprocessed text is generated for each document, and a hash and a feature vector are determined based on the preprocessed text. A set of clusters is retrieved, wherein each cluster is associated with a hash list and a cumulative feature vector. Each of the documents may then be associated with a cluster by comparing the hash of the document to the hash lists of the clusters and/or by determining similarities between the feature vector of the document and the cumulative feature vectors of the clusters. |