r - Cluster assignment after estimate in a large dataset (Mclust) -


i've been doing clustering analysis relative large dataset (~50.000 observations , 16 variables).

library(mclust) load(file="mdper.f.rdata")#mdper.f = stored data 

as computer unable it, did few subsets of information (10 x 5.000, 16.000 in example, 15min computing) , using mclust determine optimal number of groups.

ind<- sample(1:nrow(mdper.f),size=16000)#sampling especial 16.000, 15min cumputing  nfac <- mdper.f[ind,]#sampling fnac <- scale(nfac) #scale data mod = mclust(fnac) #determining optimal number of clusters summary(mod) #summary  #results: ---------------------------------------------------- gaussian finite mixture model fitted em algorithm  ----------------------------------------------------  mclust vii (spherical, varying volume) model 9 components:  log.likelihood     n df    bic      icl    128118.2 16000 80 255462 254905.3  clustering table:    1    2    3    4    5    6    7    8    9  1879 2505 3452 3117 2846  464  822  590  325  

resulting 9 (10 out 10 of datasets of 5.000), so, guess it's okay.. now, assign rest of data estimated cluster divisions in order multidimensional parts of cluster.

how can it?

i started play mclust object can't see how handle , apply rest of data. optimal solution original data column cluster number (1 9) assigned, example.

i've got answer after few minutes working:

first of all, there concept mistake, dataset must scaled before partitioning , using predict()

library(mclust) load(file="mdper.f.rdata")#mdper.f = stored data  mdper.f.s <- scale(mdper.f)#scaling data  ind<- sample(1:nrow(mdper.f.s),size=16000)#sampling 16.000  nfac <- mdper.f.s[ind,]#sampling mod16 = mclust(nfac)#determining optimal number of clusters, 15min cumputing 7 vars  prediction<-predict(mod16 ,mdper.f.s )#predict calculated model , scaled data mdper.f <- cbind(mdper.f,prediction$classification)#assignment original data colnames(mdper.f.pred)[8]<-"cluster" #assing name new column 

Comments