i've been doing clustering analysis relative large dataset (~50.000 observations , 16 variables).
library(mclust) load(file="mdper.f.rdata")#mdper.f = stored data
as computer unable it, did few subsets of information (10 x 5.000, 16.000 in example, 15min computing) , using mclust determine optimal number of groups.
ind<- sample(1:nrow(mdper.f),size=16000)#sampling especial 16.000, 15min cumputing nfac <- mdper.f[ind,]#sampling fnac <- scale(nfac) #scale data mod = mclust(fnac) #determining optimal number of clusters summary(mod) #summary #results: ---------------------------------------------------- gaussian finite mixture model fitted em algorithm ---------------------------------------------------- mclust vii (spherical, varying volume) model 9 components: log.likelihood n df bic icl 128118.2 16000 80 255462 254905.3 clustering table: 1 2 3 4 5 6 7 8 9 1879 2505 3452 3117 2846 464 822 590 325
resulting 9 (10 out 10 of datasets of 5.000), so, guess it's okay.. now, assign rest of data estimated cluster divisions in order multidimensional parts of cluster.
how can it?
i started play mclust object can't see how handle , apply rest of data. optimal solution original data column cluster number (1 9) assigned, example.
i've got answer after few minutes working:
first of all, there concept mistake, dataset must scaled before partitioning , using predict()
library(mclust) load(file="mdper.f.rdata")#mdper.f = stored data mdper.f.s <- scale(mdper.f)#scaling data ind<- sample(1:nrow(mdper.f.s),size=16000)#sampling 16.000 nfac <- mdper.f.s[ind,]#sampling mod16 = mclust(nfac)#determining optimal number of clusters, 15min cumputing 7 vars prediction<-predict(mod16 ,mdper.f.s )#predict calculated model , scaled data mdper.f <- cbind(mdper.f,prediction$classification)#assignment original data colnames(mdper.f.pred)[8]<-"cluster" #assing name new column
Comments
Post a Comment