One particular type of data that is well suited for use by multiDA is microarray data. In this example, we demostrate the power of multiDA in predicting cancer type using the provided SRBCT data.

Small Round Blue Cell Tumors (SRBCT) data

The SRBCT dataset (Khan, 2001) looks at classifying 4 classes of different childhood tumours sharing similar visual features during routine histology. These classes include Ewing’s family of tumours (EWS), neuroblastoma (NB), Burkitt’s lymphoma (BL), and rhabdomyosarcoma (RMS). Data was collected from 63 cDNA microarrays, with 1586 features present after filtering for genes with zero median absolute deviation. The data can be accessed by typing SRBCT. We assign our feature data to X, and our response data to y.

##   Length Class  Mode   
## X 99918  -none- numeric
## y    63  factor numeric

Fitting the multiDA model to the SRBCT data

We will fit a multiDA model to describe the group structure for each feature.

res  <- multiDA(X=X, y=y, penalty="EBIC", equal.var=TRUE, set.options="exhaustive")
  • We use the default EBIC penalty as we have much more features than observations.
  • We use the option equal.var = TRUE to fit a model which assumes each Gaussian curve has the same variance.
  • The option set.options = "exhaustive" tells the multiDA algorithm to consider all 15 options of grouping 4 objects together (see What does multiDA actually do? here for describing possible group structures). If the number of classes is much larger, say 5 or 6, the number of groupings grows rapidly (see Bell Number here), and so another set option may be considered to restrict the number of groupings explored.

We can explore the results of the multiDA model using the print() function:

## Sample Size:
## [1] 63
## Number of Features:
## [1] 1586
## Classes:
## [1] 4
## Equal Variance Assumption:
## [1] TRUE
## Number of Significant Features:
## [1] 215
## Summary of Significant Features:
##    rank feature.ID gamma.hat partition
## 1     1      V1172 0.9997741         8
## 2     2      V1232 0.9997576         4
## 3     3      V1233 0.9997563         3
## 4     4      V1324 0.9997506         3
## 5     5       V706 0.9997441         5
## 6     6       V434 0.9997437         3
## 7     7       V527 0.9997424         3
## 8     8      V1189 0.9997368         3
## 9     9      V1166 0.9997347         7
## 10   10       V148 0.9997342         4
## Partition Matrix:
## [1] "exhaustive"
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,]    1    1    1    1    1    1    1    1    1     1     1     1     1
## [2,]    1    1    1    2    2    2    1    2    2     1     2     2     2
## [3,]    1    2    1    1    2    2    2    1    3     2     1     3     2
## [4,]    1    1    2    1    2    1    2    2    1     3     3     2     3
##      [,14] [,15]
## [1,]     1     1
## [2,]     2     2
## [3,]     3     3
## [4,]     3     4

The key information returned to us is that 215 of the 1586 are deemed significant, with high gamma.hat values inndicating a higher probablity of membership to a particular partitioning. For example, for the most significant feature V1172, there is a 0.999 chance that it is described by partition 8, which, looking at the Partition Matrix output, consists of classes 1 and 3 described by a Gaussian curve, and classes 2 and 4 explained by a Gaussian curve (2 cueves altogether).

A low top value of gamma.hat indicates high uncertainty in model fit even for the most significant features - and in such a case another ML algorithm should be considered.

Exploring the plot functionality

We can visualise the paritioning of the classes within each feature using the plot() method for multiDA. By default, the plot function plots the top 10 ranked features. If `ranked=FALSE, then the user can specify which features to be plotted (specified by column names).

An example using specified features:

Prediction using the multiDA algorithm

If we want to predict class labels, we can use the predict function in order to do so. In this case, we will find the resubstitution error rate for this dataset using the multiDA algorithm. We can either extract the probability of class membership for each point ($probabilities) or we can directly extract class membership prediction (which class had the max probability) using $y.pred.

##               [,1] [,2]          [,3]         [,4]
## [1,] 1.908807e-135    1  1.336194e-99 2.596398e-68
## [2,] 1.214488e-123    1  3.550677e-80 8.116775e-55
## [3,] 1.530691e-162    1 3.396422e-123 8.374009e-63
## [4,] 1.601513e-108    1  3.320017e-67 3.077376e-35
## [5,] 1.860622e-157    1 7.604828e-114 4.035431e-64
## [6,] 1.469110e-171    1 5.841163e-129 1.691079e-80
## [1] 2 2 2 2 2 2
## Levels: 1 2 3 4
## [1] 0

Using the multiDA “tidy” functions

Tidy functions have been created for ease of use - namely glimpse_multiDA() and tidy_multiDA(), both of which take a trained multiDA object (in this case, res) as function input.

multiDA “glimpse” function

This function returns a one row data frame, with quick summaries from the algorithm. In the spirit of the “glance” function from the broom package.

##   Classes Number.of.Partitions Number.of.Signficant.Features Top.Gamma
## 1       4                   15                          1586 0.9997741

multiDA “tidy” function

This returns a tidy data frame, with key results from the trained multiDA object, namely, a data.frame of significant features and their ranks. In the spirit of “tidy” from the broom package.

##   rank feature.ID gamma.hat partition
## 1    1      V1172 0.9997741         8
## 2    2      V1232 0.9997576         4
## 3    3      V1233 0.9997563         3
## 4    4      V1324 0.9997506         3
## 5    5       V706 0.9997441         5
## 6    6       V434 0.9997437         3