Additional Usage of Statomatic

This vignette will demonstrate additional usage of the statomatic package beyond that shown in the basic workflow vignette. It is recommended to read the basic workflow vignette before reading this one.

Begin by loading the statomatic package

library(statomatic)

A general overview

In this example we’ll demonstate how to use statomatic without creating a statomatic data set.

All the functions used by statomatic were first developed to be compatible with data.frames. The statomatic data set was added later as an optional way to keep data and results organized all in one place. The data.frame compatibility remains, and we can use most of statomatic’s functions individually.

We’ll use the same normalized counts that were used in the multi-group multi-factor example of the basic workflow vignette.

x <- read.csv("normalized_counts.csv")
head(x)
#>                    X Sample_600 Sample_601 Sample_602 Sample_616 Sample_617
#> 1 ENSMUSG00000051285  169.21935  309.83231 417.709088  214.03944  202.50044
#> 2 ENSMUSG00000057363   47.53353   66.64997  54.805802   41.23696   49.30862
#> 3 ENSMUSG00000100555   23.76676   55.84187   0.000000   48.10978   18.19153
#> 4 ENSMUSG00000078184   50.38554   59.44457   5.924952   12.76382   24.41495
#> 5 ENSMUSG00000061024  273.79311  293.62015 232.554350  144.32935  360.00079
#> 6 ENSMUSG00000025911  154.95929  525.99438 309.578721  208.14845  256.59631
#>   Sample_603 Sample_604 Sample_605 Sample_606 Sample_621 Sample_607 Sample_608
#> 1  110.24045  171.38160   61.79141 190.642528  427.33305  228.73359  410.89116
#> 2   57.16171    0.00000    0.00000 103.558904   58.42021   63.76202   88.17407
#> 3   28.58086  151.21906   68.65713   5.884029   30.29196   21.25401   15.87133
#> 4   32.66384   10.08127    0.00000  44.718618   37.86495   19.22981   18.51655
#> 5  145.62627   55.44699  109.85140 222.416283  315.90190  253.02388  319.19013
#> 6  155.15322  171.38160    0.00000 264.781289  529.02750  301.60446  336.82494
#>   Sample_609 Sample_610 Sample_623 Sample_611 Sample_612 Sample_613 Sample_614
#> 1  313.05044  412.25499  133.76758  93.303195 248.040131   46.21469  417.35080
#> 2   60.75650   14.45406   99.46820 103.366522  18.445254    0.00000   65.25144
#> 3   45.30993   40.01052   30.86944   5.093783   1.456204  390.25738    0.00000
#> 4   39.13131   36.86833   56.59398  57.025521  49.996347    0.00000   82.04884
#> 5  209.04355  211.36447  221.23100 240.525945 248.040131  302.96296  440.60874
#> 6  375.86649  528.09696  265.82019 148.589375 171.832106   10.26993  323.02694
#>   Sample_615
#> 1  513.48498
#> 2   39.36603
#> 3    0.00000
#> 4   14.61838
#> 5  213.52043
#> 6  646.77697

We always need to make sure that the data inside the data.frame is purely numeric, and purely the numbers we’re trying to analyze. We can use the cf() function to set a specified column (defaults to column 1) as the rownames of the data.frame.

x <- cf(x)
head(x)
#>                    Sample_600 Sample_601 Sample_602 Sample_616 Sample_617
#> ENSMUSG00000051285  169.21935  309.83231 417.709088  214.03944  202.50044
#> ENSMUSG00000057363   47.53353   66.64997  54.805802   41.23696   49.30862
#> ENSMUSG00000100555   23.76676   55.84187   0.000000   48.10978   18.19153
#> ENSMUSG00000078184   50.38554   59.44457   5.924952   12.76382   24.41495
#> ENSMUSG00000061024  273.79311  293.62015 232.554350  144.32935  360.00079
#> ENSMUSG00000025911  154.95929  525.99438 309.578721  208.14845  256.59631
#>                    Sample_603 Sample_604 Sample_605 Sample_606 Sample_621
#> ENSMUSG00000051285  110.24045  171.38160   61.79141 190.642528  427.33305
#> ENSMUSG00000057363   57.16171    0.00000    0.00000 103.558904   58.42021
#> ENSMUSG00000100555   28.58086  151.21906   68.65713   5.884029   30.29196
#> ENSMUSG00000078184   32.66384   10.08127    0.00000  44.718618   37.86495
#> ENSMUSG00000061024  145.62627   55.44699  109.85140 222.416283  315.90190
#> ENSMUSG00000025911  155.15322  171.38160    0.00000 264.781289  529.02750
#>                    Sample_607 Sample_608 Sample_609 Sample_610 Sample_623
#> ENSMUSG00000051285  228.73359  410.89116  313.05044  412.25499  133.76758
#> ENSMUSG00000057363   63.76202   88.17407   60.75650   14.45406   99.46820
#> ENSMUSG00000100555   21.25401   15.87133   45.30993   40.01052   30.86944
#> ENSMUSG00000078184   19.22981   18.51655   39.13131   36.86833   56.59398
#> ENSMUSG00000061024  253.02388  319.19013  209.04355  211.36447  221.23100
#> ENSMUSG00000025911  301.60446  336.82494  375.86649  528.09696  265.82019
#>                    Sample_611 Sample_612 Sample_613 Sample_614 Sample_615
#> ENSMUSG00000051285  93.303195 248.040131   46.21469  417.35080  513.48498
#> ENSMUSG00000057363 103.366522  18.445254    0.00000   65.25144   39.36603
#> ENSMUSG00000100555   5.093783   1.456204  390.25738    0.00000    0.00000
#> ENSMUSG00000078184  57.025521  49.996347    0.00000   82.04884   14.61838
#> ENSMUSG00000061024 240.525945 248.040131  302.96296  440.60874  213.52043
#> ENSMUSG00000025911 148.589375 171.832106   10.26993  323.02694  646.77697

We also always need a sample_info object to map our samples (columns of the data.frame) to the experimental factors (groups, etc.).

sample_info <- data.frame(samples = colnames(x), sex = c(rep("Male", 10), rep("Female", 10)), 
                          genotype = c(rep("WT", 5), rep("KO", 5), rep("WT", 5), rep("KO", 5)), 
                          group = c(rep("M_WT", 5), rep("M_KO", 5), rep("F_WT", 5), rep("F_KO", 5)))
sample_info
#>       samples    sex genotype group
#> 1  Sample_600   Male       WT  M_WT
#> 2  Sample_601   Male       WT  M_WT
#> 3  Sample_602   Male       WT  M_WT
#> 4  Sample_616   Male       WT  M_WT
#> 5  Sample_617   Male       WT  M_WT
#> 6  Sample_603   Male       KO  M_KO
#> 7  Sample_604   Male       KO  M_KO
#> 8  Sample_605   Male       KO  M_KO
#> 9  Sample_606   Male       KO  M_KO
#> 10 Sample_621   Male       KO  M_KO
#> 11 Sample_607 Female       WT  F_WT
#> 12 Sample_608 Female       WT  F_WT
#> 13 Sample_609 Female       WT  F_WT
#> 14 Sample_610 Female       WT  F_WT
#> 15 Sample_623 Female       WT  F_WT
#> 16 Sample_611 Female       KO  F_KO
#> 17 Sample_612 Female       KO  F_KO
#> 18 Sample_613 Female       KO  F_KO
#> 19 Sample_614 Female       KO  F_KO
#> 20 Sample_615 Female       KO  F_KO

Now, instead of building the statomatic data set, we can just jump right in to the analysis. To do all the same things that sds_analyze does, we can use the multigroup_main() function. This function will sort each row of the data based on it’s distributions and scedasticity and perform the applicable tests. The return value of the function is a list, which contains all the results of the analysis.

res <- multigroup_main(x, var1 = sample_info$sex, var2 = sample_info$genotype, var3 = sample_info$group)
#> sorting data 100% done...
#> Using two way anova 
#> anova 100% done
#> Tukey test 100% done
#> Calculating fold changes...
#> Welch test 100% done
#> DunnettT3 test 100% done
#> Calculating fold changes...
#> Kruskal-Wallis Test 100% done
#> Dunn Test 100% done
#> Calculating fold changes...
names(res)
#> [1] "anova_results" "welch_results" "kw_results"    "foldchanges"  
#> [5] "ne"            "nu"            "nn"

You can access the elements of this list either by doing something like res[[1]] (double brackets recommended) or res$anova_results. Also notice that instead of specifying a design like we did when building a statomatic data set, we used the var1, 2, and 3 to tell statomatic what the experimental factors are. Based on these, statomatic knew to use a two way anova.

The multigroup_main function can handle both a multi factor and a single factor analysis. We could also do this:

res <- multigroup_main(x, var1 = sample_info$group)
#> sorting data 100% done...
#> Using one way anova 
#> anova 100% done
#> Tukey test 100% done
#> Calculating fold changes...
#> Welch test 100% done
#> DunnettT3 test 100% done
#> Calculating fold changes...
#> Kruskal-Wallis Test 100% done
#> Dunn Test 100% done
#> Calculating fold changes...
names(res)
#> [1] "anova_results" "welch_results" "kw_results"    "foldchanges"  
#> [5] "ne"            "nu"            "nn"

And notice that now statomatic used a one way anova.

There is also a twogroup_main function, which works very similarly to the example directly above. Note that the twogroup_main function can only handle one factor with exactly two groups.

res <- twogroup_main(x, var = sample_info$sex)
#> sorting data 100% done...
#> t-test 100% done
#> Calculating fold changes...
#> Welch test 100% done
#> Calculating fold changes...
#> Wilcox-test 100% done
#> Calculating fold changes...
names(res)
#> [1] "ttest_results"  "welch_results"  "wilcox_results" "foldchanges"   
#> [5] "ne"             "nu"             "nn"

The sds_analyze function is comprised of these two main functions, along with some control flow stuff to make sure the data is routed to the correct one. In the same way that sds_analyze is comprised of the multigroup_main and twogroup_main functions above, each of the two main functions is comprised of yet more functions which can also be used independently.

We’ll first dig into the multigroup_main function.

By doing this:

multigroup_main
#> function (x, var1, var2 = NULL, var3 = NULL, padj = FALSE, write_files = FALSE, 
#>     file_names = c("anova_results.csv", "welch_results.csv", 
#>         "kw_results.csv")) 
#> {
#>     if (write_files == TRUE && length(file_names) != 3) 
#>         stop("This function creates three files but has not been provided three names")
#>     if (is.null(var3)) {
#>         var3 <- var1
#>     }
#>     nenunn <- test_norm(x, var1, var2)
#>     ne <- nenunn[[1]]
#>     nu <- nenunn[[2]]
#>     nn <- nenunn[[3]]
#>     all_fcs <- NULL
#>     anova_results <- data.frame()
#>     welch_results <- data.frame()
#>     kw_results <- data.frame()
#>     if (nrow(ne) > 0) {
#>         my_anova <- run_anova(ne, var1 = var1, var2 = var2, padj)
#>         if (padj == TRUE) {
#>             if (is.null(var2)) {
#>                 colnames(my_anova) <- c(substitute(var1), paste0(deparse(substitute(var1)), 
#>                   "_padj"))
#>             }
#>             else {
#>                 colnames(my_anova) <- c(substitute(var1), substitute(var2), 
#>                   "interaction", paste0(deparse(substitute(var1)), 
#>                     "_padj"), paste0(deparse(substitute(var2)), 
#>                     "_padj"), "interaction_padj")
#>             }
#>         }
#>         else {
#>             if (is.null(var2)) {
#>                 colnames(my_anova) <- "Pval"
#>             }
#>             else {
#>                 colnames(my_anova) <- c(substitute(var1), substitute(var2), 
#>                   "interaction")
#>             }
#>         }
#>         tukey_res <- run_tukey(ne, var3)
#>         fc_ne <- fold_change(ne, var3)
#>         tukey_res <- combine_fc_pvals(tukey_res, fc_ne)
#>         anova_results <- cbind(my_anova, tukey_res)
#>         all_fcs <- rbind(all_fcs, fc_ne)
#>     }
#>     if (nrow(nu) > 0) {
#>         welch_res <- run_welch(nu, var3, padj)
#>         dunnett_res <- run_dunnett(nu, var3)
#>         fc_nu <- fold_change(nu, var3)
#>         dunnett_res <- combine_fc_pvals(dunnett_res, fc_nu)
#>         welch_results <- cbind(welch_res, dunnett_res)
#>         all_fcs <- rbind(all_fcs, fc_nu)
#>     }
#>     if (nrow(nn) > 0) {
#>         krusk_res <- run_kruskal(nn, var3, padj)
#>         dunn_res <- run_dunn(nn, var3)
#>         fc_nn <- fold_change(nn, var3)
#>         dunn_res <- combine_fc_pvals(dunn_res, fc_nn)
#>         kw_results <- cbind(krusk_res, dunn_res)
#>         all_fcs <- rbind(all_fcs, fc_nn)
#>     }
#>     if (write_files == TRUE) {
#>         write.csv(anova_results, file_names[1])
#>         write.csv(welch_results, file_names[2])
#>         write.csv(kw_results, file_names[3])
#>         return(NULL)
#>     }
#>     res <- list(anova_results = anova_results, welch_results = welch_results, 
#>         kw_results = kw_results, foldchanges = all_fcs, ne = ne, 
#>         nu = nu, nn = nn)
#>     return(res)
#> }
#> <bytecode: 0x158f7fc70>
#> <environment: namespace:statomatic>

We can see the source code of the function.

Don’t worry about the whole thing, but notice the functions: test_norm, run_anova, run_tukey, run_welch, run_dunnett, run_kruskal, run_dunn, and fold_change.

Each of these functions is available to use independently of the rest.

The test_norm function is the function that sorts the data by distribution and scedasticity. It can handle either 1 or 2 factors.

res <- test_norm(x, sample_info$sex, sample_info$genotype)
#> sorting data 100% done...

The return value is a list with three data.frames. The first, res[[1]], is data determined to have normally distributed residuals and equal variances amongst the factors. The second, res[[2]], has normally distributed residuals but unequal variances amongst the factors. The third, res[[3]], is data with non-normally distributed residuals.

To use test_norm with only one factor, simply supply only one.

res <- test_norm(x, sample_info$group)
#> sorting data 100% done...

The usage of the rest of the functions is identical to that of the test_norm function, and their names are descriptive of what they do.

run_anova runs an anova test, run_tukey runs a tukey test, etc.

The run_anova function can handle either one factor or two, and either a one way or two way anova will be performed in accordance with the number of factors supplied. The rest of the functions only support using one factor.

We’ll demonstate the run_anova function.

anova_res <- run_anova(x, sample_info$sex, sample_info$genotype)
#> Using two way anova 
#> anova 100% done
head(anova_res)
#>                    sample_info$sex sample_info$genotype interaction
#> ENSMUSG00000051285       0.4181785            0.4267951  0.79596064
#> ENSMUSG00000057363       0.6363190            0.3757298  0.70346317
#> ENSMUSG00000100555       0.7760540            0.3688330  0.80312281
#> ENSMUSG00000078184       0.3813036            0.9576038  0.57458343
#> ENSMUSG00000061024       0.1879463            0.5524404  0.08030823
#> ENSMUSG00000025911       0.5146527            0.3078191  0.83172345

And the run_tukey function.

tukey_res <- run_tukey(x, sample_info$group)
#> Tukey test 100% done...
head(tukey_res)
#>                    Pval_F_KO_vs_F_WT Pval_F_KO_vs_M_KO Pval_F_KO_vs_M_WT
#> ENSMUSG00000051285         0.9790520         0.8652082         0.9999995
#> ENSMUSG00000057363         0.7955281         0.9998882         0.9899203
#> ENSMUSG00000100555         0.8379946         0.9800907         0.8257741
#> ENSMUSG00000078184         0.9699575         0.7280883         0.9050970
#> ENSMUSG00000061024         0.8094858         0.1413002         0.9469754
#> ENSMUSG00000025911         0.8062227         0.9883949         0.9925533
#>                    Pval_F_WT_vs_M_KO Pval_F_WT_vs_M_WT Pval_M_KO_vs_M_WT
#> ENSMUSG00000051285         0.6568296         0.9773126         0.8699999
#> ENSMUSG00000057363         0.7597698         0.9258447         0.9820468
#> ENSMUSG00000100555         0.9688062         0.9999939         0.9635951
#> ENSMUSG00000078184         0.9310571         0.9954529         0.9824965
#> ENSMUSG00000061024         0.5161829         0.9850262         0.3321566
#> ENSMUSG00000025911         0.6260176         0.9228793         0.9330056
#>                          test
#> ENSMUSG00000051285 tukey-test
#> ENSMUSG00000057363 tukey-test
#> ENSMUSG00000100555 tukey-test
#> ENSMUSG00000078184 tukey-test
#> ENSMUSG00000061024 tukey-test
#> ENSMUSG00000025911 tukey-test

We’ll discuss only briefly about the twogroup_main function. You can view it’s source code the same way we did for the multigroup_main function.

This function is made up of the same test_norm function, run_ttest, run_welch, and run_wilcox, which are all available to use independently.

Run_ttest runs a t-test, run welch runs a welch test (and is the same welch test used in the multigroup case), and run_wilcox runs a wilcox test. Each of these three functions can handle only one factor, and their usage is identical to the run_tukey function.

To end this vignette, we’ll demonstrate a more advanced use case for using the multigroup_main function.

A more advanced example

In this example we have six different tables we want to analyze. These tables are from a metagenomic analysis, and contain abundance information at the phylum, class, order, family, genus, and species level. We looked at this species data before, in the basic workflow vignette.

There are 24 samples in this experiment, and 3 experiemental factors, with 8 samples per factor. We will call the factors groups 1-3. Therefore, we’ll use the multigroup main function with one factor.

We’re going to apply the statomatic method to all six of these tables in order to identify differences between groups at each taxonomic level.

We read in the data all together and store it in a list.

#get all files in the current directory with abundance.csv in the name 
a <- list.files(".", pattern = "abundance.csv")
a
#> [1] "class_abundance.csv"   "family_abundance.csv"  "genus_abundance.csv"  
#> [4] "order_abundance.csv"   "phylum_abundance.csv"  "species_abundance.csv"

Notice that the file names are in alphabetical order.

#read them in with read.csv
read_in <- lapply(a, read.csv)
#look at first three rows of data
lapply(read_in, head, 3)
#> [[1]]
#>              class sample_1 sample_2 sample_3 sample_4 sample_5 sample_6
#> 1       Clostridia      102      475     2910     1188     1278      324
#> 2      Bacteroidia     1066     1005      383      801      623      628
#> 3 Erysipelotrichia      588      186      141      157      127      123
#>   sample_7 sample_8 sample_9 sample_10 sample_11 sample_12 sample_13 sample_14
#> 1     4350      641     1176       883       426      5880        82       158
#> 2      596      549      259       264       551       270       207       582
#> 3      752      139      129       132       229       249       131       464
#>   sample_15 sample_16 sample_17 sample_18 sample_19 sample_20 sample_21
#> 1      1632       531       384       829      1708      4864       799
#> 2       972       700       477       724      1227       661      1754
#> 3       144       135       121       746       752       171       251
#>   sample_22 sample_23 sample_24
#> 1       841      1262       524
#> 2      1656       688      1523
#> 3       230       410        62
#> 
#> [[2]]
#>             family sample_1 sample_2 sample_3 sample_4 sample_5 sample_6
#> 1  Lachnospiraceae       36      123     1888      426      586       69
#> 2 Oscillospiraceae       39      215      501      487      344      145
#> 3     unclassified       76      204      384      342      421       83
#>   sample_7 sample_8 sample_9 sample_10 sample_11 sample_12 sample_13 sample_14
#> 1     2786      258      808       266       217      3684        38        56
#> 2     1066      255      175       324        66      1301        24        54
#> 3      585      159      310       229        56      1308        45        67
#>   sample_15 sample_16 sample_17 sample_18 sample_19 sample_20 sample_21
#> 1      1099       395        39       136       764      3133       217
#> 2       173        71        32       258       491       896       293
#> 3       369       124        71       286       514      1188       273
#>   sample_22 sample_23 sample_24
#> 1       290       508        98
#> 2       292       480       284
#> 3       288       262       197
#> 
#> [[3]]
#>          genus sample_1 sample_2 sample_3 sample_4 sample_5 sample_6 sample_7
#> 1 unclassified      281      304     1902      564      688      235     2828
#> 2   Dubosiella      583      181      118      145      112      118      665
#> 3  Bacteroides      406       92       91      106      111      185      106
#>   sample_8 sample_9 sample_10 sample_11 sample_12 sample_13 sample_14 sample_15
#> 1      370      758       322       228      3085       114       152      1477
#> 2      131      121       117       113       210       120       460       117
#> 3      156       91        93        78        95        97       421       165
#>   sample_16 sample_17 sample_18 sample_19 sample_20 sample_21 sample_22
#> 1       611       185       510      1002      3544       611       665
#> 2       105       114       737       728       123       245       220
#> 3       129       102       129       128       170       617       677
#>   sample_23 sample_24
#> 1       717       323
#> 2       402        57
#> 3       140       774
#> 
#> [[4]]
#>                order sample_1 sample_2 sample_3 sample_4 sample_5 sample_6
#> 1      Eubacteriales      100      468     2888     1184     1261      318
#> 2      Bacteroidales     1065     1005      380      800      621      626
#> 3 Erysipelotrichales      588      186      141      157      127      123
#>   sample_7 sample_8 sample_9 sample_10 sample_11 sample_12 sample_13 sample_14
#> 1     4338      632     1167       874       426      5858        82       158
#> 2      593      549      258       263       551       270       206       582
#> 3      752      139      129       132       229       249       131       464
#>   sample_15 sample_16 sample_17 sample_18 sample_19 sample_20 sample_21
#> 1      1625       529       382       812      1696      4845       783
#> 2       967       696       477       716      1224       656      1745
#> 3       144       135       121       746       752       171       251
#>   sample_22 sample_23 sample_24
#> 1       824      1245       521
#> 2      1650       687      1520
#> 3       230       410        62
#> 
#> [[5]]
#>           phylum sample_1 sample_2 sample_3 sample_4 sample_5 sample_6 sample_7
#> 1      Bacillota      710      790     3214     1471     1573      495     5279
#> 2   Bacteroidota     1071     1007      385      801      627      631      603
#> 3 Pseudomonadota       14      269      475      372      396       29      197
#>   sample_8 sample_9 sample_10 sample_11 sample_12 sample_13 sample_14 sample_15
#> 1      836     1506      1117      1109      6689       234       668      1948
#> 2      551      261       267       551       274       207       584       978
#> 3      266      402       395       263       352       413        14       462
#>   sample_16 sample_17 sample_18 sample_19 sample_20 sample_21 sample_22
#> 1       712       999      1707      2770      5492      1231      1214
#> 2       708       477       736      1235       670      1764      1671
#> 3       176       409        25        35       516        35       163
#>   sample_23 sample_24
#> 1      1862       630
#> 2       704      1532
#> 3        29        18
#> 
#> [[6]]
#>                            species sample_1 sample_2 sample_3 sample_4 sample_5
#> 1        Lachnospiraceae bacterium       14       33      245      224      167
#> 2          Dubosiella newyorkensis      583      181      118      145      112
#> 3 Parasutterella excrementihominis        4      241      431      327      343
#>   sample_6 sample_7 sample_8 sample_9 sample_10 sample_11 sample_12 sample_13
#> 1        9      996       71      312        85        91      1731        22
#> 2      118      665      131      121       117       113       210       120
#> 3       21      128      241      367       364       246       255       393
#>   sample_14 sample_15 sample_16 sample_17 sample_18 sample_19 sample_20
#> 1        12       221       289        11        35       168      1907
#> 2       460       117       105       114       737       728       123
#> 3         8       386       120       385         7         4       433
#>   sample_21 sample_22 sample_23 sample_24
#> 1        42       103       136        28
#> 2       245       220       402        57
#> 3         4       117        10         1

Each of these tables needs the cf function.

#apply cf to each table
tables <- lapply(read_in, cf)

We can use the same sample info object for all of these tables, since they all contain the same samples.

#tables[[1]] is the first table in the tables list. all the tables have the same column names
sample_info <- data.frame(sample = colnames(tables[[1]]), group = c(rep("group1", 8), rep("group2", 8), rep("group3", 8)))
sample_info
#>       sample  group
#> 1   sample_1 group1
#> 2   sample_2 group1
#> 3   sample_3 group1
#> 4   sample_4 group1
#> 5   sample_5 group1
#> 6   sample_6 group1
#> 7   sample_7 group1
#> 8   sample_8 group1
#> 9   sample_9 group2
#> 10 sample_10 group2
#> 11 sample_11 group2
#> 12 sample_12 group2
#> 13 sample_13 group2
#> 14 sample_14 group2
#> 15 sample_15 group2
#> 16 sample_16 group2
#> 17 sample_17 group3
#> 18 sample_18 group3
#> 19 sample_19 group3
#> 20 sample_20 group3
#> 21 sample_21 group3
#> 22 sample_22 group3
#> 23 sample_23 group3
#> 24 sample_24 group3

Now we’ll apply the multigroup_main function to each of these tables, again using lapply.

res <- lapply(tables, multigroup_main, sample_info$group)
#> 
#> Using one way anova 
#> anova 100% done
#> Tukey test 100% done
#> Calculating fold changes...
#> Welch test 100% done
#> DunnettT3 test 100% done
#> Calculating fold changes...
#> Kruskal-Wallis Test 100% done
#> Dunn Test 100% done
#> Calculating fold changes...
#> sorting data 100% done...
#> Using one way anova 
#> anova 100% done
#> Tukey test 100% done
#> Calculating fold changes...
#> Welch test 100% done
#> DunnettT3 test 100% done
#> Calculating fold changes...
#> Kruskal-Wallis Test 60% done...Kruskal-Wallis Test 100% done
#> Dunn Test 60% done...Dunn Test 100% done
#> Calculating fold changes...
#> sorting data 100% done...
#> Using one way anova 
#> anova 100% done
#> Tukey test 100% done
#> Calculating fold changes...
#> Kruskal-Wallis Test 71% done...Kruskal-Wallis Test 100% done
#> Dunn Test 71% done...Dunn Test 100% done
#> Calculating fold changes...
#> sorting data 100% done...
#> Using one way anova 
#> anova 100% done
#> Tukey test 100% done
#> Calculating fold changes...
#> Welch test 100% done
#> DunnettT3 test 100% done
#> Calculating fold changes...
#> Kruskal-Wallis Test 60% done...Kruskal-Wallis Test 100% done
#> Dunn Test 60% done...Dunn Test 100% done
#> Calculating fold changes...
#> 
#> Using one way anova 
#> anova 100% done
#> Tukey test 100% done
#> Calculating fold changes...
#> Welch test 100% done
#> DunnettT3 test 100% done
#> Calculating fold changes...
#> Kruskal-Wallis Test 100% done
#> Dunn Test 100% done
#> Calculating fold changes...
#> sorting data 100% done...
#> Using one way anova 
#> anova 100% done
#> Tukey test 100% done
#> Calculating fold changes...
#> Welch test 100% done
#> DunnettT3 test 100% done
#> Calculating fold changes...
#> Kruskal-Wallis Test 63% done...Kruskal-Wallis Test 100% done
#> Dunn Test 63% done...Dunn Test 100% done
#> Calculating fold changes...

We know from before that multigroup_main returns a list, and we just applied this function to a list of data.frames. So what is res? It’s a list of lists.

#res has 6 elements, one per table we analyzed
length(res)
#> [1] 6
#each element of res has 7 elements
length(res[[1]])
#> [1] 7
names(res[[1]])
#> [1] "anova_results" "welch_results" "kw_results"    "foldchanges"  
#> [5] "ne"            "nu"            "nn"

Each element of res is the same as it was when we were only analyzing one table. The only difference is now we’re working with several of them at once.

We’ll look at the results of the first one just to orient ourselves, and we’re only going to look at the first three elements, the anova, welch, and kruskal-wallis results.

#get the first element of res (which is a list of results of the first analysis). This will be the class abundances since it is first alphabetically
x <- res[[1]]
lapply(x[1:3], head)
#> $anova_results
#>                           Pval Pval_group1_vs_group2
#> Betaproteobacteria  0.20000097             0.8263342
#> Coriobacteriia      0.06770511             0.5440510
#> Negativicutes       0.04907064             0.6193168
#> Tissierellia        0.15297673             0.1656226
#> Alphaproteobacteria 0.08962253             0.9493171
#> Flavobacteriia      0.16490962             0.5901316
#>                     log2FoldChange_group1_vs_group2 Pval_group1_vs_group3
#> Betaproteobacteria                       -0.2803858             0.4456629
#> Coriobacteriia                           -0.5572681             0.3576415
#> Negativicutes                             0.6147098             0.2415234
#> Tissierellia                              1.2537566             0.9467810
#> Alphaproteobacteria                       0.2630344             0.1795866
#> Flavobacteriia                           -0.8744691             0.1411272
#>                     log2FoldChange_group1_vs_group3 Pval_group2_vs_group3
#> Betaproteobacteria                        0.8590642            0.18512474
#> Coriobacteriia                            1.3940111            0.05543407
#> Negativicutes                            -0.6890709            0.04158533
#> Tissierellia                              0.1468414            0.27729400
#> Alphaproteobacteria                      -1.0000000            0.10326409
#> Flavobacteriia                           -1.4150375            0.59013161
#>                     log2FoldChange_group2_vs_group3 group1_mean group2_mean
#> Betaproteobacteria                        1.1394500     229.000     278.125
#> Coriobacteriia                            1.9512792     173.125     254.750
#> Negativicutes                            -1.3037807       6.125       4.000
#> Tissierellia                             -1.1069152       3.875       1.625
#> Alphaproteobacteria                      -1.2630344       1.500       1.250
#> Flavobacteriia                           -0.5405684       0.750       1.375
#>                     group3_mean       test
#> Betaproteobacteria      126.250 tukey-test
#> Coriobacteriia           65.875 tukey-test
#> Negativicutes             9.875 tukey-test
#> Tissierellia              3.500 tukey-test
#> Alphaproteobacteria       3.000 tukey-test
#> Flavobacteriia            2.000 tukey-test
#> 
#> $welch_results
#>                     Pval Pval_group1_vs_group2 log2FoldChange_group1_vs_group2
#> Bacteroidia   0.03093208             0.2372938                       0.5706097
#> Chitinophagia 0.04254664             0.6030114                       1.2223924
#>               Pval_group1_vs_group3 log2FoldChange_group1_vs_group3
#> Bacteroidia               0.2150544                      -0.6241665
#> Chitinophagia             0.2606331                      -1.2801079
#>               Pval_group2_vs_group3 log2FoldChange_group2_vs_group3 group1_mean
#> Bacteroidia              0.03416268                       -1.194776     706.375
#> Chitinophagia            0.05656468                       -2.502500       0.875
#>               group2_mean group3_mean         test
#> Bacteroidia       475.625    1088.750 dunnett-test
#> Chitinophagia       0.375       2.125 dunnett-test
#> 
#> $kw_results
#>                           Pval Pval_group1_vs_group2
#> Clostridia          0.81668648                     1
#> Erysipelotrichia    0.65079423                     1
#> unclassified        0.73334806                     1
#> Verrucomicrobiae    0.12045536                     1
#> Bacilli             0.08206044                     1
#> Gammaproteobacteria 0.79132952                     1
#>                     log2FoldChange_group1_vs_group2 Pval_group1_vs_group3
#> Clostridia                               0.06548115            1.00000000
#> Erysipelotrichia                         0.45625701            1.00000000
#> unclassified                            -0.28237212            1.00000000
#> Verrucomicrobiae                        -0.47375524            0.19709373
#> Bacilli                                 -2.15324626            0.08117608
#> Gammaproteobacteria                     -0.46247849            1.00000000
#>                     log2FoldChange_group1_vs_group3 Pval_group2_vs_group3
#> Clostridia                              0.007316499             1.0000000
#> Erysipelotrichia                       -0.309751171             1.0000000
#> unclassified                           -0.561999137             1.0000000
#> Verrucomicrobiae                       -0.652276613             0.2581564
#> Bacilli                                -2.106153552             0.4871113
#> Gammaproteobacteria                    -0.173648087             1.0000000
#>                     log2FoldChange_group2_vs_group3 group1_mean group2_mean
#> Clostridia                              -0.05816465    1408.500    1346.000
#> Erysipelotrichia                        -0.76600818     276.625     201.625
#> unclassified                            -0.27962702     145.125     176.500
#> Verrucomicrobiae                        -0.17852137      82.000     113.875
#> Bacilli                                  0.04709271      25.375     112.875
#> Gammaproteobacteria                      0.28883041      21.500      29.625
#>                     group3_mean      test
#> Clostridia             1401.375 dunn-test
#> Erysipelotrichia        342.875 dunn-test
#> unclassified            214.250 dunn-test
#> Verrucomicrobiae        128.875 dunn-test
#> Bacilli                 109.250 dunn-test
#> Gammaproteobacteria      24.250 dunn-test

Since all the results of this analysis were tested in the same way, and all of the results from each taxonomic level have the same columns, we can collate each one into a single data.frame.

What we’ll do is for each list of results in res, we’ll rbind (row bind) the first three elements, save this collated table to a new list and then give it a name.

For the names, we’ll use the result of the list.files function above, but we’ll trim off the .csv extension first.

#gsub .csv with nothing in list a
name_list <- gsub(".csv", "", a)
name_list
#> [1] "class_abundance"   "family_abundance"  "genus_abundance"  
#> [4] "order_abundance"   "phylum_abundance"  "species_abundance"

Now we’ll collate and save the tables

#define an empty list for our new tables
collated_tables <- list()
#use a for loop to collate and save our tables
for (i in 1:length(res)) {
  #the current result were processing
  temp <- res[[i]]
  #get just the p values and fold change results 
  first_three <- temp[1:3]
  #rbind into one new table
  new_table <- rbind(first_three[[1]], first_three[[2]], first_three[[3]])
  #save the new table to the collated_tables list
  collated_tables[[i]] <- new_table
}
#set the names of collated_tables to be the names from name_list
names(collated_tables) <- name_list
#look at some of the results
lapply(collated_tables, head, 3)
#> $class_abundance
#>                          Pval Pval_group1_vs_group2
#> Betaproteobacteria 0.20000097             0.8263342
#> Coriobacteriia     0.06770511             0.5440510
#> Negativicutes      0.04907064             0.6193168
#>                    log2FoldChange_group1_vs_group2 Pval_group1_vs_group3
#> Betaproteobacteria                      -0.2803858             0.4456629
#> Coriobacteriia                          -0.5572681             0.3576415
#> Negativicutes                            0.6147098             0.2415234
#>                    log2FoldChange_group1_vs_group3 Pval_group2_vs_group3
#> Betaproteobacteria                       0.8590642            0.18512474
#> Coriobacteriia                           1.3940111            0.05543407
#> Negativicutes                           -0.6890709            0.04158533
#>                    log2FoldChange_group2_vs_group3 group1_mean group2_mean
#> Betaproteobacteria                        1.139450     229.000     278.125
#> Coriobacteriia                            1.951279     173.125     254.750
#> Negativicutes                            -1.303781       6.125       4.000
#>                    group3_mean       test
#> Betaproteobacteria     126.250 tukey-test
#> Coriobacteriia          65.875 tukey-test
#> Negativicutes            9.875 tukey-test
#> 
#> $family_abundance
#>                       Pval Pval_group1_vs_group2
#> Sutterellaceae 0.197457096             0.8104024
#> Atopobiaceae   0.066157338             0.5388124
#> Prevotellaceae 0.001577777             0.6487139
#>                log2FoldChange_group1_vs_group2 Pval_group1_vs_group3
#> Sutterellaceae                      -0.2966753            0.45486902
#> Atopobiaceae                        -0.5663191            0.35585660
#> Prevotellaceae                       0.7670228            0.01360419
#>                log2FoldChange_group1_vs_group3 Pval_group2_vs_group3
#> Sutterellaceae                       0.8589077            0.18096687
#> Atopobiaceae                         1.4272001            0.05406228
#> Prevotellaceae                      -1.2857835            0.00171722
#>                log2FoldChange_group2_vs_group3 group1_mean group2_mean
#> Sutterellaceae                        1.155583      219.00     269.000
#> Atopobiaceae                          1.993519      168.75     249.875
#> Prevotellaceae                       -2.052806       24.25      14.250
#>                group3_mean       test
#> Sutterellaceae     120.750 tukey-test
#> Atopobiaceae        62.750 tukey-test
#> Prevotellaceae      59.125 tukey-test
#> 
#> $genus_abundance
#>                          Pval Pval_group1_vs_group2
#> Parasutterella     0.19869563             0.8113967
#> Muricaecibacterium 0.06812654             0.5464864
#> Anaerotruncus      0.51580712             0.8242216
#>                    log2FoldChange_group1_vs_group2 Pval_group1_vs_group3
#> Parasutterella                          -0.2961576             0.4560574
#> Muricaecibacterium                      -0.5635121             0.3573489
#> Anaerotruncus                            0.3156143             0.8353710
#>                    log2FoldChange_group1_vs_group3 Pval_group2_vs_group3
#> Parasutterella                           0.8580840            0.18218500
#> Muricaecibacterium                       1.4364879            0.05582506
#> Anaerotruncus                           -0.2503234            0.48405009
#>                    log2FoldChange_group2_vs_group3 group1_mean group2_mean
#> Parasutterella                           1.1542416     218.875     268.750
#> Muricaecibacterium                       2.0000000     163.750     242.000
#> Anaerotruncus                           -0.5659377      35.625      28.625
#>                    group3_mean       test
#> Parasutterella         120.750 tukey-test
#> Muricaecibacterium      60.500 tukey-test
#> Anaerotruncus           42.375 tukey-test
#> 
#> $order_abundance
#>                        Pval Pval_group1_vs_group2
#> Burkholderiales  0.19886795             0.8240061
#> Coriobacteriales 0.06626659             0.5465790
#> Bacillales       0.04799237             0.1124440
#>                  log2FoldChange_group1_vs_group2 Pval_group1_vs_group3
#> Burkholderiales                       -0.2828907             0.4457946
#> Coriobacteriales                      -0.5561823             0.3504407
#> Bacillales                             1.1020982             0.9373723
#>                  log2FoldChange_group1_vs_group3 Pval_group2_vs_group3
#> Burkholderiales                        0.8616358            0.18382079
#> Coriobacteriales                       1.4224441            0.05427840
#> Bacillales                            -0.1202942            0.05767428
#>                  log2FoldChange_group2_vs_group3 group1_mean group2_mean
#> Burkholderiales                         1.144526     228.500     278.000
#> Coriobacteriales                        1.978626     170.875     251.250
#> Bacillales                             -1.222392      20.125       9.375
#>                  group3_mean       test
#> Burkholderiales      125.750 tukey-test
#> Coriobacteriales      63.750 tukey-test
#> Bacillales            21.875 tukey-test
#> 
#> $phylum_abundance
#>                               Pval Pval_group1_vs_group2
#> Pseudomonadota          0.21409442             0.7870908
#> Actinomycetota          0.07304152             0.5632674
#> Thermodesulfobacteriota 0.80778822             0.8845479
#>                         log2FoldChange_group1_vs_group2 Pval_group1_vs_group3
#> Pseudomonadota                               -0.2956677             0.5019108
#> Actinomycetota                               -0.5316898             0.3622041
#> Thermodesulfobacteriota                       0.2995603             0.9863854
#>                         log2FoldChange_group1_vs_group3 Pval_group2_vs_group3
#> Pseudomonadota                               0.71426786            0.19346553
#> Actinomycetota                               1.32040063            0.06022461
#> Thermodesulfobacteriota                     -0.08746284            0.80490509
#>                         log2FoldChange_group2_vs_group3 group1_mean group2_mean
#> Pseudomonadota                                1.0099356      252.25     309.625
#> Actinomycetota                                1.8520905      177.00     255.875
#> Thermodesulfobacteriota                      -0.3870231        2.00       1.625
#>                         group3_mean       test
#> Pseudomonadota              153.750 tukey-test
#> Actinomycetota               70.875 tukey-test
#> Thermodesulfobacteriota       2.125 tukey-test
#> 
#> $species_abundance
#>                                         Pval Pval_group1_vs_group2
#> Parasutterella excrementihominis 0.199182490             0.8062039
#> Muricaecibacterium torontonense  0.068126536             0.5464864
#> Muribaculaceae bacterium         0.001032886             0.4923547
#>                                  log2FoldChange_group1_vs_group2
#> Parasutterella excrementihominis                      -0.3011695
#> Muricaecibacterium torontonense                       -0.5635121
#> Muribaculaceae bacterium                               0.6818240
#>                                  Pval_group1_vs_group3
#> Parasutterella excrementihominis            0.46121140
#> Muricaecibacterium torontonense             0.35734892
#> Muribaculaceae bacterium                    0.01419407
#>                                  log2FoldChange_group1_vs_group3
#> Parasutterella excrementihominis                       0.8531586
#> Muricaecibacterium torontonense                        1.4364879
#> Muribaculaceae bacterium                              -1.0093379
#>                                  Pval_group2_vs_group3
#> Parasutterella excrementihominis           0.182048409
#> Muricaecibacterium torontonense            0.055825056
#> Muribaculaceae bacterium                   0.000975412
#>                                  log2FoldChange_group2_vs_group3 group1_mean
#> Parasutterella excrementihominis                        1.154328      217.00
#> Muricaecibacterium torontonense                         2.000000      163.75
#> Muribaculaceae bacterium                               -1.691162       38.50
#>                                  group2_mean group3_mean       test
#> Parasutterella excrementihominis     267.375     120.125 tukey-test
#> Muricaecibacterium torontonense      242.000      60.500 tukey-test
#> Muribaculaceae bacterium              24.000      77.500 tukey-test

From here, we could do some further analysis/filtering of the results. One way we could export these results is as follows. We write.csv for each of the elements in collated_tables and give it the same file name as the original abundance table with results_ at the beginning

for (i in 1:length(collated_tables)) {
  write.csv(collated_tables[[i]], paste0("results_", a[i]))
}