basic_workflow

This basic workflow vignette is intended to demonstate some basic usage of the statomatic package to analyze count data from an experiment.

There will be three different analyses included here, meant to demonstrate how statomatic handles different experiment designs. The three designs demonstrated here will be a multi group (> 2) two factor design, a two group one factor design, and a multi group one factor design.

Part1: Multi group 2 factor design

The data analyzed here is from an RNA Sequencing experiement which aimed to identify differentially expressed genes across several experiemental conditions. The details of this experiment won’t be discussed here.

Load the data to be analyzed. Here we analyze the normalized counts, as created by DESeq2.

norm_counts <- read.csv("normalized_counts.csv")
head(norm_counts)
#>                    X Sample_600 Sample_601 Sample_602 Sample_616 Sample_617
#> 1 ENSMUSG00000051285  169.21935  309.83231 417.709088  214.03944  202.50044
#> 2 ENSMUSG00000057363   47.53353   66.64997  54.805802   41.23696   49.30862
#> 3 ENSMUSG00000100555   23.76676   55.84187   0.000000   48.10978   18.19153
#> 4 ENSMUSG00000078184   50.38554   59.44457   5.924952   12.76382   24.41495
#> 5 ENSMUSG00000061024  273.79311  293.62015 232.554350  144.32935  360.00079
#> 6 ENSMUSG00000025911  154.95929  525.99438 309.578721  208.14845  256.59631
#>   Sample_603 Sample_604 Sample_605 Sample_606 Sample_621 Sample_607 Sample_608
#> 1  110.24045  171.38160   61.79141 190.642528  427.33305  228.73359  410.89116
#> 2   57.16171    0.00000    0.00000 103.558904   58.42021   63.76202   88.17407
#> 3   28.58086  151.21906   68.65713   5.884029   30.29196   21.25401   15.87133
#> 4   32.66384   10.08127    0.00000  44.718618   37.86495   19.22981   18.51655
#> 5  145.62627   55.44699  109.85140 222.416283  315.90190  253.02388  319.19013
#> 6  155.15322  171.38160    0.00000 264.781289  529.02750  301.60446  336.82494
#>   Sample_609 Sample_610 Sample_623 Sample_611 Sample_612 Sample_613 Sample_614
#> 1  313.05044  412.25499  133.76758  93.303195 248.040131   46.21469  417.35080
#> 2   60.75650   14.45406   99.46820 103.366522  18.445254    0.00000   65.25144
#> 3   45.30993   40.01052   30.86944   5.093783   1.456204  390.25738    0.00000
#> 4   39.13131   36.86833   56.59398  57.025521  49.996347    0.00000   82.04884
#> 5  209.04355  211.36447  221.23100 240.525945 248.040131  302.96296  440.60874
#> 6  375.86649  528.09696  265.82019 148.589375 171.832106   10.26993  323.02694
#>   Sample_615
#> 1  513.48498
#> 2   39.36603
#> 3    0.00000
#> 4   14.61838
#> 5  213.52043
#> 6  646.77697

Notice that the row names of norm_counts are numbers, and that the gene ids are actually inside the data.frame. Statomatic will not be able to handle this textual data, so we need to make sure to set the gene ids as the row names and remove the gene ids column from inside the data.frame.

Statomatic has a built in function for this, cf(). This function will set a specified column of a data.frame to be the new row names, and then remove the column from inside the data.frame. The column which should be the row names can be specified by column_index = an integer, but this value defaults to 1 (i.e., the first column).

norm_counts <- cf(norm_counts)
head(norm_counts)
#>                    Sample_600 Sample_601 Sample_602 Sample_616 Sample_617
#> ENSMUSG00000051285  169.21935  309.83231 417.709088  214.03944  202.50044
#> ENSMUSG00000057363   47.53353   66.64997  54.805802   41.23696   49.30862
#> ENSMUSG00000100555   23.76676   55.84187   0.000000   48.10978   18.19153
#> ENSMUSG00000078184   50.38554   59.44457   5.924952   12.76382   24.41495
#> ENSMUSG00000061024  273.79311  293.62015 232.554350  144.32935  360.00079
#> ENSMUSG00000025911  154.95929  525.99438 309.578721  208.14845  256.59631
#>                    Sample_603 Sample_604 Sample_605 Sample_606 Sample_621
#> ENSMUSG00000051285  110.24045  171.38160   61.79141 190.642528  427.33305
#> ENSMUSG00000057363   57.16171    0.00000    0.00000 103.558904   58.42021
#> ENSMUSG00000100555   28.58086  151.21906   68.65713   5.884029   30.29196
#> ENSMUSG00000078184   32.66384   10.08127    0.00000  44.718618   37.86495
#> ENSMUSG00000061024  145.62627   55.44699  109.85140 222.416283  315.90190
#> ENSMUSG00000025911  155.15322  171.38160    0.00000 264.781289  529.02750
#>                    Sample_607 Sample_608 Sample_609 Sample_610 Sample_623
#> ENSMUSG00000051285  228.73359  410.89116  313.05044  412.25499  133.76758
#> ENSMUSG00000057363   63.76202   88.17407   60.75650   14.45406   99.46820
#> ENSMUSG00000100555   21.25401   15.87133   45.30993   40.01052   30.86944
#> ENSMUSG00000078184   19.22981   18.51655   39.13131   36.86833   56.59398
#> ENSMUSG00000061024  253.02388  319.19013  209.04355  211.36447  221.23100
#> ENSMUSG00000025911  301.60446  336.82494  375.86649  528.09696  265.82019
#>                    Sample_611 Sample_612 Sample_613 Sample_614 Sample_615
#> ENSMUSG00000051285  93.303195 248.040131   46.21469  417.35080  513.48498
#> ENSMUSG00000057363 103.366522  18.445254    0.00000   65.25144   39.36603
#> ENSMUSG00000100555   5.093783   1.456204  390.25738    0.00000    0.00000
#> ENSMUSG00000078184  57.025521  49.996347    0.00000   82.04884   14.61838
#> ENSMUSG00000061024 240.525945 248.040131  302.96296  440.60874  213.52043
#> ENSMUSG00000025911 148.589375 171.832106   10.26993  323.02694  646.77697

Now the data is ready to be analyzed.

We also must provide statomatic with metadata about our samples. In the case of this experiment, our samples come from different experimental groups. The way to do this is by creating another data.frame which contains our sample names and their experimental conditions. We’ll use the column names of norm_counts, since these are the samples, and we’ll specify conditions of the experiment in additional columns. The experimental conditions here are sex and genotype.

sample_info <- data.frame(samples = colnames(norm_counts), sex = c(rep("Male", 10), rep("Female", 10)), 
                          genotype = c(rep("WT", 5), rep("KO", 5), rep("WT", 5), rep("KO", 5)), 
                          group = c(rep("M_WT", 5), rep("M_KO", 5), rep("F_WT", 5), rep("F_KO", 5)))
sample_info
#>       samples    sex genotype group
#> 1  Sample_600   Male       WT  M_WT
#> 2  Sample_601   Male       WT  M_WT
#> 3  Sample_602   Male       WT  M_WT
#> 4  Sample_616   Male       WT  M_WT
#> 5  Sample_617   Male       WT  M_WT
#> 6  Sample_603   Male       KO  M_KO
#> 7  Sample_604   Male       KO  M_KO
#> 8  Sample_605   Male       KO  M_KO
#> 9  Sample_606   Male       KO  M_KO
#> 10 Sample_621   Male       KO  M_KO
#> 11 Sample_607 Female       WT  F_WT
#> 12 Sample_608 Female       WT  F_WT
#> 13 Sample_609 Female       WT  F_WT
#> 14 Sample_610 Female       WT  F_WT
#> 15 Sample_623 Female       WT  F_WT
#> 16 Sample_611 Female       KO  F_KO
#> 17 Sample_612 Female       KO  F_KO
#> 18 Sample_613 Female       KO  F_KO
#> 19 Sample_614 Female       KO  F_KO
#> 20 Sample_615 Female       KO  F_KO

Notice that the final column, group, is a linear combination of the first two columns. This is necessary to statomatic’s process. The final column of the sample_info should always describe the distribution of the rest of the experimental conditions (if there are any).

Now we build the statomatic data set. We’ll include the columns, the sample metdata (sample_info), and specify the experimental design. The experimental design here is sex + genotype.

sds <- make_sds(x = norm_counts, colData = sample_info, design = ~ sex + genotype)
sds
#> class: StatomaticDataset 
#> dim: 100 20 
#> metadata(0):
#> assays(1): counts
#> rownames(100): ENSMUSG00000051285 ENSMUSG00000057363 ...
#>   ENSMUSG00000026384 ENSMUSG00000036975
#> rowData names(0):
#> colnames(20): Sample_600 Sample_601 ... Sample_614 Sample_615
#> colData names(4): samples sex genotype group

By viewing sds we can see information about our data such as the number of rows, column names, etc.

To analyze the data contained in sds, we use the sds_analyze() function. This functions automatically takes information from sds in order to perform the analysis. This function will use the experimental design we specified when we created sds

sds <- sds_analyze(sds)
#> sorting data 100% done...
#> Using two way anova 
#> anova 100% done
#> Tukey test 100% done
#> Calculating fold changes...
#> Welch test 100% done
#> DunnettT3 test 100% done
#> Calculating fold changes...
#> Kruskal-Wallis Test 100% done
#> Dunn Test 100% done
#> Calculating fold changes...
sds
#> class: StatomaticDataset 
#> dim: 100 20 
#> metadata(0):
#> assays(1): counts
#> rownames(100): ENSMUSG00000051285 ENSMUSG00000057363 ...
#>   ENSMUSG00000026384 ENSMUSG00000036975
#> rowData names(0):
#> colnames(20): Sample_600 Sample_601 ... Sample_614 Sample_615
#> colData names(4): samples sex genotype group

Notice that sds appears unchanged. The results of the analysis are stored within sds, and we can access them as follows.

To access all of the p-values from the multi-comparison tests and the fold changes, we can use the get_results() function.

res <- get_results(sds)
head(res)
#>                    Pval_F_KO_vs_F_WT log2FoldChange_F_KO_vs_F_WT
#> ENSMUSG00000051285         0.9790520                  -0.1849281
#> ENSMUSG00000057363         0.7955281                  -0.5285301
#> ENSMUSG00000078184         0.9699575                   0.2579516
#> ENSMUSG00000061024         0.8094858                   0.2521327
#> ENSMUSG00000025911         0.8062227                  -0.4755035
#> ENSMUSG00000057335         0.8020005                  -0.5990134
#>                    Pval_F_KO_vs_M_KO log2FoldChange_F_KO_vs_M_KO
#> ENSMUSG00000051285         0.8652082                   0.4555891
#> ENSMUSG00000057363         0.9998882                   0.0472020
#> ENSMUSG00000078184         0.7280883                   0.7006521
#> ENSMUSG00000061024         0.1413002                   0.7674774
#> ENSMUSG00000025911         0.9883949                   0.2151199
#> ENSMUSG00000057335         0.9992630                   0.1067049
#>                    Pval_F_KO_vs_M_WT log2FoldChange_F_KO_vs_M_WT
#> ENSMUSG00000051285         0.9999995                 0.005584147
#> ENSMUSG00000057363         0.9899203                -0.196868100
#> ENSMUSG00000078184         0.9050970                 0.413461170
#> ENSMUSG00000061024         0.9469754                 0.148453255
#> ENSMUSG00000025911         0.9925533                -0.162232725
#> ENSMUSG00000057335         0.7267167                -0.672042886
#>                    Pval_F_WT_vs_M_KO log2FoldChange_F_WT_vs_M_KO
#> ENSMUSG00000051285         0.6568296                   0.6405172
#> ENSMUSG00000057363         0.7597698                   0.5757321
#> ENSMUSG00000078184         0.9310571                   0.4427005
#> ENSMUSG00000061024         0.5161829                   0.5153447
#> ENSMUSG00000025911         0.6260176                   0.6906234
#> ENSMUSG00000057335         0.7340525                   0.7057183
#>                    Pval_F_WT_vs_M_WT log2FoldChange_F_WT_vs_M_WT
#> ENSMUSG00000051285         0.9773126                  0.19051226
#> ENSMUSG00000057363         0.9258447                  0.33166197
#> ENSMUSG00000078184         0.9954529                  0.15550958
#> ENSMUSG00000061024         0.9850262                 -0.10367949
#> ENSMUSG00000025911         0.9228793                  0.31327074
#> ENSMUSG00000057335         0.9990121                 -0.07302953
#>                    Pval_M_KO_vs_M_WT log2FoldChange_M_KO_vs_M_WT F_KO_mean
#> ENSMUSG00000051285         0.8699999                  -0.4500049 263.67876
#> ENSMUSG00000057363         0.9820468                  -0.2440701  45.28585
#> ENSMUSG00000078184         0.9824965                  -0.2871910  40.73782
#> ENSMUSG00000061024         0.3321566                  -0.6190242 289.13164
#> ENSMUSG00000025911         0.9330056                  -0.3773527 260.09906
#> ENSMUSG00000057335         0.6535879                  -0.7787478 189.02091
#>                    F_WT_mean M_KO_mean M_WT_mean       test
#> ENSMUSG00000051285 299.73955 192.27781 262.66013 tukey-test
#> ENSMUSG00000057363  65.32297  43.82817  51.90698 tukey-test
#> ENSMUSG00000078184  34.06800  25.06574  30.58677 tukey-test
#> ENSMUSG00000061024 242.77061 169.84857 260.85955 tukey-test
#> ENSMUSG00000025911 361.64261 224.06872 291.05543 tukey-test
#> ENSMUSG00000057335 286.30626 175.54500 301.17223 tukey-test

Notice that in addition to the p-values and fold changes, the mean of each group as well as the test used is at the end of the data.frame.

If you want to see the anova results, they can be accessed with get_anova().

anova_res <- get_anova(sds)
head(anova_res)
#>                          sex  genotype interaction Pval_F_KO_vs_F_WT
#> ENSMUSG00000051285 0.4181785 0.4267951  0.79596064         0.9790520
#> ENSMUSG00000057363 0.6363190 0.3757298  0.70346317         0.7955281
#> ENSMUSG00000078184 0.3813036 0.9576038  0.57458343         0.9699575
#> ENSMUSG00000061024 0.1879463 0.5524404  0.08030823         0.8094858
#> ENSMUSG00000025911 0.5146527 0.3078191  0.83172345         0.8062227
#> ENSMUSG00000057335 0.9928111 0.1615905  0.85432131         0.8020005
#>                    log2FoldChange_F_KO_vs_F_WT Pval_F_KO_vs_M_KO
#> ENSMUSG00000051285                  -0.1849281         0.8652082
#> ENSMUSG00000057363                  -0.5285301         0.9998882
#> ENSMUSG00000078184                   0.2579516         0.7280883
#> ENSMUSG00000061024                   0.2521327         0.1413002
#> ENSMUSG00000025911                  -0.4755035         0.9883949
#> ENSMUSG00000057335                  -0.5990134         0.9992630
#>                    log2FoldChange_F_KO_vs_M_KO Pval_F_KO_vs_M_WT
#> ENSMUSG00000051285                   0.4555891         0.9999995
#> ENSMUSG00000057363                   0.0472020         0.9899203
#> ENSMUSG00000078184                   0.7006521         0.9050970
#> ENSMUSG00000061024                   0.7674774         0.9469754
#> ENSMUSG00000025911                   0.2151199         0.9925533
#> ENSMUSG00000057335                   0.1067049         0.7267167
#>                    log2FoldChange_F_KO_vs_M_WT Pval_F_WT_vs_M_KO
#> ENSMUSG00000051285                 0.005584147         0.6568296
#> ENSMUSG00000057363                -0.196868100         0.7597698
#> ENSMUSG00000078184                 0.413461170         0.9310571
#> ENSMUSG00000061024                 0.148453255         0.5161829
#> ENSMUSG00000025911                -0.162232725         0.6260176
#> ENSMUSG00000057335                -0.672042886         0.7340525
#>                    log2FoldChange_F_WT_vs_M_KO Pval_F_WT_vs_M_WT
#> ENSMUSG00000051285                   0.6405172         0.9773126
#> ENSMUSG00000057363                   0.5757321         0.9258447
#> ENSMUSG00000078184                   0.4427005         0.9954529
#> ENSMUSG00000061024                   0.5153447         0.9850262
#> ENSMUSG00000025911                   0.6906234         0.9228793
#> ENSMUSG00000057335                   0.7057183         0.9990121
#>                    log2FoldChange_F_WT_vs_M_WT Pval_M_KO_vs_M_WT
#> ENSMUSG00000051285                  0.19051226         0.8699999
#> ENSMUSG00000057363                  0.33166197         0.9820468
#> ENSMUSG00000078184                  0.15550958         0.9824965
#> ENSMUSG00000061024                 -0.10367949         0.3321566
#> ENSMUSG00000025911                  0.31327074         0.9330056
#> ENSMUSG00000057335                 -0.07302953         0.6535879
#>                    log2FoldChange_M_KO_vs_M_WT F_KO_mean F_WT_mean M_KO_mean
#> ENSMUSG00000051285                  -0.4500049 263.67876 299.73955 192.27781
#> ENSMUSG00000057363                  -0.2440701  45.28585  65.32297  43.82817
#> ENSMUSG00000078184                  -0.2871910  40.73782  34.06800  25.06574
#> ENSMUSG00000061024                  -0.6190242 289.13164 242.77061 169.84857
#> ENSMUSG00000025911                  -0.3773527 260.09906 361.64261 224.06872
#> ENSMUSG00000057335                  -0.7787478 189.02091 286.30626 175.54500
#>                    M_WT_mean       test
#> ENSMUSG00000051285 262.66013 tukey-test
#> ENSMUSG00000057363  51.90698 tukey-test
#> ENSMUSG00000078184  30.58677 tukey-test
#> ENSMUSG00000061024 260.85955 tukey-test
#> ENSMUSG00000025911 291.05543 tukey-test
#> ENSMUSG00000057335 301.17223 tukey-test

This data.frame also includes the Tukey multi-comparison results, as well as fold changes.

To access the welch and kruskal-wallis results, use get_welch_test and get_kw.

You can also access the data.frames ne, nu, and nn by using get_ne, get_nu, and get_nn.

To see what other results are stored in sds, you can do this:

names(sds@results)
#> [1] "anova_results"      "welch_test_results" "kw_results"        
#> [4] "all_results"        "ne"                 "nu"                
#> [7] "nn"                 "foldchange"

You can access any of these results by doing something like sds@results$anova_results.

Part2: Two group, one factor design

First load the data. For this example we’ll look at some gene expression data from an RNA-Seq experiment. These counts were normalized by DESeq2. There are two groups in this data set. Group1 is samples 1-4, and group2 is samples 5-8. For the purposes of this example, we’ll say that group1 is “treated” and group2 is “untreated”.

x <- read.csv("two_group_normalized_counts.csv")
head(x)
#>                    X   sample_1   sample_2   sample_3   sample_4   sample_5
#> 1 ENSMUSG00000000001 1707.89280 1843.60723 1687.02686 1501.02388 1806.65191
#> 2 ENSMUSG00000000028   27.84608   30.17596   29.61100   69.66777   48.57703
#> 3 ENSMUSG00000000056  468.74232  517.30211  556.20667  404.54810  455.79719
#> 4 ENSMUSG00000000058   60.33317   81.90617   75.22795   51.45915  116.79157
#> 5 ENSMUSG00000000078 2706.87086 3110.99739 2504.93077 2310.91176 2468.12630
#> 6 ENSMUSG00000000085  158.95470  145.13198  226.48416  192.37806  248.05289
#>     sample_6   sample_7   sample_8
#> 1 1916.63101 1984.70503 2100.84931
#> 2   51.32071   42.77382   52.52123
#> 3  468.79492  497.03174  504.00183
#> 4   88.82430   71.86001   68.68161
#> 5 2762.43574 2557.01869 2697.77332
#> 6  224.03462  203.60336  198.97467

We need to format the data so that the gene ids in the first column are rownames, and not inside the data.frame. We can use the cf() function for this.

x <- cf(x)
head(x)
#>                      sample_1   sample_2   sample_3   sample_4   sample_5
#> ENSMUSG00000000001 1707.89280 1843.60723 1687.02686 1501.02388 1806.65191
#> ENSMUSG00000000028   27.84608   30.17596   29.61100   69.66777   48.57703
#> ENSMUSG00000000056  468.74232  517.30211  556.20667  404.54810  455.79719
#> ENSMUSG00000000058   60.33317   81.90617   75.22795   51.45915  116.79157
#> ENSMUSG00000000078 2706.87086 3110.99739 2504.93077 2310.91176 2468.12630
#> ENSMUSG00000000085  158.95470  145.13198  226.48416  192.37806  248.05289
#>                      sample_6   sample_7   sample_8
#> ENSMUSG00000000001 1916.63101 1984.70503 2100.84931
#> ENSMUSG00000000028   51.32071   42.77382   52.52123
#> ENSMUSG00000000056  468.79492  497.03174  504.00183
#> ENSMUSG00000000058   88.82430   71.86001   68.68161
#> ENSMUSG00000000078 2762.43574 2557.01869 2697.77332
#> ENSMUSG00000000085  224.03462  203.60336  198.97467

Now, we have to make a sample_info object to map our samples to their experimental factors.

sample_info <- data.frame(sample = colnames(x), group = c(rep("treated", 4), rep("untreated", 4)))
sample_info
#>     sample     group
#> 1 sample_1   treated
#> 2 sample_2   treated
#> 3 sample_3   treated
#> 4 sample_4   treated
#> 5 sample_5 untreated
#> 6 sample_6 untreated
#> 7 sample_7 untreated
#> 8 sample_8 untreated

Now we build the statomatic data set

sds <- make_sds(x, colData = sample_info, design = ~ group)
sds
#> class: StatomaticDataset 
#> dim: 100 8 
#> metadata(0):
#> assays(1): counts
#> rownames(100): ENSMUSG00000000001 ENSMUSG00000000028 ...
#>   ENSMUSG00000000787 ENSMUSG00000000791
#> rowData names(0):
#> colnames(8): sample_1 sample_2 ... sample_7 sample_8
#> colData names(2): sample group

And we run sds_analyze.

sds <- sds_analyze(sds)
#> sorting data 100% done...
#> t-test 100% done
#> Calculating fold changes...
#> Welch test 100% done
#> Calculating fold changes...
#> Wilcox-test 100% done
#> Calculating fold changes...

And we can extract the results by doing:

res <- get_results(sds)
head(res)
#>                          Pval   test log2FoldChange_treated_vs_untreated
#> ENSMUSG00000000001 0.02895439 t-test                         -0.21245532
#> ENSMUSG00000000056 0.88359469 t-test                          0.01577680
#> ENSMUSG00000000058 0.18825794 t-test                         -0.36421697
#> ENSMUSG00000000078 0.84664860 t-test                          0.02026956
#> ENSMUSG00000000085 0.12619557 t-test                         -0.27483781
#> ENSMUSG00000000088 0.82980304 t-test                         -0.02965276
#>                    treated_mean untreated_mean
#> ENSMUSG00000000001   1684.88769     1952.20932
#> ENSMUSG00000000056    486.69980      481.40642
#> ENSMUSG00000000058     67.23161       86.53937
#> ENSMUSG00000000078   2658.42770     2621.33851
#> ENSMUSG00000000085    180.73722      218.66639
#> ENSMUSG00000000088    593.82628      606.15792

You can access most results stored in sds with a function that starts with get_ and ends with the name of the result. To see a full list of results stored in sds, you can do this:

names(sds@results)
#> [1] "t_test_results"      "welch_test_results"  "wilcox_test_results"
#> [4] "all_results"         "ne"                  "nu"                 
#> [7] "nn"                  "foldchange"

And any of these can be accessed by sds@results$result_name

Notice that we used the same function (sds_analyze) to analyze both multi-group and two-group data. The sds_analyze function uses the design information supplied during the creation of sds to decide whether to apply two-group or multi-group tests.

We’re going to use the same sds_analyze function again for the next part of this vignette.

Part3: Multi-group one factor design

The data we use here is some species level data from a metagenomic analysis. There are 24 samples, and three experimental factors with 8 samples per group. For our purposes here, we’ll call the factors group1, 2, and 3. The pace of this analysis will be a bit faster than the previous one, as most steps are very similar.

#read in data
x <- read.csv("species_abundance.csv")
#fix rownames
x <- cf(x)
head(x)
#>                                  sample_1 sample_2 sample_3 sample_4 sample_5
#> Lachnospiraceae bacterium              14       33      245      224      167
#> Dubosiella newyorkensis               583      181      118      145      112
#> Parasutterella excrementihominis        4      241      431      327      343
#> Barnesiella sp. CU968                 261      146      168      138      128
#> Bacteroides thetaiotaomicron          356       74       79       91       95
#> Lachnospiraceae bacterium 28-4          2        2     1176        7       92
#>                                  sample_6 sample_7 sample_8 sample_9 sample_10
#> Lachnospiraceae bacterium               9      996       71      312        85
#> Dubosiella newyorkensis               118      665      131      121       117
#> Parasutterella excrementihominis       21      128      241      367       364
#> Barnesiella sp. CU968                 175      251      144      129       129
#> Bacteroides thetaiotaomicron          155       80      129       76        83
#> Lachnospiraceae bacterium 28-4          5     1170       43       19         4
#>                                  sample_11 sample_12 sample_13 sample_14
#> Lachnospiraceae bacterium               91      1731        22        12
#> Dubosiella newyorkensis                113       210       120       460
#> Parasutterella excrementihominis       246       255       393         8
#> Barnesiella sp. CU968                  393       130        25        15
#> Bacteroides thetaiotaomicron            75        82        77       371
#> Lachnospiraceae bacterium 28-4          42       272         0         3
#>                                  sample_15 sample_16 sample_17 sample_18
#> Lachnospiraceae bacterium              221       289        11        35
#> Dubosiella newyorkensis                117       105       114       737
#> Parasutterella excrementihominis       386       120       385         7
#> Barnesiella sp. CU968                  417       191       138       132
#> Bacteroides thetaiotaomicron           119        91        78        76
#> Lachnospiraceae bacterium 28-4         671        25         3         3
#>                                  sample_19 sample_20 sample_21 sample_22
#> Lachnospiraceae bacterium              168      1907        42       103
#> Dubosiella newyorkensis                728       123       245       220
#> Parasutterella excrementihominis         4       433         4       117
#> Barnesiella sp. CU968                  123       116       213       284
#> Bacteroides thetaiotaomicron            75       116       513       582
#> Lachnospiraceae bacterium 28-4          14       461         3         5
#>                                  sample_23 sample_24
#> Lachnospiraceae bacterium              136        28
#> Dubosiella newyorkensis                402        57
#> Parasutterella excrementihominis        10         1
#> Barnesiella sp. CU968                  123       420
#> Bacteroides thetaiotaomicron            81       721
#> Lachnospiraceae bacterium 28-4           8         4

#make sample_info
sample_info <- data.frame(sample = colnames(x), group = c(rep("group1", 8), rep("group2", 8), rep("group3", 8)))
sample_info
#>       sample  group
#> 1   sample_1 group1
#> 2   sample_2 group1
#> 3   sample_3 group1
#> 4   sample_4 group1
#> 5   sample_5 group1
#> 6   sample_6 group1
#> 7   sample_7 group1
#> 8   sample_8 group1
#> 9   sample_9 group2
#> 10 sample_10 group2
#> 11 sample_11 group2
#> 12 sample_12 group2
#> 13 sample_13 group2
#> 14 sample_14 group2
#> 15 sample_15 group2
#> 16 sample_16 group2
#> 17 sample_17 group3
#> 18 sample_18 group3
#> 19 sample_19 group3
#> 20 sample_20 group3
#> 21 sample_21 group3
#> 22 sample_22 group3
#> 23 sample_23 group3
#> 24 sample_24 group3

#make sds
sds <- make_sds(x, colData = sample_info, design = ~ group)
sds <- sds_analyze(sds)
#> sorting data 100% done...
#> Using one way anova 
#> anova 100% done
#> Tukey test 100% done
#> Calculating fold changes...
#> Welch test 100% done
#> DunnettT3 test 100% done
#> Calculating fold changes...
#> Kruskal-Wallis Test 63% done...Kruskal-Wallis Test 100% done
#> Dunn Test 63% done...Dunn Test 100% done
#> Calculating fold changes...

Notice here that statomatic tells us that it’s using a one way anova to analyze the normal-equal variance portion of the data. This is because of our specified design.

res <- get_results(sds)
head(res)
#>                                          Pval Pval_group1_vs_group2
#> Parasutterella excrementihominis 0.1991824896             0.8062039
#> Muricaecibacterium torontonense  0.0681265365             0.5464864
#> Muribaculaceae bacterium         0.0010328865             0.4923547
#> Sangeribacter muris              0.3985184169             0.8117231
#> Anaerotruncus colihominis        0.6083336357             0.7787006
#> Bacteroides sp. CAG:927          0.0009232984             1.0000000
#>                                  log2FoldChange_group1_vs_group2
#> Parasutterella excrementihominis                      -0.3011695
#> Muricaecibacterium torontonense                       -0.5635121
#> Muribaculaceae bacterium                               0.6818240
#> Sangeribacter muris                                    0.2115041
#> Anaerotruncus colihominis                              0.4340958
#> Bacteroides sp. CAG:927                                0.0000000
#>                                  Pval_group1_vs_group3
#> Parasutterella excrementihominis           0.461211401
#> Muricaecibacterium torontonense            0.357348922
#> Muribaculaceae bacterium                   0.014194068
#> Sangeribacter muris                        0.727643346
#> Anaerotruncus colihominis                  0.949127148
#> Bacteroides sp. CAG:927                    0.002511996
#>                                  log2FoldChange_group1_vs_group3
#> Parasutterella excrementihominis                       0.8531586
#> Muricaecibacterium torontonense                        1.4364879
#> Muribaculaceae bacterium                              -1.0093379
#> Sangeribacter muris                                   -0.2250666
#> Anaerotruncus colihominis                             -0.1610624
#> Bacteroides sp. CAG:927                               -1.6455040
#>                                  Pval_group2_vs_group3
#> Parasutterella excrementihominis           0.182048409
#> Muricaecibacterium torontonense            0.055825056
#> Muribaculaceae bacterium                   0.000975412
#> Sangeribacter muris                        0.366905054
#> Anaerotruncus colihominis                  0.593457381
#> Bacteroides sp. CAG:927                    0.002511996
#>                                  log2FoldChange_group2_vs_group3 group1_mean
#> Parasutterella excrementihominis                       1.1543281     217.000
#> Muricaecibacterium torontonense                        2.0000000     163.750
#> Muribaculaceae bacterium                              -1.6911619      38.500
#> Sangeribacter muris                                   -0.4365707      19.250
#> Anaerotruncus colihominis                             -0.5951583      15.875
#> Bacteroides sp. CAG:927                               -1.6455040       8.750
#>                                  group2_mean group3_mean       test
#> Parasutterella excrementihominis     267.375     120.125 tukey-test
#> Muricaecibacterium torontonense      242.000      60.500 tukey-test
#> Muribaculaceae bacterium              24.000      77.500 tukey-test
#> Sangeribacter muris                   16.625      22.500 tukey-test
#> Anaerotruncus colihominis             11.750      17.750 tukey-test
#> Bacteroides sp. CAG:927                8.750      27.375 tukey-test