The main resources to learn tidymodels were Alison Hill’s slides from Introduction to Machine Learning with the Tidyverse, which contains all the slides for the course she prepared with Garrett Grolemund for RStudio::conf(2020), and Edgar Ruiz’s Gentle introduction to tidymodels on the RStudio website.

Objectives

 KEGG enzyme of Metagenome data from the sixth hospital is as the input for analysis.

 The results below demonstrate the comparison of group(NC, AD and aMCI).

 To get the variable importance by machine learning.

Getting set up

First we need to load some libraries: tidymodels and tidyverse.

# load the relevant tidymodels libraries
library(tidymodels)
library(tidyverse)
library(workflows)
library(tune)
library(here)
library(readxl)
library("Hotelling")
library(DT)
library(recipeselectors)
library("yardstick")
library(RCurl)
library(xgboost)
library(glmnet)
# load the dataset

df <- read.csv("./data/data/transform_data_KO.csv")

Clean data

df.filter <- subset(df, Diagnosis != "aMCI")
df.filter$Diagnosis<-factor(df.filter$Diagnosis)
target_group<-unique(df.filter$Diagnosis) %>% as.character()

df.filter.clean <- df.filter[,-c(1)]
colnames(df.filter.clean)[1] <- "Diagnosis"

# you can select some columns which you want to use
df.filter.clean_colnames<-colnames(df.filter.clean)

# remove the row of NA for df.filter.clean
rownames(df.filter.clean) <- NULL
df.filter.clean<-df.filter.clean %>% mutate_if(is.numeric, function(x){x+1})

Select top20 features

#####################
# Define a recipe
sixthhospital_recipe_all <- 
  # which consists of the formula (outcome ~ predictors)
  recipe(Diagnosis ~ ., data = df.filter.clean) %>%
  # and some pre-processing steps
  step_log(all_numeric()) %>%
  # Hotelling::clr() %>% 
  step_normalize(all_numeric()) 

sixthhospital_all_top20<- sixthhospital_recipe_all %>%
  step_select_roc( all_predictors() , top_p=20, outcome="Diagnosis") %>%
  # apply the recipe to the training data
  prep() %>%
  # extract the pre-processed training dataset
  juice()

colnames_top<-names(sixthhospital_all_top20)

#get a filter dataframe(top20)
df.filter.clean.top<-df.filter.clean[,colnames_top]

Split train and test data

set.seed(123)
# split the data into trainng (66%) and testing (33%)
sixthhospital_split <- initial_split(df.filter.clean.top, prop = 2/3)
sixthhospital_split

## <Analysis/Assess/Total>
## <30/14/44>

# extract training and testing sets
sixthhospital_train <- training(sixthhospital_split)
sixthhospital_test <- testing(sixthhospital_split)
#####################

Define a new recipe

sixthhospital_recipe <- 
  # which consists of the formula (outcome ~ predictors)
  recipe(Diagnosis ~ ., data = sixthhospital_train) %>%
  # and some pre-processing steps
  step_log(all_numeric()) %>%
  # Hotelling::clr() %>% 
  step_normalize(all_numeric()) %>%  
  prep(sixthhospital_train) 
  # %>% juice()

#get train and test data preprocessed 
sixthhospital_train_preprocessed <- bake(sixthhospital_recipe, sixthhospital_train)
sixthhospital_test_preprocessed <- bake(sixthhospital_recipe, sixthhospital_test)

# create CV object from training data
set.seed(1234)
sixthhospital_cv <- vfold_cv(sixthhospital_train, v = 5, repeats = 10, strata = "Diagnosis")

Method1

Specify the model

So far we’ve split our data into training/testing, and we’ve specified our pre-processing steps using a recipe. The next thing we want to specify is our model (using the parsnip package).

Parsnip offers a unified interface for the massive variety of models that exist in R. This means that you only have to learn one way of specifying a model, and you can use this specification and have it generate a linear model, a random forest model, a support vector machine model, and more with a single line of code.

There are a few primary components that you need to provide for the model specification

The model type: what kind of model you want to fit, set using a different function depending on the model, such as rand_forest() for random forest, logistic_reg() for logistic regression, svm_poly() for a polynomial SVM model etc. The full list of models available via parsnip can be found here.
The arguments: the model parameter values (now consistently named across different models), set using set_args().
The engine: the underlying package the model should come from (e.g. “ranger” for the ranger implementation of Random Forest), set using set_engine().
The mode: the type of prediction - since several packages can do both classification (binary/categorical prediction) and regression (continuous prediction), set using set_mode().

For instance, if we want to fit a random forest model as implemented by the ranger package for the purpose of classification and we want to tune the mtry parameter (the number of randomly selected variables to be considered at each split in the trees), then we would define the following model specification:

rf_model <- 
  # specify that the model is a random forest
  rand_forest() %>%
  # specify that the `mtry` parameter needs to be tuned
  set_args(mtry = tune()) %>%
  # select the engine/package that underlies the model
  set_engine("ranger", importance = "impurity") %>%
  # choose either the continuous regression or binary classification mode
  set_mode("classification")

If you want to be able to examine the variable importance of your final model later, you will need to set importance argument when setting the engine. For ranger, the importance options are "impurity" or "permutation".

As another example, the following code would instead specify a logistic regression model from the glm package.

lr_model <- 
  # specify that the model is a random forest
  logistic_reg() %>%
  # select the engine/package that underlies the model
  set_engine("glm") %>%
  # choose either the continuous regression or binary classification mode
  set_mode("classification")

Note that this code doesn’t actually fit the model. Like the recipe, it just outlines a description of the model. Moreover, setting a parameter to tune() means that it will be tuned later in the tune stage of the pipeline (i.e. the value of the parameter that yields the best performance will be chosen). You could also just specify a particular value of the parameter if you don’t want to tune it e.g. using set_args(mtry = 4).

Put it all together in a workflow

We’re now ready to put the model and recipes together into a workflow. You initiate a workflow using workflow() (from the workflows package) and then you can add a recipe and add a model to it.

# set the workflow
rf_workflow <- workflow() %>%
  # add the recipe
  add_recipe(sixthhospital_recipe) %>%
  # add the model
  add_model(rf_model)

Note that we still haven’t yet implemented the pre-processing steps in the recipe nor have we fit the model. We’ve just written the framework. It is only when we tune the parameters or fit the model that the recipe and model frameworks are actually implemented.

Tune the parameters

Since we had a parameter that we designated to be tuned (mtry), we need to tune it (i.e. choose the value that leads to the best performance) before fitting our model. If you don’t have any parameters to tune, you can skip this step.

Note that we will do our tuning using the cross-validation object (sixthhospital_cv). To do this, we specify the range of mtry values we want to try, and then we add a tuning layer to our workflow using tune_grid() (from the tune package). Note that we focus on two metrics: accuracy and roc_auc (from the yardstick package).

# specify which values to try
rf_grid <- expand.grid(mtry = c(3,4,5))
# extract results
rf_tune_results <- rf_workflow %>%
  tune_grid(resamples = sixthhospital_cv, #CV object
            grid = rf_grid, # grid of values to try
            metrics = metric_set(accuracy, roc_auc) # metrics we care about
            )

You can tune multiple parameters at once by providing multiple parameters to the expand.grid() function, e.g. expand.grid(mtry = c(3, 4, 5), trees = c(100, 500)).

It’s always a good idea to explore the results of the cross-validation. collect_metrics() is a really handy function that can be used in a variety of circumstances to extract any metrics that have been calculated within the object it’s being used on. In this case, the metrics come from the cross-validation performance across the different values of the parameters.

# print results
rf_tune_results %>% collect_metrics()

## # A tibble: 6 x 7
##    mtry .metric  .estimator  mean     n std_err .config
##   <dbl> <chr>    <chr>      <dbl> <int>   <dbl> <chr>  
## 1     3 accuracy binary     0.823    50  0.0193 Model1 
## 2     3 roc_auc  binary     0.898    50  0.0202 Model1 
## 3     4 accuracy binary     0.812    50  0.0234 Model2 
## 4     4 roc_auc  binary     0.893    50  0.0210 Model2 
## 5     5 accuracy binary     0.814    50  0.0205 Model3 
## 6     5 roc_auc  binary     0.886    50  0.0213 Model3

Across both accuracy and AUC, the best performance (just) can be selected.

Finalize the workflow

We want to add a layer to our workflow that corresponds to the tuned parameter, i.e. sets mtry to be the value that yielded the best results. If you didn’t tune any parameters, you can skip this step.

We can extract the best value for the roc_auc metric by applying the select_best() function to the tune object.

param_final <- rf_tune_results %>% select_best(metric = "roc_auc")
param_final

## # A tibble: 1 x 2
##    mtry .config
##   <dbl> <chr>  
## 1     3 Model1

Then we can add this parameter to the workflow using the finalize_workflow() function.

rf_workflow <- rf_workflow %>%
  finalize_workflow(param_final)

Evaluate the model on the test set

Now we’ve defined our recipe, our model, and tuned the model’s parameters, we’re ready to actually fit the final model. Since all of this information is contained within the workflow object, we will apply the last_fit() function to our workflow and our train/test split object. This will automatically train the model specified by the workflow using the training data, and produce evaluations based on the test set.

rf_fit <- rf_workflow %>%
  # fit on the training set and evaluate on test set
  last_fit(sixthhospital_split)

Note that the fit object that is created is a data-frame-like object; specifically, it is a tibble with list columns.

rf_fit

## # Resampling results
## # Monte Carlo cross-validation (0.68/0.32) with 1 resamples  
## # A tibble: 1 x 6
##   splits       id           .metrics      .notes       .predictions    .workflow
##   <list>       <chr>        <list>        <list>       <list>          <list>   
## 1 <split [30/… train/test … <tibble [2 ×… <tibble [0 … <tibble [14 × … <workflo…

This is a really nice feature of tidymodels (and is what makes it work so nicely with the tidyverse) since you can do all of your tidyverse operations to the model object. While truly taking advantage of this flexibility requires proficiency with purrr, if you don’t want to deal with purrr and list-columns, there are functions that can extract the relevant information from the fit object that remove the need for purrr as we will see below.

Since we supplied the train/test object when we fit the workflow, the metrics are evaluated on the test set. Now when we use the collect_metrics() function (recall we used this when tuning our parameters), it extracts the performance of the final model (since rf_fit now consists of a single final model) applied to the test set.

test_performance <- rf_fit %>% collect_metrics()
test_performance

## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.786
## 2 roc_auc  binary         0.857

Overall the performance is very good.

You can also extract the test set predictions themselves using the collect_predictions() function. Note that there are 192 rows in the predictions object below which matches the number of test set observations (just to give you some evidence that these are based on the test set rather than the training set).

# generate predictions from the test set
test_predictions <- rf_fit %>% collect_predictions()
test_predictions$Value<- test_predictions[,c(paste0(".pred_",target_group[1]))] %>% unlist()
# test_predictions <- rf_fit %>% pull(.predictions) ## purrr function
test_predictions

## # A tibble: 14 x 7
##    id               .pred_AD .pred_NC  .row .pred_class Diagnosis  Value
##    <chr>               <dbl>    <dbl> <int> <fct>       <fct>      <dbl>
##  1 train/test split   0.758     0.242     3 AD          AD        0.758 
##  2 train/test split   0.897     0.103     5 AD          AD        0.897 
##  3 train/test split   0.831     0.169     9 AD          AD        0.831 
##  4 train/test split   0.376     0.624    14 NC          AD        0.376 
##  5 train/test split   0.152     0.848    15 NC          NC        0.152 
##  6 train/test split   0.541     0.459    25 AD          NC        0.541 
##  7 train/test split   0.414     0.586    26 NC          NC        0.414 
##  8 train/test split   0.151     0.849    27 NC          NC        0.151 
##  9 train/test split   0.679     0.321    28 AD          AD        0.679 
## 10 train/test split   0.805     0.195    31 AD          NC        0.805 
## 11 train/test split   0.856     0.144    36 AD          AD        0.856 
## 12 train/test split   0.0801    0.920    37 NC          NC        0.0801
## 13 train/test split   0.341     0.659    38 NC          NC        0.341 
## 14 train/test split   0.521     0.479    42 AD          AD        0.521

Since this is just a normal data frame/tibble object, we can generate summaries and plots such as a confusion matrix.

# generate a confusion matrix
cm <- test_predictions %>% 
  conf_mat(truth = Diagnosis, estimate = .pred_class)
autoplot(cm, type = "heatmap")

We could also plot distributions of the predicted probability distributions for each class.

test_predictions %>%
  ggplot( aes_string(x = "Value" , fill = 'Diagnosis') ) +
  geom_density( alpha = 0.5)

test_predictions

## # A tibble: 14 x 7
##    id               .pred_AD .pred_NC  .row .pred_class Diagnosis  Value
##    <chr>               <dbl>    <dbl> <int> <fct>       <fct>      <dbl>
##  1 train/test split   0.758     0.242     3 AD          AD        0.758 
##  2 train/test split   0.897     0.103     5 AD          AD        0.897 
##  3 train/test split   0.831     0.169     9 AD          AD        0.831 
##  4 train/test split   0.376     0.624    14 NC          AD        0.376 
##  5 train/test split   0.152     0.848    15 NC          NC        0.152 
##  6 train/test split   0.541     0.459    25 AD          NC        0.541 
##  7 train/test split   0.414     0.586    26 NC          NC        0.414 
##  8 train/test split   0.151     0.849    27 NC          NC        0.151 
##  9 train/test split   0.679     0.321    28 AD          AD        0.679 
## 10 train/test split   0.805     0.195    31 AD          NC        0.805 
## 11 train/test split   0.856     0.144    36 AD          AD        0.856 
## 12 train/test split   0.0801    0.920    37 NC          NC        0.0801
## 13 train/test split   0.341     0.659    38 NC          NC        0.341 
## 14 train/test split   0.521     0.479    42 AD          AD        0.521

roc_curve(test_predictions, Diagnosis,Value )%>% autoplot()

Fitting and using your final model (You may skip all steps below because it is for external validation)

The previous section evaluated the model trained on the training data using the testing data. But once you’ve determined your final model, you often want to train it on your full dataset and then use it to predict the response for new data.

If you want to use your model to predict the response for new observations, you need to use the fit() function on your workflow and the dataset that you want to fit the final model on (e.g. the complete training + testing dataset). Below I use our own original dataset, but actually we should do it on external dataset.

final_model <- fit(rf_workflow, df.filter.clean)

The final_model object contains a few things including the ranger object trained with the parameters established through the workflow contained in rf_workflow based on the data in sixthhospital.clean (the combined training and testing data).

final_model

## ══ Workflow [trained] ═══════════════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ─────────────────────────────────────────────────────────────────────────────────────
## 2 Recipe Steps
## 
## ● step_log()
## ● step_normalize()
## 
## ── Model ────────────────────────────────────────────────────────────────────────────────────────────
## Ranger result
## 
## Call:
##  ranger::ranger(formula = ..y ~ ., data = data, mtry = ~3, importance = ~"impurity",      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1), probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  500 
## Sample size:                      44 
## Number of independent variables:  20 
## Mtry:                             3 
## Target node size:                 10 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.1559749

Variable importance

If you want to extract the variable importance scores from your model, as far as I can tell, for now you need to extract the model object from the fit() object (which for us is called final_model). The function that extracts the model is pull_workflow_fit() and then you need to grab the fit object that the output contains.

ranger_obj <- pull_workflow_fit(final_model)$fit
ranger_obj

## Ranger result
## 
## Call:
##  ranger::ranger(formula = ..y ~ ., data = data, mtry = ~3, importance = ~"impurity",      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1), probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  500 
## Sample size:                      44 
## Number of independent variables:  20 
## Mtry:                             3 
## Target node size:                 10 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.1559749

Then you can extract the variable importance from the ranger object itself (variable.importance is a specific object contained within ranger output - this will need to be adapted for the specific object type of other models).

ranger_obj_importance <- ranger_obj$variable.importance %>% data.frame()
ranger_obj_importance$feature<-rownames(ranger_obj_importance)
ranger_obj_importance$value<-ranger_obj_importance$.
ranger_obj_importance$.<-NULL
ranger_obj_importance$.<-NULL

print(ranger_obj_importance$feature)

##  [1] "K00052" "K00210" "K00836" "K01226" "K02082" "K02745" "K02747" "K02818"
##  [9] "K03465" "K03722" "K03802" "K03828" "K05895" "K06919" "K07148" "K07506"
## [17] "K11063" "K13275" "K13282" "K20151"

# ko<-c("ko00010","ko00020","ko00030")
ko<- ranger_obj_importance$feature
# get name
print(length(ko))

## [1] 20

ko_names <- c()
for (i in ko){
  # ko_name <- getURL(paste0("http://togows.dbcls.jp/entry/pathway/",i,"/name"))
  ko_name <- getURL(paste0("http://togows.org/entry/kegg-enzyme/",i,"/name"))
  ko_names <- c(ko_names,ko_name)
}
ko_names <- gsub("\n","",ko_names)
ranger_obj_importance$ko_names <- ko_names
c2<-ggplot(ranger_obj_importance,aes(x=reorder(ko_names, abs(value) ), y=value) ) +
  geom_bar(stat="identity", position=position_dodge() ,fill="red") +  
  coord_flip()+
  theme_bw()+theme_minimal()+
  theme( legend.position = "right",axis.title.x = element_blank(),axis.text.x =element_text(angle = 0),axis.title.y = element_blank()  )+
  theme(axis.title.x = element_text(size = 5))+
  theme(axis.text.x = element_text(size = 6,color="black"),axis.text.y = element_text(size = 10,color="black"))+
  ggtitle(label = paste0(target_group,collapse ='|') )
c2

DT::datatable(ranger_obj_importance)

Method2

#Model Training
df_training<-sixthhospital_train_preprocessed
df_testing<-sixthhospital_test_preprocessed
df_ranger <- rand_forest(trees = 100, mode = "classification") %>%
  set_engine("ranger") %>%
  fit(Diagnosis ~ ., data = df_training)

df_rf <-  rand_forest(trees = 100, mode = "classification") %>%
  set_engine("randomForest",importance=T) %>%
  fit(Diagnosis ~ ., data = df_training)

df_lr <-logistic_reg(mode = "classification") %>%
   set_engine("glm") %>%
  fit(Diagnosis ~ ., data = df_training)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

# df_lr1 <-logistic_reg(mode = "classification") %>%
#    set_engine("glmnet") %>%
#   fit(Diagnosis ~ ., data = df_training)

df_bt <-boost_tree(mode = "classification") %>%
   set_engine("xgboost") %>%
  fit(Diagnosis ~ ., data = df_training)

#Predictions
predict(df_ranger, df_testing)

## # A tibble: 14 x 1
##    .pred_class
##    <fct>      
##  1 AD         
##  2 AD         
##  3 AD         
##  4 NC         
##  5 NC         
##  6 AD         
##  7 NC         
##  8 NC         
##  9 AD         
## 10 AD         
## 11 AD         
## 12 NC         
## 13 NC         
## 14 AD

df_ranger %>%
  predict(df_testing) %>%
  bind_cols(df_testing) %>%
  glimpse()

## Rows: 14
## Columns: 22
## $ .pred_class <fct> AD, AD, AD, NC, NC, AD, NC, NC, AD, AD, AD, NC, NC, AD
## $ K00052      <dbl> 0.7809951, 0.7154290, 0.4955381, -1.9531014, -0.6582282, …
## $ K00210      <dbl> 0.04263319, 0.48673764, -0.30707079, 0.58210332, 1.060592…
## $ K00836      <dbl> 1.18039366, 0.30308450, 0.68858184, -0.29729681, -2.01205…
## $ K01226      <dbl> 0.6187337, 0.4047961, 0.1071349, -0.1536828, -0.4446155, …
## $ K02082      <dbl> 0.182809072, 0.216945332, 0.004933035, -0.078061360, -0.5…
## $ K02745      <dbl> -0.14280003, 0.49761653, 0.13336357, -0.91186224, -2.0340…
## $ K02747      <dbl> -0.22619636, 0.33823247, -0.57639907, -1.17783541, -2.489…
## $ K02818      <dbl> 0.773034643, 0.441263044, 0.832145189, -0.317001082, -1.0…
## $ K03465      <dbl> 1.1573995, 0.7557580, 0.4755377, 1.0905615, -1.2985911, 1…
## $ K03722      <dbl> -0.0008661466, 0.3697163702, 0.2065728392, 0.3292798358, …
## $ K03802      <dbl> 1.6063919, 0.3594838, 1.0415371, -1.2850429, -1.2850429, …
## $ K03828      <dbl> -0.475451865, -0.067380613, -0.006041819, 0.745534150, -0…
## $ K05895      <dbl> 0.5955950764, 1.2858381255, 1.1834216895, -1.9622380674, …
## $ K06919      <dbl> 1.13979314, 0.56840099, 0.88737509, -0.47433355, -0.87809…
## $ K07148      <dbl> -0.51552323, -1.28366872, -0.95999259, 0.82430696, 0.5110…
## $ K07506      <dbl> -0.1336175, 0.4644781, -0.2274853, 0.6614558, 0.1908874, …
## $ K11063      <dbl> 1.36504163, -0.65849286, -0.06322721, 0.12384402, 1.08145…
## $ K13275      <dbl> 0.83256702, 1.36381334, 2.99535382, -1.23227870, -1.23227…
## $ K13282      <dbl> 1.28817709, 0.58632296, 0.33411787, 1.02720776, -1.452071…
## $ K20151      <dbl> 0.41135829, 0.96504509, 0.60437360, -0.54268371, -1.77115…
## $ Diagnosis   <fct> AD, AD, AD, AD, NC, NC, NC, NC, AD, NC, AD, NC, NC, AD

df_lr %>%
  predict(df_testing) %>%
  bind_cols(df_testing) %>%
  glimpse()

## Rows: 14
## Columns: 22
## $ .pred_class <fct> AD, AD, AD, AD, NC, NC, AD, NC, NC, AD, NC, AD, NC, AD
## $ K00052      <dbl> 0.7809951, 0.7154290, 0.4955381, -1.9531014, -0.6582282, …
## $ K00210      <dbl> 0.04263319, 0.48673764, -0.30707079, 0.58210332, 1.060592…
## $ K00836      <dbl> 1.18039366, 0.30308450, 0.68858184, -0.29729681, -2.01205…
## $ K01226      <dbl> 0.6187337, 0.4047961, 0.1071349, -0.1536828, -0.4446155, …
## $ K02082      <dbl> 0.182809072, 0.216945332, 0.004933035, -0.078061360, -0.5…
## $ K02745      <dbl> -0.14280003, 0.49761653, 0.13336357, -0.91186224, -2.0340…
## $ K02747      <dbl> -0.22619636, 0.33823247, -0.57639907, -1.17783541, -2.489…
## $ K02818      <dbl> 0.773034643, 0.441263044, 0.832145189, -0.317001082, -1.0…
## $ K03465      <dbl> 1.1573995, 0.7557580, 0.4755377, 1.0905615, -1.2985911, 1…
## $ K03722      <dbl> -0.0008661466, 0.3697163702, 0.2065728392, 0.3292798358, …
## $ K03802      <dbl> 1.6063919, 0.3594838, 1.0415371, -1.2850429, -1.2850429, …
## $ K03828      <dbl> -0.475451865, -0.067380613, -0.006041819, 0.745534150, -0…
## $ K05895      <dbl> 0.5955950764, 1.2858381255, 1.1834216895, -1.9622380674, …
## $ K06919      <dbl> 1.13979314, 0.56840099, 0.88737509, -0.47433355, -0.87809…
## $ K07148      <dbl> -0.51552323, -1.28366872, -0.95999259, 0.82430696, 0.5110…
## $ K07506      <dbl> -0.1336175, 0.4644781, -0.2274853, 0.6614558, 0.1908874, …
## $ K11063      <dbl> 1.36504163, -0.65849286, -0.06322721, 0.12384402, 1.08145…
## $ K13275      <dbl> 0.83256702, 1.36381334, 2.99535382, -1.23227870, -1.23227…
## $ K13282      <dbl> 1.28817709, 0.58632296, 0.33411787, 1.02720776, -1.452071…
## $ K20151      <dbl> 0.41135829, 0.96504509, 0.60437360, -0.54268371, -1.77115…
## $ Diagnosis   <fct> AD, AD, AD, AD, NC, NC, NC, NC, AD, NC, AD, NC, NC, AD

df_bt %>%
  predict(df_testing) %>%
  bind_cols(df_testing) %>%
  glimpse()

## Rows: 14
## Columns: 22
## $ .pred_class <fct> AD, AD, AD, NC, NC, AD, AD, NC, AD, AD, AD, NC, NC, AD
## $ K00052      <dbl> 0.7809951, 0.7154290, 0.4955381, -1.9531014, -0.6582282, …
## $ K00210      <dbl> 0.04263319, 0.48673764, -0.30707079, 0.58210332, 1.060592…
## $ K00836      <dbl> 1.18039366, 0.30308450, 0.68858184, -0.29729681, -2.01205…
## $ K01226      <dbl> 0.6187337, 0.4047961, 0.1071349, -0.1536828, -0.4446155, …
## $ K02082      <dbl> 0.182809072, 0.216945332, 0.004933035, -0.078061360, -0.5…
## $ K02745      <dbl> -0.14280003, 0.49761653, 0.13336357, -0.91186224, -2.0340…
## $ K02747      <dbl> -0.22619636, 0.33823247, -0.57639907, -1.17783541, -2.489…
## $ K02818      <dbl> 0.773034643, 0.441263044, 0.832145189, -0.317001082, -1.0…
## $ K03465      <dbl> 1.1573995, 0.7557580, 0.4755377, 1.0905615, -1.2985911, 1…
## $ K03722      <dbl> -0.0008661466, 0.3697163702, 0.2065728392, 0.3292798358, …
## $ K03802      <dbl> 1.6063919, 0.3594838, 1.0415371, -1.2850429, -1.2850429, …
## $ K03828      <dbl> -0.475451865, -0.067380613, -0.006041819, 0.745534150, -0…
## $ K05895      <dbl> 0.5955950764, 1.2858381255, 1.1834216895, -1.9622380674, …
## $ K06919      <dbl> 1.13979314, 0.56840099, 0.88737509, -0.47433355, -0.87809…
## $ K07148      <dbl> -0.51552323, -1.28366872, -0.95999259, 0.82430696, 0.5110…
## $ K07506      <dbl> -0.1336175, 0.4644781, -0.2274853, 0.6614558, 0.1908874, …
## $ K11063      <dbl> 1.36504163, -0.65849286, -0.06322721, 0.12384402, 1.08145…
## $ K13275      <dbl> 0.83256702, 1.36381334, 2.99535382, -1.23227870, -1.23227…
## $ K13282      <dbl> 1.28817709, 0.58632296, 0.33411787, 1.02720776, -1.452071…
## $ K20151      <dbl> 0.41135829, 0.96504509, 0.60437360, -0.54268371, -1.77115…
## $ Diagnosis   <fct> AD, AD, AD, AD, NC, NC, NC, NC, AD, NC, AD, NC, NC, AD

#Model Validation
df_ranger %>%
  predict(df_testing) %>%
  bind_cols(df_testing) %>%
  metrics(truth = Diagnosis, estimate = .pred_class)

## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.786
## 2 kap      binary         0.571

df_rf %>%
  predict(df_testing) %>%
  bind_cols(df_testing) %>%
  metrics(truth = Diagnosis, estimate = .pred_class)

## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.786
## 2 kap      binary         0.571

df_lr %>%
  predict(df_testing) %>%
  bind_cols(df_testing) %>%
  metrics(truth = Diagnosis, estimate = .pred_class)

## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.643
## 2 kap      binary         0.286

# df_lr1 %>%
#   predict(df_testing) %>%
#   bind_cols(df_testing) %>%
#   metrics(truth = Diagnosis, estimate = .pred_class)

df_bt %>%
  predict(df_testing) %>%
  bind_cols(df_testing) %>%
  metrics(truth = Diagnosis, estimate = .pred_class)

## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.714
## 2 kap      binary         0.429

#Per classifier metrics
df_probs <- df_ranger %>%
  predict(df_testing, type = "prob") %>%
  bind_cols(df_testing)

roc_fig <- predict(df_ranger, df_testing, type = "prob") %>%
    bind_cols(predict(df_ranger, df_testing)) %>%
    bind_cols(dplyr::select(df_testing, Diagnosis))
roc_fig$V1 <- as.data.frame(roc_fig)[,1]

roc_fig_lr <- predict(df_lr, df_testing, type = "prob") %>%
    bind_cols(predict(df_lr, df_testing)) %>%
    bind_cols(dplyr::select(df_testing, Diagnosis))
roc_fig_lr$V1 <- as.data.frame(roc_fig_lr)[,1]

# roc_fig_lr1 <- multi_predict(df_lr1, df_testing, type = "prob") %>%
#     bind_cols(multi_predict(df_lr1, df_testing)) %>%
#     bind_cols(dplyr::select(df_testing, Diagnosis))
# roc_fig_lr1$V1 <- as.data.frame(roc_fig_lr1)[,1]

roc_fig_bt <- predict(df_bt, df_testing, type = "prob") %>%
    bind_cols(predict(df_bt, df_testing)) %>%
    bind_cols(dplyr::select(df_testing, Diagnosis))
roc_fig_bt$V1 <- as.data.frame(roc_fig_bt)[,1]

#get aac auc
# predict(df_ranger, df_testing, type = "prob") %>%
#     bind_cols(predict(df_ranger, df_testing)) %>%
#     bind_cols(select(df_testing, Diagnosis)) %>%
yardstick::metrics(roc_fig, Diagnosis,  estimate = .pred_class, V1)

## # A tibble: 4 x 3
##   .metric     .estimator .estimate
##   <chr>       <chr>          <dbl>
## 1 accuracy    binary         0.786
## 2 kap         binary         0.571
## 3 mn_log_loss binary         0.498
## 4 roc_auc     binary         0.816

yardstick::metrics(roc_fig_lr, Diagnosis,  estimate = .pred_class, V1)

## # A tibble: 4 x 3
##   .metric     .estimator .estimate
##   <chr>       <chr>          <dbl>
## 1 accuracy    binary         0.643
## 2 kap         binary         0.286
## 3 mn_log_loss binary        10.4  
## 4 roc_auc     binary         0.714

# yardstick::metrics(roc_fig_lr1, Diagnosis,  estimate = .pred_class, V1)
yardstick::metrics(roc_fig_bt, Diagnosis,  estimate = .pred_class, V1)

## # A tibble: 4 x 3
##   .metric     .estimator .estimate
##   <chr>       <chr>          <dbl>
## 1 accuracy    binary         0.714
## 2 kap         binary         0.429
## 3 mn_log_loss binary         0.521
## 4 roc_auc     binary         0.806

#plot ROC
roc_curve(roc_fig, Diagnosis,V1 )%>% autoplot()

roc_curve(roc_fig_lr, Diagnosis,V1 )%>% autoplot()

# roc_curve(roc_fig_lr1, Diagnosis,V1 )%>% autoplot()
roc_curve(roc_fig_bt, Diagnosis,V1 )%>% autoplot()

#fearure importance
ko <- rownames(df_rf$fit$importance)
ko_names <- c()
for (i in ko){
  # ko_name <- getURL(paste0("http://togows.dbcls.jp/entry/pathway/",i,"/name"))
  ko_name <- getURL(paste0("http://togows.org/entry/kegg-enzyme/",i,"/name"))
  ko_names <- c(ko_names,ko_name)
}
ko_names <- gsub("\n","",ko_names)
df_rf_imp<- cbind(rownames(df_rf$fit$importance), df_rf$fit$importance, ko_names) %>% as_tibble() %>% dplyr::select(predictor= V1, MeanDecreaseAccuracy, enzyme = ko_names)   %>% 
 mutate(MeanDecreaseAccuracy=as.numeric(MeanDecreaseAccuracy)) %>% arrange(desc((MeanDecreaseAccuracy)))

## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
## Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

datatable(df_rf_imp)

Tidymodels: tidy machine learning in R

Binbin Zhao

October 17, 2020