The main resources to learn tidymodels were Alison Hill’s slides from Introduction to Machine Learning with the Tidyverse, which contains all the slides for the course she prepared with Garrett Grolemund for RStudio::conf(2020), and Edgar Ruiz’s Gentle introduction to tidymodels on the RStudio website.
KEGG enzyme of Metagenome data from the sixth hospital is as the input for analysis.
The results below demonstrate the comparison of group(NC, AD and aMCI).
To get the variable importance by machine learning.
First we need to load some libraries: tidymodels
and tidyverse
.
df.filter <- subset(df, Diagnosis != "aMCI")
df.filter$Diagnosis<-factor(df.filter$Diagnosis)
target_group<-unique(df.filter$Diagnosis) %>% as.character()
df.filter.clean <- df.filter[,-c(1)]
colnames(df.filter.clean)[1] <- "Diagnosis"
# you can select some columns which you want to use
df.filter.clean_colnames<-colnames(df.filter.clean)
# remove the row of NA for df.filter.clean
rownames(df.filter.clean) <- NULL
df.filter.clean<-df.filter.clean %>% mutate_if(is.numeric, function(x){x+1})
#####################
# Define a recipe
sixthhospital_recipe_all <-
# which consists of the formula (outcome ~ predictors)
recipe(Diagnosis ~ ., data = df.filter.clean) %>%
# and some pre-processing steps
step_log(all_numeric()) %>%
# Hotelling::clr() %>%
step_normalize(all_numeric())
sixthhospital_all_top20<- sixthhospital_recipe_all %>%
step_select_roc( all_predictors() , top_p=20, outcome="Diagnosis") %>%
# apply the recipe to the training data
prep() %>%
# extract the pre-processed training dataset
juice()
colnames_top<-names(sixthhospital_all_top20)
#get a filter dataframe(top20)
df.filter.clean.top<-df.filter.clean[,colnames_top]
set.seed(123)
# split the data into trainng (66%) and testing (33%)
sixthhospital_split <- initial_split(df.filter.clean.top, prop = 2/3)
sixthhospital_split
## <Analysis/Assess/Total>
## <30/14/44>
sixthhospital_recipe <-
# which consists of the formula (outcome ~ predictors)
recipe(Diagnosis ~ ., data = sixthhospital_train) %>%
# and some pre-processing steps
step_log(all_numeric()) %>%
# Hotelling::clr() %>%
step_normalize(all_numeric()) %>%
prep(sixthhospital_train)
# %>% juice()
#get train and test data preprocessed
sixthhospital_train_preprocessed <- bake(sixthhospital_recipe, sixthhospital_train)
sixthhospital_test_preprocessed <- bake(sixthhospital_recipe, sixthhospital_test)
# create CV object from training data
set.seed(1234)
sixthhospital_cv <- vfold_cv(sixthhospital_train, v = 5, repeats = 10, strata = "Diagnosis")
So far we’ve split our data into training/testing, and we’ve specified our pre-processing steps using a recipe. The next thing we want to specify is our model (using the parsnip
package).
Parsnip offers a unified interface for the massive variety of models that exist in R. This means that you only have to learn one way of specifying a model, and you can use this specification and have it generate a linear model, a random forest model, a support vector machine model, and more with a single line of code.
There are a few primary components that you need to provide for the model specification
The model type: what kind of model you want to fit, set using a different function depending on the model, such as rand_forest()
for random forest, logistic_reg()
for logistic regression, svm_poly()
for a polynomial SVM model etc. The full list of models available via parsnip can be found here.
The arguments: the model parameter values (now consistently named across different models), set using set_args()
.
The engine: the underlying package the model should come from (e.g. “ranger” for the ranger implementation of Random Forest), set using set_engine()
.
The mode: the type of prediction - since several packages can do both classification (binary/categorical prediction) and regression (continuous prediction), set using set_mode()
.
For instance, if we want to fit a random forest model as implemented by the ranger
package for the purpose of classification and we want to tune the mtry
parameter (the number of randomly selected variables to be considered at each split in the trees), then we would define the following model specification:
rf_model <-
# specify that the model is a random forest
rand_forest() %>%
# specify that the `mtry` parameter needs to be tuned
set_args(mtry = tune()) %>%
# select the engine/package that underlies the model
set_engine("ranger", importance = "impurity") %>%
# choose either the continuous regression or binary classification mode
set_mode("classification")
If you want to be able to examine the variable importance of your final model later, you will need to set importance
argument when setting the engine. For ranger, the importance options are "impurity"
or "permutation"
.
As another example, the following code would instead specify a logistic regression model from the glm
package.
lr_model <-
# specify that the model is a random forest
logistic_reg() %>%
# select the engine/package that underlies the model
set_engine("glm") %>%
# choose either the continuous regression or binary classification mode
set_mode("classification")
Note that this code doesn’t actually fit the model. Like the recipe, it just outlines a description of the model. Moreover, setting a parameter to tune()
means that it will be tuned later in the tune stage of the pipeline (i.e. the value of the parameter that yields the best performance will be chosen). You could also just specify a particular value of the parameter if you don’t want to tune it e.g. using set_args(mtry = 4)
.
We’re now ready to put the model and recipes together into a workflow. You initiate a workflow using workflow()
(from the workflows
package) and then you can add a recipe and add a model to it.
# set the workflow
rf_workflow <- workflow() %>%
# add the recipe
add_recipe(sixthhospital_recipe) %>%
# add the model
add_model(rf_model)
Note that we still haven’t yet implemented the pre-processing steps in the recipe nor have we fit the model. We’ve just written the framework. It is only when we tune the parameters or fit the model that the recipe and model frameworks are actually implemented.
Since we had a parameter that we designated to be tuned (mtry
), we need to tune it (i.e. choose the value that leads to the best performance) before fitting our model. If you don’t have any parameters to tune, you can skip this step.
Note that we will do our tuning using the cross-validation object (sixthhospital_cv
). To do this, we specify the range of mtry
values we want to try, and then we add a tuning layer to our workflow using tune_grid()
(from the tune
package). Note that we focus on two metrics: accuracy
and roc_auc
(from the yardstick
package).
# specify which values to try
rf_grid <- expand.grid(mtry = c(3,4,5))
# extract results
rf_tune_results <- rf_workflow %>%
tune_grid(resamples = sixthhospital_cv, #CV object
grid = rf_grid, # grid of values to try
metrics = metric_set(accuracy, roc_auc) # metrics we care about
)
You can tune multiple parameters at once by providing multiple parameters to the expand.grid()
function, e.g. expand.grid(mtry = c(3, 4, 5), trees = c(100, 500))
.
It’s always a good idea to explore the results of the cross-validation. collect_metrics()
is a really handy function that can be used in a variety of circumstances to extract any metrics that have been calculated within the object it’s being used on. In this case, the metrics come from the cross-validation performance across the different values of the parameters.
## # A tibble: 6 x 7
## mtry .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 3 accuracy binary 0.823 50 0.0193 Model1
## 2 3 roc_auc binary 0.898 50 0.0202 Model1
## 3 4 accuracy binary 0.812 50 0.0234 Model2
## 4 4 roc_auc binary 0.893 50 0.0210 Model2
## 5 5 accuracy binary 0.814 50 0.0205 Model3
## 6 5 roc_auc binary 0.886 50 0.0213 Model3
Across both accuracy and AUC, the best performance (just) can be selected.
We want to add a layer to our workflow that corresponds to the tuned parameter, i.e. sets mtry
to be the value that yielded the best results. If you didn’t tune any parameters, you can skip this step.
We can extract the best value for the roc_auc metric by applying the select_best()
function to the tune object.
## # A tibble: 1 x 2
## mtry .config
## <dbl> <chr>
## 1 3 Model1
Then we can add this parameter to the workflow using the finalize_workflow()
function.
Now we’ve defined our recipe, our model, and tuned the model’s parameters, we’re ready to actually fit the final model. Since all of this information is contained within the workflow object, we will apply the last_fit()
function to our workflow and our train/test split object. This will automatically train the model specified by the workflow using the training data, and produce evaluations based on the test set.
rf_fit <- rf_workflow %>%
# fit on the training set and evaluate on test set
last_fit(sixthhospital_split)
Note that the fit object that is created is a data-frame-like object; specifically, it is a tibble with list columns.
## # Resampling results
## # Monte Carlo cross-validation (0.68/0.32) with 1 resamples
## # A tibble: 1 x 6
## splits id .metrics .notes .predictions .workflow
## <list> <chr> <list> <list> <list> <list>
## 1 <split [30/… train/test … <tibble [2 ×… <tibble [0 … <tibble [14 × … <workflo…
This is a really nice feature of tidymodels (and is what makes it work so nicely with the tidyverse) since you can do all of your tidyverse operations to the model object. While truly taking advantage of this flexibility requires proficiency with purrr, if you don’t want to deal with purrr and list-columns, there are functions that can extract the relevant information from the fit object that remove the need for purrr as we will see below.
Since we supplied the train/test object when we fit the workflow, the metrics are evaluated on the test set. Now when we use the collect_metrics()
function (recall we used this when tuning our parameters), it extracts the performance of the final model (since rf_fit
now consists of a single final model) applied to the test set.
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.786
## 2 roc_auc binary 0.857
Overall the performance is very good.
You can also extract the test set predictions themselves using the collect_predictions()
function. Note that there are 192 rows in the predictions object below which matches the number of test set observations (just to give you some evidence that these are based on the test set rather than the training set).
# generate predictions from the test set
test_predictions <- rf_fit %>% collect_predictions()
test_predictions$Value<- test_predictions[,c(paste0(".pred_",target_group[1]))] %>% unlist()
# test_predictions <- rf_fit %>% pull(.predictions) ## purrr function
test_predictions
## # A tibble: 14 x 7
## id .pred_AD .pred_NC .row .pred_class Diagnosis Value
## <chr> <dbl> <dbl> <int> <fct> <fct> <dbl>
## 1 train/test split 0.758 0.242 3 AD AD 0.758
## 2 train/test split 0.897 0.103 5 AD AD 0.897
## 3 train/test split 0.831 0.169 9 AD AD 0.831
## 4 train/test split 0.376 0.624 14 NC AD 0.376
## 5 train/test split 0.152 0.848 15 NC NC 0.152
## 6 train/test split 0.541 0.459 25 AD NC 0.541
## 7 train/test split 0.414 0.586 26 NC NC 0.414
## 8 train/test split 0.151 0.849 27 NC NC 0.151
## 9 train/test split 0.679 0.321 28 AD AD 0.679
## 10 train/test split 0.805 0.195 31 AD NC 0.805
## 11 train/test split 0.856 0.144 36 AD AD 0.856
## 12 train/test split 0.0801 0.920 37 NC NC 0.0801
## 13 train/test split 0.341 0.659 38 NC NC 0.341
## 14 train/test split 0.521 0.479 42 AD AD 0.521
Since this is just a normal data frame/tibble object, we can generate summaries and plots such as a confusion matrix.
# generate a confusion matrix
cm <- test_predictions %>%
conf_mat(truth = Diagnosis, estimate = .pred_class)
autoplot(cm, type = "heatmap")
We could also plot distributions of the predicted probability distributions for each class.
test_predictions %>%
ggplot( aes_string(x = "Value" , fill = 'Diagnosis') ) +
geom_density( alpha = 0.5)
## # A tibble: 14 x 7
## id .pred_AD .pred_NC .row .pred_class Diagnosis Value
## <chr> <dbl> <dbl> <int> <fct> <fct> <dbl>
## 1 train/test split 0.758 0.242 3 AD AD 0.758
## 2 train/test split 0.897 0.103 5 AD AD 0.897
## 3 train/test split 0.831 0.169 9 AD AD 0.831
## 4 train/test split 0.376 0.624 14 NC AD 0.376
## 5 train/test split 0.152 0.848 15 NC NC 0.152
## 6 train/test split 0.541 0.459 25 AD NC 0.541
## 7 train/test split 0.414 0.586 26 NC NC 0.414
## 8 train/test split 0.151 0.849 27 NC NC 0.151
## 9 train/test split 0.679 0.321 28 AD AD 0.679
## 10 train/test split 0.805 0.195 31 AD NC 0.805
## 11 train/test split 0.856 0.144 36 AD AD 0.856
## 12 train/test split 0.0801 0.920 37 NC NC 0.0801
## 13 train/test split 0.341 0.659 38 NC NC 0.341
## 14 train/test split 0.521 0.479 42 AD AD 0.521
The previous section evaluated the model trained on the training data using the testing data. But once you’ve determined your final model, you often want to train it on your full dataset and then use it to predict the response for new data.
If you want to use your model to predict the response for new observations, you need to use the fit()
function on your workflow and the dataset that you want to fit the final model on (e.g. the complete training + testing dataset). Below I use our own original dataset, but actually we should do it on external dataset.
The final_model
object contains a few things including the ranger object trained with the parameters established through the workflow contained in rf_workflow
based on the data in sixthhospital.clean
(the combined training and testing data).
## ══ Workflow [trained] ═══════════════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
##
## ── Preprocessor ─────────────────────────────────────────────────────────────────────────────────────
## 2 Recipe Steps
##
## ● step_log()
## ● step_normalize()
##
## ── Model ────────────────────────────────────────────────────────────────────────────────────────────
## Ranger result
##
## Call:
## ranger::ranger(formula = ..y ~ ., data = data, mtry = ~3, importance = ~"impurity", num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
##
## Type: Probability estimation
## Number of trees: 500
## Sample size: 44
## Number of independent variables: 20
## Mtry: 3
## Target node size: 10
## Variable importance mode: impurity
## Splitrule: gini
## OOB prediction error (Brier s.): 0.1559749
If you want to extract the variable importance scores from your model, as far as I can tell, for now you need to extract the model object from the fit()
object (which for us is called final_model
). The function that extracts the model is pull_workflow_fit()
and then you need to grab the fit
object that the output contains.
## Ranger result
##
## Call:
## ranger::ranger(formula = ..y ~ ., data = data, mtry = ~3, importance = ~"impurity", num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
##
## Type: Probability estimation
## Number of trees: 500
## Sample size: 44
## Number of independent variables: 20
## Mtry: 3
## Target node size: 10
## Variable importance mode: impurity
## Splitrule: gini
## OOB prediction error (Brier s.): 0.1559749
Then you can extract the variable importance from the ranger object itself (variable.importance
is a specific object contained within ranger output - this will need to be adapted for the specific object type of other models).
ranger_obj_importance <- ranger_obj$variable.importance %>% data.frame()
ranger_obj_importance$feature<-rownames(ranger_obj_importance)
ranger_obj_importance$value<-ranger_obj_importance$.
ranger_obj_importance$.<-NULL
ranger_obj_importance$.<-NULL
## [1] "K00052" "K00210" "K00836" "K01226" "K02082" "K02745" "K02747" "K02818"
## [9] "K03465" "K03722" "K03802" "K03828" "K05895" "K06919" "K07148" "K07506"
## [17] "K11063" "K13275" "K13282" "K20151"
# ko<-c("ko00010","ko00020","ko00030")
ko<- ranger_obj_importance$feature
# get name
print(length(ko))
## [1] 20
ko_names <- c()
for (i in ko){
# ko_name <- getURL(paste0("http://togows.dbcls.jp/entry/pathway/",i,"/name"))
ko_name <- getURL(paste0("http://togows.org/entry/kegg-enzyme/",i,"/name"))
ko_names <- c(ko_names,ko_name)
}
ko_names <- gsub("\n","",ko_names)
ranger_obj_importance$ko_names <- ko_names
c2<-ggplot(ranger_obj_importance,aes(x=reorder(ko_names, abs(value) ), y=value) ) +
geom_bar(stat="identity", position=position_dodge() ,fill="red") +
coord_flip()+
theme_bw()+theme_minimal()+
theme( legend.position = "right",axis.title.x = element_blank(),axis.text.x =element_text(angle = 0),axis.title.y = element_blank() )+
theme(axis.title.x = element_text(size = 5))+
theme(axis.text.x = element_text(size = 6,color="black"),axis.text.y = element_text(size = 10,color="black"))+
ggtitle(label = paste0(target_group,collapse ='|') )
c2
#Model Training
df_training<-sixthhospital_train_preprocessed
df_testing<-sixthhospital_test_preprocessed
df_ranger <- rand_forest(trees = 100, mode = "classification") %>%
set_engine("ranger") %>%
fit(Diagnosis ~ ., data = df_training)
df_rf <- rand_forest(trees = 100, mode = "classification") %>%
set_engine("randomForest",importance=T) %>%
fit(Diagnosis ~ ., data = df_training)
df_lr <-logistic_reg(mode = "classification") %>%
set_engine("glm") %>%
fit(Diagnosis ~ ., data = df_training)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# df_lr1 <-logistic_reg(mode = "classification") %>%
# set_engine("glmnet") %>%
# fit(Diagnosis ~ ., data = df_training)
df_bt <-boost_tree(mode = "classification") %>%
set_engine("xgboost") %>%
fit(Diagnosis ~ ., data = df_training)
#Predictions
predict(df_ranger, df_testing)
## # A tibble: 14 x 1
## .pred_class
## <fct>
## 1 AD
## 2 AD
## 3 AD
## 4 NC
## 5 NC
## 6 AD
## 7 NC
## 8 NC
## 9 AD
## 10 AD
## 11 AD
## 12 NC
## 13 NC
## 14 AD
## Rows: 14
## Columns: 22
## $ .pred_class <fct> AD, AD, AD, NC, NC, AD, NC, NC, AD, AD, AD, NC, NC, AD
## $ K00052 <dbl> 0.7809951, 0.7154290, 0.4955381, -1.9531014, -0.6582282, …
## $ K00210 <dbl> 0.04263319, 0.48673764, -0.30707079, 0.58210332, 1.060592…
## $ K00836 <dbl> 1.18039366, 0.30308450, 0.68858184, -0.29729681, -2.01205…
## $ K01226 <dbl> 0.6187337, 0.4047961, 0.1071349, -0.1536828, -0.4446155, …
## $ K02082 <dbl> 0.182809072, 0.216945332, 0.004933035, -0.078061360, -0.5…
## $ K02745 <dbl> -0.14280003, 0.49761653, 0.13336357, -0.91186224, -2.0340…
## $ K02747 <dbl> -0.22619636, 0.33823247, -0.57639907, -1.17783541, -2.489…
## $ K02818 <dbl> 0.773034643, 0.441263044, 0.832145189, -0.317001082, -1.0…
## $ K03465 <dbl> 1.1573995, 0.7557580, 0.4755377, 1.0905615, -1.2985911, 1…
## $ K03722 <dbl> -0.0008661466, 0.3697163702, 0.2065728392, 0.3292798358, …
## $ K03802 <dbl> 1.6063919, 0.3594838, 1.0415371, -1.2850429, -1.2850429, …
## $ K03828 <dbl> -0.475451865, -0.067380613, -0.006041819, 0.745534150, -0…
## $ K05895 <dbl> 0.5955950764, 1.2858381255, 1.1834216895, -1.9622380674, …
## $ K06919 <dbl> 1.13979314, 0.56840099, 0.88737509, -0.47433355, -0.87809…
## $ K07148 <dbl> -0.51552323, -1.28366872, -0.95999259, 0.82430696, 0.5110…
## $ K07506 <dbl> -0.1336175, 0.4644781, -0.2274853, 0.6614558, 0.1908874, …
## $ K11063 <dbl> 1.36504163, -0.65849286, -0.06322721, 0.12384402, 1.08145…
## $ K13275 <dbl> 0.83256702, 1.36381334, 2.99535382, -1.23227870, -1.23227…
## $ K13282 <dbl> 1.28817709, 0.58632296, 0.33411787, 1.02720776, -1.452071…
## $ K20151 <dbl> 0.41135829, 0.96504509, 0.60437360, -0.54268371, -1.77115…
## $ Diagnosis <fct> AD, AD, AD, AD, NC, NC, NC, NC, AD, NC, AD, NC, NC, AD
## Rows: 14
## Columns: 22
## $ .pred_class <fct> AD, AD, AD, AD, NC, NC, AD, NC, NC, AD, NC, AD, NC, AD
## $ K00052 <dbl> 0.7809951, 0.7154290, 0.4955381, -1.9531014, -0.6582282, …
## $ K00210 <dbl> 0.04263319, 0.48673764, -0.30707079, 0.58210332, 1.060592…
## $ K00836 <dbl> 1.18039366, 0.30308450, 0.68858184, -0.29729681, -2.01205…
## $ K01226 <dbl> 0.6187337, 0.4047961, 0.1071349, -0.1536828, -0.4446155, …
## $ K02082 <dbl> 0.182809072, 0.216945332, 0.004933035, -0.078061360, -0.5…
## $ K02745 <dbl> -0.14280003, 0.49761653, 0.13336357, -0.91186224, -2.0340…
## $ K02747 <dbl> -0.22619636, 0.33823247, -0.57639907, -1.17783541, -2.489…
## $ K02818 <dbl> 0.773034643, 0.441263044, 0.832145189, -0.317001082, -1.0…
## $ K03465 <dbl> 1.1573995, 0.7557580, 0.4755377, 1.0905615, -1.2985911, 1…
## $ K03722 <dbl> -0.0008661466, 0.3697163702, 0.2065728392, 0.3292798358, …
## $ K03802 <dbl> 1.6063919, 0.3594838, 1.0415371, -1.2850429, -1.2850429, …
## $ K03828 <dbl> -0.475451865, -0.067380613, -0.006041819, 0.745534150, -0…
## $ K05895 <dbl> 0.5955950764, 1.2858381255, 1.1834216895, -1.9622380674, …
## $ K06919 <dbl> 1.13979314, 0.56840099, 0.88737509, -0.47433355, -0.87809…
## $ K07148 <dbl> -0.51552323, -1.28366872, -0.95999259, 0.82430696, 0.5110…
## $ K07506 <dbl> -0.1336175, 0.4644781, -0.2274853, 0.6614558, 0.1908874, …
## $ K11063 <dbl> 1.36504163, -0.65849286, -0.06322721, 0.12384402, 1.08145…
## $ K13275 <dbl> 0.83256702, 1.36381334, 2.99535382, -1.23227870, -1.23227…
## $ K13282 <dbl> 1.28817709, 0.58632296, 0.33411787, 1.02720776, -1.452071…
## $ K20151 <dbl> 0.41135829, 0.96504509, 0.60437360, -0.54268371, -1.77115…
## $ Diagnosis <fct> AD, AD, AD, AD, NC, NC, NC, NC, AD, NC, AD, NC, NC, AD
## Rows: 14
## Columns: 22
## $ .pred_class <fct> AD, AD, AD, NC, NC, AD, AD, NC, AD, AD, AD, NC, NC, AD
## $ K00052 <dbl> 0.7809951, 0.7154290, 0.4955381, -1.9531014, -0.6582282, …
## $ K00210 <dbl> 0.04263319, 0.48673764, -0.30707079, 0.58210332, 1.060592…
## $ K00836 <dbl> 1.18039366, 0.30308450, 0.68858184, -0.29729681, -2.01205…
## $ K01226 <dbl> 0.6187337, 0.4047961, 0.1071349, -0.1536828, -0.4446155, …
## $ K02082 <dbl> 0.182809072, 0.216945332, 0.004933035, -0.078061360, -0.5…
## $ K02745 <dbl> -0.14280003, 0.49761653, 0.13336357, -0.91186224, -2.0340…
## $ K02747 <dbl> -0.22619636, 0.33823247, -0.57639907, -1.17783541, -2.489…
## $ K02818 <dbl> 0.773034643, 0.441263044, 0.832145189, -0.317001082, -1.0…
## $ K03465 <dbl> 1.1573995, 0.7557580, 0.4755377, 1.0905615, -1.2985911, 1…
## $ K03722 <dbl> -0.0008661466, 0.3697163702, 0.2065728392, 0.3292798358, …
## $ K03802 <dbl> 1.6063919, 0.3594838, 1.0415371, -1.2850429, -1.2850429, …
## $ K03828 <dbl> -0.475451865, -0.067380613, -0.006041819, 0.745534150, -0…
## $ K05895 <dbl> 0.5955950764, 1.2858381255, 1.1834216895, -1.9622380674, …
## $ K06919 <dbl> 1.13979314, 0.56840099, 0.88737509, -0.47433355, -0.87809…
## $ K07148 <dbl> -0.51552323, -1.28366872, -0.95999259, 0.82430696, 0.5110…
## $ K07506 <dbl> -0.1336175, 0.4644781, -0.2274853, 0.6614558, 0.1908874, …
## $ K11063 <dbl> 1.36504163, -0.65849286, -0.06322721, 0.12384402, 1.08145…
## $ K13275 <dbl> 0.83256702, 1.36381334, 2.99535382, -1.23227870, -1.23227…
## $ K13282 <dbl> 1.28817709, 0.58632296, 0.33411787, 1.02720776, -1.452071…
## $ K20151 <dbl> 0.41135829, 0.96504509, 0.60437360, -0.54268371, -1.77115…
## $ Diagnosis <fct> AD, AD, AD, AD, NC, NC, NC, NC, AD, NC, AD, NC, NC, AD
#Model Validation
df_ranger %>%
predict(df_testing) %>%
bind_cols(df_testing) %>%
metrics(truth = Diagnosis, estimate = .pred_class)
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.786
## 2 kap binary 0.571
df_rf %>%
predict(df_testing) %>%
bind_cols(df_testing) %>%
metrics(truth = Diagnosis, estimate = .pred_class)
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.786
## 2 kap binary 0.571
df_lr %>%
predict(df_testing) %>%
bind_cols(df_testing) %>%
metrics(truth = Diagnosis, estimate = .pred_class)
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.643
## 2 kap binary 0.286
# df_lr1 %>%
# predict(df_testing) %>%
# bind_cols(df_testing) %>%
# metrics(truth = Diagnosis, estimate = .pred_class)
df_bt %>%
predict(df_testing) %>%
bind_cols(df_testing) %>%
metrics(truth = Diagnosis, estimate = .pred_class)
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.714
## 2 kap binary 0.429
#Per classifier metrics
df_probs <- df_ranger %>%
predict(df_testing, type = "prob") %>%
bind_cols(df_testing)
roc_fig <- predict(df_ranger, df_testing, type = "prob") %>%
bind_cols(predict(df_ranger, df_testing)) %>%
bind_cols(dplyr::select(df_testing, Diagnosis))
roc_fig$V1 <- as.data.frame(roc_fig)[,1]
roc_fig_lr <- predict(df_lr, df_testing, type = "prob") %>%
bind_cols(predict(df_lr, df_testing)) %>%
bind_cols(dplyr::select(df_testing, Diagnosis))
roc_fig_lr$V1 <- as.data.frame(roc_fig_lr)[,1]
# roc_fig_lr1 <- multi_predict(df_lr1, df_testing, type = "prob") %>%
# bind_cols(multi_predict(df_lr1, df_testing)) %>%
# bind_cols(dplyr::select(df_testing, Diagnosis))
# roc_fig_lr1$V1 <- as.data.frame(roc_fig_lr1)[,1]
roc_fig_bt <- predict(df_bt, df_testing, type = "prob") %>%
bind_cols(predict(df_bt, df_testing)) %>%
bind_cols(dplyr::select(df_testing, Diagnosis))
roc_fig_bt$V1 <- as.data.frame(roc_fig_bt)[,1]
#get aac auc
# predict(df_ranger, df_testing, type = "prob") %>%
# bind_cols(predict(df_ranger, df_testing)) %>%
# bind_cols(select(df_testing, Diagnosis)) %>%
yardstick::metrics(roc_fig, Diagnosis, estimate = .pred_class, V1)
## # A tibble: 4 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.786
## 2 kap binary 0.571
## 3 mn_log_loss binary 0.498
## 4 roc_auc binary 0.816
## # A tibble: 4 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.643
## 2 kap binary 0.286
## 3 mn_log_loss binary 10.4
## 4 roc_auc binary 0.714
# yardstick::metrics(roc_fig_lr1, Diagnosis, estimate = .pred_class, V1)
yardstick::metrics(roc_fig_bt, Diagnosis, estimate = .pred_class, V1)
## # A tibble: 4 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.714
## 2 kap binary 0.429
## 3 mn_log_loss binary 0.521
## 4 roc_auc binary 0.806
# roc_curve(roc_fig_lr1, Diagnosis,V1 )%>% autoplot()
roc_curve(roc_fig_bt, Diagnosis,V1 )%>% autoplot()
#fearure importance
ko <- rownames(df_rf$fit$importance)
ko_names <- c()
for (i in ko){
# ko_name <- getURL(paste0("http://togows.dbcls.jp/entry/pathway/",i,"/name"))
ko_name <- getURL(paste0("http://togows.org/entry/kegg-enzyme/",i,"/name"))
ko_names <- c(ko_names,ko_name)
}
ko_names <- gsub("\n","",ko_names)
df_rf_imp<- cbind(rownames(df_rf$fit$importance), df_rf$fit$importance, ko_names) %>% as_tibble() %>% dplyr::select(predictor= V1, MeanDecreaseAccuracy, enzyme = ko_names) %>%
mutate(MeanDecreaseAccuracy=as.numeric(MeanDecreaseAccuracy)) %>% arrange(desc((MeanDecreaseAccuracy)))
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
## Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.