Disclaimer: This analysis has not been peer-reviewed and is not meant to replace the process of species assessment and prioritization undertaken by the IUCN Red List.
An analysis to aid in predicted classification and thus triage of IUCN Red List species currently ‘Not Evaluated’ or ‘Data Deficient’.
Problem Description: The rate of extinction at present is high above historical levels. Conservation has limited resources to allocate to both the assessment of species and their protection. The main international organization that evaluates the population status of species is the IUCN. They assess species into several categories of risk, and a category of ‘Data Deficient’ when there is not enough information to judge the risk status of a species.
Data sources:
IUCN API - Current Threat Status, taxonomy, max depth, habitat type, threats, etc.
https://apiv3.iucnredlist.org/api/v3/docs Fishbase - Length Data, Price category, Vulnerability Index, habitat range? Age at sexual maturity? Max Age?
www.fishbase.org Aquamaps - Area size (sum of cell area where occurrence is predicted)
www.aquamaps.org
Proposed Solution: To properly allocate resources for conservation we need to know which species are most threatened. Species classified as ‘Data Deficient’ could be of high concern (e.g., Threatened) or they could be of a low priority (Least Concern or Near Threatened). An improvement to this process would be to predict which species are most likely of high conservation concern within the Data Deficient category as well as of species not assessed by the IUCN at present.
After some review, I believe it is more important not to predict the exact species category, but to predict whether the species is likely to be ‘Threatened’ (e.g., Vulnerable, Endangered, or Critically Endangered), or of a lesser concern (‘Not Threatened’, i.e., Least Concern or Near Threatened).
Therefore, I am to train a machine learning or statistical model on available data that will predict the IUCN category or their general level of conservation concern of species currently listed as Data Deficient or species not currently included on the IUCN Red List.
Based on my current knowledge, I begin with a focus on aquatic species, and especially fishes.
ML Methods: Logarithmic regression, Decision Tree, Random Forest, Naive Bayes, kNN I can attempt these with regression based on a linear-scoring method of the IUCN status (0-1), or on a more classical classification on the threat categories, or large over-arching categories (Not Threatened vs. Threatened).
pal <- wes_palette(name = "Zissou1", n=100, type = c("continuous"))
zissou5 <- wes_palette(name="Zissou1")
theme_set(theme_classic())
null_colour <- "white"
#Establish list of marine species
all_species <- read.csv(file.path(dir_data, "iucn_species_list_version_2019-2.csv"))
marine_species <- read.csv(file.path(dir_data, "spp_marine_from_version_2019-2.csv")) %>% filter(is.na(habs)==F)
marine_species <- marine_species %>% separate(habs, into=paste("habs", seq(1,50), sep=""), sep=",")
marine_species <- marine_species %>%
gather(key="habitat_num", value="value", -iucn_sid,-max_depth) %>%
mutate(habitat_type=as.factor(value)) %>%
dplyr::select(-c(habitat_num, value))
marine_species <- marine_species %>% filter(is.na(habitat_type)==F)
marine_species <- marine_species %>%
mutate(habitat_group = gsub(habitat_type, pattern="\\..*", replacement="")) %>%
mutate(habitat_group = as.factor(trimws(habitat_group)))
#### Align with threats and narrative text: ####
species_threats <- read.csv(file.path(dir_data, "species_fishing_threats.csv"), stringsAsFactors = F)
colnames(species_threats) <- gsub(colnames(species_threats), pattern="result\\.", replacement="")
risk_codes <- read.csv(file.path(dir_data, "risk_code_lookup.csv"), stringsAsFactors = F)
species_threats <- species_threats %>% left_join(risk_codes, by=c("category"="code"))
species_threats <- species_threats %>%
dplyr::select(iucn_sid, class_name, scientific_name, code_current, cat_score) %>%
unique()
# threatened_status <- c("VU", "CR", "EN")
# species_threats <- species_threats %>% filter(code_current %in% threatened_status)
text_threats <- read.csv(file.path(dir_data, "narrative_threats_output_tidy.csv"))
num_dd <- nrow(all_species %>% filter(result.category=="DD"))
16011 species are currently classified by the IUCN as Data Deficient (DD; see Figure @ref(fig:iucn-species-plot)). This is a major challenge to conservation as data deficient species are not prioritized for conservation efforts, and many are likely at an elevated risk of extinction.
all_species <- all_species %>% left_join(risk_codes, by=c("result.category"="code"))
all_species <- all_species %>% mutate(code_current = fct_relevel(code_current, c("DD", "LC", "NT", "VU", "EN", "CR", "EX")))
included_species <- all_species %>% filter(result.taxonid %in% species_threats$iucn_sid)
all_species_plot <- ggplot(data=all_species, aes(x=code_current, fill=code_current)) +
geom_bar() +
scale_fill_manual(values= IUCNpalette::iucn_palette(category="All", exclude="NE")) +
ylab("Number of species") +
labs(fill="IUCN Category") +
xlab("") +
scale_y_continuous(expand=c(0,0)) +
theme(legend.position = "none") +
NULL
all_species_plot
Let’s focus on those that are not threatened or great (Least Concern and Near Threatened) and those that are Threatened or worse (Vulnerable, Endangered, Critically Endangered, and Extinct).
all_species <- all_species %>%
mutate(new_category = if_else(is.na(cat_score), "Data Deficient",
if_else(cat_score <=0.2, "Not Threatened", "Threatened"
#if_else(cat_score >0.8, "Extinct", "Threatened") #Removed this as it probably doesn't add much
))) %>%
mutate(new_category = as.factor(new_category))
all_species_plot <- ggplot(data=all_species, aes(x=new_category, fill=new_category)) +
geom_bar() +
scale_fill_manual(values= c("grey50", zissou5[1], zissou5[5])) +
ylab("Number of species") +
labs(fill="IUCN Category") +
xlab("") +
scale_y_continuous(expand=c(0,0)) +
theme(legend.position = "none") +
NULL
all_species_plot
Here we’re going to take advantage of the rfishbase package to extract relevant predictors for fishes. This will limit our analysis to those species in both the IUCN red list and Fishbase.
Here I look at the possible columns of Fishbase data and use some based on what I think may be good predictors of an elevated risk status. I then check some of these for how many of them are NAs as this data is less useful if I have to fill in a lot.
Longevity of the species is probably a good predictor but only available for ~2000 of the 34,000 species covered by Fishbase. I try to see if there is a clear relationship with Length that could be modelled but it doesn’t seem like it.
fishbase_taxa <- load_taxa()
#Variables of interest: Length, Vulnerability, CommonLength, PriceCateg, Importance
x <- species(species_list = "Bolbometopon muricatum") #Used to find out which fields are important
# colnames(x)
species_list_fb <- fishbase_taxa %>% pull(Species)
fb_data <- species(species_list = species_list_fb, fields=c("Species", "Length", "Vulnerability", "CommonLength", "PriceCateg", "Importance", "LongevityWild", "BodyShapeI", "DepthRangeDeep", "DepthRangeComDeep"))
# popgrowth("Oreochromis niloticus")
# distribution("Oreochromis niloticus")
nrow(fb_data %>% filter(is.na(DepthRangeDeep)))
## [1] 19402
nrow(fb_data %>% filter(is.na(DepthRangeComDeep) & is.na(DepthRangeDeep)))
## [1] 19269
fb_data %>% group_by(DepthRangeDeep) %>%
summarize(n=n())
## # A tibble: 1,451 x 2
## DepthRangeDeep n
## <dbl> <int>
## 1 0 9
## 2 1 144
## 3 2 167
## 4 3 127
## 5 4 68
## 6 5 158
## 7 6 102
## 8 7 44
## 9 8 104
## 10 9 53
## # ... with 1,441 more rows
#Is there a relationship between Length and Longevity?
fb_data %>%
ggplot(aes(x=Length, y=LongevityWild)) +
geom_point()
Aquamaps has tons of data on distribution of aquatic species. Their database is accessible through their package which is then set-up as a sqlite database locally. I compare the taxa lists between aquamaps and the IUCN to find where they overlap as right now I only want to predict for species already included in the IUCN Red List. Main predictor I am using here is range size (if a species occurs in a given cell (0.5 by 0.5 degree cell), then it is assumed to occupy the area of that cell). Species with larger ranges have more places they can be unaffected by human or other disturbances.
This data is really helpful but takes a long time to process so I set eval=F on this chunk.
download_db(force = TRUE) #DB is 1gb so takes a long time to do this download.
my_db <- aquamapsdata:::src_sqlite_aquamapsdata()
# for a db table, return a tibble with the columns and their data types
ls_types <- function(table) {
res <- table %>% head %>% collect %>% lapply(type_sum) %>% unlist
colname <- names(res)
title <- as.character(table$ops$x)
tibble(table = title, col_name = colname, col_type = res, desc = NA)
}
# run the above function on all tables
am_schema <- bind_rows(
my_db %>% tbl("nativemaps") %>% ls_types,
my_db %>% tbl("hcaf") %>% ls_types,
my_db %>% tbl("hspen") %>% ls_types,
my_db %>% tbl("occ") %>% ls_types,
my_db %>% tbl("taxa") %>% ls_types
)
#datatable(am_schema)
#Get species list
am_species <- my_db %>% tbl("taxa") %>% filter(is.na(iucn_code)==F) %>% collect %>% mutate(SciName = paste(Genus, Species)) %>% select(SciName, SPECIESID)
am_species <- am_species %>% filter(SciName %in% all_species$result.scientific_name)
#Get maps for species in list
am_species_maps <- my_db %>% tbl("nativemaps") %>% collect %>% filter(SpeciesID %in% am_species$SPECIESID)
#get reference for map cell ids:
am_map <- my_db %>% tbl("hcaf") %>% collect %>% select(CsquareCode, CenterLat, CenterLong, CellArea)
am_species_maps <- am_species_maps %>% left_join(am_map)
am_species_maps %>%
ggplot(aes(x=CenterLong, y=CenterLat, fill=probability)) +
geom_tile()
write_csv(am_species_maps, "./data/am_species_maps.csv")
am_species_maps_area <- am_species_maps %>%
group_by(SpeciesID) %>%
summarize(CellArea = sum(CellArea)) %>%
left_join(am_species, by=c("SpeciesID"="SPECIESID"))
write_csv(am_species_maps_area, "./data/am_species_area.csv")
Combine the IUCN, Fishbase and Aquamaps data here and fill in NAs where possible and filter out observations with NAs when not possible. I also save the Data Deficient (DD) species for later as these are what I want to get to in the end.
am_species_maps_area <- read_csv( "../data/am_species_area.csv")
all_species <- read.csv(file.path(dir_data, "iucn_species_list.csv"))
risk_codes <- read.csv(file.path(dir_data, "risk_code_lookup.csv"), stringsAsFactors = F)
risk_codes <- risk_codes %>% dplyr::select(code, code_current, cat_score) %>% distinct()
dat <- all_species %>%
left_join(risk_codes, by=c("result.category"="code")) %>%
left_join(fb_data, by=c("result.scientific_name"="Species")) %>%
left_join(am_species_maps_area, by=c("result.scientific_name"="SciName")) %>%
left_join(marine_species, by=c("result.taxonid"="iucn_sid"))
#Add in feature of larger categories. Might improve classifier performace:
dat$new_category <- NA
dat <- dat %>%
mutate(new_category = if_else(is.na(cat_score), "Data Deficient",
if_else(cat_score <=0.2, "Not Threatened", "Threatened"
#if_else(cat_score >0.8, "Extinct", "Threatened") #Removed this as it probably doesn't add much
))) %>%
mutate(new_category = as.factor(new_category))
dat_no_dd <- dat %>% filter(code_current != "DD")
dat_no_dd <- dat_no_dd %>% filter(is.na(max_depth)==F) %>% droplevels.data.frame()
fish_dat <- dat_no_dd %>%
filter(is.na(Length)==F & is.na(cat_score)==F) %>%
droplevels.data.frame() %>%
filter(is.na(CellArea)==F)
#14963 fishes are found in FB and IUCN list
#dat_no_dd %>% filter(is.na(habitat_type))
#Separate out DD data
dd <- dat %>% filter(code_current == "DD")
fish_dat_dd <- dd %>% filter(is.na(Length)==F) %>% filter(is.na(max_depth)==F) #2220 fishes are found in FB and IUCN list and are Data Deficient
Check to see how many of the variables I was looking at are missing in the databases I am using. I can fill in some based on physical relationships in these species but harder for other variables (e.g., price category).
#
# unique(fish_dat$PriceCateg)
fish_dat$PriceCateg[is.na(fish_dat$PriceCateg)] <- "unknown"
#Assume common length is a normal relationship to max length
nrow(fish_dat %>% filter(is.na(Length)==T))
## [1] 0
length_model <- lm(CommonLength ~ Length, dat=fish_dat)
fish_dat$CommonLengthPred <- NA
fish_dat$CommonLengthPred <- predict(length_model, newdata=fish_dat)
fish_dat$CommonLength[is.na(fish_dat$CommonLength)==T] <- fish_dat$CommonLengthPred[is.na(fish_dat$CommonLength)==T]
summary(fish_dat %>% select(cat_score, Length, Vulnerability, CommonLength))
## cat_score Length Vulnerability CommonLength
## Min. :0.00000 Min. : 1.00 Min. :10.00 Min. : 3.200
## 1st Qu.:0.00000 1st Qu.: 12.00 1st Qu.:16.38 1st Qu.: 9.491
## Median :0.00000 Median : 25.00 Median :28.26 Median : 16.530
## Mean :0.04191 Mean : 49.59 Mean :31.87 Mean : 29.845
## 3rd Qu.:0.00000 3rd Qu.: 51.00 3rd Qu.:41.60 3rd Qu.: 30.067
## Max. :1.00000 Max. :1600.00 Max. :90.00 Max. :1000.000
fish_dat_numeric <- fish_dat %>% select(cat_score, Length, Vulnerability, CommonLength)
cormat <- round(cor(fish_dat_numeric),2)
cormat <- as.data.frame(cormat)
cormat$Var1 <- row.names(cormat)
melted_cormat <- cormat %>% gather(key="Var2", value="value", -Var1)
ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) +
geom_tile()
#Descriptors:
#importance
fish_dat %>% group_by(PriceCateg) %>% summarize(n=n())
## # A tibble: 5 x 2
## PriceCateg n
## <chr> <int>
## 1 high 1988
## 2 low 788
## 3 medium 3158
## 4 unknown 6358
## 5 very high 4020
fish_dat %>% group_by(Importance) %>% summarize(n=n())
## # A tibble: 7 x 2
## Importance n
## <chr> <int>
## 1 <NA> 7735
## 2 commercial 2619
## 3 highly commercial 575
## 4 minor commercial 3120
## 5 of no interest 1669
## 6 of potential interest 73
## 7 subsistence fisheries 521
fish_dat$Importance[is.na(fish_dat$Importance)] <- "unknown"
fish_dat$BodyShapeI[is.na(fish_dat$BodyShapeI)] <- "other"
fish_dat$BodyShapeI[fish_dat$BodyShapeI=="other (see remarks)"] <- "other"
#Check for remaining NAs in important variables.
#colSums(is.na(fish_dat))
Interestingly, vulnerabilty is a poorer predictor of cat_score than length.
I set up the train and test datasets that I can continue to come back to the clean versions. Some of the modelling I do later on needs the data in particular formats (factors, numerics, etc.) so I want to modify it for each analysis, but have a ‘clean’ version I can keep coming back to. I use a 70% data partition train/test split here.
train_raw <- sample_frac(fish_dat, size=0.7, replace=F)
test_raw <- fish_dat %>% filter(!result.taxonid %in% train_raw$result.taxonid)
train <- train_raw
test <- test_raw
I am going to be doing several types of models, but I am going to focus on a binary classification for now. Therefore, I’m going to use an F1-Score as my primary metric to compare between model types.
Looking back at our training data, we know the vast majority are of least concern or near threatened. That means a model could be ‘correct’ around 90% of the time if we just guessed ‘Not Threatened’. That is a very high percentage, but doesn’t give us any new information.
lc_accuracy <- nrow(test %>% filter(code_current=="LC")) / nrow(test)
lc_accuracy
## [1] 0.9070905
models_list <- list()
Start with some linear and generalized linear models. Helps me see what variables are significant predictors.
train <- train_raw
test <- test_raw
fish1 <- glm(new_category ~ Vulnerability + CommonLength + PriceCateg + Importance + Length + CellArea, data=train, family="binomial")
#summary(fish1)
fish2 <- glm(new_category ~ Vulnerability + CommonLength + PriceCateg + Importance + Length + max_depth + CellArea, data=train, family="binomial")
#summary(fish2)
fish3 <- glm(new_category ~ Vulnerability + CommonLength + PriceCateg + Importance + Length + habitat_group + max_depth + CellArea, data=train, family="binomial")
#summary(fish3)
fish1_pred <- predict(fish1, newdata = test)
fish2_pred <- predict(fish2, newdata = test)
fish3_pred <- predict(fish3, newdata = test)
models_list <- models_list %>% append(c(fish1, fish2, fish3))
convert_to_new_category <- function(predictions){
if(is.numeric(predictions)){
predictions <- if_else(predictions >=0.50, "Threatened", "Not Threatened")
predictions <- as.factor(predictions)
levels(predictions) <- c("Not Threatened", "Threatened")
return(predictions)
}
}
fish1_pred <- convert_to_new_category(fish1_pred)
fish2_pred <- convert_to_new_category(fish2_pred)
fish3_pred <- convert_to_new_category(fish3_pred)
predicted_values <- list()
predicted_values <- list("fish1" = fish1_pred,
"fish2" = fish2_pred,
"fish3" = fish3_pred)
Our LM models are performing quite well with MSPE lowering for each variable added. Vulnerability, CellArea, and Length, are all performing well as predictors that are highly significant. Different levels within the categorical variables are also performing well, such as commercial species and high value species.
Can PCA help us? Maybe less useful because of how many categorical variables we have:
fish.pca <- prcomp(train %>% select(cat_score, Length, Vulnerability, CommonLength, CellArea), center = TRUE,scale. = TRUE)
summary(fish.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 1.7012 0.9944 0.8050 0.64556 0.22898
## Proportion of Variance 0.5788 0.1978 0.1296 0.08335 0.01049
## Cumulative Proportion 0.5788 0.7766 0.9062 0.98951 1.00000
devtools::install_github("vqv/ggbiplot")
ipak("ggbiplot")
## ggbiplot
## TRUE
ggbiplot(fish.pca, alpha=0.3)
pca_groups <- train %>% pull(code_current) %>% as.character()
pca_groups <- train %>% pull(new_category) %>% as.character()
#grid1 <- grid2 <- seq(0.05, 0.2, length=5)
#ggbiplot(fish.pca, ellipse=TRUE, labels = pca_groups, groups=pca_groups)
tree <- rpart(new_category ~ Vulnerability + PriceCateg + Importance + Length + habitat_group + CellArea, data=train, minbucket=50)
prp(tree)
tree.pred = predict(tree, newdata=test)
tree.pred <- as.data.frame(tree.pred)
tree.pred <- tree.pred %>%
mutate(prediction = if_else(Threatened>=0.5, "Threatened", "Not Threatened")) %>%
mutate(prediction = as.factor(prediction))
levels(tree.pred$prediction) <- c("Not Threatened", "Threatened")
tree.pred <- tree.pred %>% pull(prediction)
predicted_values <- predicted_values %>% append(list("cart_classifier"=tree.pred))
Length and Vulnerability are key here. This might not be surprising based on previous research but is still helpful for categorizing our data.
A lazy learning algorithm. kNN uses points that are similar on the predictors to derive patterns in the data for its prediction.
train <- train_raw
test <- test_raw
train_labels <- train %>% pull(new_category)
test_labels <- test %>% pull(new_category)
train_knn <- train %>% select(Vulnerability, PriceCateg, Importance, Length, habitat_group)
test_knn <- test %>% select(Vulnerability, PriceCateg, Importance, Length, habitat_group)
#using class::knn
colSums(is.na(train_knn))
## Vulnerability PriceCateg Importance Length habitat_group
## 0 0 0 0 0
colSums(is.na(test_knn))
## Vulnerability PriceCateg Importance Length habitat_group
## 0 0 0 0 0
train_y <- train %>% select(new_category)
# colnames(train_knn)
# colnames(test_knn)
#Transform habitat groups,price categ and importance into dummy variables.
ipak("fastDummies")
## fastDummies
## TRUE
train_knn <- fastDummies::dummy_cols(train_knn, select_columns = c("PriceCateg", "Importance", "habitat_group"), remove_first_dummy = TRUE) %>% select(-c(PriceCateg, Importance, habitat_group))
test_knn <- fastDummies::dummy_cols(test_knn, select_columns = c("PriceCateg", "Importance", "habitat_group"), remove_first_dummy = TRUE) %>% select(-c(PriceCateg, Importance, habitat_group))
#test_knn$habitat_gorup_15 <- as.integer(0)
dummy_cols <- colnames(train_knn)[grep(colnames(train_knn), pattern="_")]
knn_cv_model <- class::knn.cv(train=train_knn, cl=train_labels, k=15)
knn_pred_category <- class::knn(train=train_knn, cl=train_labels, test=test_knn, k=15)
#Add values to predicted values for later evaluation.
predicted_values <- predicted_values %>% append(list("knn_classifier"=knn_pred_category))
#fct_relevel(factor(test_labels), as.factor(1))
test_labels <- as.factor(test_labels)
table <- table(knn_pred_category, test_labels)
caret::confusionMatrix(knn_pred_category, test_labels)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Not Threatened Threatened
## Not Threatened 763 39
## Threatened 4 12
##
## Accuracy : 0.9474
## 95% CI : (0.9298, 0.9617)
## No Information Rate : 0.9377
## P-Value [Acc > NIR] : 0.138
##
## Kappa : 0.3385
##
## Mcnemar's Test P-Value : 2.161e-07
##
## Sensitivity : 0.9948
## Specificity : 0.2353
## Pos Pred Value : 0.9514
## Neg Pred Value : 0.7500
## Prevalence : 0.9377
## Detection Rate : 0.9328
## Detection Prevalence : 0.9804
## Balanced Accuracy : 0.6150
##
## 'Positive' Class : Not Threatened
##
Balanced accuracy of kNN does not perform very well. Predicts 39 species are Not Threatened when they actually are.
train <- train_raw %>% select(Vulnerability, PriceCateg, Importance, Length, habitat_group, CellArea, cat_score, new_category)
test <- test_raw %>% select(Vulnerability, PriceCateg, Importance, Length, habitat_group, CellArea, cat_score, new_category)
#Fishy Bayes:
train <- train_raw %>% select(Vulnerability, PriceCateg, Importance, Length, habitat_group, CellArea, new_category)
test <- test_raw %>% select(Vulnerability, PriceCateg, Importance, Length, habitat_group, CellArea, new_category)
newNBclassifier_category <- naive_bayes(new_category ~ Vulnerability + PriceCateg + Importance + Length + habitat_group + CellArea, usekernel=T, data=train, laplace = 1)
printALL=function(model){
trainPred=predict(model, newdata = train)
trainTable=table(train$new_category, trainPred)
testPred=predict(model, newdata=test)
testTable=table(test$new_category, testPred)
trainAcc=(trainTable[1,1]+trainTable[2,2])/sum(trainTable)
testAcc=(testTable[1,1]+testTable[2,2])/sum(testTable)
message("Contingency Table for Training Data")
print(trainTable)
message("Contingency Table for Test Data")
print(testTable)
message("Accuracy")
print(round(cbind(trainAccuracy=trainAcc, testAccuracy=testAcc),3))
}
printALL(newNBclassifier_category)
## trainPred
## Not Threatened Threatened
## Not Threatened 10241 433
## Threatened 393 351
## testPred
## Not Threatened Threatened
## Not Threatened 745 22
## Threatened 33 18
## trainAccuracy testAccuracy
## [1,] 0.928 0.933
nb_predictions=predict(newNBclassifier_category, newdata=test)
predicted_values <- predicted_values %>% append(list("nb_classifier"=nb_predictions))
train <- train_raw
test <- test_raw
train <- train %>%
mutate_if(is.character, as.factor)
test <- test %>%
mutate_if(is.character, as.factor)
rand.class <- randomForest(as.factor(code_current) ~ Vulnerability + PriceCateg + Importance + Length + habitat_group + CellArea, data=train, na.action=na.omit)
print(rand.class)
##
## Call:
## randomForest(formula = as.factor(code_current) ~ Vulnerability + PriceCateg + Importance + Length + habitat_group + CellArea, data = train, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 2.39%
## Confusion matrix:
## CR EN EX LC NT VU class.error
## CR 107 3 0 14 1 2 0.157480315
## EN 3 142 0 45 1 6 0.279187817
## EX 0 0 0 15 0 0 1.000000000
## LC 1 1 0 10311 6 11 0.001839303
## NT 2 9 0 51 275 7 0.200581395
## VU 4 8 0 77 6 310 0.234567901
rand.class.pred <- predict(rand.class, newdata=test)
# auc <- auc(test$code_current, rand.class.pred)
# plot(roc(test$cat_score, rand.class.pred))
# ipak("MLmetrics")
# LogLoss(rand.class.pred,test$code_current)
round(importance(rand.class), 2)
## MeanDecreaseGini
## Vulnerability 487.52
## PriceCateg 130.78
## Importance 146.01
## Length 533.78
## habitat_group 59.58
## CellArea 439.97
#Now try with just the new_category which is what we're interested in.
rand.category <- randomForest(as.factor(new_category) ~ Vulnerability + PriceCateg + Importance + Length + habitat_group + CellArea, data=train, na.action=na.omit)
print(rand.category)
##
## Call:
## randomForest(formula = as.factor(new_category) ~ Vulnerability + PriceCateg + Importance + Length + habitat_group + CellArea, data = train, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 1.8%
## Confusion matrix:
## Not Threatened Threatened class.error
## Not Threatened 10639 35 0.003278996
## Threatened 170 574 0.228494624
rand.category.pred <- predict(rand.category, newdata=test)
predicted_values <- predicted_values %>% append(list("random_forest"=rand.category.pred))
# auc <- auc(test$code_current, rand.class.pred)
# plot(roc(test$cat_score, rand.class.pred))
Based on the above, we use the random forest model. It performed well on predicting the test data while not just giving us an answer (Least Concern/Not threatened) that would be right ~90% of the time.
precision_scores <- lapply(predicted_values, precision, reference = as.factor(test$new_category))
recall_scores <- lapply(predicted_values, recall, reference = as.factor(test$new_category))
F1_scores <- lapply(predicted_values, F_meas, reference = as.factor(test$new_category))
chosen_model <- rand.category
Real purpose of this is to better predict which species that are Data Deficient at present, may be most at risk.
fish_dat_dd$PriceCateg[is.na(fish_dat_dd$PriceCateg)] <- "unknown"
#Assume common length is a normal relationship to max length
# nrow(fish_dat_dd %>% filter(is.na(Length)==T)) #0 rows don't have length,
# nrow(fish_dat_dd %>% filter(is.na(CommonLength)==T)) #But 2250 don't have common length
length_model <- lm(CommonLength ~ Length, dat=fish_dat_dd)
fish_dat_dd$CommonLengthPred <- NA
fish_dat_dd$CommonLengthPred <- predict(length_model, newdata=fish_dat_dd)
fish_dat_dd$CommonLength[is.na(fish_dat_dd$CommonLength)==T] <- fish_dat_dd$CommonLengthPred[is.na(fish_dat_dd$CommonLength)==T]
#summary(fish_dat_dd %>% select(cat_score, Length, Vulnerability, CommonLength))
#Descriptors:
#importance
# fish_dat_dd %>% group_by(PriceCateg) %>% summarize(n=n())
# fish_dat_dd %>% group_by(Importance) %>% summarize(n=n())
fish_dat_dd$Importance[is.na(fish_dat_dd$Importance)] <- "unknown"
#Check for remaining NAs in important variables.
colSums(is.na(fish_dat_dd))
## count page result.taxonid
## 0 0 0
## result.kingdom_name result.phylum_name result.class_name
## 0 0 0
## result.order_name result.family_name result.genus_name
## 0 0 0
## result.scientific_name result.infra_rank result.infra_name
## 0 2732 2732
## result.population result.category code_current
## 2605 0 0
## cat_score Length Vulnerability
## 2732 0 0
## CommonLength PriceCateg Importance
## 0 0 0
## LongevityWild BodyShapeI DepthRangeDeep
## 2539 446 676
## DepthRangeComDeep SpeciesID CellArea
## 2483 1063 1063
## max_depth habitat_type habitat_group
## 0 0 0
## new_category CommonLengthPred
## 0 0
fish_dat_dd <- fish_dat_dd %>%
mutate_if(is.character, as.factor)
dd_predictions <- predict(chosen_model, fish_dat_dd)
num_dd <- length(dd_predictions)
num_dd_predicted <- length(dd_predictions[is.na(dd_predictions)==F])
num_dd_predicted_threatened <- length(dd_predictions[dd_predictions=="Threatened" & is.na(dd_predictions)==F])
fish_dat_dd$pred <- as.character(dd_predictions)
fish_dat_dd$pred[is.na(fish_dat_dd$pred)] <- "No Prediction"
We attempt to apply our random forest model to the DD data. There are 2732 DD species that we try this on, but due to NAs in some of our predictors we only prepare predictions for 1669. Out of these, only 23 are predicted to be threatened. While this seems low this is good news for two reasons: 1. Less species are threatened in the DD category than we may have thought (Yay!). 2. 18 is a small number of species that could be re-assessed or prioritized for original research into their current population status. If we chose them at random (maybe unlikely), we would likely waste a lot of time and effort re-assessing DD species that are likely “Least Concern”.
Due to missing data, we cannot predict for all species and are missing predictions for ~1000 species. Next step would be to impute these values where possible and re-run the prediction.
fish_dat_dd %>% ggplot(aes(x=pred, fill=pred)) +
geom_bar() +
xlab("Prediction Status") +
theme(legend.position = "none") +
labs(y="Number of species") +
scale_fill_manual(values=c("grey50" ,zissou5[1], zissou5[5])) +
scale_y_continuous(expand=c(0,0))