Disclaimer: This analysis has not been peer-reviewed and is not meant to replace the process of species assessment and prioritization undertaken by the IUCN Red List.

Intro

An analysis to aid in predicted classification and thus triage of IUCN Red List species currently ‘Not Evaluated’ or ‘Data Deficient’.

Problem Description: The rate of extinction at present is high above historical levels. Conservation has limited resources to allocate to both the assessment of species and their protection. The main international organization that evaluates the population status of species is the IUCN. They assess species into several categories of risk, and a category of ‘Data Deficient’ when there is not enough information to judge the risk status of a species.

IUCN Categories. Source: IUCN, 2019

IUCN Categories. Source: IUCN, 2019

Data sources:
IUCN API - Current Threat Status, taxonomy, max depth, habitat type, threats, etc.
https://apiv3.iucnredlist.org/api/v3/docs Fishbase - Length Data, Price category, Vulnerability Index, habitat range? Age at sexual maturity? Max Age?
www.fishbase.org Aquamaps - Area size (sum of cell area where occurrence is predicted)
www.aquamaps.org

Proposed Solution: To properly allocate resources for conservation we need to know which species are most threatened. Species classified as ‘Data Deficient’ could be of high concern (e.g., Threatened) or they could be of a low priority (Least Concern or Near Threatened). An improvement to this process would be to predict which species are most likely of high conservation concern within the Data Deficient category as well as of species not assessed by the IUCN at present.

After some review, I believe it is more important not to predict the exact species category, but to predict whether the species is likely to be ‘Threatened’ (e.g., Vulnerable, Endangered, or Critically Endangered), or of a lesser concern (‘Not Threatened’, i.e., Least Concern or Near Threatened).

Therefore, I am to train a machine learning or statistical model on available data that will predict the IUCN category or their general level of conservation concern of species currently listed as Data Deficient or species not currently included on the IUCN Red List.

Based on my current knowledge, I begin with a focus on aquatic species, and especially fishes.

ML Methods: Logarithmic regression, Decision Tree, Random Forest, Naive Bayes, kNN I can attempt these with regression based on a linear-scoring method of the IUCN status (0-1), or on a more classical classification on the threat categories, or large over-arching categories (Not Threatened vs. Threatened).

pal <- wes_palette(name = "Zissou1", n=100, type = c("continuous"))
zissou5 <- wes_palette(name="Zissou1")
theme_set(theme_classic())
null_colour <- "white"

Data

#Establish list of marine species
all_species <- read.csv(file.path(dir_data, "iucn_species_list_version_2019-2.csv"))
marine_species <- read.csv(file.path(dir_data, "spp_marine_from_version_2019-2.csv")) %>% filter(is.na(habs)==F)

marine_species <- marine_species %>% separate(habs, into=paste("habs", seq(1,50), sep=""), sep=",") 
marine_species <- marine_species %>% 
  gather(key="habitat_num", value="value", -iucn_sid,-max_depth) %>% 
  mutate(habitat_type=as.factor(value)) %>% 
  dplyr::select(-c(habitat_num, value)) 
marine_species <- marine_species %>% filter(is.na(habitat_type)==F)
marine_species <- marine_species %>% 
  mutate(habitat_group = gsub(habitat_type, pattern="\\..*", replacement="")) %>% 
  mutate(habitat_group = as.factor(trimws(habitat_group)))

#### Align with threats and narrative text: ####
species_threats <- read.csv(file.path(dir_data, "species_fishing_threats.csv"), stringsAsFactors = F)
colnames(species_threats) <- gsub(colnames(species_threats), pattern="result\\.", replacement="")
risk_codes <- read.csv(file.path(dir_data, "risk_code_lookup.csv"), stringsAsFactors = F)

species_threats <- species_threats %>% left_join(risk_codes, by=c("category"="code"))
species_threats <- species_threats %>%
  dplyr::select(iucn_sid, class_name, scientific_name, code_current, cat_score) %>%
  unique()

# threatened_status <- c("VU", "CR", "EN")
# species_threats <- species_threats %>% filter(code_current %in% threatened_status)
text_threats <- read.csv(file.path(dir_data, "narrative_threats_output_tidy.csv"))

num_dd <- nrow(all_species %>% filter(result.category=="DD"))

Exploring the Data

16011 species are currently classified by the IUCN as Data Deficient (DD; see Figure @ref(fig:iucn-species-plot)). This is a major challenge to conservation as data deficient species are not prioritized for conservation efforts, and many are likely at an elevated risk of extinction.

all_species <- all_species %>% left_join(risk_codes, by=c("result.category"="code"))
all_species <- all_species  %>% mutate(code_current = fct_relevel(code_current, c("DD", "LC", "NT", "VU", "EN", "CR", "EX"))) 
included_species <- all_species %>% filter(result.taxonid %in% species_threats$iucn_sid)
all_species_plot <- ggplot(data=all_species, aes(x=code_current, fill=code_current)) +
  geom_bar() + 
  scale_fill_manual(values= IUCNpalette::iucn_palette(category="All", exclude="NE")) +
  ylab("Number of species") + 
  labs(fill="IUCN Category") + 
  xlab("") + 
  scale_y_continuous(expand=c(0,0)) + 
  theme(legend.position = "none") + 
  NULL 

all_species_plot
IUCN Red List species by conservation status

IUCN Red List species by conservation status

Let’s focus on those that are not threatened or great (Least Concern and Near Threatened) and those that are Threatened or worse (Vulnerable, Endangered, Critically Endangered, and Extinct).

all_species <- all_species %>%
  mutate(new_category = if_else(is.na(cat_score), "Data Deficient", 
                                if_else(cat_score <=0.2, "Not Threatened", "Threatened"
                                #if_else(cat_score >0.8, "Extinct", "Threatened") #Removed this as it probably doesn't add much 
                                ))) %>% 
  mutate(new_category = as.factor(new_category))
all_species_plot <- ggplot(data=all_species, aes(x=new_category, fill=new_category)) +
  geom_bar() + 
  scale_fill_manual(values= c("grey50", zissou5[1], zissou5[5])) +
  ylab("Number of species") + 
  labs(fill="IUCN Category") + 
  xlab("") + 
  scale_y_continuous(expand=c(0,0)) + 
  theme(legend.position = "none") + 
  NULL 

all_species_plot

Here we’re going to take advantage of the rfishbase package to extract relevant predictors for fishes. This will limit our analysis to those species in both the IUCN red list and Fishbase.

Here I look at the possible columns of Fishbase data and use some based on what I think may be good predictors of an elevated risk status. I then check some of these for how many of them are NAs as this data is less useful if I have to fill in a lot.

Longevity of the species is probably a good predictor but only available for ~2000 of the 34,000 species covered by Fishbase. I try to see if there is a clear relationship with Length that could be modelled but it doesn’t seem like it.

fishbase_taxa <- load_taxa()
#Variables of interest: Length, Vulnerability, CommonLength, PriceCateg, Importance
x <- species(species_list = "Bolbometopon muricatum") #Used to find out which fields are important
# colnames(x)
species_list_fb <- fishbase_taxa %>% pull(Species)
fb_data <- species(species_list = species_list_fb, fields=c("Species", "Length", "Vulnerability", "CommonLength", "PriceCateg", "Importance", "LongevityWild", "BodyShapeI", "DepthRangeDeep", "DepthRangeComDeep"))
# popgrowth("Oreochromis niloticus")
# distribution("Oreochromis niloticus")

nrow(fb_data %>% filter(is.na(DepthRangeDeep)))
## [1] 19402
nrow(fb_data %>% filter(is.na(DepthRangeComDeep) & is.na(DepthRangeDeep)))
## [1] 19269
fb_data %>% group_by(DepthRangeDeep) %>% 
  summarize(n=n())
## # A tibble: 1,451 x 2
##    DepthRangeDeep     n
##             <dbl> <int>
##  1              0     9
##  2              1   144
##  3              2   167
##  4              3   127
##  5              4    68
##  6              5   158
##  7              6   102
##  8              7    44
##  9              8   104
## 10              9    53
## # ... with 1,441 more rows
#Is there a relationship between Length and Longevity? 
fb_data %>% 
  ggplot(aes(x=Length, y=LongevityWild)) +
  geom_point()

Aquamaps has tons of data on distribution of aquatic species. Their database is accessible through their package which is then set-up as a sqlite database locally. I compare the taxa lists between aquamaps and the IUCN to find where they overlap as right now I only want to predict for species already included in the IUCN Red List. Main predictor I am using here is range size (if a species occurs in a given cell (0.5 by 0.5 degree cell), then it is assumed to occupy the area of that cell). Species with larger ranges have more places they can be unaffected by human or other disturbances.

This data is really helpful but takes a long time to process so I set eval=F on this chunk.

download_db(force = TRUE) #DB is 1gb so takes a long time to do this download. 
my_db <- aquamapsdata:::src_sqlite_aquamapsdata()
# for a db table, return a tibble with the columns and their data types
ls_types <- function(table) {
  res <- table %>% head %>% collect %>% lapply(type_sum) %>% unlist
  colname <- names(res)
  title <- as.character(table$ops$x)
  tibble(table = title, col_name = colname, col_type = res, desc = NA)
}
# run the above function on all tables
am_schema <- bind_rows(
  my_db %>% tbl("nativemaps") %>% ls_types,
  my_db %>% tbl("hcaf") %>% ls_types,
  my_db %>% tbl("hspen") %>% ls_types,
  my_db %>% tbl("occ") %>% ls_types,
  my_db %>% tbl("taxa") %>% ls_types
)
#datatable(am_schema)

#Get species list
am_species <- my_db %>% tbl("taxa") %>% filter(is.na(iucn_code)==F) %>% collect %>% mutate(SciName = paste(Genus, Species)) %>% select(SciName, SPECIESID) 
am_species <- am_species %>% filter(SciName %in% all_species$result.scientific_name)
#Get maps for species in list
am_species_maps <- my_db %>% tbl("nativemaps") %>% collect %>% filter(SpeciesID %in% am_species$SPECIESID) 
#get reference for map cell ids:
am_map <- my_db %>% tbl("hcaf") %>% collect %>% select(CsquareCode, CenterLat, CenterLong, CellArea)


am_species_maps <- am_species_maps %>% left_join(am_map)

am_species_maps %>% 
  ggplot(aes(x=CenterLong, y=CenterLat, fill=probability)) + 
  geom_tile()

write_csv(am_species_maps, "./data/am_species_maps.csv")

am_species_maps_area <- am_species_maps %>% 
  group_by(SpeciesID) %>% 
  summarize(CellArea = sum(CellArea)) %>% 
  left_join(am_species, by=c("SpeciesID"="SPECIESID"))

write_csv(am_species_maps_area, "./data/am_species_area.csv")

Combine the IUCN, Fishbase and Aquamaps data here and fill in NAs where possible and filter out observations with NAs when not possible. I also save the Data Deficient (DD) species for later as these are what I want to get to in the end.

am_species_maps_area <- read_csv( "../data/am_species_area.csv")

all_species <- read.csv(file.path(dir_data, "iucn_species_list.csv"))
risk_codes <- read.csv(file.path(dir_data, "risk_code_lookup.csv"), stringsAsFactors = F)
risk_codes <- risk_codes %>% dplyr::select(code, code_current, cat_score) %>% distinct()
dat <- all_species %>% 
  left_join(risk_codes, by=c("result.category"="code")) %>% 
  left_join(fb_data, by=c("result.scientific_name"="Species")) %>% 
  left_join(am_species_maps_area, by=c("result.scientific_name"="SciName")) %>% 
  left_join(marine_species, by=c("result.taxonid"="iucn_sid"))

#Add in feature of larger categories. Might improve classifier performace:
dat$new_category <- NA
dat <- dat %>%
  mutate(new_category = if_else(is.na(cat_score), "Data Deficient", 
                                if_else(cat_score <=0.2, "Not Threatened", "Threatened"
                                #if_else(cat_score >0.8, "Extinct", "Threatened") #Removed this as it probably doesn't add much 
                                ))) %>% 
  mutate(new_category = as.factor(new_category))


dat_no_dd <- dat %>% filter(code_current != "DD")
dat_no_dd <- dat_no_dd  %>% filter(is.na(max_depth)==F) %>% droplevels.data.frame()

fish_dat <- dat_no_dd %>%
  filter(is.na(Length)==F & is.na(cat_score)==F) %>% 
  droplevels.data.frame()  %>% 
  filter(is.na(CellArea)==F)

#14963 fishes are found in FB and IUCN list

#dat_no_dd %>% filter(is.na(habitat_type)) 

#Separate out DD data
dd <- dat %>% filter(code_current == "DD")
fish_dat_dd <- dd %>% filter(is.na(Length)==F) %>% filter(is.na(max_depth)==F) #2220 fishes are found in FB and IUCN list and are Data Deficient

Check to see how many of the variables I was looking at are missing in the databases I am using. I can fill in some based on physical relationships in these species but harder for other variables (e.g., price category).

#
# unique(fish_dat$PriceCateg)
fish_dat$PriceCateg[is.na(fish_dat$PriceCateg)] <- "unknown"


#Assume common length is a normal relationship to max length
nrow(fish_dat %>% filter(is.na(Length)==T))
## [1] 0
length_model <- lm(CommonLength ~ Length, dat=fish_dat)
fish_dat$CommonLengthPred <- NA
fish_dat$CommonLengthPred <- predict(length_model, newdata=fish_dat)
fish_dat$CommonLength[is.na(fish_dat$CommonLength)==T] <- fish_dat$CommonLengthPred[is.na(fish_dat$CommonLength)==T]


summary(fish_dat %>% select(cat_score, Length, Vulnerability, CommonLength))
##    cat_score           Length        Vulnerability    CommonLength     
##  Min.   :0.00000   Min.   :   1.00   Min.   :10.00   Min.   :   3.200  
##  1st Qu.:0.00000   1st Qu.:  12.00   1st Qu.:16.38   1st Qu.:   9.491  
##  Median :0.00000   Median :  25.00   Median :28.26   Median :  16.530  
##  Mean   :0.04191   Mean   :  49.59   Mean   :31.87   Mean   :  29.845  
##  3rd Qu.:0.00000   3rd Qu.:  51.00   3rd Qu.:41.60   3rd Qu.:  30.067  
##  Max.   :1.00000   Max.   :1600.00   Max.   :90.00   Max.   :1000.000
fish_dat_numeric <- fish_dat %>% select(cat_score, Length, Vulnerability, CommonLength)

cormat <- round(cor(fish_dat_numeric),2)
cormat <- as.data.frame(cormat) 
cormat$Var1 <- row.names(cormat)
melted_cormat <- cormat %>% gather(key="Var2", value="value", -Var1)
ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile()

#Descriptors: 
#importance
fish_dat %>% group_by(PriceCateg) %>% summarize(n=n())
## # A tibble: 5 x 2
##   PriceCateg     n
##   <chr>      <int>
## 1 high        1988
## 2 low          788
## 3 medium      3158
## 4 unknown     6358
## 5 very high   4020
fish_dat %>% group_by(Importance) %>% summarize(n=n())
## # A tibble: 7 x 2
##   Importance                n
##   <chr>                 <int>
## 1 <NA>                   7735
## 2 commercial             2619
## 3 highly commercial       575
## 4 minor commercial       3120
## 5 of no interest         1669
## 6 of potential interest    73
## 7 subsistence fisheries   521
fish_dat$Importance[is.na(fish_dat$Importance)] <- "unknown"


fish_dat$BodyShapeI[is.na(fish_dat$BodyShapeI)] <- "other"
fish_dat$BodyShapeI[fish_dat$BodyShapeI=="other (see remarks)"] <- "other"
#Check for remaining NAs in important variables. 
#colSums(is.na(fish_dat))

Interestingly, vulnerabilty is a poorer predictor of cat_score than length.

I set up the train and test datasets that I can continue to come back to the clean versions. Some of the modelling I do later on needs the data in particular formats (factors, numerics, etc.) so I want to modify it for each analysis, but have a ‘clean’ version I can keep coming back to. I use a 70% data partition train/test split here.

train_raw <- sample_frac(fish_dat, size=0.7, replace=F)
test_raw  <- fish_dat %>% filter(!result.taxonid %in% train_raw$result.taxonid)

train <- train_raw
test <- test_raw

I am going to be doing several types of models, but I am going to focus on a binary classification for now. Therefore, I’m going to use an F1-Score as my primary metric to compare between model types.

Looking back at our training data, we know the vast majority are of least concern or near threatened. That means a model could be ‘correct’ around 90% of the time if we just guessed ‘Not Threatened’. That is a very high percentage, but doesn’t give us any new information.

lc_accuracy <- nrow(test %>% filter(code_current=="LC")) / nrow(test)
lc_accuracy
## [1] 0.9070905
models_list <- list()

Modeling

Linear Modeling

Start with some linear and generalized linear models. Helps me see what variables are significant predictors.

train <- train_raw
test <- test_raw

fish1 <- glm(new_category ~ Vulnerability + CommonLength + PriceCateg + Importance + Length + CellArea, data=train, family="binomial")
#summary(fish1)

fish2 <- glm(new_category ~ Vulnerability + CommonLength + PriceCateg + Importance + Length + max_depth + CellArea, data=train, family="binomial")
#summary(fish2)

fish3 <- glm(new_category ~ Vulnerability +  CommonLength + PriceCateg + Importance + Length + habitat_group + max_depth + CellArea, data=train, family="binomial")
#summary(fish3)

fish1_pred <- predict(fish1, newdata = test)
fish2_pred <- predict(fish2, newdata = test)
fish3_pred <- predict(fish3, newdata = test)

models_list <- models_list %>% append(c(fish1, fish2, fish3))

convert_to_new_category <- function(predictions){
  if(is.numeric(predictions)){
    predictions <- if_else(predictions >=0.50, "Threatened", "Not Threatened")
    predictions <- as.factor(predictions)
    levels(predictions) <- c("Not Threatened", "Threatened")
    return(predictions)
  }
}

fish1_pred <- convert_to_new_category(fish1_pred)
fish2_pred <- convert_to_new_category(fish2_pred)
fish3_pred <- convert_to_new_category(fish3_pred)

predicted_values <- list()
predicted_values <- list("fish1" = fish1_pred,
                         "fish2" = fish2_pred,
                         "fish3" = fish3_pred)

Our LM models are performing quite well with MSPE lowering for each variable added. Vulnerability, CellArea, and Length, are all performing well as predictors that are highly significant. Different levels within the categorical variables are also performing well, such as commercial species and high value species.

PCA

Can PCA help us? Maybe less useful because of how many categorical variables we have:

fish.pca <- prcomp(train %>% select(cat_score, Length, Vulnerability, CommonLength, CellArea), center = TRUE,scale. = TRUE)
summary(fish.pca)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5
## Standard deviation     1.7012 0.9944 0.8050 0.64556 0.22898
## Proportion of Variance 0.5788 0.1978 0.1296 0.08335 0.01049
## Cumulative Proportion  0.5788 0.7766 0.9062 0.98951 1.00000
devtools::install_github("vqv/ggbiplot")

ipak("ggbiplot")
## ggbiplot 
##     TRUE
ggbiplot(fish.pca, alpha=0.3)

pca_groups <- train %>% pull(code_current) %>% as.character()
pca_groups <- train %>% pull(new_category) %>% as.character()

#grid1 <- grid2 <- seq(0.05, 0.2, length=5)
#ggbiplot(fish.pca, ellipse=TRUE, labels = pca_groups, groups=pca_groups)

Decisions trees:

tree <- rpart(new_category ~ Vulnerability +  PriceCateg + Importance + Length + habitat_group + CellArea, data=train, minbucket=50)
prp(tree)

tree.pred = predict(tree, newdata=test)

tree.pred <- as.data.frame(tree.pred)
tree.pred <- tree.pred %>% 
  mutate(prediction = if_else(Threatened>=0.5, "Threatened", "Not Threatened")) %>%
  mutate(prediction = as.factor(prediction))

levels(tree.pred$prediction) <- c("Not Threatened", "Threatened")
tree.pred <- tree.pred %>% pull(prediction)

predicted_values <- predicted_values %>% append(list("cart_classifier"=tree.pred))

Length and Vulnerability are key here. This might not be surprising based on previous research but is still helpful for categorizing our data.

kNN

A lazy learning algorithm. kNN uses points that are similar on the predictors to derive patterns in the data for its prediction.

train <- train_raw
test <- test_raw

train_labels <- train %>% pull(new_category)
test_labels <- test %>% pull(new_category)
train_knn <- train %>% select(Vulnerability, PriceCateg, Importance, Length, habitat_group)
test_knn <- test %>% select(Vulnerability, PriceCateg, Importance, Length, habitat_group)


#using class::knn 

colSums(is.na(train_knn))
## Vulnerability    PriceCateg    Importance        Length habitat_group 
##             0             0             0             0             0
colSums(is.na(test_knn))
## Vulnerability    PriceCateg    Importance        Length habitat_group 
##             0             0             0             0             0
train_y <- train  %>% select(new_category)

# colnames(train_knn)
# colnames(test_knn)

#Transform habitat groups,price categ and importance into dummy variables. 
ipak("fastDummies")
## fastDummies 
##        TRUE
train_knn <- fastDummies::dummy_cols(train_knn, select_columns = c("PriceCateg", "Importance", "habitat_group"), remove_first_dummy = TRUE) %>% select(-c(PriceCateg, Importance, habitat_group))
test_knn <- fastDummies::dummy_cols(test_knn, select_columns = c("PriceCateg", "Importance", "habitat_group"), remove_first_dummy = TRUE) %>% select(-c(PriceCateg, Importance, habitat_group))
#test_knn$habitat_gorup_15 <- as.integer(0)
dummy_cols <- colnames(train_knn)[grep(colnames(train_knn), pattern="_")]
knn_cv_model <- class::knn.cv(train=train_knn, cl=train_labels,  k=15)
knn_pred_category <- class::knn(train=train_knn, cl=train_labels, test=test_knn, k=15)

#Add values to predicted values for later evaluation. 
predicted_values <- predicted_values %>% append(list("knn_classifier"=knn_pred_category))

#fct_relevel(factor(test_labels), as.factor(1))

test_labels <- as.factor(test_labels)

table <- table(knn_pred_category, test_labels)
caret::confusionMatrix(knn_pred_category, test_labels)
## Confusion Matrix and Statistics
## 
##                 Reference
## Prediction       Not Threatened Threatened
##   Not Threatened            763         39
##   Threatened                  4         12
##                                           
##                Accuracy : 0.9474          
##                  95% CI : (0.9298, 0.9617)
##     No Information Rate : 0.9377          
##     P-Value [Acc > NIR] : 0.138           
##                                           
##                   Kappa : 0.3385          
##                                           
##  Mcnemar's Test P-Value : 2.161e-07       
##                                           
##             Sensitivity : 0.9948          
##             Specificity : 0.2353          
##          Pos Pred Value : 0.9514          
##          Neg Pred Value : 0.7500          
##              Prevalence : 0.9377          
##          Detection Rate : 0.9328          
##    Detection Prevalence : 0.9804          
##       Balanced Accuracy : 0.6150          
##                                           
##        'Positive' Class : Not Threatened  
## 

Balanced accuracy of kNN does not perform very well. Predicts 39 species are Not Threatened when they actually are.

Naive Bayes

train <- train_raw %>% select(Vulnerability, PriceCateg, Importance, Length, habitat_group, CellArea, cat_score, new_category)
test <- test_raw %>% select(Vulnerability, PriceCateg, Importance, Length, habitat_group, CellArea, cat_score, new_category)

#Fishy Bayes:

train <- train_raw %>% select(Vulnerability, PriceCateg, Importance, Length, habitat_group, CellArea, new_category)
test <- test_raw %>% select(Vulnerability, PriceCateg, Importance, Length, habitat_group, CellArea, new_category)

newNBclassifier_category <- naive_bayes(new_category ~ Vulnerability +  PriceCateg + Importance + Length + habitat_group + CellArea, usekernel=T, data=train, laplace = 1)
printALL=function(model){
  trainPred=predict(model, newdata = train)
  trainTable=table(train$new_category, trainPred)
  testPred=predict(model, newdata=test)
  testTable=table(test$new_category, testPred)
  trainAcc=(trainTable[1,1]+trainTable[2,2])/sum(trainTable)
  testAcc=(testTable[1,1]+testTable[2,2])/sum(testTable)
  message("Contingency Table for Training Data")
  print(trainTable)
  message("Contingency Table for Test Data")
  print(testTable)
  message("Accuracy")
  print(round(cbind(trainAccuracy=trainAcc, testAccuracy=testAcc),3))
}
printALL(newNBclassifier_category)
##                 trainPred
##                  Not Threatened Threatened
##   Not Threatened          10241        433
##   Threatened                393        351
##                 testPred
##                  Not Threatened Threatened
##   Not Threatened            745         22
##   Threatened                 33         18
##      trainAccuracy testAccuracy
## [1,]         0.928        0.933
nb_predictions=predict(newNBclassifier_category, newdata=test)

predicted_values <- predicted_values %>% append(list("nb_classifier"=nb_predictions))

Random Forest

train <- train_raw
test <- test_raw

train <- train %>% 
  mutate_if(is.character, as.factor)
test <- test %>% 
  mutate_if(is.character, as.factor)


rand.class <- randomForest(as.factor(code_current) ~ Vulnerability +  PriceCateg + Importance + Length + habitat_group + CellArea, data=train, na.action=na.omit)
print(rand.class)
## 
## Call:
##  randomForest(formula = as.factor(code_current) ~ Vulnerability +      PriceCateg + Importance + Length + habitat_group + CellArea,      data = train, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 2.39%
## Confusion matrix:
##     CR  EN EX    LC  NT  VU class.error
## CR 107   3  0    14   1   2 0.157480315
## EN   3 142  0    45   1   6 0.279187817
## EX   0   0  0    15   0   0 1.000000000
## LC   1   1  0 10311   6  11 0.001839303
## NT   2   9  0    51 275   7 0.200581395
## VU   4   8  0    77   6 310 0.234567901
rand.class.pred <- predict(rand.class, newdata=test)
# auc <- auc(test$code_current, rand.class.pred)
# plot(roc(test$cat_score, rand.class.pred))

# ipak("MLmetrics")
# LogLoss(rand.class.pred,test$code_current)

round(importance(rand.class), 2)
##               MeanDecreaseGini
## Vulnerability           487.52
## PriceCateg              130.78
## Importance              146.01
## Length                  533.78
## habitat_group            59.58
## CellArea                439.97
#Now try with just the new_category which is what we're interested in. 
rand.category <- randomForest(as.factor(new_category) ~ Vulnerability +  PriceCateg + Importance + Length + habitat_group + CellArea, data=train, na.action=na.omit)
print(rand.category)
## 
## Call:
##  randomForest(formula = as.factor(new_category) ~ Vulnerability +      PriceCateg + Importance + Length + habitat_group + CellArea,      data = train, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 1.8%
## Confusion matrix:
##                Not Threatened Threatened class.error
## Not Threatened          10639         35 0.003278996
## Threatened                170        574 0.228494624
rand.category.pred <- predict(rand.category, newdata=test)

predicted_values <- predicted_values %>% append(list("random_forest"=rand.category.pred))


# auc <- auc(test$code_current, rand.class.pred)
# plot(roc(test$cat_score, rand.class.pred))

Model Performance Summaries

Based on the above, we use the random forest model. It performed well on predicting the test data while not just giving us an answer (Least Concern/Not threatened) that would be right ~90% of the time.

precision_scores <- lapply(predicted_values, precision, reference = as.factor(test$new_category))
recall_scores <- lapply(predicted_values, recall, reference = as.factor(test$new_category))
F1_scores <- lapply(predicted_values, F_meas, reference = as.factor(test$new_category))
chosen_model <- rand.category

Real purpose of this is to better predict which species that are Data Deficient at present, may be most at risk.

fish_dat_dd$PriceCateg[is.na(fish_dat_dd$PriceCateg)] <- "unknown"


#Assume common length is a normal relationship to max length
# nrow(fish_dat_dd %>% filter(is.na(Length)==T)) #0 rows don't have length,
# nrow(fish_dat_dd %>% filter(is.na(CommonLength)==T)) #But 2250 don't have common length

length_model <- lm(CommonLength ~ Length, dat=fish_dat_dd)
fish_dat_dd$CommonLengthPred <- NA
fish_dat_dd$CommonLengthPred <- predict(length_model, newdata=fish_dat_dd)
fish_dat_dd$CommonLength[is.na(fish_dat_dd$CommonLength)==T] <- fish_dat_dd$CommonLengthPred[is.na(fish_dat_dd$CommonLength)==T]


#summary(fish_dat_dd %>% select(cat_score, Length, Vulnerability, CommonLength))


#Descriptors: 
#importance
# fish_dat_dd %>% group_by(PriceCateg) %>% summarize(n=n())
# fish_dat_dd %>% group_by(Importance) %>% summarize(n=n())

fish_dat_dd$Importance[is.na(fish_dat_dd$Importance)] <- "unknown"

#Check for remaining NAs in important variables. 
colSums(is.na(fish_dat_dd))
##                  count                   page         result.taxonid 
##                      0                      0                      0 
##    result.kingdom_name     result.phylum_name      result.class_name 
##                      0                      0                      0 
##      result.order_name     result.family_name      result.genus_name 
##                      0                      0                      0 
## result.scientific_name      result.infra_rank      result.infra_name 
##                      0                   2732                   2732 
##      result.population        result.category           code_current 
##                   2605                      0                      0 
##              cat_score                 Length          Vulnerability 
##                   2732                      0                      0 
##           CommonLength             PriceCateg             Importance 
##                      0                      0                      0 
##          LongevityWild             BodyShapeI         DepthRangeDeep 
##                   2539                    446                    676 
##      DepthRangeComDeep              SpeciesID               CellArea 
##                   2483                   1063                   1063 
##              max_depth           habitat_type          habitat_group 
##                      0                      0                      0 
##           new_category       CommonLengthPred 
##                      0                      0
fish_dat_dd <- fish_dat_dd  %>% 
  mutate_if(is.character, as.factor)


dd_predictions <- predict(chosen_model, fish_dat_dd)
num_dd <- length(dd_predictions)
num_dd_predicted <- length(dd_predictions[is.na(dd_predictions)==F])
num_dd_predicted_threatened <- length(dd_predictions[dd_predictions=="Threatened" & is.na(dd_predictions)==F])


fish_dat_dd$pred <- as.character(dd_predictions)
fish_dat_dd$pred[is.na(fish_dat_dd$pred)] <- "No Prediction"

We attempt to apply our random forest model to the DD data. There are 2732 DD species that we try this on, but due to NAs in some of our predictors we only prepare predictions for 1669. Out of these, only 23 are predicted to be threatened. While this seems low this is good news for two reasons: 1. Less species are threatened in the DD category than we may have thought (Yay!). 2. 18 is a small number of species that could be re-assessed or prioritized for original research into their current population status. If we chose them at random (maybe unlikely), we would likely waste a lot of time and effort re-assessing DD species that are likely “Least Concern”.

Due to missing data, we cannot predict for all species and are missing predictions for ~1000 species. Next step would be to impute these values where possible and re-run the prediction.

fish_dat_dd %>% ggplot(aes(x=pred, fill=pred)) +
  geom_bar() +
  xlab("Prediction Status") +
  theme(legend.position = "none") +
  labs(y="Number of species") + 
  scale_fill_manual(values=c("grey50" ,zissou5[1], zissou5[5])) +
  scale_y_continuous(expand=c(0,0))