I learnt about WinBUGS, OpenBUGS, then JAGS, and used data cloning for some projects. And I started abstracting away the workflow and in 2009 I submitted the dclone package to CRAN. The dclone package is still alive and well, now even supports Stan.
Since then, Subhash has retired and I moved jobs. We thought recently that we should do another data cloning workshop like we did a few times over the years. And now it is coming up on April 13, 2023, as a joint meetup of the Edmonton and Calgary R User groups.
Mixed models, also known as hierarchical models and multilevel models, is a useful class of models for applied sciences. The goal of the workshop is to give an introduction to the logic, theory, and implementation of these models to solve practical problems. The workshop will include a seminar style overview and hands on exercises including common model classes and examples that participants can extend for their own needs.
Find more about the course on the GitHub site: https://github.com/datacloning/workshop-2023-edmonton.
]]>The paper represents a major step forwards in understanding the complexity of population size estimation. We paid special attention to framing the implications for conservation and management of species at risk and for improving future data collection. It was published in the Condor, and it is open access:
Here is the abstract:
Estimating the population abundance of landbirds is a challenging task complicated by the amount, type, and quality of available data. Avian conservationists have relied on population estimates from Partners in Flight (PIF), which primarily uses roadside data from the North American Breeding Bird Survey (BBS). However, the BBS was not designed to estimate population sizes. Therefore, we set out to compare the PIF approach with spatially explicit models incorporating roadside and off-road point-count surveys. We calculated population estimates for 81 landbird species in Bird Conservation Region 6 in Alberta, Canada, using land cover and climate as predictors. We also developed a framework to evaluate how the differences between the detection distance, time-of-day, roadside count, and habitat representation adjustments explain discrepancies between the 2 estimators. We showed that the key assumptions of the PIF population estimator were commonly violated in this region, and that the 2 approaches provided different population estimates for most species. The average differences between estimators were explained by differences in the detection-distance and time-of-day components, but these adjustments left much unexplained variation among species. Differences in the roadside count and habitat representation components explained most of the among-species variation. The variation caused by these factors was large enough to change the population ranking of the species. The roadside count bias needs serious attention when roadside surveys are used to extrapolate over off-road areas. Habitat representation bias is likely prevalent in regions sparsely and non-representatively sampled by roadside surveys, such as the boreal region of North America, and thus population estimates for these regions need to be treated with caution for certain species. Additional sampling and integrated modeling of available data sources can contribute towards more accurate population estimates for conservation in remote areas of North America.
I am not going to provide another perspective here, but rather, I list all the other sources out there that I wrote in connection with the paper, or commentaries that arose in news outlets.
Blog posts:
In the news:
Presentations:
Social activity on Altmetric:
]]>In summary, we evaluated a conventional removal model and a finite mixture removal model, with and without covariates, for 152 bird species. We found that the probabilities of predicted availability under conventional and finite mixture models were very similar with respect to the range of probability values and the shape of the response curves to predictor variables. However, finite mixture models were better supported for the large majority of species. We also found overwhelming support for time-varying models irrespective of the parametrization.
I have written a related post about the journey that led to this paper (Count me in! I am available for detection at 6 AM on May 26th), in this post I describe the math behind the removal modeling as implemented in the detect R package.
It has long been recognized that nearly all avian field surveys underestimate abundances, unless the estimates are adjusted for the proportion of birds present but undetected at the times and locations surveyed. Detectability is the product of the probability that birds make themselves available for detection by emitting detectable cues (availability); and the probability that an available bird will be perceived by a bird surveyor (perceptibility).
The time-removal model, originally developed for estimating wildlife and fish abundances from mark-recapture studies, was later reformulated for avian surveys with the goal of improving estimates of bird abundance by accounting for the availability bias inherent in point-count data. The removal model applied to point-count surveys estimates the probability that a bird is available for detection as a function of the average number of detectable cues that an individual bird gives per minute (singing rate), and the known count duration.
Time-removal models are based on a removal experiment whereby animals are trapped and thereby removed from the closed population of animals being sampled. When applying a removal model to avian point-count surveys, the counts of singing birds (\(Y_{ij}, \ldots, Y_{iJ}\)) within a given point-count survey \(i\) (\(i = 1,\ldots, n\)) are tallied relative to when each bird is first detected in multiple and consecutive time intervals, with the survey start time \(t_{i0} = 0\), the end times of the time intervals \(t_{ij}\) (\(j = 1, 2,\ldots, J\)), and the total count duration of the survey \(t_{iJ}\). We count each individual bird once, so individuals are ‘mentally removed’ from a closed population of undetected birds by the surveyor.
We have just defined the kind of data we need for the removal models. In this post, I am going to use a data set from our paper about comparing human observer based counts to automated recording units, Paired sampling standardizes point count data from humans and acoustic recorders. The data set we used is wrapped up in an R package called paired (thanks for Steve Van Wilgenburg for suggestions on this post and for agreeing to share this data set).
if (!require(paired))
devtools::install_github("borealbirds/paired")
library(paired)
data(paired)
We will use the counts for Ovenbird, one of the most common species in the data set (abbreviated as "OVEN"
). The data is in long format, so I am using the mefa4 R package to make the sample by species cross-tabulation. Then subsetting the data to retain samples obtained by human observers, then getting rid of missing predictor data. For predictors, we will use a variable capturing date (JDAY
; standardized ordinal day of the year) and an other one capturing time of day (TSSR
; time since local sunrise).
The data frame X
contains the predictors. The matrix Y
contains the counts of newly counted individuals binned into consecutive time intervals (0–3, 3–5, 5–10 minutes): cell values are the \(Y_{ij}\)’s. The D
object is another matrix mirroring the structure of Y
but instead of counts, it contains the interval end times: cell values are
the \(t_{ij}\)’s.
library(mefa4)
spp <- "OVEN"
xt <- Xtab(Count ~ PKEY + Interval, paired,
subset=paired$SurveyType == "HUM" & paired$SPECIES == spp)
Y <- as.matrix(xt[,c("0-3 min", "3-5 min", "5-10 min")])
X <- nonDuplicated(paired[paired$SurveyType == "HUM",],
PKEY, TRUE)[rownames(Y),]
i <- !is.na(X$Latitude)
Y <- Y[i,]
X <- X[i,c("JDAY", "TSSR")]
D <- matrix(c(3, 5, 10), nrow(Y), 3, byrow=TRUE)
dimnames(D) <- dimnames(Y)
tail(X)
## JDAY TSSR
## 96PA-C2-B_2 0.4383562 0.13468851
## 96PA-C2-C_1 0.4438356 0.08526145
## 96PA-C2-C_2 0.4383562 0.15273704
## 96PA-C2-D_1 0.4438356 0.13597129
## 96PA-C2-E_1 0.4438356 0.10122629
## 96PA-C2-F_1 0.4438356 0.11999388
tail(Y)
## 0-3 min 3-5 min 5-10 min
## 96PA-C2-B_2 5 0 0
## 96PA-C2-C_1 2 0 0
## 96PA-C2-C_2 2 0 0
## 96PA-C2-D_1 7 0 0
## 96PA-C2-E_1 2 0 0
## 96PA-C2-F_1 2 0 0
tail(D)
0-3 min 3-5 min 5-10 min
## 96PA-C2-B_2 3 5 10
## 96PA-C2-C_1 3 5 10
## 96PA-C2-C_2 3 5 10
## 96PA-C2-D_1 3 5 10
## 96PA-C2-E_1 3 5 10
## 96PA-C2-F_1 3 5 10
In the simplest continuous time-removal model, singing events by individual birds are assumed to follow a Poisson process. We can use the rate parameter of the Poisson process (\(\phi\)) to estimate the singing rate of birds during a point count.
In the time-invariant conventional removal model (Me0
), the individuals of a species at a given location and time are assumed to be homogeneous in their singing rates. The time to first detection follows the exponential distribution \(f(t_{ij}) = \phi exp(-t_{ij} \phi)\), and the cumulative density function of times to first detection in time interval (0, \(t_{iJ}\)) gives us the probability that a bird sings at least once during the point count as \(p(t_{iJ}) = 1 - exp(-t_{iJ} \phi)\).
We use the cmulti
function from the detect R package to fit the removal models. The algorithm used in the function is based on conditional maximum likelihood, and is described in this paper its supporting material.
We are using the type = "rem"
for conventional removal models.
library(detect)
Me0 <- cmulti(Y | D ~ 1, X, type="rem")
summary(Me0)
## Call:
## cmulti(formula = Y | D ~ 1, data = X, type = "rem")
##
## Removal Sampling (homogeneous singing rate)
## Conditional Maximum Likelihood estimates
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## log.phi_(Intercept) -0.91751 0.05826 -15.75 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-likelihood: -272.1
## BIC = 549.4
Singing rates of birds vary with time of day, time of year, breeding status, and stage of the nesting cycle. Thus, removal model estimates of availability may be improved by accounting for variation in singing rates using covariates for day of year and time of day. In this case \(p(t_{iJ}) = 1 - exp(-t_{iJ} \phi_{i})\) and \(log(\phi_{i}) = \beta_{0} + \sum^{K}_{k=1} \beta_{k} x_{ik}\) is the linear predictor with \(K\) covariates and the corresponding unknown coefficients (\(\beta_{k}\), \(k = 0,\ldots, K\)).
We could fit all the possible multivariate and nonlinear models as we did in the paper, but
let’s just keep it simple for now and fit models with JDAY
and TSSR
as covariates
(models Me1
and Me2
).
Me1 <- cmulti(Y | D ~ JDAY, X, type="rem")
Me2 <- cmulti(Y | D ~ TSSR, X, type="rem")
Now compare the three conventional models based on AIC and inspect the summary for the best supported model with the JDAY
effect.
Me_AIC <- AIC(Me0, Me1, Me2)
Me_AIC$dAIC <- Me_AIC$AIC - min(Me_AIC$AIC)
MeBest <- get(rownames(Me_AIC)[Me_AIC$dAIC == 0])
Me_AIC
## df AIC dAIC
## Me0 1 546.1270 0.7187895
## Me1 2 545.4082 0.0000000
## Me2 2 546.4612 1.0529236
summary(MeBest)
## Call:
## cmulti(formula = Y | D ~ JDAY, data = X, type = "rem")
##
## Removal Sampling (homogeneous singing rate)
## Conditional Maximum Likelihood estimates
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## log.phi_(Intercept) 1.460 1.471 0.993 0.321
## log.phi_JDAY -5.235 3.247 -1.612 0.107
##
## Log-likelihood: -270.7
## BIC = 552
To visually capture the time-varying effects, we make some plots using base graphics, colors matching the time-varying predictor. This way we can not only assess how availability probability (given a fixed time interval) is changing with the values of the predictor, but also how the cumulative distribution changes with time.
n <- 100
JDAY <- seq(min(X$JDAY), max(X$JDAY), length.out=n+1)
TSSR <- seq(min(X$TSSR), max(X$TSSR), length.out=n+1)
Duration <- seq(0, 10, length.out=n)
col <- viridis::viridis(n)
b <- coef(MeBest)
op <- par(las=1, mfrow=c(2,1), mar=c(4,4,2,2))
p1 <- 1-exp(-3*exp(b[1]+b[2]*JDAY))
plot(JDAY, p1, ylim=c(0,1), type="n",
main=paste(spp, rownames(Me_AIC)[Me_AIC$dAIC == 0]),
ylab="P(availability)")
for (i in seq_len(n)) {
lines(JDAY[c(i,i+1)], p1[c(i,i+1)], col=col[i], lwd=2)
}
plot(Duration, Duration, type="n", ylim=c(0,1),
ylab="P(availability)")
for (i in seq_len(n)) {
p2 <- 1-exp(-Duration*exp(b[1]+b[2]*JDAY[i]))
lines(Duration, p2, col=col[i])
}
abline(v=3, col="grey")
par(op)
The removal model can also accommodate behavioral heterogeneity in singing by subdividing the sampled population for a species at a given point into a finite mixture of birds with low and high singing rates, which requires the additional estimation of the proportion of birds in the sampled population with low singing rates.
In the continuous-time formulation of the finite mixture (or two-point mixture) removal model, the cumulative density function during a point count is given by \(p(t_{iJ}) = (1 - c) 1 + c [1 - exp(-t_{iJ} \phi)] = 1 - c exp(-t_{iJ} \phi)\), where \(\phi\) is the singing rate for the group of infrequently singing birds, and \(c\) is the proportion of birds during the point count that are infrequent singers. The remaining proportions (\(1 - c\); the intercept of the cumulative density function) of the frequent singers are assumed to be detected instantaneously at the start of the first time interval. In the simplest form of the finite mixture model, the proportion and singing rate of birds that sing infrequently is homogeneous across all times and locations (model Mf0
). We are using the type = "fmix"
for finite mixture removal models.
Mf0 <- cmulti(Y | D ~ 1, X, type="fmix")
summary(Mf0)
## Call:
## cmulti(formula = Y | D ~ 1, data = X, type = "fmix")
##
## Removal Sampling (heterogeneous singing rate)
## Conditional Maximum Likelihood estimates
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## log.phi_(Intercept) -2.1902 0.4914 -4.457 8.32e-06 ***
## logit.c 0.1182 0.1543 0.766 0.444
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-likelihood: -257.6
## BIC = 525.8
Previously, researchers (see refs in the paper) have applied covariate effects on the parameter \(\phi_{i}\) of the finite mixture model, similarly to how we modeled these effects in conventional models. This model assumes that the parameter \(c\) is constant irrespective of time and location (i.e. only the infrequent singer group changes its singing behavior).
We can fit finite mixture models with JDAY
and TSSR
as covariates on \(\phi\)
(models Mf1
and Mf2
). In this case \(p(t_{iJ}) = 1 - c exp(-t_{iJ} \phi_{i})\) and \(log(\phi_{i}) = \beta_{0} + \sum^{K}_{k=1} \beta_{k} x_{ik}\) is the linear predictor with \(K\) covariates and the corresponding unknown coefficients (\(\beta_{k}\), \(k = 0,\ldots, K\)).
Mf1 <- cmulti(Y | D ~ JDAY, X, type="fmix")
Mf2 <- cmulti(Y | D ~ TSSR, X, type="fmix")
Compare the three finite mixture models based on AIC and inspect the summary for the best supported model with the TSSR
effect in this case.
Mf_AIC <- AIC(Mf0, Mf1, Mf2)
Mf_AIC$dAIC <- Mf_AIC$AIC - min(Mf_AIC$AIC)
MfBest <- get(rownames(Mf_AIC)[Mf_AIC$dAIC == 0])
Mf_AIC
## df AIC dAIC
## Mf0 2 519.2222 0.1053855
## Mf1 3 520.4007 1.2838985
## Mf2 3 519.1168 0.0000000
summary(MfBest)
## Call:
## cmulti(formula = Y | D ~ TSSR, data = X, type = "fmix")
##
## Removal Sampling (heterogeneous singing rate)
## Conditional Maximum Likelihood estimates
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## log.phi_(Intercept) -1.1939 0.4195 -2.846 0.00442 **
## log.phi_TSSR -9.0089 4.7712 -1.888 0.05900 .
## logit.c 0.2016 0.1702 1.184 0.23622
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-likelihood: -256.6
## BIC = 529
We produce a similar plot as before.
b <- coef(MfBest)
op <- par(las=1, mfrow=c(2,1), mar=c(4,4,2,2))
p1 <- 1-plogis(b[3])*exp(-3*exp(b[1]+b[2]*TSSR))
plot(TSSR, p1, ylim=c(0,1), type="n",
main=paste(spp, rownames(Mf_AIC)[Mf_AIC$dAIC == 0]),
ylab="P(availability)")
for (i in seq_len(n)) {
lines(TSSR[c(i,i+1)], p1[c(i,i+1)], col=col[i], lwd=2)
}
plot(Duration, Duration, type="n", ylim=c(0,1),
ylab="P(availability)")
for (i in seq_len(n)) {
p2 <- 1-plogis(b[3])*exp(-Duration*exp(b[1]+b[2]*TSSR[i]))
lines(Duration, p2, col=col[i])
}
abline(v=3, col="grey")
par(op)
An alternative parametrization is that \(c_{i}\) rather than \(\phi\) be the time-varying parameter, allowing the individuals to switch between the frequent and infrequent group depending on covariates. We can fit this class of finite mixture model with JDAY
and TSSR
as covariates on \(c\) (models Mm1
and Mm2
) using type = "mix"
(instead of "fmix"
). In this case \(p(t_{iJ}) = 1 - c_{i} exp(-t_{iJ} \phi)\) and \(logit(c_{i}) = \beta_{0} + \sum^{K}_{k=1} \beta_{k} x_{ik}\) is the linear predictor with \(K\) covariates and the corresponding unknown coefficients (\(\beta_{k}\), \(k = 0,\ldots, K\)). Because \(c_{i}\) is a proportion, we model it on the logit scale.
Mm1 <- cmulti(Y | D ~ JDAY, X, type="mix")
Mm2 <- cmulti(Y | D ~ TSSR, X, type="mix")
We did not fit a null model for this parametrization, because it is identical to the Mf0
model, so that model Mf0
is what we use to compare AIC values and inspect the summary for the best supported model with the JDAY
effect in this case.
Mm_AIC <- AIC(Mf0, Mm1, Mm2)
Mm_AIC$dAIC <- Mm_AIC$AIC - min(Mm_AIC$AIC)
MmBest <- get(rownames(Mm_AIC)[Mm_AIC$dAIC == 0])
Mm_AIC
## df AIC dAIC
## Mf0 2 519.2222 0.1949952
## Mm1 3 519.0272 0.0000000
## Mm2 3 520.8744 1.8471803
summary(MmBest)
## Call:
## cmulti(formula = Y | D ~ JDAY, data = X, type = "mix")
##
## Removal Sampling (heterogeneous singing rate)
## Conditional Maximum Likelihood estimates
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## log.phi -2.1910 0.4914 -4.459 8.24e-06 ***
## logit.c_(Intercept) -4.7600 3.3828 -1.407 0.159
## logit.c_JDAY 10.7368 7.4287 1.445 0.148
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-likelihood: -256.5
## BIC = 528.9
We produce a similar plot as before.
b <- coef(MmBest)
op <- par(las=1, mfrow=c(2,1), mar=c(4,4,2,2))
p1 <- 1-plogis(b[2]+b[3]*JDAY)*exp(-3*exp(b[1]))
plot(JDAY, p1, ylim=c(0,1), type="n",
main=paste(spp, rownames(Mm_AIC)[Mm_AIC$dAIC == 0]),
ylab="P(availability)")
for (i in seq_len(n)) {
lines(JDAY[c(i,i+1)], p1[c(i,i+1)], col=col[i], lwd=2)
}
plot(Duration, Duration, type="n", ylim=c(0,1),
ylab="P(availability)")
for (i in seq_len(n)) {
p2 <- 1-plogis(b[2]+b[3]*JDAY[i])*exp(-Duration*exp(b[1]))
lines(Duration, p2, col=col[i])
}
abline(v=3, col="grey")
par(op)
So which of the 3 parametrizations proved to be best for our Ovenbird example data? It was the finite mixture with time-varying proportion of infrequent singers with a thin margin. Second was the other finite mixture model, while the conventional model was lagging behind.
M_AIC <- AIC(MeBest, MfBest, MmBest)
M_AIC$dAIC <- M_AIC$AIC - min(M_AIC$AIC)
M_AIC
## df AIC dAIC
## MeBest 2 545.4082 26.38106209
## MfBest 3 519.1168 0.08960974
## MmBest 3 519.0272 0.00000000
Finite mixture models provide some really nice insight into how singing behavior changes over time and, due to more parameters, they provide a better fit and thus minimize bias in population size estimates. But all this improvement comes with a price: sample size requirements (or more precisely, the number of detections required) are really high. To have all the benefits with reduced variance, one needs about 1000 non-zero observations to fit finite mixture models, 20 times more than needed to reliably fit conventional removal models. This is much higher than previously suggested minimum sample sizes.
Our findings also indicate that lengthening the count duration from 3 minutes to 5–10 minutes is an important consideration when designing field surveys to increase the accuracy and precision of population estimates. Well-informed survey design combined with various forms of removal sampling are useful in accounting for availability bias in point counts, thereby improving population estimates, and allowing for better integration of disparate studies at larger spatial scales.
To this end, we provide our removal model estimates as part of the QPAD R package and the R functions required to fit all the above outlined removal models as part of the detect R package. We at the Boreal Avian Modelling Project and our collaborators are already utilizing the removal model estimates to correct for availability bias in our continental and regional projects to inform better management and conservation of bird populations. Read more about these projects in our reports.
Please report any issues here and feel free to comment below!
UPDATE: AOS press release, EurekAlert!, ABMI blog.
]]>The first example uses a regular slider that returns a single value. To make that an interval, we will use standard deviation (SD, sigma) in a quality control chart (QCC). The code is based on the pistonrings
data set from the qcc package. The Shewhart chart sets 3 sigma limit to indicate state of control. The slider is used to adjusts the sigma limit and the GIF below plays is as an animation.
library(shiny)
library(intrval)
library(qcc)
data(pistonrings)
mu <- mean(pistonrings$diameter[pistonrings$trial])
SD <- sd(pistonrings$diameter[pistonrings$trial])
x <- pistonrings$diameter[!pistonrings$trial]
## UI function
ui <- fluidPage(
plotOutput("plot"),
sliderInput("x", "x SD:",
min=0, max=5, value=0, step=0.1,
animate=animationOptions(100)
)
)
# Server logic
server <- function(input, output) {
output$plot <- renderPlot({
Main <- paste("Shewhart quality control chart",
"diameter of piston rings", sprintf("+/- %.1f SD", input$x),
sep="\n")
iv <- mu + input$x * c(-SD, SD)
plot(x, pch = 19, col = x %)(% iv +1, type = "b",
ylim = mu + 5 * c(-SD, SD), main = Main)
abline(h = mu)
abline(h = iv, lty = 2)
})
}
## Run shiny app
if (interactive()) shinyApp(ui, server)
The second example uses range slider returning two values, which is our interval. To spice things up a bit, we combine intervals on two axes to color some random points. The next range slider defines a distance interval and colors the random points inside the ring.
library(shiny)
library(intrval)
set.seed(1)
n <- 10^4
x <- round(runif(n, -2, 2), 2)
y <- round(runif(n, -2, 2), 2)
d <- round(sqrt(x^2 + y^2), 2)
## UI function
ui <- fluidPage(
titlePanel("intrval example with shiny"),
sidebarLayout(
sidebarPanel(
sliderInput("bb_x", "x value:",
min=min(x), max=max(x), value=range(x),
step=round(diff(range(x))/20, 1), animate=TRUE
),
sliderInput("bb_y", "y value:",
min = min(y), max = max(y), value = range(y),
step=round(diff(range(y))/20, 1), animate=TRUE
),
sliderInput("bb_d", "radial distance:",
min = 0, max = max(d), value = c(0, max(d)/2),
step=round(max(d)/20, 1), animate=TRUE
)
),
mainPanel(
plotOutput("plot")
)
)
)
# Server logic
server <- function(input, output) {
output$plot <- renderPlot({
iv1 <- x %[]% input$bb_x & y %[]% input$bb_y
iv2 <- x %[]% input$bb_y & y %[]% input$bb_x
iv3 <- d %()% input$bb_d
op <- par(mfrow=c(1,2))
plot(x, y, pch = 19, cex = 0.25, col = iv1 + iv2 + 3,
main = "Intersecting bounding boxes")
plot(x, y, pch = 19, cex = 0.25, col = iv3 + 1,
main = "Deck the halls:\ndistance range from center")
par(op)
})
}
## Run shiny app
if (interactive()) shinyApp(ui, server)
If you think there are other use cases for intrval in shiny applications, let me know in the comments section!
If you want to learn more about how to host Shiny apps, check out the Hosting Data Apps website!
]]>The R package is hosted on GitHub
(no CRAN version yet),
please submit any issues here.
The package is also archived on Zenodo with DOI 10.5281/zenodo.596410.
To install the package, use
devtools::install_github("borealbirds/lhreg")
.
Here, I am going to skim the implementation based on the more
complete supporting information of the paper which has all the
reproducible code (try vignette(topic = "lhreg", package = "lhreg")
after
installing and loading the package).
Here is the rendered html version.
The most important function is lhreg
which takes the following main arguments:
Y
: response vector,X
: model matrix for the mean.SE
: standard error estimate (observation error) for the response,V
: correlation matrix,and fits a Multivariate Normal model to the observed Y
vector
with phylogenetically based (or any other known) correlations
and optionally with observation error (SE
), and covariate effects (X
).
The function is pretty bare-bones (i.e. no formula interface,
the design matrix X
needs to be properly specified through
e.g. model.matrix()
). The lambda
argument
is a non-negative number modifying the strength of phylogenetic effects.
lambda = 0
is equivalent to lm
with
weights = 1/(SE^2)
, lambda = 1
implies Brownian motion evolution,
lambda = NA
lets the function estimate it based on the data.
In terms of optimization, besides the algorithms from stats::optim
,
we also have differential evolution algorithm based on the
DEoptim package (a bit time consuming but very reliable).
The output object class has some methods defined (like logLik
and summary
)
and as a result AIC/BIC will work out of the box. The vignette also
describes a few techniques which are pretty nice to have in
a multivariate setting (i.e. profile likelihood, parametric bootstrap)
to support advanced hypothesis testing and model selection.
We used leave one out cross-validation to see how well we could predict the
values based on data from the other species, traits and phylogeny.
The conditional distribution we used for that is described in the paper which
made this exercise very straightforward.
Maybe it is just ignorance, but I couldn’t find another paper
that would have described it in a nice and useful manner,
however, if one wishes to make trait/phylogeny based
predictions for detectability, this formula is going to be
very useful (look inside the loo2
function for implementation).
At the end of the vignette, there is a hack based on phytools::contMap
function to produce non-rainbow colors.
(It was surprisingly non-straightforward to hack the code —
modular code please!)
The following figure shows the two input data vectors mirrored side-by-side:
I realize this is not a very detailed post, but the paper and the vignette should satisfy your curiosity. If you still have unanswered questions, feel free to ask them below!
UPDATE
After downloading the yearly publications numbers
using filters ADDRESS=HUNGARY; CATEGORIES=ECOLOGY
,
I started where I left off few years ago. I fit Ricker growth model
to two time intervals of the data: 1978–1997, and 1998–2017.
The R code below uses the PVAClone package that I wrote with Khurram Nadeem, and is based on fitting state-space models using MCMC and data cloning with JAGS. The other intrval package is pretty new but handy little helper (see related posts here)
library(PVAClone)
library(intrval)
## the data from WoS
x <- structure(list(years = 1973:2017, records = c(1, 0, 4, 0, 0,
6, 2, 5, 4, 7, 5, 7, 3, 5, 9, 11, 20, 8, 10, 15, 29, 24, 53,
12, 13, 30, 32, 36, 45, 39, 42, 43, 50, 62, 95, 106, 113, 83,
108, 99, 89, 117, 111, 134, 127)), .Names = c("years", "records"
), row.names = c(NA, 45L), class = "data.frame")
## fit the 2 models
ncl <- 10 # number of clones
m1 <- pva(x$records[x$years %[]% c(1978, 1997)], ricker("none"), ncl)
m2 <- pva(x$records[x$years %[]% c(1998, 2017)], ricker("none"), ncl)
## organize estimates
cf <- data.frame(t(sapply(list(early=m1, late=m2), coef)))
cf$K <- with(cf, -a/b)
## growth curve: early period
yr1 <- 1978:1997
pr1 <- numeric(length(yr1))
pr1[1] <- log(x$records[x$years==1978])
for (i in 2:length(pr1))
pr1[i] <- pr1[i-1] + cf["early", "a"] + cf["early", "b"]*exp(pr1[i-1])
pr1 <- exp(pr1)
## growth curve: late period
yr2 <- 1998:2017
pr2 <- numeric(length(yr2))
pr2[1] <- log(x$records[x$years==1998])
for (i in 2:length(pr2))
pr2[i] <- pr2[i-1] + cf["late", "a"] + cf["late", "b"]*exp(pr2[i-1])
pr2 <- exp(pr2)
## and finally the figure using base graphics
op <- par(las=2)
barplot(x$records, names.arg = x$years, space=0,
ylab="# of publications", xlab="years",
col=ifelse(x$years < 1998, "grey", "gold"))
lines(yr1-min(x$years)+0.5, pr1, col=4)
abline(h=cf["early", "K"], col=4, lty=3)
lines(yr2-min(x$years)+0.5, pr2, col=2)
abline(h=cf["late2017", "K"], col=2, lty=3)
par(op)
Here are the model parameters for the two Ricker models:
a | b | sigma | K | |
---|---|---|---|---|
1978–1997 | 0.38 | -0.03 | 0.60 | 13.85 |
1998–2017 | 0.21 | 0.00 | 0.16 | 119.00 |
The K carrying capacity used to be 100 based on 1998–2012 data, but now K = 119, which is a significant improvement — heartfelt kudos to the ecologists in Hungary (more papers please)! The growth rate hasn’t changed (a = 0.21). So we can conclude that if the rate remained constant but carrying capacity increased, the change must be related to resource availability (i.e. increased funding, more jobs, improved infrastructure).
This is good news to me! Let me know what you think by leaving a comment below!
]]>parallel::mclapply
with forking (see this post for more background on the issue). Strangely enough, a GitHub issue held the key to the solution that I am going to outline below. Long story short: forking is no longer expensive with pbapply, and as it turns out, it never was.
The issue mentioned parallel::makeForkCluster
as the way to set up a Fork cluster, which, according to the help page, ‘is merely a stub on Windows. On Unix-alike platforms it creates the worker process by forking’.
So I looked at some timings starting with one of the examples on the ?pbapply
help page:
library(pbapply)
set.seed(1234)
n <- 200
x <- rnorm(n)
y <- rnorm(n, crossprod(t(model.matrix(~ x)), c(0, 1)), sd = 0.5)
d <- data.frame(y, x)
mod <- lm(y ~ x, d)
ndat <- model.frame(mod)
B <- 100
bid <- sapply(1:B, function(i) sample(nrow(ndat), nrow(ndat), TRUE))
fun <- function(z) {
if (missing(z))
z <- sample(nrow(ndat), nrow(ndat), TRUE)
coef(lm(mod$call$formula, data=ndat[z,]))
}
## forking with mclapply
system.time(res1 <- pblapply(1:B, function(i) fun(bid[,i]), cl = 2L))
## |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 01s
## user system elapsed
## 0.587 0.919 0.845
## forking with parLapply
cl <- makeForkCluster(2L)
system.time(res2 <- pblapply(1:B, function(i) fun(bid[,i]), cl = cl))
## |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 00s
## user system elapsed
## 0.058 0.009 0.215
stopCluster(cl)
## Socket cluster (need to pass objects to workers)
cl <- makeCluster(2L)
clusterExport(cl, c("fun", "mod", "ndat", "bid"))
system.time(res3 <- pblapply(1:B, function(i) fun(bid[,i]), cl = cl))
## |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 00s
## user system elapsed
## 0.053 0.008 0.169
stopCluster(cl)
Forking with mclapply
is still pricey, but the almost equivalent makeForkCluster
trick, that does not require objects to be passed to workers due to the shared memory nature of the process, is pretty close to the ordinary Socket cluster option.
What if I used this trick in the package? I would then create a Fork cluster
(cl <- makeForkCluster(cl)
), run parLapply(cl, ...)
, and destroy the cluster with on.exit(stopCluster(cl), add = TRUE)
. So I created a branch to do some tests:
ncl <- 2
B <- 1000
fun <- function(x) {
Sys.sleep(0.01)
x^2
}
library(pbmcapply)
(t1 <- system.time(pbmclapply(1:B, fun, mc.cores = ncl)))
## |========================================================| 100%, ETA 00:00
## user system elapsed
## 0.242 0.114 5.461
library(pbapply) # 1.3-4 CRAN version
(t2 <- system.time(pblapply(1:B, fun, cl = ncl)))
## |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 07s
## user system elapsed
## 0.667 1.390 6.547
library(pbapply) # 1.3-5 fork-cluster-speedup branch
(t3 <- system.time(pblapply(1:B, fun, cl = ncl)))
## |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 06s
## user system elapsed
## 0.225 0.100 5.710
Really nice so far: pbapply caught up to forking based timings with pbmcapply. Let’s see a bit more extensive runs to see how the number of progress bar updates affects the timings. If things work as I hope, there shouldn’t be an increase with the new forking idea:
timer_fun <- function(X, FUN, nout = 100, ...) {
pbo <- pboptions(nout = nout)
on.exit(pboptions(pbo))
unname(system.time(pblapply(X, FUN, ...))[3])
}
timer_NULL <- list(
nout1 = timer_fun(1:B, fun, nout = 1, cl = NULL),
nout10 = timer_fun(1:B, fun, nout = 10, cl = NULL),
nout100 = timer_fun(1:B, fun, nout = 100, cl = NULL),
nout1000 = timer_fun(1:B, fun, nout = 1000, cl = NULL))
unlist(timer_NULL)
## nout1 nout10 nout100 nout1000
## 12.221 11.899 11.775 11.260
cl <- makeCluster(ncl)
timer_cl <- list(
nout1 = timer_fun(1:B, fun, nout = 1, cl = cl),
nout10 = timer_fun(1:B, fun, nout = 10, cl = cl),
nout100 = timer_fun(1:B, fun, nout = 100, cl = cl),
nout1000 = timer_fun(1:B, fun, nout = 1000, cl = cl))
stopCluster(cl)
unlist(timer_cl)
## nout1 nout10 nout100 nout1000
## 6.033 6.091 6.011 6.273
## forking with 1.3-4 CRAN version
timer_mc <- list(
nout1 = timer_fun(1:B, fun, nout = 1, cl = ncl),
nout10 = timer_fun(1:B, fun, nout = 10, cl = ncl),
nout100 = timer_fun(1:B, fun, nout = 100, cl = ncl),
nout1000 = timer_fun(1:B, fun, nout = 1000, cl = ncl))
unlist(timer_mc)
## nout1 nout10 nout100 nout1000
## 5.563 5.659 6.620 10.692
## forking with 1.3-5 fork-cluster-speedup branch
timer_new <- list(
nout1 = timer_fun(1:B, fun, nout = 1, cl = ncl),
nout10 = timer_fun(1:B, fun, nout = 10, cl = ncl),
nout100 = timer_fun(1:B, fun, nout = 100, cl = ncl),
nout1000 = timer_fun(1:B, fun, nout = 1000, cl = ncl))
unlist(timer_new)
## nout1 nout10 nout100 nout1000
## 5.480 5.574 5.665 6.063
The new implementation with the Fork cluster trick hands down beat the old implementation using mclapply
. I wonder what is causing the
wildly different timings results. Is it due to all the other
mclapply
arguments that give control over pre-scheduling, cleanup, and RNG seeds?
The new branch can be installed as:
devtools::install_github("psolymos/pbapply", ref = "fork-cluster-speedup")
I am a bit reluctant of merging the new branch for the following reasons:
makeForkCluster
was already an option before by explicitly stating the cluster to be a Fork;mclapply
wasn’t so bad to begin with, because the number of updates were capped by the nout
option.I would recommend the following workflow that is based purely on the stable CRAN version:
cl <- makeForkCluster(2L)
output <- pblapply(..., cl = cl)
stopCluster(cl)
As always, I am keen on hearing what you think: either in the comments or on GitHub.
]]>o
in the middle as in c(a1, b1) %[]o[]% c(a2, b2)
.
Interval 1: | Interval 2: [] |
[) |
(] |
() |
---|---|---|---|---|
[] |
%[]o[]% |
%[]o[)% |
%[]o(]% |
%[]o()% |
[) |
%[)o[]% |
%[)o[)% |
%[)o(]% |
%[)o()% |
(] |
%(]o[]% |
%(]o[)% |
%(]o(]% |
%(]o()% |
() |
%()o[]% |
%()o[)% |
%()o(]% |
%()o()% |
The overlap of two closed intervals, [a1, b1] and [a2, b2],
is evaluated by the %[]o[]%
(%[o]%
is an alias)
operator (a1 <= b1
, a2 <= b2
).
Endpoints can be defined as a vector with two values
(c(a1, b1)
) or can be stored in matrix-like objects or a lists
in which case comparisons are made element-wise.
If lengths do not match, shorter objects are recycled.
These value-to-interval operators work for numeric (integer, real)
and ordered vectors, and object types which are measured at
least on ordinal scale (e.g. dates).
Note that interval endpoints
are sorted internally thus ensuring the conditions
a1 <= b1
and a2 <= b2
is not necessary.
c(2, 3) %[]o[]% c(0, 1)
list(0:4, 1:5) %[]o[]% c(2, 3)
cbind(0:4, 1:5) %[]o[]% c(2, 3)
data.frame(a=0:4, b=1:5) %[]o[]% c(2, 3)
If lengths do not match, shorter objects are recycled. These value-to-interval operators work for numeric (integer, real) and ordered vectors, and object types which are measured at least on ordinal scale (e.g. dates).
%)o(%
is used for the negation of two closed interval overlap (%[o]%
),
directional evaluation is done via the operators
%[<o]%
and %[o>]%
.
The overlap of two open intervals
is evaluated by the %(o)%
(alias for %()o()%
).
%]o[%
is used for the negation of two open interval overlap,
directional evaluation is done via the operators
%(<o)%
and %(o>)%
.
Overlap operators with mixed endpoint do not have
negation and directional counterparts.
Equal | Not equal | Less than | Greater than |
---|---|---|---|
%[o]% |
%)o(% |
%[<o]% |
%[o>]% |
%(o)% |
%]o[% |
%(<o)% |
%(o>)% |
Thanks for all the feedback so far and please keep’em coming: leave a comment below or use the issue tracker to provide feedback or report a problem.
]]>%inrange%
and %between%
. That got me thinking: it would be really cool to generalize this idea for different intervals, for example as x %[]% c(a, b)
.
We want to evaluate if values of x
satisfy the condition x >= a & x <= b
given that a <= b
. Typing x %[]% c(a, b)
instead of the previous expression is not much shorter (14 vs. 15 characters with counting spaces). But considering the a <= b
condition as well, it becomes a saving (x >= min(a, b) & x <= mmax(a, b)
is 31 characters long). And sorting is really important, because by flipping a
and b
, we get quite different answers:
x <- 5
x >= 1 & x <= 10
# [1] TRUE
x >= 10 & x <= 1
# [1] FALSE
Also, min
and max
will not be very useful when we want to vectorize the expression. We need to use pmin
and pmax
for obvious reasons:
x >= min(1:10, 10:1) & x <= max(10:1, 1:10)
# [1] TRUE
x >= pmin(1:10, 10:1) & x <= pmax(10:1, 1:10)
# [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
If interval endpoints can also be open or closed, and allowing them to flip around makes the semantics of left/right closed/open interval definitions hard. We can thus all agree that there is a need for an expression, like x %[]% c(a, b)
, that is compact, flexible, and invariant to endpoint sorting. This is exactly what the intrval package is for!
Functions for evaluating if values of vectors are within
different open/closed intervals
(x %[]% c(a, b)
), or if two closed
intervals overlap (c(a1, b1) %[o]% c(a2, b2)
).
Operators for negation and directional relations also implemented.
Values of x
are compared to interval endpoints a
and b
(a <= b
).
Endpoints can be defined as a vector with two values (c(a, b)
): these values will be compared as a single interval with each value in x
.
If endpoints are stored in a matrix-like object or a list,
comparisons are made element-wise.
x <- rep(4, 5)
a <- 1:5
b <- 3:7
cbind(x=x, a=a, b=b)
x %[]% cbind(a, b) # matrix
x %[]% data.frame(a=a, b=b) # data.frame
x %[]% list(a, b) # list
If lengths do not match, shorter objects are recycled. Return values are logicals.
Note: interval endpoints are sorted internally thus ensuring the condition
a <= b
is not necessary.
These value-to-interval operators work for numeric (integer, real) and ordered vectors, and object types which are measured at least on ordinal scale (e.g. dates).
The following special operators are used to indicate closed ([
, ]
) or open ((
, )
) interval endpoints:
Operator | Expression | Condition |
---|---|---|
%[]% |
x %[]% c(a, b) |
x >= a & x <= b |
%[)% |
x %[)% c(a, b) |
x >= a & x < b |
%(]% |
x %(]% c(a, b) |
x > a & x <= b |
%()% |
x %()% c(a, b) |
x > a & x < b |
Equal | Not equal | Less than | Greater than |
---|---|---|---|
%[]% |
%)(% |
%[<]% |
%[>]% |
%[)% |
%)[% |
%[<)% |
%[>)% |
%(]% |
%](% |
%(<]% |
%(>]% |
%()% |
%][% |
%(<)% |
%(>)% |
The helper function intrval_types
can be used to
print/plot the following summary:
The overlap of two closed intervals, [a1
, b1
] and [a2
, b2
],
is evaluated by the %[o]%
operator (a1 <= b1
, a2 <= b2
).
Endpoints can be defined as a vector with two values
(c(a1, b1)
)or can be stored in matrix-like objects or a lists
in which case comparisons are made element-wise.
Note: interval endpoints
are sorted internally thus ensuring the conditions
a1 <= b1
and a2 <= b2
is not necessary.
c(2:3) %[o]% c(0:1)
list(0:4, 1:5) %[o]% c(2:3)
cbind(0:4, 1:5) %[o]% c(2:3)
data.frame(a=0:4, b=1:5) %[o]% c(2:3)
If lengths do not match, shorter objects are recycled. These value-to-interval operators work for numeric (integer, real) and ordered vectors, and object types which are measured at least on ordinal scale (e.g. dates).
%)o(%
is used for the negation,
directional evaluation is done via the operators %[<o]%
and %[o>]%
.
Equal | Not equal | Less than | Greater than |
---|---|---|---|
%[o]% |
%)o(% |
%[<o]% |
%[o>]% |
The previous operators will return NA
for unordered factors.
Set overlap can be evaluated by the base %in%
operator and its negation
%nin%
. (This feature is really redundant, I know, but decided to include regardless…)
Install development version from GitHub (not yet on CRAN):
library(devtools)
install_github("psolymos/intrval")
The package is licensed under GPL-2.
library(intrval)
## bounding box
set.seed(1)
n <- 10^4
x <- runif(n, -2, 2)
y <- runif(n, -2, 2)
d <- sqrt(x^2 + y^2)
iv1 <- x %[]% c(-0.25, 0.25) & y %[]% c(-1.5, 1.5)
iv2 <- x %[]% c(-1.5, 1.5) & y %[]% c(-0.25, 0.25)
iv3 <- d %()% c(1, 1.5)
plot(x, y, pch = 19, cex = 0.25, col = iv1 + iv2 + 1,
main = "Intersecting bounding boxes")
plot(x, y, pch = 19, cex = 0.25, col = iv3 + 1,
main = "Deck the halls:\ndistance range from center")
## time series filtering
x <- seq(0, 4*24*60*60, 60*60)
dt <- as.POSIXct(x, origin="2000-01-01 00:00:00")
f <- as.POSIXlt(dt)$hour %[]% c(0, 11)
plot(sin(x) ~ dt, type="l", col="grey",
main = "Filtering date/time objects")
points(sin(x) ~ dt, pch = 19, col = f + 1)
## QCC
library(qcc)
data(pistonrings)
mu <- mean(pistonrings$diameter[pistonrings$trial])
SD <- sd(pistonrings$diameter[pistonrings$trial])
x <- pistonrings$diameter[!pistonrings$trial]
iv <- mu + 3 * c(-SD, SD)
plot(x, pch = 19, col = x %)(% iv +1, type = "b", ylim = mu + 5 * c(-SD, SD),
main = "Shewhart quality control chart\ndiameter of piston rings")
abline(h = mu)
abline(h = iv, lty = 2)
## Annette Dobson (1990) "An Introduction to Generalized Linear Models".
## Page 9: Plant Weight Data.
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
lm.D9 <- lm(weight ~ group)
## compare 95% confidence intervals with 0
(CI.D9 <- confint(lm.D9))
# 2.5 % 97.5 %
# (Intercept) 4.56934 5.4946602
# groupTrt -1.02530 0.2833003
0 %[]% CI.D9
# (Intercept) groupTrt
# FALSE TRUE
lm.D90 <- lm(weight ~ group - 1) # omitting intercept
## compare 95% confidence of the 2 groups to each other
(CI.D90 <- confint(lm.D90))
# 2.5 % 97.5 %
# groupCtl 4.56934 5.49466
# groupTrt 4.19834 5.12366
CI.D90[1,] %[o]% CI.D90[2,]
# 2.5 %
# TRUE
DATE <- as.Date(c("2000-01-01","2000-02-01", "2000-03-31"))
DATE %[<]% as.Date(c("2000-01-151", "2000-03-15"))
# [1] TRUE FALSE FALSE
DATE %[]% as.Date(c("2000-01-151", "2000-03-15"))
# [1] FALSE TRUE FALSE
DATE %[>]% as.Date(c("2000-01-151", "2000-03-15"))
# [1] FALSE FALSE TRUE
For more examples, see the unit-testing script.
Please check out the package and use the issue tracker to suggest a new feature or report a problem.
Sergey Kashin pointed out that some operators are redundant. It is now explained in the manual:
Note that some operators return identical results but
are syntactically different:
%[<]%
and %[<)%
both evaluate x < a
;
%[>]%
and %(>]%
both evaluate x > b
;
%(<]%
and %(<)%
evaluate x <= a
;
%[>)%
and %(>)%
both evaluate x >= b
.
This is so because we evaluate only one end of the interval
but still conceptually referring to the relationship
defined by the right-hand-side interval object.
This implies 2 conditional logical evaluations
instead of treating it as a single 3-level ordered factor.
intrval R package v0.1 is on CRAN: https://CRAN.R-project.org/package=intrval
]]>%in%
R function as x %in% y
. Whenever I need the negation of that, I used to write !(x %in% y)
. Not much of a hassle, but still, wouldn’t it be nicer to have x %notin% y
instead? So I decided to code it for my mefa4 package that I maintain primarily to make my data munging time shorter and more efficient. Coding a %special%
function was no big deal. But I had to do quite a bit of research and trial-error until I figured out the proper documentation. So here it goes.
The function name needs quotes and exactly two arguments, one for the left and one for the right hand side of the operator in the middle:
"%notin%" <- function(x, table) !(match(x, table, nomatch = 0) > 0)
Let us see what it does:
1:4 %in% 3:5
## [1] FALSE FALSE TRUE TRUE
1:4 %notin% 3:5
## [1] TRUE TRUE FALSE FALSE
We need to export the function, so just add the following entry to the NAMESPACE
file:
export("%notin%")
This is where things get are a bit more interesting. The LaTeX engine needs the percent sign to be escaped (\%
) throughout the whole documentation. Also pay close attention to the usage section (x \%notin\% table
).
\name{\%notin\%}
\alias{\%notin\%}
\title{
Negated Value Matching
}
\description{
\code{\%notin\%} is the negation of \code{\link{\%in\%}},
which returns a logical vector indicating if there is a non-match or not
for its left operand.
}
\usage{
x \%notin\% table
}
\arguments{
\item{x}{
vector or \code{NULL}: the values to be matched.
}
\item{table}{
vector or \code{NULL}: the values to be matched against.
}
}
\value{
A logical vector, indicating if a non-match was located for each element of
\code{x}: thus the values are \code{TRUE} or \code{FALSE} and never \code{NA}.
}
\author{
Peter Solymos <solymos@ualberta.ca>
}
\seealso{
All the opposite of what is written for \code{\link{\%in\%}}.
}
\examples{
1:10 \%notin\% c(1,3,5,9)
sstr <- c("c","ab","B","bba","c",NA,"@","bla","a","Ba","\%")
sstr[sstr \%notin\% c(letters, LETTERS)]
}
\keyword{manip}
\keyword{logic}
UPDATE
Some updates from the comments:
%>%
(pipe) operator.%nin%
function ({match(x, table, nomatch = 0) == 0}
). (Note that the unexported Matrix:::"%nin%"
is defined as {is.na(match(x, table))}
.)