Péter Sólymos

CalgaryR & YEGRUG Meetup: Data Cloning - Hierarchical Models Made Easy

2023-04-11T00:00:00-06:00

I moved to Canada in 2008 to start a postdoctoral fellowship with Prof. Subhash Lele at the stats department of the University of Alberta. Subhash at the time just published a paper about a statistical technique called data cloning. Data cloning is a way to use Bayesian MCMC algorithms to do frequentist inference. Yes, you read that right.

I learnt about WinBUGS, OpenBUGS, then JAGS, and used data cloning for some projects. And I started abstracting away the workflow and in 2009 I submitted the dclone package to CRAN. The dclone package is still alive and well, now even supports Stan.

Since then, Subhash has retired and I moved jobs. We thought recently that we should do another data cloning workshop like we did a few times over the years. And now it is coming up on April 13, 2023, as a joint meetup of the Edmonton and Calgary R User groups.

Mixed models, also known as hierarchical models and multilevel models, is a useful class of models for applied sciences. The goal of the workshop is to give an introduction to the logic, theory, and implementation of these models to solve practical problems. The workshop will include a seminar style overview and hands on exercises including common model classes and examples that participants can extend for their own needs.

Find more about the course on the GitHub site: https://github.com/datacloning/workshop-2023-edmonton.

How many birds are out there?

2020-06-22T00:00:00-06:00

In a recent paper entitled “Lessons learned from comparing spatially explicit models and the Partners in Flight approach to estimate population sizes of boreal birds in Alberta, Canada” we developed improved, spatially explicit models for 81 land bird species in northern Alberta, Canada. We then compared these estimates of bird abundance to a commonly-used but non-spatially explicit estimate by Partners in Flight (PIF v 3.0) that’s based on the North American Breeding Bird Survey (BBS) data set. The publication is a result of years of collaboration between the ABMI, Boreal Avian Modelling (BAM) project, Canadian Wildlife Service (Environment and Climate Change Canada), and United States Geological Survey.

The paper represents a major step forwards in understanding the complexity of population size estimation. We paid special attention to framing the implications for conservation and management of species at risk and for improving future data collection. It was published in the Condor, and it is open access:

publication: https://doi.org/10.1093/condor/duaa007
supporting information: http://doi.org/10.5281/zenodo.3563112

Here is the abstract:

Estimating the population abundance of landbirds is a challenging task complicated by the amount, type, and quality of available data. Avian conservationists have relied on population estimates from Partners in Flight (PIF), which primarily uses roadside data from the North American Breeding Bird Survey (BBS). However, the BBS was not designed to estimate population sizes. Therefore, we set out to compare the PIF approach with spatially explicit models incorporating roadside and off-road point-count surveys. We calculated population estimates for 81 landbird species in Bird Conservation Region 6 in Alberta, Canada, using land cover and climate as predictors. We also developed a framework to evaluate how the differences between the detection distance, time-of-day, roadside count, and habitat representation adjustments explain discrepancies between the 2 estimators. We showed that the key assumptions of the PIF population estimator were commonly violated in this region, and that the 2 approaches provided different population estimates for most species. The average differences between estimators were explained by differences in the detection-distance and time-of-day components, but these adjustments left much unexplained variation among species. Differences in the roadside count and habitat representation components explained most of the among-species variation. The variation caused by these factors was large enough to change the population ranking of the species. The roadside count bias needs serious attention when roadside surveys are used to extrapolate over off-road areas. Habitat representation bias is likely prevalent in regions sparsely and non-representatively sampled by roadside surveys, such as the boreal region of North America, and thus population estimates for these regions need to be treated with caution for certain species. Additional sampling and integrated modeling of available data sources can contribute towards more accurate population estimates for conservation in remote areas of North America.

I am not going to provide another perspective here, but rather, I list all the other sources out there that I wrote in connection with the paper, or commentaries that arose in news outlets.

Blog posts:

Three things you should know about population estimation, the AOS blog
Made in Alberta models help continental bird conservation, the ABMI blog

In the news:

Presentations:

this one walks you through the math nicely

Social activity on Altmetric:

Fitting removal models with the detect R package

2018-08-30T00:00:00-06:00

In a paper recently published in the Condor, titled Evaluating time-removal models for estimating availability of boreal birds during point-count surveys: sample size requirements and model complexity, we assessed different ways of controlling for point-count duration in bird counts using data from the Boreal Avian Modelling Project. As the title indicates, the paper describes a cost-benefit analysis to make recommendations about when to use different types of the removal model. The paper is open access, so feel free to read the whole paper here.

In summary, we evaluated a conventional removal model and a finite mixture removal model, with and without covariates, for 152 bird species. We found that the probabilities of predicted availability under conventional and finite mixture models were very similar with respect to the range of probability values and the shape of the response curves to predictor variables. However, finite mixture models were better supported for the large majority of species. We also found overwhelming support for time-varying models irrespective of the parametrization.

I have written a related post about the journey that led to this paper (Count me in! I am available for detection at 6 AM on May 26th), in this post I describe the math behind the removal modeling as implemented in the detect R package.

Continuous time-removal models

It has long been recognized that nearly all avian field surveys underestimate abundances, unless the estimates are adjusted for the proportion of birds present but undetected at the times and locations surveyed. Detectability is the product of the probability that birds make themselves available for detection by emitting detectable cues (availability); and the probability that an available bird will be perceived by a bird surveyor (perceptibility).

The time-removal model, originally developed for estimating wildlife and fish abundances from mark-recapture studies, was later reformulated for avian surveys with the goal of improving estimates of bird abundance by accounting for the availability bias inherent in point-count data. The removal model applied to point-count surveys estimates the probability that a bird is available for detection as a function of the average number of detectable cues that an individual bird gives per minute (singing rate), and the known count duration.

Time-removal models are based on a removal experiment whereby animals are trapped and thereby removed from the closed population of animals being sampled. When applying a removal model to avian point-count surveys, the counts of singing birds (\(Y_{ij}, \ldots, Y_{iJ}\)) within a given point-count survey \(i\) (\(i = 1,\ldots, n\)) are tallied relative to when each bird is first detected in multiple and consecutive time intervals, with the survey start time \(t_{i0} = 0\), the end times of the time intervals \(t_{ij}\) (\(j = 1, 2,\ldots, J\)), and the total count duration of the survey \(t_{iJ}\). We count each individual bird once, so individuals are ‘mentally removed’ from a closed population of undetected birds by the surveyor.

Data requirements

We have just defined the kind of data we need for the removal models. In this post, I am going to use a data set from our paper about comparing human observer based counts to automated recording units, Paired sampling standardizes point count data from humans and acoustic recorders. The data set we used is wrapped up in an R package called paired (thanks for Steve Van Wilgenburg for suggestions on this post and for agreeing to share this data set).

if (!require(paired))
    devtools::install_github("borealbirds/paired")
library(paired)
data(paired)

We will use the counts for Ovenbird, one of the most common species in the data set (abbreviated as "OVEN"). The data is in long format, so I am using the mefa4 R package to make the sample by species cross-tabulation. Then subsetting the data to retain samples obtained by human observers, then getting rid of missing predictor data. For predictors, we will use a variable capturing date (JDAY; standardized ordinal day of the year) and an other one capturing time of day (TSSR; time since local sunrise).

The data frame X contains the predictors. The matrix Y contains the counts of newly counted individuals binned into consecutive time intervals (0–3, 3–5, 5–10 minutes): cell values are the \(Y_{ij}\)’s. The D object is another matrix mirroring the structure of Y but instead of counts, it contains the interval end times: cell values are the \(t_{ij}\)’s.

library(mefa4)
spp <- "OVEN"
xt <- Xtab(Count ~ PKEY + Interval, paired,
    subset=paired$SurveyType == "HUM" & paired$SPECIES == spp)
Y <- as.matrix(xt[,c("0-3 min", "3-5 min", "5-10 min")])
X <- nonDuplicated(paired[paired$SurveyType == "HUM",],
    PKEY, TRUE)[rownames(Y),]
i <- !is.na(X$Latitude)
Y <- Y[i,]
X <- X[i,c("JDAY", "TSSR")]
D <- matrix(c(3, 5, 10), nrow(Y), 3, byrow=TRUE)
dimnames(D) <- dimnames(Y)

tail(X)

##                  JDAY       TSSR
## 96PA-C2-B_2 0.4383562 0.13468851
## 96PA-C2-C_1 0.4438356 0.08526145
## 96PA-C2-C_2 0.4383562 0.15273704
## 96PA-C2-D_1 0.4438356 0.13597129
## 96PA-C2-E_1 0.4438356 0.10122629
## 96PA-C2-F_1 0.4438356 0.11999388

tail(Y)

##             0-3 min 3-5 min 5-10 min
## 96PA-C2-B_2       5       0        0
## 96PA-C2-C_1       2       0        0
## 96PA-C2-C_2       2       0        0
## 96PA-C2-D_1       7       0        0
## 96PA-C2-E_1       2       0        0
## 96PA-C2-F_1       2       0        0

tail(D)

            0-3 min 3-5 min 5-10 min
## 96PA-C2-B_2       3       5       10
## 96PA-C2-C_1       3       5       10
## 96PA-C2-C_2       3       5       10
## 96PA-C2-D_1       3       5       10
## 96PA-C2-E_1       3       5       10
## 96PA-C2-F_1       3       5       10

Time-invariant conventional removal model

In the simplest continuous time-removal model, singing events by individual birds are assumed to follow a Poisson process. We can use the rate parameter of the Poisson process (\(\phi\)) to estimate the singing rate of birds during a point count.

In the time-invariant conventional removal model (Me0), the individuals of a species at a given location and time are assumed to be homogeneous in their singing rates. The time to first detection follows the exponential distribution \(f(t_{ij}) = \phi exp(-t_{ij} \phi)\), and the cumulative density function of times to first detection in time interval (0, \(t_{iJ}\)) gives us the probability that a bird sings at least once during the point count as \(p(t_{iJ}) = 1 - exp(-t_{iJ} \phi)\).

We use the cmulti function from the detect R package to fit the removal models. The algorithm used in the function is based on conditional maximum likelihood, and is described in this paper its supporting material. We are using the type = "rem" for conventional removal models.

library(detect)
Me0 <- cmulti(Y | D ~ 1, X, type="rem")
summary(Me0)

## Call:
## cmulti(formula = Y | D ~ 1, data = X, type = "rem")
##
## Removal Sampling (homogeneous singing rate)
## Conditional Maximum Likelihood estimates
##
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)
## log.phi_(Intercept) -0.91751    0.05826  -15.75   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-likelihood: -272.1
## BIC = 549.4

Time-varying conventional removal models

Singing rates of birds vary with time of day, time of year, breeding status, and stage of the nesting cycle. Thus, removal model estimates of availability may be improved by accounting for variation in singing rates using covariates for day of year and time of day. In this case \(p(t_{iJ}) = 1 - exp(-t_{iJ} \phi_{i})\) and \(log(\phi_{i}) = \beta_{0} + \sum^{K}_{k=1} \beta_{k} x_{ik}\) is the linear predictor with \(K\) covariates and the corresponding unknown coefficients (\(\beta_{k}\), \(k = 0,\ldots, K\)).

We could fit all the possible multivariate and nonlinear models as we did in the paper, but let’s just keep it simple for now and fit models with JDAY and TSSR as covariates (models Me1 and Me2).

Me1 <- cmulti(Y | D ~ JDAY, X, type="rem")
Me2 <- cmulti(Y | D ~ TSSR, X, type="rem")

Now compare the three conventional models based on AIC and inspect the summary for the best supported model with the JDAY effect.

Me_AIC <- AIC(Me0, Me1, Me2)
Me_AIC$dAIC <- Me_AIC$AIC - min(Me_AIC$AIC)

MeBest <- get(rownames(Me_AIC)[Me_AIC$dAIC == 0])
Me_AIC

##     df      AIC      dAIC
## Me0  1 546.1270 0.7187895
## Me1  2 545.4082 0.0000000
## Me2  2 546.4612 1.0529236

summary(MeBest)
## Call:
## cmulti(formula = Y | D ~ JDAY, data = X, type = "rem")
##
## Removal Sampling (homogeneous singing rate)
## Conditional Maximum Likelihood estimates
##
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)
## log.phi_(Intercept)    1.460      1.471   0.993    0.321
## log.phi_JDAY          -5.235      3.247  -1.612    0.107
##
## Log-likelihood: -270.7
## BIC =   552

To visually capture the time-varying effects, we make some plots using base graphics, colors matching the time-varying predictor. This way we can not only assess how availability probability (given a fixed time interval) is changing with the values of the predictor, but also how the cumulative distribution changes with time.

n <- 100
JDAY <- seq(min(X$JDAY), max(X$JDAY), length.out=n+1)
TSSR <- seq(min(X$TSSR), max(X$TSSR), length.out=n+1)
Duration <- seq(0, 10, length.out=n)
col <- viridis::viridis(n)

b <- coef(MeBest)
op <- par(las=1, mfrow=c(2,1), mar=c(4,4,2,2))
p1 <- 1-exp(-3*exp(b[1]+b[2]*JDAY))
plot(JDAY, p1, ylim=c(0,1), type="n",
    main=paste(spp, rownames(Me_AIC)[Me_AIC$dAIC == 0]),
    ylab="P(availability)")
for (i in seq_len(n)) {
    lines(JDAY[c(i,i+1)], p1[c(i,i+1)], col=col[i], lwd=2)
}
plot(Duration, Duration, type="n", ylim=c(0,1),
    ylab="P(availability)")
for (i in seq_len(n)) {
    p2 <- 1-exp(-Duration*exp(b[1]+b[2]*JDAY[i]))
    lines(Duration, p2, col=col[i])
}
abline(v=3, col="grey")
par(op)

Time-invariant finite mixture removal model

The removal model can also accommodate behavioral heterogeneity in singing by subdividing the sampled population for a species at a given point into a finite mixture of birds with low and high singing rates, which requires the additional estimation of the proportion of birds in the sampled population with low singing rates.

In the continuous-time formulation of the finite mixture (or two-point mixture) removal model, the cumulative density function during a point count is given by \(p(t_{iJ}) = (1 - c) 1 + c [1 - exp(-t_{iJ} \phi)] = 1 - c exp(-t_{iJ} \phi)\), where \(\phi\) is the singing rate for the group of infrequently singing birds, and \(c\) is the proportion of birds during the point count that are infrequent singers. The remaining proportions (\(1 - c\); the intercept of the cumulative density function) of the frequent singers are assumed to be detected instantaneously at the start of the first time interval. In the simplest form of the finite mixture model, the proportion and singing rate of birds that sing infrequently is homogeneous across all times and locations (model Mf0). We are using the type = "fmix" for finite mixture removal models.

Mf0 <- cmulti(Y | D ~ 1, X, type="fmix")
summary(Mf0)

## Call:
## cmulti(formula = Y | D ~ 1, data = X, type = "fmix")
##
## Removal Sampling (heterogeneous singing rate)
## Conditional Maximum Likelihood estimates
##
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)
## log.phi_(Intercept)  -2.1902     0.4914  -4.457 8.32e-06 ***
## logit.c               0.1182     0.1543   0.766    0.444
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-likelihood: -257.6
## BIC = 525.8

Time-varying finite mixture removal models

Previously, researchers (see refs in the paper) have applied covariate effects on the parameter \(\phi_{i}\) of the finite mixture model, similarly to how we modeled these effects in conventional models. This model assumes that the parameter \(c\) is constant irrespective of time and location (i.e. only the infrequent singer group changes its singing behavior).

We can fit finite mixture models with JDAY and TSSR as covariates on \(\phi\) (models Mf1 and Mf2). In this case \(p(t_{iJ}) = 1 - c exp(-t_{iJ} \phi_{i})\) and \(log(\phi_{i}) = \beta_{0} + \sum^{K}_{k=1} \beta_{k} x_{ik}\) is the linear predictor with \(K\) covariates and the corresponding unknown coefficients (\(\beta_{k}\), \(k = 0,\ldots, K\)).

Mf1 <- cmulti(Y | D ~ JDAY, X, type="fmix")
Mf2 <- cmulti(Y | D ~ TSSR, X, type="fmix")

Compare the three finite mixture models based on AIC and inspect the summary for the best supported model with the TSSR effect in this case.

Mf_AIC <- AIC(Mf0, Mf1, Mf2)
Mf_AIC$dAIC <- Mf_AIC$AIC - min(Mf_AIC$AIC)

MfBest <- get(rownames(Mf_AIC)[Mf_AIC$dAIC == 0])
Mf_AIC

##     df      AIC      dAIC
## Mf0  2 519.2222 0.1053855
## Mf1  3 520.4007 1.2838985
## Mf2  3 519.1168 0.0000000

summary(MfBest)

## Call:
## cmulti(formula = Y | D ~ TSSR, data = X, type = "fmix")
##
## Removal Sampling (heterogeneous singing rate)
## Conditional Maximum Likelihood estimates
##
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)
## log.phi_(Intercept)  -1.1939     0.4195  -2.846  0.00442 **
## log.phi_TSSR         -9.0089     4.7712  -1.888  0.05900 .
## logit.c               0.2016     0.1702   1.184  0.23622
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-likelihood: -256.6
## BIC =   529

We produce a similar plot as before.

b <- coef(MfBest)
op <- par(las=1, mfrow=c(2,1), mar=c(4,4,2,2))
p1 <- 1-plogis(b[3])*exp(-3*exp(b[1]+b[2]*TSSR))
plot(TSSR, p1, ylim=c(0,1), type="n",
    main=paste(spp, rownames(Mf_AIC)[Mf_AIC$dAIC == 0]),
    ylab="P(availability)")
for (i in seq_len(n)) {
    lines(TSSR[c(i,i+1)], p1[c(i,i+1)], col=col[i], lwd=2)
}
plot(Duration, Duration, type="n", ylim=c(0,1),
    ylab="P(availability)")
for (i in seq_len(n)) {
    p2 <- 1-plogis(b[3])*exp(-Duration*exp(b[1]+b[2]*TSSR[i]))
    lines(Duration, p2, col=col[i])
}
abline(v=3, col="grey")
par(op)

An alternative parametrization is that \(c_{i}\) rather than \(\phi\) be the time-varying parameter, allowing the individuals to switch between the frequent and infrequent group depending on covariates. We can fit this class of finite mixture model with JDAY and TSSR as covariates on \(c\) (models Mm1 and Mm2) using type = "mix" (instead of "fmix"). In this case \(p(t_{iJ}) = 1 - c_{i} exp(-t_{iJ} \phi)\) and \(logit(c_{i}) = \beta_{0} + \sum^{K}_{k=1} \beta_{k} x_{ik}\) is the linear predictor with \(K\) covariates and the corresponding unknown coefficients (\(\beta_{k}\), \(k = 0,\ldots, K\)). Because \(c_{i}\) is a proportion, we model it on the logit scale.

Mm1 <- cmulti(Y | D ~ JDAY, X, type="mix")
Mm2 <- cmulti(Y | D ~ TSSR, X, type="mix")

We did not fit a null model for this parametrization, because it is identical to the Mf0 model, so that model Mf0 is what we use to compare AIC values and inspect the summary for the best supported model with the JDAY effect in this case.

Mm_AIC <- AIC(Mf0, Mm1, Mm2)
Mm_AIC$dAIC <- Mm_AIC$AIC - min(Mm_AIC$AIC)

MmBest <- get(rownames(Mm_AIC)[Mm_AIC$dAIC == 0])
Mm_AIC

##     df      AIC      dAIC
## Mf0  2 519.2222 0.1949952
## Mm1  3 519.0272 0.0000000
## Mm2  3 520.8744 1.8471803

summary(MmBest)

## Call:
## cmulti(formula = Y | D ~ JDAY, data = X, type = "mix")
##
## Removal Sampling (heterogeneous singing rate)
## Conditional Maximum Likelihood estimates
##
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)
## log.phi              -2.1910     0.4914  -4.459 8.24e-06 ***
## logit.c_(Intercept)  -4.7600     3.3828  -1.407    0.159
## logit.c_JDAY         10.7368     7.4287   1.445    0.148
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-likelihood: -256.5
## BIC = 528.9

We produce a similar plot as before.

b <- coef(MmBest)
op <- par(las=1, mfrow=c(2,1), mar=c(4,4,2,2))
p1 <- 1-plogis(b[2]+b[3]*JDAY)*exp(-3*exp(b[1]))
plot(JDAY, p1, ylim=c(0,1), type="n",
    main=paste(spp, rownames(Mm_AIC)[Mm_AIC$dAIC == 0]),
    ylab="P(availability)")
for (i in seq_len(n)) {
    lines(JDAY[c(i,i+1)], p1[c(i,i+1)], col=col[i], lwd=2)
}
plot(Duration, Duration, type="n", ylim=c(0,1),
    ylab="P(availability)")
for (i in seq_len(n)) {
    p2 <- 1-plogis(b[2]+b[3]*JDAY[i])*exp(-Duration*exp(b[1]))
    lines(Duration, p2, col=col[i])
}
abline(v=3, col="grey")
par(op)

Let the best model win

So which of the 3 parametrizations proved to be best for our Ovenbird example data? It was the finite mixture with time-varying proportion of infrequent singers with a thin margin. Second was the other finite mixture model, while the conventional model was lagging behind.

M_AIC <- AIC(MeBest, MfBest, MmBest)
M_AIC$dAIC <- M_AIC$AIC - min(M_AIC$AIC)
M_AIC

##        df      AIC        dAIC
## MeBest  2 545.4082 26.38106209
## MfBest  3 519.1168  0.08960974
## MmBest  3 519.0272  0.00000000

Conclusions and applications

Finite mixture models provide some really nice insight into how singing behavior changes over time and, due to more parameters, they provide a better fit and thus minimize bias in population size estimates. But all this improvement comes with a price: sample size requirements (or more precisely, the number of detections required) are really high. To have all the benefits with reduced variance, one needs about 1000 non-zero observations to fit finite mixture models, 20 times more than needed to reliably fit conventional removal models. This is much higher than previously suggested minimum sample sizes.

Our findings also indicate that lengthening the count duration from 3 minutes to 5–10 minutes is an important consideration when designing field surveys to increase the accuracy and precision of population estimates. Well-informed survey design combined with various forms of removal sampling are useful in accounting for availability bias in point counts, thereby improving population estimates, and allowing for better integration of disparate studies at larger spatial scales.

To this end, we provide our removal model estimates as part of the QPAD R package and the R functions required to fit all the above outlined removal models as part of the detect R package. We at the Boreal Avian Modelling Project and our collaborators are already utilizing the removal model estimates to correct for availability bias in our continental and regional projects to inform better management and conservation of bird populations. Read more about these projects in our reports.

Please report any issues here and feel free to comment below!

UPDATE: AOS press release, EurekAlert!, ABMI blog.

Shiny slider examples with the intrval R package

2018-03-08T00:00:00-07:00

The intrval R package is lightweight (~11K), standalone (apart from importing from graphics, has exactly 0 non-base dependency), and it has a very narrow scope: it implements relational operators for intervals — very well aligned with the tiny manifesto. In this post we will explore the use of the package in two shiny apps with sliders.

The first example uses a regular slider that returns a single value. To make that an interval, we will use standard deviation (SD, sigma) in a quality control chart (QCC). The code is based on the pistonrings data set from the qcc package. The Shewhart chart sets 3 sigma limit to indicate state of control. The slider is used to adjusts the sigma limit and the GIF below plays is as an animation.

library(shiny)
library(intrval)
library(qcc)

data(pistonrings)
mu <- mean(pistonrings$diameter[pistonrings$trial])
SD <- sd(pistonrings$diameter[pistonrings$trial])
x <- pistonrings$diameter[!pistonrings$trial]

## UI function
ui <- fluidPage(
  plotOutput("plot"),
  sliderInput("x", "x SD:",
    min=0, max=5, value=0, step=0.1,
    animate=animationOptions(100)
  )
)

# Server logic
server <- function(input, output) {
  output$plot <- renderPlot({
    Main <- paste("Shewhart quality control chart",
        "diameter of piston rings", sprintf("+/- %.1f SD", input$x),
        sep="\n")
    iv <- mu + input$x * c(-SD, SD)
    plot(x, pch = 19, col = x %)(% iv +1, type = "b",
        ylim = mu + 5 * c(-SD, SD), main = Main)
    abline(h = mu)
    abline(h = iv, lty = 2)
  })
}

## Run shiny app
if (interactive()) shinyApp(ui, server)

The second example uses range slider returning two values, which is our interval. To spice things up a bit, we combine intervals on two axes to color some random points. The next range slider defines a distance interval and colors the random points inside the ring.

library(shiny)
library(intrval)

set.seed(1)
n <- 10^4
x <- round(runif(n, -2, 2), 2)
y <- round(runif(n, -2, 2), 2)
d <- round(sqrt(x^2 + y^2), 2)

## UI function
ui <- fluidPage(
  titlePanel("intrval example with shiny"),
  sidebarLayout(
    sidebarPanel(
      sliderInput("bb_x", "x value:",
        min=min(x), max=max(x), value=range(x),
        step=round(diff(range(x))/20, 1), animate=TRUE
      ),
      sliderInput("bb_y", "y value:",
        min = min(y), max = max(y), value = range(y),
        step=round(diff(range(y))/20, 1), animate=TRUE
      ),
      sliderInput("bb_d", "radial distance:",
        min = 0, max = max(d), value = c(0, max(d)/2),
        step=round(max(d)/20, 1), animate=TRUE
      )
    ),
    mainPanel(
      plotOutput("plot")
    )
  )
)

# Server logic
server <- function(input, output) {
  output$plot <- renderPlot({
    iv1 <- x %[]% input$bb_x & y %[]% input$bb_y
    iv2 <- x %[]% input$bb_y & y %[]% input$bb_x
    iv3 <- d %()% input$bb_d
    op <- par(mfrow=c(1,2))
    plot(x, y, pch = 19, cex = 0.25, col = iv1 + iv2 + 3,
        main = "Intersecting bounding boxes")
    plot(x, y, pch = 19, cex = 0.25, col = iv3 + 1,
         main = "Deck the halls:\ndistance range from center")  
    par(op)
  })
}

## Run shiny app
if (interactive()) shinyApp(ui, server)

If you think there are other use cases for intrval in shiny applications, let me know in the comments section!

If you want to learn more about how to host Shiny apps, check out the Hosting Data Apps website!

Phylogeny and species traits predict bird detectability

2018-02-09T00:00:00-07:00

It all started with this paper in Methods in Ecol. Evol. where we looked at detectability of many species. So we wanted to use life history traits to validate our results. But we had to cut the manuscript, and there was this leftover with some neat patterns, but without much focus. It took a few years, and the most positive peer-review experience ever, and the paper is now early view in Ecography. This post is a quick summary of the goodies stuffed inside the lhreg R package that makes the whole analysis reproducible, and provides some functions for similar PGLMM models.

The R package is hosted on GitHub (no CRAN version yet), please submit any issues here. The package is also archived on Zenodo with DOI 10.5281/zenodo.596410. To install the package, use devtools::install_github("borealbirds/lhreg").

Here, I am going to skim the implementation based on the more complete supporting information of the paper which has all the reproducible code (try vignette(topic = "lhreg", package = "lhreg") after installing and loading the package). Here is the rendered html version.

The most important function is lhreg which takes the following main arguments:

Y: response vector,
X: model matrix for the mean.
SE: standard error estimate (observation error) for the response,
V: correlation matrix,

and fits a Multivariate Normal model to the observed Y vector with phylogenetically based (or any other known) correlations and optionally with observation error (SE), and covariate effects (X). The function is pretty bare-bones (i.e. no formula interface, the design matrix X needs to be properly specified through e.g. model.matrix()). The lambda argument is a non-negative number modifying the strength of phylogenetic effects. lambda = 0 is equivalent to lm with weights = 1/(SE^2), lambda = 1 implies Brownian motion evolution, lambda = NA lets the function estimate it based on the data.

In terms of optimization, besides the algorithms from stats::optim, we also have differential evolution algorithm based on the DEoptim package (a bit time consuming but very reliable). The output object class has some methods defined (like logLik and summary) and as a result AIC/BIC will work out of the box. The vignette also describes a few techniques which are pretty nice to have in a multivariate setting (i.e. profile likelihood, parametric bootstrap) to support advanced hypothesis testing and model selection.

We used leave one out cross-validation to see how well we could predict the values based on data from the other species, traits and phylogeny. The conditional distribution we used for that is described in the paper which made this exercise very straightforward. Maybe it is just ignorance, but I couldn’t find another paper that would have described it in a nice and useful manner, however, if one wishes to make trait/phylogeny based predictions for detectability, this formula is going to be very useful (look inside the loo2 function for implementation).

At the end of the vignette, there is a hack based on phytools::contMap function to produce non-rainbow colors. (It was surprisingly non-straightforward to hack the code — modular code please!) The following figure shows the two input data vectors mirrored side-by-side:

I realize this is not a very detailed post, but the paper and the vignette should satisfy your curiosity. If you still have unanswered questions, feel free to ask them below!

UPDATE

2018-09-26: post on Ecography blog and video abstract.
2018-04-13: a related blog post in Hungarian.

PVA: Publication Viability Analysis, round 3

2018-02-06T00:00:00-07:00

A friend and colleague of mine, Péter Batáry has circulated news from Nature magazine about the EU freezing innovation funds to Bulgaria. The article had a figure about publication trends for Bulgaria, compared with Romania and Hungary. As I have blogged about such trends in ecology before (here and here), I felt the need to update my PVA models with two years worth of data from WoS.

After downloading the yearly publications numbers using filters ADDRESS=HUNGARY; CATEGORIES=ECOLOGY, I started where I left off few years ago. I fit Ricker growth model to two time intervals of the data: 1978–1997, and 1998–2017.

The R code below uses the PVAClone package that I wrote with Khurram Nadeem, and is based on fitting state-space models using MCMC and data cloning with JAGS. The other intrval package is pretty new but handy little helper (see related posts here)

library(PVAClone)
library(intrval)

## the data from WoS
x <- structure(list(years = 1973:2017, records = c(1, 0, 4, 0, 0,
    6, 2, 5, 4, 7, 5, 7, 3, 5, 9, 11, 20, 8, 10, 15, 29, 24, 53,
    12, 13, 30, 32, 36, 45, 39, 42, 43, 50, 62, 95, 106, 113, 83,
    108, 99, 89, 117, 111, 134, 127)), .Names = c("years", "records"
    ), row.names = c(NA, 45L), class = "data.frame")

## fit the 2 models
ncl <- 10 # number of clones
m1 <- pva(x$records[x$years %[]% c(1978, 1997)], ricker("none"), ncl)
m2 <- pva(x$records[x$years %[]% c(1998, 2017)], ricker("none"), ncl)

## organize estimates
cf <- data.frame(t(sapply(list(early=m1, late=m2), coef)))
cf$K <- with(cf, -a/b)

## growth curve: early period
yr1 <- 1978:1997
pr1 <- numeric(length(yr1))
pr1[1] <- log(x$records[x$years==1978])
for (i in 2:length(pr1))
    pr1[i] <- pr1[i-1] + cf["early", "a"] + cf["early", "b"]*exp(pr1[i-1])
pr1 <- exp(pr1)

## growth curve: late period
yr2 <- 1998:2017
pr2 <- numeric(length(yr2))
pr2[1] <- log(x$records[x$years==1998])
for (i in 2:length(pr2))
    pr2[i] <- pr2[i-1] + cf["late", "a"] + cf["late", "b"]*exp(pr2[i-1])
pr2 <- exp(pr2)

## and finally the figure using base graphics
op <- par(las=2)
barplot(x$records, names.arg = x$years, space=0,
    ylab="# of publications", xlab="years",
    col=ifelse(x$years < 1998, "grey", "gold"))
lines(yr1-min(x$years)+0.5, pr1, col=4)
abline(h=cf["early", "K"], col=4, lty=3)
lines(yr2-min(x$years)+0.5, pr2, col=2)
abline(h=cf["late2017", "K"], col=2, lty=3)
par(op)

Here are the model parameters for the two Ricker models:

	a	b	sigma	K
1978–1997	0.38	-0.03	0.60	13.85
1998–2017	0.21	0.00	0.16	119.00

The K carrying capacity used to be 100 based on 1998–2012 data, but now K = 119, which is a significant improvement — heartfelt kudos to the ecologists in Hungary (more papers please)! The growth rate hasn’t changed (a = 0.21). So we can conclude that if the rate remained constant but carrying capacity increased, the change must be related to resource availability (i.e. increased funding, more jobs, improved infrastructure).

This is good news to me! Let me know what you think by leaving a comment below!

The progress bar just got a lot cheaper

2018-01-23T00:00:00-07:00

The pbapply R package that adds progress bar to vectorized functions has been know to accumulate overhead when calling parallel::mclapply with forking (see this post for more background on the issue). Strangely enough, a GitHub issue held the key to the solution that I am going to outline below. Long story short: forking is no longer expensive with pbapply, and as it turns out, it never was.

The issue mentioned parallel::makeForkCluster as the way to set up a Fork cluster, which, according to the help page, ‘is merely a stub on Windows. On Unix-alike platforms it creates the worker process by forking’. So I looked at some timings starting with one of the examples on the ?pbapply help page:

library(pbapply)
set.seed(1234)
n <- 200
x <- rnorm(n)
y <- rnorm(n, crossprod(t(model.matrix(~ x)), c(0, 1)), sd = 0.5)
d <- data.frame(y, x)

mod <- lm(y ~ x, d)
ndat <- model.frame(mod)
B <- 100
bid <- sapply(1:B, function(i) sample(nrow(ndat), nrow(ndat), TRUE))
fun <- function(z) {
    if (missing(z))
        z <- sample(nrow(ndat), nrow(ndat), TRUE)
    coef(lm(mod$call$formula, data=ndat[z,]))
} 

## forking with mclapply
system.time(res1 <- pblapply(1:B, function(i) fun(bid[,i]), cl = 2L))
##   |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 01s
##   user  system elapsed 
##  0.587   0.919   0.845 

## forking with parLapply
cl <- makeForkCluster(2L)
system.time(res2 <- pblapply(1:B, function(i) fun(bid[,i]), cl = cl))
##   |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 00s
##   user  system elapsed 
##  0.058   0.009   0.215 
stopCluster(cl)

## Socket cluster (need to pass objects to workers)
cl <- makeCluster(2L)
clusterExport(cl, c("fun", "mod", "ndat", "bid"))
system.time(res3 <- pblapply(1:B, function(i) fun(bid[,i]), cl = cl))
##   |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 00s
##   user  system elapsed 
##  0.053   0.008   0.169 
stopCluster(cl)

Forking with mclapply is still pricey, but the almost equivalent makeForkCluster trick, that does not require objects to be passed to workers due to the shared memory nature of the process, is pretty close to the ordinary Socket cluster option.

What if I used this trick in the package? I would then create a Fork cluster (cl <- makeForkCluster(cl)), run parLapply(cl, ...), and destroy the cluster with on.exit(stopCluster(cl), add = TRUE). So I created a branch to do some tests:

ncl <- 2
B <- 1000
fun <- function(x) {
    Sys.sleep(0.01)
    x^2
}
library(pbmcapply)
(t1 <- system.time(pbmclapply(1:B, fun, mc.cores = ncl)))
##  |========================================================| 100%, ETA 00:00
##   user  system elapsed 
##  0.242   0.114   5.461 

library(pbapply) # 1.3-4 CRAN version
(t2 <- system.time(pblapply(1:B, fun, cl = ncl)))
##   |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 07s
##   user  system elapsed 
##  0.667   1.390   6.547 

library(pbapply) # 1.3-5 fork-cluster-speedup branch
(t3 <- system.time(pblapply(1:B, fun, cl = ncl)))
##   |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 06s
##   user  system elapsed 
##  0.225   0.100   5.710 

Really nice so far: pbapply caught up to forking based timings with pbmcapply. Let’s see a bit more extensive runs to see how the number of progress bar updates affects the timings. If things work as I hope, there shouldn’t be an increase with the new forking idea:

timer_fun <- function(X, FUN, nout = 100, ...) {
    pbo <- pboptions(nout = nout)
    on.exit(pboptions(pbo))
    unname(system.time(pblapply(X, FUN, ...))[3])
}
timer_NULL <- list(
    nout1  = timer_fun(1:B, fun, nout = 1,       cl = NULL),
    nout10  = timer_fun(1:B, fun, nout = 10,     cl = NULL),
    nout100  = timer_fun(1:B, fun, nout = 100,   cl = NULL),
    nout1000  = timer_fun(1:B, fun, nout = 1000, cl = NULL))
unlist(timer_NULL)
##   nout1   nout10  nout100 nout1000 
##  12.221   11.899   11.775   11.260 

cl <- makeCluster(ncl)
timer_cl <- list(
    nout1  = timer_fun(1:B, fun, nout = 1,       cl = cl),
    nout10  = timer_fun(1:B, fun, nout = 10,     cl = cl),
    nout100  = timer_fun(1:B, fun, nout = 100,   cl = cl),
    nout1000  = timer_fun(1:B, fun, nout = 1000, cl = cl))
stopCluster(cl)
unlist(timer_cl)
##   nout1   nout10  nout100 nout1000 
##   6.033    6.091    6.011    6.273 


## forking with 1.3-4 CRAN version
timer_mc <- list(
    nout1  = timer_fun(1:B, fun, nout = 1,       cl = ncl),
    nout10  = timer_fun(1:B, fun, nout = 10,     cl = ncl),
    nout100  = timer_fun(1:B, fun, nout = 100,   cl = ncl),
    nout1000  = timer_fun(1:B, fun, nout = 1000, cl = ncl))
unlist(timer_mc)
##   nout1   nout10  nout100 nout1000 
##   5.563    5.659    6.620   10.692 

## forking with 1.3-5 fork-cluster-speedup branch
timer_new <- list(
    nout1  = timer_fun(1:B, fun, nout = 1,       cl = ncl),
    nout10  = timer_fun(1:B, fun, nout = 10,     cl = ncl),
    nout100  = timer_fun(1:B, fun, nout = 100,   cl = ncl),
    nout1000  = timer_fun(1:B, fun, nout = 1000, cl = ncl))
unlist(timer_new)
##   nout1   nout10  nout100 nout1000 
##   5.480    5.574    5.665    6.063 

The new implementation with the Fork cluster trick hands down beat the old implementation using mclapply. I wonder what is causing the wildly different timings results. Is it due to all the other mclapply arguments that give control over pre-scheduling, cleanup, and RNG seeds?

The new branch can be installed as:

devtools::install_github("psolymos/pbapply", ref = "fork-cluster-speedup")

I am a bit reluctant of merging the new branch for the following reasons:

makeForkCluster was already an option before by explicitly stating the cluster to be a Fork;
by hiding the process of creating and destroying the cluster, user options are restricted (i.e. no control over RNGs, which can be a major drawback for simulations);
mclapply wasn’t so bad to begin with, because the number of updates were capped by the nout option.

I would recommend the following workflow that is based purely on the stable CRAN version:

cl <- makeForkCluster(2L)
output <- pblapply(..., cl = cl)
stopCluster(cl)

As always, I am keen on hearing what you think: either in the comments or on GitHub.

What is new in the intrval R package?

2017-01-26T00:00:00-07:00

An update (v 0.1-1) of the intrval package was recently published on CRAN. The package simplifies interval related logical operations (read more about the motivation in this post). So what is new in this version? Some of the inconsistencies in the 1st CRAN release have been cleaned up, and I have been pushed hard (see GitHub issue to implement all the 16 interval-to-interval operators. These operators define the open/closed nature of the lower/upper limits of the intervals on the left and right hand side of the o in the middle as in c(a1, b1) %[]o[]% c(a2, b2).

Interval 1:	Interval 2: `[]`	`[)`	`(]`	`()`
`[]`	`%[]o[]%`	`%[]o[)%`	`%[]o(]%`	`%[]o()%`
`[)`	`%[)o[]%`	`%[)o[)%`	`%[)o(]%`	`%[)o()%`
`(]`	`%(]o[]%`	`%(]o[)%`	`%(]o(]%`	`%(]o()%`
`()`	`%()o[]%`	`%()o[)%`	`%()o(]%`	`%()o()%`

The overlap of two closed intervals, [a1, b1] and [a2, b2], is evaluated by the %[]o[]% (%[o]% is an alias) operator (a1 <= b1, a2 <= b2). Endpoints can be defined as a vector with two values (c(a1, b1)) or can be stored in matrix-like objects or a lists in which case comparisons are made element-wise.

If lengths do not match, shorter objects are recycled. These value-to-interval operators work for numeric (integer, real) and ordered vectors, and object types which are measured at least on ordinal scale (e.g. dates). Note that interval endpoints are sorted internally thus ensuring the conditions a1 <= b1 and a2 <= b2 is not necessary.

c(2, 3) %[]o[]% c(0, 1)
list(0:4, 1:5) %[]o[]% c(2, 3)
cbind(0:4, 1:5) %[]o[]% c(2, 3)
data.frame(a=0:4, b=1:5) %[]o[]% c(2, 3)

%)o(% is used for the negation of two closed interval overlap (%[o]%), directional evaluation is done via the operators %[ and %[o>]%. The overlap of two open intervals is evaluated by the %(o)% (alias for %()o()%). %]o[% is used for the negation of two open interval overlap, directional evaluation is done via the operators %( and %(o>)%. Overlap operators with mixed endpoint do not have negation and directional counterparts.




  
    
      Equal
      Not equal
      Less than
      Greater than
    
  
  
    
      %[o]%
      %)o(%
      %[
      %[o>]%
    
    
      %(o)%
      %]o[%
      %(
      %(o>)%
    
  


Thanks for all the feedback so far and please keep’em coming: 
leave a comment below or use the issue tracker
to provide feedback or report a problem.

Equal	Not equal	Less than	Greater than
`%[o]%`	`%)o(%`	`%[`	`%[o>]%`
`%(o)%`	`%]o[%`	`%(`	`%(o>)%`



Relational operators for intervals with the intrval R package
2016-12-02T00:00:00-07:00
I recently posted a piece about how to write and document special functions in R. I meant that as a prelude for the topic I am writing about in this post. Let me start at the beginning. The other day Dirk Eddelbuettel tweeted about the new release of the data.table package (v1.9.8).
There were new features announced for joins based on %inrange% and %between%. That got me thinking: it would be really cool to generalize this idea for different intervals, for example as x %[]% c(a, b).

Motivation

We want to evaluate if values of x satisfy the condition x >= a & x <= b given that a <= b. Typing x %[]% c(a, b) instead of the previous expression is not much shorter (14 vs. 15 characters with counting spaces). But considering the a <= b condition as well, it becomes a saving (x >= min(a, b) & x <= mmax(a, b) is 31 characters long). And sorting is really important, because by flipping a and b, we get quite different answers:

x <- 5
x >= 1 & x <= 10
# [1] TRUE
x >= 10 & x <= 1
# [1] FALSE


Also, min and max will not be very useful when we want to vectorize the expression. We need to use pmin and pmax for obvious reasons:

x >= min(1:10, 10:1) & x <= max(10:1, 1:10)
# [1] TRUE
x >= pmin(1:10, 10:1) & x <= pmax(10:1, 1:10)
# [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE


If interval endpoints can also be open or closed, and allowing them to flip around makes the semantics of left/right closed/open interval definitions hard. We can thus all agree that there is a need for an expression, like x %[]% c(a, b), that is compact, flexible, and invariant to endpoint sorting. This is exactly what the intrval package is for!

What’s in the package

Functions for evaluating if values of vectors are within
different open/closed intervals
(x %[]% c(a, b)), or if two closed
intervals overlap (c(a1, b1) %[o]% c(a2, b2)).
Operators for negation and directional relations also implemented.

Value-to-interval relations

Values of x are compared to interval endpoints a and b (a <= b).
Endpoints can be defined as a vector with two values (c(a, b)): these values will be compared as a single interval with each value in x.
If endpoints are stored in a matrix-like object or a list,
comparisons are made element-wise.

x <- rep(4, 5)
a <- 1:5
b <- 3:7
cbind(x=x, a=a, b=b)
x %[]% cbind(a, b) # matrix
x %[]% data.frame(a=a, b=b) # data.frame
x %[]% list(a, b) # list


If lengths do not match, shorter objects are recycled. Return values are logicals.
Note: interval endpoints are sorted internally thus ensuring the condition
a <= b is not necessary.

These value-to-interval operators work for numeric (integer, real) and ordered vectors, and object types which are measured at least on ordinal scale (e.g. dates).

Closed and open intervals

The following special operators are used to indicate closed ([, ]) or open ((, )) interval endpoints:


  
    
      Operator
      Expression
      Condition
    
  
  
    
      %[]%
      x %[]% c(a, b)
      x >= a & x <= b
    
    
      %[)%
      x %[)% c(a, b)
      x >= a & x < b
    
    
      %(]%
      x %(]% c(a, b)
      x > a & x <= b
    
    
      %()%
      x %()% c(a, b)
      x > a & x < b
    
  


Negation and directional relations


  
    
      Equal
      Not equal
      Less than
      Greater than
    
  
  
    
      %[]%
      %)(%
      %[<]%
      %[>]%
    
    
      %[)%
      %)[%
      %[<)%
      %[>)%
    
    
      %(]%
      %](%
      %(<]%
      %(>]%
    
    
      %()%
      %][%
      %(<)%
      %(>)%
    
  


The helper function intrval_types can be used to
print/plot the following summary:



Interval-to-interval relations

The overlap of two closed intervals, [a1, b1] and [a2, b2],
is evaluated by the %[o]% operator (a1 <= b1, a2 <= b2).
Endpoints can be defined as a vector with two values
(c(a1, b1))or can be stored in matrix-like objects or a lists
in which case comparisons are made element-wise.
Note: interval endpoints
are sorted internally thus ensuring the conditions
a1 <= b1 and a2 <= b2 is not necessary.

c(2:3) %[o]% c(0:1)
list(0:4, 1:5) %[o]% c(2:3)
cbind(0:4, 1:5) %[o]% c(2:3)
data.frame(a=0:4, b=1:5) %[o]% c(2:3)


If lengths do not match, shorter objects are recycled.
These value-to-interval operators work for numeric (integer, real)
and ordered vectors, and object types which are measured at
least on ordinal scale (e.g. dates).

%)o(% is used for the negation,
directional evaluation is done via the operators %[ and %[o>]%.



  
    
      Equal
      Not equal
      Less than
      Greater than
    
  
  
    
      %[o]%
      %)o(%
      %[
      %[o>]%
    
  


Operators for discrete variables

The previous operators will return NA for unordered factors.
Set overlap can be evaluated by the base %in% operator and its negation
%nin%. (This feature is really redundant, I know, but decided to include regardless…)

Install

Install development version from GitHub (not yet on CRAN):

library(devtools)
install_github("psolymos/intrval")


The package is licensed under GPL-2.

Examples



library(intrval)

## bounding box
set.seed(1)
n <- 10^4
x <- runif(n, -2, 2)
y <- runif(n, -2, 2)
d <- sqrt(x^2 + y^2)
iv1 <- x %[]% c(-0.25, 0.25) & y %[]% c(-1.5, 1.5)
iv2 <- x %[]% c(-1.5, 1.5) & y %[]% c(-0.25, 0.25)
iv3 <- d %()% c(1, 1.5)
plot(x, y, pch = 19, cex = 0.25, col = iv1 + iv2 + 1,
    main = "Intersecting bounding boxes")
plot(x, y, pch = 19, cex = 0.25, col = iv3 + 1,
     main = "Deck the halls:\ndistance range from center")

## time series filtering
x <- seq(0, 4*24*60*60, 60*60)
dt <- as.POSIXct(x, origin="2000-01-01 00:00:00")
f <- as.POSIXlt(dt)$hour %[]% c(0, 11)
plot(sin(x) ~ dt, type="l", col="grey",
    main = "Filtering date/time objects")
points(sin(x) ~ dt, pch = 19, col = f + 1)

## QCC
library(qcc)
data(pistonrings)
mu <- mean(pistonrings$diameter[pistonrings$trial])
SD <- sd(pistonrings$diameter[pistonrings$trial])
x <- pistonrings$diameter[!pistonrings$trial]
iv <- mu + 3 * c(-SD, SD)
plot(x, pch = 19, col = x %)(% iv +1, type = "b", ylim = mu + 5 * c(-SD, SD),
    main = "Shewhart quality control chart\ndiameter of piston rings")
abline(h = mu)
abline(h = iv, lty = 2)


## Annette Dobson (1990) "An Introduction to Generalized Linear Models".
## Page 9: Plant Weight Data.
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)

lm.D9 <- lm(weight ~ group)
## compare 95% confidence intervals with 0
(CI.D9 <- confint(lm.D9))
#                2.5 %    97.5 %
# (Intercept)  4.56934 5.4946602
# groupTrt    -1.02530 0.2833003
0 %[]% CI.D9
# (Intercept)    groupTrt
#       FALSE        TRUE

lm.D90 <- lm(weight ~ group - 1) # omitting intercept
## compare 95% confidence of the 2 groups to each other
(CI.D90 <- confint(lm.D90))
#            2.5 %  97.5 %
# groupCtl 4.56934 5.49466
# groupTrt 4.19834 5.12366
CI.D90[1,] %[o]% CI.D90[2,]
# 2.5 %
#  TRUE

DATE <- as.Date(c("2000-01-01","2000-02-01", "2000-03-31"))
DATE %[<]% as.Date(c("2000-01-151", "2000-03-15"))
# [1]  TRUE FALSE FALSE
DATE %[]% as.Date(c("2000-01-151", "2000-03-15"))
# [1] FALSE  TRUE FALSE
DATE %[>]% as.Date(c("2000-01-151", "2000-03-15"))
# [1] FALSE FALSE  TRUE


For more examples, see the unit-testing script.

Feedback

Please check out the package and use the issue tracker
to suggest a new feature or report a problem.

Update (2016-12-04)

Sergey Kashin pointed out that some operators are redundant. It is now explained in the manual:

Note that some operators return identical results but
are syntactically different:
%[<]% and %[<)% both evaluate x < a;
%[>]% and %(>]% both evaluate x > b;
%(<]% and %(<)% evaluate x <= a;
%[>)% and %(>)% both evaluate x >= b.
This is so because we evaluate only one end of the interval
but still conceptually referring to the relationship
defined by the right-hand-side interval object.
This implies 2 conditional logical evaluations
instead of treating it as a single 3-level ordered factor.

Update (2016-12-06)

intrval R package v0.1 is on CRAN: https://CRAN.R-project.org/package=intrval


How to write and document %special% functions in R
2016-11-26T00:00:00-07:00
I spend a considerable portion of my working hours with data processing where I often use the %in% R function as x %in% y. Whenever I need the negation of that, I used to write !(x %in% y). Not much of a hassle, but still, wouldn’t it be nicer to have x %notin% y instead? So I decided to code it for my mefa4 package that I maintain primarily to make my data munging time shorter and more efficient. Coding a %special% function was no big deal. But I had to do quite a bit of research and trial-error until I figured out the proper documentation. So here it goes.

The function

The function name needs quotes and exactly two arguments, one for the left and one for the right hand side of the operator in the middle:

"%notin%" <- function(x, table) !(match(x, table, nomatch = 0) > 0)


Let us see what it does:

1:4 %in% 3:5
## [1] FALSE FALSE  TRUE  TRUE
1:4 %notin% 3:5
## [1]  TRUE  TRUE FALSE FALSE


The NAMESPACE entry

We need to export the function, so just add the following entry to the NAMESPACE file:

export("%notin%")


The Rd file

This is where things get are a bit more interesting. The LaTeX engine needs the percent sign to be escaped (\%) throughout the whole documentation. Also pay close attention to the usage section (x \%notin\% table).

\name{\%notin\%}
\alias{\%notin\%}
\title{
Negated Value Matching
}
\description{
\code{\%notin\%} is the negation of \code{\link{\%in\%}},
which returns a logical vector indicating if there is a non-match or not
for its left operand.
}
\usage{
x \%notin\% table
}
\arguments{
  \item{x}{
vector or \code{NULL}: the values to be matched.
}
  \item{table}{
vector or \code{NULL}: the values to be matched against.
}
}
\value{
A logical vector, indicating if a non-match was located for each element of
\code{x}: thus the values are \code{TRUE} or \code{FALSE} and never \code{NA}.
}
\author{
Peter Solymos 
}
\seealso{
All the opposite of what is written for \code{\link{\%in\%}}.
}
\examples{
1:10 \%notin\% c(1,3,5,9)
sstr <- c("c","ab","B","bba","c",NA,"@","bla","a","Ba","\%")
sstr[sstr \%notin\% c(letters, LETTERS)]
}
\keyword{manip}
\keyword{logic}


UPDATE

Some updates from the comments:


  From Marcin: One can use roxygen2 for writing package documentation, see the magrittr package docs on the %>% (pipe) operator.
  From Andrey: The Hmisc package also has a similar %nin% function ({match(x, table, nomatch = 0) == 0}). (Note that the unexported Matrix:::"%nin%" is defined as {is.na(match(x, table))}.)

Operator	Expression	Condition
`%[]%`	`x %[]% c(a, b)`	`x >= a & x <= b`
`%[)%`	`x %[)% c(a, b)`	`x >= a & x < b`
`%(]%`	`x %(]% c(a, b)`	`x > a & x <= b`
`%()%`	`x %()% c(a, b)`	`x > a & x < b`

Equal	Not equal	Less than	Greater than
`%[]%`	`%)(%`	`%[<]%`	`%[>]%`
`%[)%`	`%)[%`	`%[<)%`	`%[>)%`
`%(]%`	`%](%`	`%(<]%`	`%(>]%`
`%()%`	`%][%`	`%(<)%`	`%(>)%`