Generation of item response theory (IRT) based CASI scores in the ACT study

Adult Changes in Thought (ACT) study, Data Science and Informatics Core, Kaiser Permanente Washington Health Research Institute, Seattle WA; contact email: Rod.L.Walker@kp.org

Introduction

This vignette will explain the derivation of the CASI bifactor scores and their standard errors that are included in the ACT study’s 2024 analytic data freeze (and later freezes). The variable names in that data are casi_bifactor_score and casi_bifactor_score_SE, generated using R version 4.3.3 (2024-02-29) with the package mirt version 1.44.0. For researchers using the ACT study’s previous 2022 analytic data freeze, then only scores (and not their standard errors) were included. The variable name in that previous data is casi_BI_score, generated using the package lavaan version 0.6.16. Additionally, at the end of this document, we provide a brief description of CASI scores included in the ACT analytic data freezes prior to 2022.

Background

At each baseline and biennial ACT in-person study visit, participants are administered the Cognitive Abilities Screening Instrument (CASI) which includes 39 cognitive items [1]. A total CASI score is determined based on the sum of these items and ranges from 0-100. The ACT study uses this score as a screening tool at each visit for determining which participants should be referred to a diagnostic evaluation for possible dementia. While useful as a screening tool in ACT, total CASI score is less useful for modeling cognitive change over time due to limited sensitivity at higher levels of cognitive functioning and nonlinear measurement properties [2,3].

To address this potential shortcoming, the ACT study has used item response theory (IRT) to create a score [4] based on the same 39 test items. In IRT, two sets of parameters are computed for each binary item, difficultly (the level of underlying cognitive ability needed to answer correctly) and discrimination (how well the item distinguishes between high and low cognitive ability). The graded response model (GRM) [5] and generalized partial credit model (GPCM) [6] extend this with additional difficulty thresholds for polytomous items.

There are several reasons to prefer IRT-based CASI scores to the total CASI sum scores.

IRT scores are generally more accurate, in that items are weighted according to their relationship with the underlying trait - in this case, global cognition.
IRT scores measure change in cognition with less bias because the score has a linear relationship with the underlying trait.
We can compute IRT scores when there are some missing items, whereas that is not possible with the sum scores. Scores for visits with missing items will be on the same metric as those with responses to all the items. The IRT score will be less precise than it would have been without missing items but still is better than not knowing the score at all.
Because missing items are not a problem, we can compute scores for some of the people with noted validity issues (more on this below). For example, we can set items that involved motor skills to missing and estimate IRT scores based on the remaining items. Unlike a sum score, the IRT scores for people with impaired motor skills are on the same metric as everyone else.

Below we describe the process by which we developed this IRT-based CASI score in ACT, and we briefly discuss how this differs from previous processes.

Identifying the development sample

We first needed to identify the sample that would be used to estimate the CASI item parameters. Ideally, we wanted to use one ACT study visit with valid CASI items from each participant. We mentioned the validity of the CASI above, so we expand on that here. After the CASI is administered at a study visit, the interviewer denotes whether the CASI sum score should be considered valid or if there are factors that call the validity into question. Here are the validity codes for the CASI in the ACT study.

1 = Valid
2 = Poor hearing
3 = Poor eyesight
4 = Impaired motor control
5 = Language barrier
6 = Impaired alertness or attentiveness
7 = Significant physical or mental discomfort
8 = Unable to read or write
9 = Other reasons

For visits with validity codes of 2, 5, 6, 7, or 9, we do not know which of the 39 cognitive items that make up the total CASI score are valid, so these visits were not usable for deriving item parameters. For visits with validity codes of 3, 4, or 8, however, we do know which of the items could be affected. Therefore, for those visits, we set items that could be affected by poor eyesight, being unable to read or write, or impaired motor control, respectively, to missing, but otherwise retained the other items. For visits with a validity code of 1, we could use data from all 39 cognitive items.

Using data from the ACT study’s 2020 analytic data freeze, we identified each ACT participant’s latest in-person study visit with at least some non-missing, valid CASI items (i.e., validity codes of 1, 3, 4, or 8) to serve as the estimation sample for the item parameters. We chose the latest visit to have the maximum spread of ability in our sample, which included data from all 5,746 ACT participants appearing in that data freeze.

Deriving the item parameters

For estimation we started with the R package lavaan. In setting up the model, we wanted to treat the non-dichotomous items as ordinal, but the package limited ordinal variables to 10 categories, so we combined some response levels for the items animals and draw, using the same groupings as in prior research [7]; we also collapsed levels for some other items (cas_read, yr, mo, casi_dat) in alignment with that work. Because some item responses were rare, we combined additional categories for items byr (“When Were You Born—Year”), body (“Identify Part of Face/Body”), and jgmt (“Judgement”). We also combined object naming items obja and objb into one item, both correct vs not.

We initially considered a unidimensional model. We used the lavaan cfa function’s option to mimic estimation as done by Mplus, using a weighted least squares mean and variance adjusted (WLSMV) estimator.

# unidimensional model
uni_model <- '
  # latent variables
      casi_bi=~ NA*bpl + byr2 + bday + casi_age + mnt + sun + rgs1 + rdbsum 
                + rc1a + rc1b + rc1c + rsubtra + rmmyear + rmmmonth + rmmdate 
                + mmday + mmseason +  rmmctst + spb + ranimal + sim + rjgmt 
                + rpta + rptb + rmmread + cas_writ + rmmdraw + rcmd + rc2a 
                + rc2b + rc2c + rbody + robj + rcobj           
    # LV mean & variance setup
        casi_bi ~  0*1
        casi_bi ~~ 1*casi_bi
'

# estimate unidimensional model with lavaan using WLSMV equivalent that corresponds closely to Mplus
fit_uni <- cfa(uni_model, data=casi_last,  mimic="Mplus", meanstructure=TRUE,
                   missing="pairwise", parameterization="theta", 
                   ordered = TRUE)

# unidimensional model fit from lavaan
round(summary(fit_uni, fit.measures=TRUE)$fit[c("rmsea.scaled", "rmsea.ci.lower.scaled", "rmsea.ci.upper.scaled", "cfi.scaled", "tli.scaled")], 3)

##          rmsea.scaled rmsea.ci.lower.scaled rmsea.ci.upper.scaled 
##                 0.049                 0.048                 0.050 
##            cfi.scaled            tli.scaled 
##                 0.914                 0.909

As one might expect, a unidimensional model fit poorly. The Root Mean Square Error of Approximation (RMSEA) was 0.049 (90% CI 0.048, 0.050), but the comparative fit index (CFI) was 0.914, and the Tucker–Lewis index (TLI) was 0.909. For reference, that would be an acceptable RMSEA (under 0.05), but many people like to see the CFI and TLI above 0.95, though there is some flexibility depending on the constituents of the model and its use.

We next considered a bifactor model [8]. Following prior work [7], we added a secondary factor for the recall items to account for a methods effect.

Figure 1. CASI bifactor model. All CASI items are indicators of the latent trait of global cognition. There is a methods effects factor, F1, for the recall items. Note: This is not the order the items are administered in; the recall items are rearranged so that F1 is easier to illustrate in the figure.

<span style='font-size:16px;'>**Figure 1.** CASI bifactor model. All CASI items are indicators of the latent trait of global cognition. There is a methods effects factor, F1, for the recall items. *Note:* This is not the order the items are administered in; the recall items are rearranged so that F1 is easier to illustrate in the figure.</span>

# bifactor model
bi_model <- '
  # latent variables
      casi_bi=~ NA*bpl + byr2 + bday + casi_age + mnt + sun + rgs1 + rdbsum 
                + rc1a + rc1b + rc1c + rsubtra + rmmyear + rmmmonth + rmmdate 
                + mmday + mmseason +  rmmctst + spb + ranimal + sim + rjgmt 
                + rpta + rptb + rmmread + cas_writ + rmmdraw + rcmd + rc2a 
                + rc2b + rc2c + rbody + robj + rcobj
    # methods factor f1
        f1 =~ NA*rc1a + rc1b + rc1c + rc2a + rc2b + rc2c                   
    # LV mean & variance setup
        casi_bi ~  0*1
        casi_bi ~~ 1*casi_bi
        f1      ~  0*1
        f1      ~~ 1*f1
    # secondary factor uncorrelated with CASI (bifactor model)
        f1      ~~ 0*casi_bi
'

# estimate bifactor model with lavaan using WLSMV equivalent that corresponds closely to Mplus
fit_bi <- cfa(bi_model, data=casi_last,  mimic="Mplus", meanstructure=TRUE,
                   missing="pairwise", parameterization="theta", 
                   ordered = TRUE)

# bifactor model fit from lavaan
summary(fit_bi, fit.measures=TRUE)$fit[c("rmsea.scaled", "rmsea.ci.lower.scaled", "rmsea.ci.upper.scaled", "cfi.scaled", "tli.scaled")]

##          rmsea.scaled rmsea.ci.lower.scaled rmsea.ci.upper.scaled 
##            0.03152972            0.03054258            0.03252543 
##            cfi.scaled            tli.scaled 
##            0.96468021            0.96196852

For this bifactor model, the RMSEA was 0.032 (90% CI 0.031, 0.033), the CFI was 0.965, and the TLI was 0.962; we considered this our final model at the time. Figure 2 provides the predicted values for the “CASI bifactor score” (i.e., our latent trait of global cognition) from this model in the development sample using Empirical Bayes Modal (EBM) estimation via the lavaan lavPredict function.

# predict factor scores from this lavaan model in the development sample and attach to the dataset
lav_scores <- lavPredict(fit_bi, newdata=casi_last, type="lv", method="EBM", transform=TRUE)
casi_last[, c("casi_bi_lavaan", "f1_lavaan") := as.data.table(lav_scores)]

Figure 2. Predicted values for the CASI bifactor score in the development sample using lavaan.

# mean/sd of predicted casi bifactor scores from lavaan
round(c(mean(casi_last$casi_bi_lavaan), sd(casi_last$casi_bi_lavaan)), 3)

## [1] 0 1

Updated estimation using mirt

While we were initially satisfied with this bifactor model, we later made the decision to re-estimate it using the mirt package in R, which has more flexibility in applications of IRT and allows for estimation of standard errors of factor scores in our model setting. We used the accompanying R package sirt to convert the lavaan bifactor model into a mirt model for estimation using the function lavaan2mirt. This allowed for estimation using either a GPCM or a GRM for the ordinal items, so we tried both.

# modify data for use with mirt (which does not like _ in variable names)
casi_last[, casiAge := casi_age]
casi_last[, casWrit := cas_writ]

# lavaan bifactor model structure with casi_age and cas_write replaced by casiAge and casWrit to allow use in mirt
bi_model_mirt <- '
    # latent variables
    casiBI=~ NA*bpl + byr2 + bday + casiAge + mnt + sun + rgs1 + rdbsum 
                 + rc1a + rc1b + rc1c + rsubtra + rmmyear + rmmmonth + rmmdate 
                 + mmday + mmseason +  rmmctst + spb + ranimal + sim + rjgmt 
                 + rpta + rptb + rmmread + casWrit + rmmdraw + rcmd + rc2a 
                 + rc2b + rc2c + rbody + robj + rcobj
    # methods factor f1
        f1 =~ NA*rc1a + rc1b + rc1c + rc2a + rc2b + rc2c                   
    # LV mean & variance setup
        casiBI ~  0*1
        casiBI ~~ 1*casiBI
        f1      ~  0*1
        f1      ~~ 1*f1
    # secondary factor uncorrelated with CASI (bifactor model)
        f1      ~~ 0*casiBI
'

# convert model structure from lavaan using sirt and estimate with mirt (using generalized partial credit model for polytomous items)
fit_mirt_gpcm <- lavaan2mirt(dat=casi_last, lavmodel=bi_model_mirt, est.mirt=TRUE, poly.itemtype="gpcm")

# convert model structure from lavaan using sirt and estimate with mirt (using graded response model for polytomous items)
fit_mirt_graded <- lavaan2mirt(dat=casi_last, lavmodel=bi_model_mirt, est.mirt=TRUE, poly.itemtype="graded")

# compare AIC/BIC and fit statistics
fit_mirt_gpcm$mirt

## 
## Call:
## mirt::mirt(data = dat, model = mirtmodel1, itemtype = itemtype, 
##     pars = mirtpars)
## 
## Full-information item factor analysis with 2 factor(s).
## Converged within 1e-04 tolerance after 85 EM iterations.
## mirt version: 1.44.0 
## M-step optimizer: BFGS 
## EM acceleration: Ramsay 
## Number of rectangular quadrature: 31
## Latent density type: Gaussian 
## 
## Log-likelihood = -100998
## Estimated parameters: 140 
## AIC = 202276.1
## BIC = 203207.9; SABIC = 202763.1

M2(fit_mirt_gpcm$mirt)[c("RMSEA", "RMSEA_5", "RMSEA_95", "CFI", "TLI")]

##            RMSEA    RMSEA_5   RMSEA_95       CFI       TLI
## stats 0.02954528 0.02848107 0.03061448 0.9869601 0.9858137

fit_mirt_graded$mirt

## 
## Call:
## mirt::mirt(data = dat, model = mirtmodel1, itemtype = itemtype, 
##     pars = mirtpars)
## 
## Full-information item factor analysis with 2 factor(s).
## Converged within 1e-04 tolerance after 35 EM iterations.
## mirt version: 1.44.0 
## M-step optimizer: BFGS 
## EM acceleration: Ramsay 
## Number of rectangular quadrature: 31
## Latent density type: Gaussian 
## 
## Log-likelihood = -101409.7
## Estimated parameters: 140 
## AIC = 203099.3
## BIC = 204031.2; SABIC = 203586.3

M2(fit_mirt_graded$mirt)[c("RMSEA", "RMSEA_5", "RMSEA_95", "CFI", "TLI")]

##           RMSEA    RMSEA_5   RMSEA_95       CFI       TLI
## stats 0.0304313 0.02936972 0.03149758 0.9861662 0.9849501

The resultant models, estimated using the standard EM algorithm, had similar fit statistics as before: RMSEA=0.030 (90% CI 0.028, 0.031), CFI=0.987, and TLI=0.986 for the GPCM; RMSEA=0.030 (90% CI 0.029, 0.031), CFI=0.986, and TLI=0.985 for the GRM. As AIC and BIC were lower for the GPCM (202276.1 and 203207.9) compared to the GRM (203099.3, 204031.2), we decided to use the GPCM as our final model. Predicted CASI bifactor scores from this model in the development sample could be generated via mirt using either expected a posteriori (EAP) or maximum a posteriori (MAP, i.e., Bayes Modal) estimation; Figure 3 provides these predictions under both estimation approaches.

# predict factor scores from the gpcm mirt model in the development sample using EAP and attach to the dataset
mirt_eap_scores <- fscores(fit_mirt_gpcm$mirt, method="EAP", full.scores=TRUE, full.scores.SE=TRUE)
casi_last[, c("casi_bi_mirt_eap", "f1_mirt_eap", "casi_bi_se_mirt_eap", "f1_se_mirt_eap") := as.data.table(mirt_eap_scores)]

# predict factor scores from the gpcm mirt model in the development sample using MAP and attach to the dataset
mirt_map_scores <- fscores(fit_mirt_gpcm$mirt, method="MAP", full.scores=TRUE, full.scores.SE=TRUE)
casi_last[, c("casi_bi_mirt_map", "f1_mirt_map", "casi_bi_se_mirt_map", "f1_se_mirt_map") := as.data.table(mirt_map_scores)]

Figure 3. Predicted values for the CASI bifactor score in the development sample using either expected a posteriori (EAP) or maximum a posteriori (MAP) estimation in mirt.

<span style='font-size:16px;'>**Figure 3.** Predicted values for the CASI bifactor score in the development sample using either *expected a posteriori* (EAP) or *maximum a posteriori* (MAP) estimation in mirt.</span>

# mean/sd of EAP predicted casi bifactor scores from gpcm mirt model
round(c(mean(casi_last$casi_bi_mirt_eap), sd(casi_last$casi_bi_mirt_eap)), 3)

## [1] 0.005 0.903

# mean/sd of MAP predicted casi bifactor scores from gpcm mirt model
round(c(mean(casi_last$casi_bi_mirt_map), sd(casi_last$casi_bi_mirt_map)), 3)

## [1] -0.071  0.862

While we specified mean 0 and standard deviation (SD) 1 in estimation of the model, the mean and SD were 0.005 and 0.903 for the EAP estimated scores and -0.071 and 0.862 for the MAP estimated scores. This is due, in part, to a ceiling effect in the CASI; while EAP and MAP estimation allows us to estimate an IRT score when all the items are answered correctly, ceiling effects and the use of these estimation approaches can result in a distribution of scores that stray from mean 0, SD 1.

Comparison between model predictions

We compared the bifactor scores from EAP and MAP estimation in mirt directly to the previous bifactor score estimates generated by lavaan in the development sample.

Figure 4. Comparison of lavaan and mirt predicted values for the CASI bifactor score in the development sample..

All estimates were highly correlated, but given the additional capabilities of mirt in IRT applications and the potentially greater estimation stability of EAP relative to MAP, we decided to use the fitted mirt model and EAP estimation to predict bifactor scores for the ACT study’s 2024 analytic data freeze as described below.

Scoring for all visits in the 2024 and later analytic data freezes

We generated predicted CASI bifactor scores (casi_bifactor_score) and their standard errors (casi_bifactor_score_SE ) using EAP estimation via mirt for all in-person ACT participant study visits across time (if the visit had at least some non-missing, valid CASI items). So that we can discuss the scores in meaningful units, we standardized the scores based on the distribution at ACT study baseline of the originally enrolled ACT cohort (enrolled in 1994-1996). For that cohort at baseline, the estimated scores had mean 0.3616 and SD 0.5976; therefore, we subtracted 0.3616 from all scores across time and then divided them (and their estimated standard errors) by 0.5976. Thus, the distribution of estimated scores at baseline for the originally enrolled ACT cohort has mean 0 and SD 1. Researchers using the CASI bifactor scores, however, may choose to re-standardize the scores based on a different reference population as is relevant for their specific analysis. The plot below provides a sample comparison between CASI bifactor scores and the original 100-point total CASI scores (casi_total).

Figure 5. Spaghetti plot of CASI bifactor scores (top panel) and total CASI scores (bottom panel) for 50 random participants. Note: 3 of the participants have invalid total CASI scores but valid CASI bifactor scores at some of their visits.

<span style='font-size:16px;'>**Figure 5.** Spaghetti plot of CASI bifactor scores (top panel) and total CASI scores (bottom panel) for 50 random participants. *Note:* 3 of the participants have invalid total CASI scores but valid CASI bifactor scores at some of their visits.</span>

Scoring for all visits in the previous 2022 analytic data freeze

Predicted CASI bifactor scores (casi_BI_score) in the ACT study’s previous 2022 analytic data freeze were generated from the prior bifactor model developed and estimated using lavaan. This was because we had not yet switched to the use of mirt at time of that data release. Therefore, the CASI bifactor scores in that 2022 freeze differ from those in the 2024 freeze, and the 2022 freeze also does not include standard errors of the scores. As in the 2024 freeze, we did standardize the scores based on the distribution at ACT study baseline of the originally enrolled ACT cohort (which, per the 2022 scores, had mean 0.0567 and SD 0.7924). Also, the original 100-point total CASI scores were still denoted by the variable name casi_total.

CASI scoring in older analytic data freezes (2010-2020)

Older ACT study analytic data freezes, which only started to use the “freeze” moniker starting in ~2015, included the 100-point total CASI score (CASI_SC), an IRT-based version of the CASI score (CASI_IRT), and its estimated standard error (CASI_IRT_SE). These CASI_IRT scores differ from those in the 2022 and 2024 analytic data freezes, as the model to generate those scores was originally constructed in 2009 using now-defunct software,Parscale 4.1 [9], which required assuming that the CASI items represent a unidimensional trait [2]. At time of that model’s development (which was estimated using data from the 5th biennial visit of ACT participants), we checked for differential item function (DIF) due to sex, age (65-75, 76-80, 81-85, 86+; also <=80 vs > 80), and education level (<=12, 13-16, 17+; also <=HS vs higher degree), and found that though there were some statistically significant DIF (which often occurs in large samples), the score impact was minimal. Unlike in the later analytic data freezes, the IRT-based CASI scores in these older freezes were not standardized based on a reference distribution at ACT study baseline of the originally enrolled ACT cohort. Instead, the distribution in the sample used to estimate the model (i.e., the 5th biennial) would essentially have served as the reference unless researchers re-standardized the data themselves. In 2022, the demise of the previous software motivated us to construct a new IRT-based score that does not assume the CASI items reflect a unidimensional trait. The CASI bifactor scores constructed since then in the 2022 and 2024 analytic data freezes, as described in this vignette, have been constructed without worrying about DIF.

References

Teng EL, Hasegawa K, Homma A, Imai Y, Larson E, Graves A, Sugimoto K, Yamaguchi T, Sasaki H, Chiu D, et al. The Cognitive Abilities Screening Instrument (CASI): a practical test for cross-cultural epidemiological studies of dementia. Int Psychogeriatr. 1994 Spring;6(1):45-58; discussion 62. doi:10.1017/s1041610294001602. PMID: 8054493.
Crane PK, Narasimhalu K, Gibbons LE, Mungas DM, Haneuse S, Larson EB, Kuller L, Hall K, van Belle G. Item response theory facilitated cocalibrating cognitive tests and reduced bias in estimated rates of decline. J Clin Epidemiol. 2008 Oct;61(10):1018-27.e9. doi:10.1016/j.jclinepi.2007.11.011. Epub 2008 May 5. PMID: 18455909; PMCID: PMC2762121.
Li G, Larson EB, Shofer JB, Crane PK, Gibbons LE, McCormick W, Bowen JD, Thompson ML. Cognitive Trajectory Changes Over 20 Years Before Dementia Diagnosis: A Large Cohort Study. J Am Geriatr Soc. 2017 Dec;65(12):2627-2633. doi:10.1111/jgs.15077. Epub 2017 Sep 21. PMID: 28940184; PMCID: PMC5729097.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement, 34(4, Pt. 2), 100.
Samejima, F. (1997). Graded Response Model. In: van der Linden, W.J., Hambleton, R.K. (eds) Handbook of Modern Item Response Theory. Springer, New York, NY. https://doi.org/10.1007/978-1-4757-2691-6_5
Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16(2), 159–176. https://doi.org/10.1177/014662169201600206
Mukherjee S, Choi SE, Lee ML, Scollard P, Trittschuh EH, Mez J, Saykin AJ, Gibbons LE, Sanders RE, Zaman AF, Teylan MA, Kukull WA, Barnes LL, Bennett DA, Lacroix AZ, Larson EB, Cuccaro M, Mercado S, Dumitrescu L, Hohman TJ, Crane PK. Cognitive domain harmonization and cocalibration in studies of older adults. Neuropsychology. 2023 May;37(4):409-423. doi:10.1037/neu0000835. Epub 2022 Aug 4. PMID: 35925737; PMCID: PMC9898463.
Reise SP, Morizot J, Hays RD. The role of the bifactor model in resolving dimensionality issues in health outcomes measures. Qual Life Res. 2007;16 Suppl 1:19-31. doi:10.1007/s11136-007-9183-7. Epub 2007 May 4. PMID: 17479357.
Muraki E, Bock D. PARSCALE for windows (Version 4.1). Chicago: Scientific Software International; 2003.

Citation

How to cite this document.

Gibbons LE, Shaw PA, and Walker RL (2026, Apr 3). Generation of item response theory (IRT) based CASI scores in the ACT study. Retrieved from https://kpwhri.github.io/actstats/CASI_IRT_Bifactor_2026_04_03.html