Article Text

Development of the Biostatistics and Clinical Epidemiology Skills (BACES) assessment for medical residents
  1. Patrick B Barlow1,
  2. Gary Skolits2,
  3. Robert Eric Heidel3,
  4. William Metheny3,
  5. Tiffany Lee Smith4
  1. 1The University of Iowa Carver College of Medicine, Iowa City, Iowa, USA
  2. 2The University of Tennessee, Knoxville, Tennessee, USA
  3. 3The University of Tennessee Graduate School of Medicine, Knoxville, Tennessee, USA
  4. 4The University of Wisconsin-Stout, Menonomie, Wisconsin, USA
  1. Correspondence to Dr Patrick B Barlow, The University of Iowa Carver College of Medicine, 1204 MEB, Iowa City, IA 52242, USA; patrick-barlow{at}


Background Although biostatistics and clinical epidemiology are essential for comprehending medical evidence, research has shown consistently low and variable knowledge among postgraduate medical trainees. Simultaneously, there has been an increase in the complexity of statistical methods among top-tier medical journals.

Aims To develop the Biostatics and Clinical Epidemiology Skills (BACES) assessment by (1) establishing content validity evidence of the BACES; (2) examining the model fit of the BACES items to an Item Response Theory (IRT) model; and (3) comparing IRT item estimates with those of traditional Classical Test Theory (CTT) indices.

Methods Thirty multiple choice questions were written to focus on interpreting clinical epidemiological and statistical methods. Content validity was assessed through a four-person expert review. The instrument was administered to 150 residents across three academic medical centres in southern USA during the autumn of 2013. Data were fit to a two-parameter logistic IRT model and the item difficulty, discrimination and examinee ability values were compared with traditional CTT item statistics.

Results 147 assessments were used for analysis (mean (SD) score 14.38 (3.38)). Twenty-six items, 13 devoted to statistics and 13 to clinical epidemiology, successfully fit a two-parameter logistic IRT model. These estimates also significantly correlated with their comparable CTT values.

Conclusions The strength of the BACES instrument was supported by (1) establishing content validity evidence; (2) fitting a sample of 147 residents’ responses to an IRT model; and (3) correlating the IRT estimates with their CTT values, which makes it a flexible yet rigorous instrument for measuring biostatistical and clinical epidemiological knowledge.

  • Test Development
  • Assessment

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.


The dominance of evidence-based medicine in graduate medical education (GME) over the past 25 years makes translating medical evidence into clinical decision-making an important skill for residents.1 Although biostatistics and clinical epidemiology are essential components for comprehending the medical evidence,2 decades of research have shown a consistently low and variable knowledge base among postgraduate medical trainees.3–6 Simultaneously, there has been an increase in the frequency and complexity of statistical methods among top-tier medical journals.5–8

Many evidence-based medicine curricula now include content dedicated to biostatistics and/or clinical epidemiological research methods to respond to this problem. However, the length, format and rigour of these courses are quite variable, as are the qualifications of course instructors (eg, resident vs faculty-led),9 ,10 which has made assessment of these skills difficult.1 There have been a number of previous of attempts to assess the resident population,3 ,11 ,12 but formal psychometric analysis of these instruments has been scarce. A 2011 review of existing instruments made an explicit call for improved knowledge assessments in this population and highlighted several content-related shortcomings of existing instruments.11

The dominant format for these instruments and the ‘gold standard’ for medical assessment in general has been the use of multiple choice questions (MCQ).13 An extensive list of best practices exist to guide MCQ writing,13–15 yet several previous instruments possessed a number of ‘violations’ of these common practices. Furthermore, these instruments were developed using Classical Test Theory (CTT), which does not allow the test items to be divided up and reorganised to meet specific educational needs without jeopardising the instrument's reliability. If future research is to heed the call for new instrumentation, then a new measurement paradigm that rigorously meets the specific needs of postgraduate medical trainees must be considered.

An alternative to the CTT approach is Item Response Theory (IRT), which postulates that the probability of correctly responding to a given item can be modelled as a function of the item's difficulty, discrimination and participant ability on the trait being measured (eg, interpreting statistics and research methods).16 It has been shown in previous research17–20 that the item difficulty, item discrimination and estimates of participant ability in CTT are almost identical to those from IRT in a single sample; however, the values generated from CTT are dependent upon the sample from which they are taken. This dependence can lead to the same instrument yielding completely different item characteristics in a second sample.21 Conversely, IRT provides stable estimates of an item's difficulty, discrimination and guessing probability that do not vary with changes in sample, item order and test conditions. This characteristic makes it an ideal approach to developing adaptable yet rigorous instruments.

The purpose of the present study was to establish baseline item characteristics and validity evidence for the Biostatics and Clinical Epidemiology Skills (BACES) assessment by leveraging the robust power of IRT to create a new adaptive knowledge assessment for postgraduate medical trainees. It addressed three objectives: (1) to establish content validity evidence of the BACES assessment; (2) to examine the model fit of the BACES items to an IRT model; and (3) to compare participant ability and item location (ie, difficulty) estimates from IRT models with those of traditional CTT indices.


Instrument construction

BACES learning goals and item writing

Table 1 shows the five specific learning goals that were used to guide the development of the BACES assessment as well as the concepts and number of test items dedicated to each of these goals. As most instructions for interpreting statistics and research methods occur within journal clubs or evidence-based medicine courses,9 ,10 the learning goals were developed with attention towards critical appraisal of existing medical literature. Also, focusing the goals on interpreting the medical literature better aligned the BACES assessment with the evidence-based practice and medical knowledge Accreditation Council for Graduate Medical Education (ACGME) core competencies.

Table 1

Proposed test blueprint for BACES assessment by learning goal

All of the BACES items were written in an MCQ format with four response options per question. Each item focused on using clinical or literature-based vignettes to emphasise residents’ application of statistical and epidemiological concepts rather than rote memorisation. These vignettes were unique for each item to avoid interlocking (dependent) items. The clinical terminology and scenarios for the vignettes were generated through consultation with an advanced-trained surgical resident who ensured only ‘common’ medical situations/conditions were included in order to avoid inadvertently measuring residents’ medical knowledge. Specifically, four sources were used to develop content for the BACES items: (1) learning objectives from the biostatistics and clinical epidemiology curriculum taught in the authors’ GME programme; (2) commonly used statistics in medical literature as defined by existing reviews;6–8 (3) common content areas among existing assessment instruments; and (4) content gaps in existing instruments.11

Expert review process

Expert review is the same approach taken by previous researchers in establishing content validity for their biostatistics & epidemiologic knowledge (BEK) assessments.11 The four-person group of experts included areas of expertise relevant to both the content and educational context of the BACES assessment. Each reviewer was given four documents: (1) a copy of the BACES items (see online supplementary file 1); (2) a detailed answer key with answer descriptions and continuum of option ‘correctness’ (see online supplementary file 2); (3) a brief overview of the study, its purpose, objectives and methods; and (4) an item review and scoring rubric (see online supplementary file 3). Although the appraisal of the items varied depending on the content specialty of the reviewer, there were no items that were candidates for removal. For additional feedback, two of the four reviewers were informally interviewed regarding their suggestions for improving the instrument, which led to a number of other improvements. The final BACES instrument contained 30 items.


The BACES instrument was developed for postgraduate medical trainees—specifically, medical residents and subspecialty fellows. The study used a multi-institutional convenience sample procedure of the resident population within the University of Tennessee system (total resident population 1033).22 Although the sample was non-randomised, the intent was to sample as broadly and heterogeneously as possible in order to obtain a representative mix of the resident ability level. Colleagues from three academic medical centres across the state of Tennessee (resident populations 683, 178 and 172) were asked to grant the researcher access to their residents.

Data collection procedures

All data were collected in person using a group administration format. Participants received an informed consent page where they were given more information about the study including its voluntary nature and the risks and benefits of participation. They were given the BACES assessment as well as a brief set of demographic items similar to those used in the study by Windish et al,6 including sex, age, years of training, location of training and any previous training in biostatistics, epidemiology or evidence-based medicine (see online supplementary file 4). To prevent possible cheating, the participants alternatingly received either BACES form A or form B, which presented the same 30 items in two different orders.

The 30-item BACES assessment consistently took 20–30 min to complete across the 10 administrations. After the assessment was completed, the researcher used the remaining journal club or didactic session time to review the answers with the group. Each resident received a copy of the answers for each BACES item as well as the brief description for those answers in order to provide them with immediate feedback on their performance. These answers were the same as the descriptions given to the expert reviewers, and also included a scannable link to the researcher's series of online lecture materials as a small incentive for participation.

Data analysis

Descriptive statistics were used for screening data for coding errors, missing data and outlying values, as well as for comparing key participant demographics. The DIMTEST procedure23–25 was used to assess the degree to which the BACES items violated the IRT assumptions of local independence and essential unidimensionality. The procedure tests the hypothesis that the set of items is made up of only one dimension and, if this hypothesis is rejected (ie, the results are statistically significant), then the conclusion is that the items measure multiple dimensions. A DETECT procedure was then used to cluster the items into different dimensions.26 ,27 This process is analogous to traditional factor analysis in which the procedure finds the combination of factors that maximise the amount of variance which the model takes into account.

One-parameter Rasch, two-parameter logistic (2PL) and three-parameter logistic (3PL) models were compared for best fit and most parsimonious models. Overall model fit was assessed through a χ2 goodness of fit index, and by comparing the change in −2 log likelihood statistics between models.16 Item difficulty (‘b’), discrimination (‘a’) and pseudo-guessing (‘c’) parameters were estimated using an expectation-maximisation method, and item fit was assessed using standardised residuals and item characteristic curves (ICCs). Participant ability levels (θ, theta) were estimated using an expected a posteriori method with a standard normal prior distribution (mean (SD) 0.00 (1.00)). Also, a SE of estimate (SEE) was calculated for each item and used to examine item and total test information (ie, reliability). The parameter estimates for the best fitting IRT model were compared with their CTT equivalents to confirm the accuracy of the IRT estimates.17–20

IRT analyses were conducted using Xcalibre V.4.2 (Guyer and Thompson, 2012). CTT item analyses were conducted using IBM SPSS V.22 (SPSS, Chicago, Illinois, USA).

Ethical committee permission

Ethical approval was obtained from both the University of Tennessee and the University of Tennessee Graduate School of Medicine Institutional Review Boards. A signed Memorandum of Understanding was obtained from each participating site which detailed their institution's role as well as the use and ownership of the data. Copies of IRB approval were sent to each site prior to administering the BACES instrument and an informed consent document was collected from each study participant.


Characteristics of study sample

One-hundred and fifty completed assessment forms were collected through the course of the study. Three participants did not indicate their particular BACES form so they could not be accurately scored; therefore, a total of 147 completed BACES assessments (77 form A and 70 form B) were used for analysis. The demographic characteristics and raw BACES scores are shown in table 2. The average raw score varied slightly, with form A having a slightly higher average score. The site locations were similar in average score; however, site A (mean (SD) 14.33 (3.28)) scored the highest. The sample consisted predominantly of men (80, 59.3%), first year residents (53, 36.1%) and those trained in the USA (97, 80.8%). Only 51 (37.8%), 58 (43.3%) and 44 (33.3%) participants had completed a class in epidemiology, biostatistics or evidence-based medicine, respectively.

Table 2

Background characteristics of examinees and raw score examination performance

Examination of the model fit of the BACES items to an IRT model (objective 2)

The first step in examining the model fit of the BACES items is to examine the assessment data for violations of essential unidemensionality and local independence. To assess these two assumptions, the DIMTEST procedure was used to assess the degree to which the BACES items departed from strict unidemensionality.23–25 The procedure showed the 30 items to be significantly multidimensional (T=3.018, p=0.0013), so a DETECT procedure was used to cluster the items into separate dimensions.26 ,27 The DETECT procedure looks for the simplest structure (number of dimensions) within the 30 items by assessing the relationships among the item responses. The analysis detected two 15-item dimensions, which split the test content between (1) clinical epidemiology and (2) statistics. Table 3 lists the item numbers associated with each of the two dimensions. For example, items related to research design (eg, 1, 2, 9 and 25) or common epidemiology concepts (eg, 17, 23 and 27) clustered into the first dimension while items that dealt with statistical tests or concepts (eg, 3, 7 and 11) clustered into the second dimension.

Table 3

Two-dimension solution for DETECT procedure

Item parameter estimates

The IRT parameters ‘a’, ‘b’, ‘c’ and θ were then calculated for the 15 items in each dimension as if they were individual instruments or testlets, where one measured clinical epidemiology and the other measured statistics.

Items 3 and 6 were flagged as being overly difficult (ie, ‘b’ parameters >3.5), and items 2 and 20 were flagged as being poorly discriminating (ie, ‘a’ parameters <0.40). These items were removed and the model was rerun similar to traditional factor analysis.16 Both of the testlets had adequate model fit after deleting the four overly difficult items. In other words, the item response patterns predicted by the model and those that were observed in the real data were not significantly different for either the clinical epidemiology testlet (p=0.06) or the statistics testlet (p=0.07).

The CTT and IRT estimates for the final model are shown in table 4. The estimated item discrimination (‘a’ parameter) ranged on average between 0.42 and 1.51 with a mean (SD) of 0.75 (0.31) for the clinical epidemiology testlet and between 0.35 and 1.07 with a mean (SD) of 0.68 (0.20) for the statistics testlet. The ‘a’ parameter refers to the slope of the ICC across a range of ability levels (θ), such that a higher value indicates a greater ability for an item to discriminate among different ability levels. For example, item 27 had the highest discrimination ability of any item on the test, which means that the probability of correctly answering this item rises sharply across a short span of ability. The ‘b’ parameter defines the difficulty estimates for the items, which can be directly compared with the proportion correct (‘P’ column) to show how items located at a higher level of ability (ie, a higher ‘b’) translate to a smaller proportion of correct responses. Overall, these ranged from −1.61 to 2.73 (mean (SD) 0.29 (1.0)) for the clinical epidemiology testlet and −2.30 to 2.90 (mean (SD) 0.91 (1.45)) for statistics.

Table 4

Classical test theory statistics and item parameter estimates for best fit 2PL model

The final test characteristic curves for both testlets and the full test are shown in figure 1A, which summarises the average difficulty and discrimination into a single curve as a function of ability (θ). The ability level at which one is 50% likely to answer a question correctly is considered to be the difficulty or ‘location’ of that particular set of items. The statistics testlet was more difficult, so its location was near θ=1, or above average ability. On the other hand, the epidemiology testlet is somewhat easier, so its location was near θ=0, or average ability. The location for the full test fell directly between the two testlets, which was very close to θ=0.

Figure 1

(A) Test characteristic curve for clinical epidemiology testlet, statistics testlet and full test. (B) Distribution of theta estimates for best fit model.

Participant ability estimates

In IRT analyses, the difficulty or discrimination of a certain item can be discussed in terms of the ability level they are most suited to measure. Figure 1B displays the frequency distribution for the person location (ie, θ) estimates for the full test and the two testlets. The highest frequency of participants fell between −0.8 and −0.4 for both testlets; however, the full test was somewhat more spread out with 87 participants falling between −0.8 and 0.4. The raw scores were very closely clustered near the average score, and only a couple of participants fell beyond that average. When these scores were translated into θ estimates on a scale with a mean (SD) of 0.0 (1.0), it is logical that many of the participants’ θ values were very close to 0.00.

Test and item information

The final step in the IRT analysis was to estimate the item and test information for each BACES item and testlet. Item information is the inverse of the SEE along different values of θ.16 It is the IRT equivalent reliability because higher information converts to lower SEE, which indicates a more accurate estimate of θ. Unlike CTT, estimates of IRT information are put in terms of ability level, so each item has a particular range of θ that it is particularly accurate at measuring. Figure 2A–C shows several of these item information functions for the total test and testlets (figure 2A) as well as each item (figure 2B–C). Overall, the results indicated that the clinical epidemiology testlet reached its maximum information of 2.04 at θ=0.15, or a slightly above average level of ability. The statistics testlet reached its peak information of 1.43 at θ=1.20. Similarly to the ICCs, the overall test met in the middle with its highest information of 3.22 at θ=0.45.

Figure 2

(A) Total information curves for clinical epidemiology, statistics and full test. (B) Item information curves for four clinical epidemiology testlet items. (C) Item information curves for four statistics testlet items.

Comparison of participant ability and item location (ie, difficulty) estimates from IRT models with those of traditional CTT indices (objective 3)

If the IRT model is appropriately estimated, estimates for ability, difficulty and discrimination should be very similar to their CTT counterparts in any single sample. Pearson r correlations were therefore used to quantify the extent to which CTT and IRT estimates were related to one another. The estimates for ‘a’, ‘b’ and θ parameters were strongly correlated with their CTT counterparts on each testlet and the full test. Specifically, CTT difficulty (P) was negatively related to IRT difficulty ‘b’ (r(24)=−0.980, p<0.001), and CTT discrimination (R) was positively associated with IRT discrimination ‘a’ (r(24)=0.91, p<0.001). Theta estimates for participant ability were also positively related to CTT total correct scores for research methods, statistics and the full test.


The purpose of the present study was to establish baseline item characteristics and validity evidence for the BACES for postgraduate medical trainees. It addressed three objectives: (1) to establish content validity evidence of the BACES assessment; (2) to examine the model fit of the BACES items to an IRT model; and (3) to compare participant ability and item location (ie, difficulty) estimates from IRT models with those of traditional CTT indices. Accordingly, a 30-item MCQ instrument was written and rigorously reviewed by a panel of four experts to establish content validity evidence. Next, the BACES assessment was administered to a sample of 147 medical residents across three academic medical centres. After removing four poorly performing items, the item response data were successfully fit to a 2PL IRT model which estimated the difficulty and discrimination of the remaining 26 BACES items as well as the participants’ ability levels. Finally, the accuracy of the IRT estimates was further assessed by correlating them to the traditional CTT values.

The BACES assessment was developed to build on more than three decades of measuring postgraduate medical trainee ability to interpret the statistics and epidemiological research methods in the medical literature. Traditionally, very brief discussions of psychometric properties were included with previous instruments.11 Additionally, the use of CTT item analyses for these instruments muddied the ability to separate their psychometric properties from the samples on which they were originally tested.18 In contrast, the BACES item parameters can be easily tested in additional samples, and rather than drastically changing across administrations (as with CTT item statistics), the IRT parameters ought to remain fairly stable.16 ,19 ,21 This property allows for the BACES items to be separated, rearranged or reassembled into new test versions without losing the accuracy or consistency in estimating person trait levels.

The flexibility of the BACES assessment has broad implications for the graduate medical student population and the educators who teach the content on which the BACES is focused. For the trainees, each BACES item was specifically designed to mimic a realistic clinical or literature-based example. The content for these examples was derived from broadly applicable medical and surgical conditions, while at the same time incorporating many of the most commonly used statistics in major medical journals.5 Moreover, the BACES assessment provides a descriptive answer key, so that examinees can receive immediate feedback on their success while at the same time getting a thorough explanation as to why their particular answer choice was correct or not.

From an educator's perspective, an instrument which retains its difficulty, discrimination and reliability values across any number of item combinations or administrations would be an attractive option, considering that the methods by which these topics are taught varies considerably across residency programmes.9 ,10 ,28 Instructors could, for example, select the BACES items that target only specific concepts such as measures of odds/risk (items 17, 27, 5, 11 and 24), and be confident that the items will perform reliably. Similarly, an instructor testing first-year residents may select BACES items that most accurately measure lower knowledge levels (eg, items 1, 7, 8 and 9), while someone facilitating a review for an in-training or boards examination would prefer to select items that most accurately measure the high end of statistics and clinical epidemiology knowledge (eg, items 4, 13, 16 and 24).

While these initial results are positive, it is important to note three key limitations to the study design, implementation and interpretation of results. This study sought to establish baseline evidence for the BACES instrument; therefore, any causal conclusions based on these results would be inappropriate without additional studies. Specifically, the invariance of IRT parameters is only possible if the IRT model fits the data.21 Although the results showed a 2PL model fit this sample, the estimates may change as a larger sample of residents is tested. The sample size used for the BACES data was smaller than the ideal size for IRT analysis. Simulation studies have shown that between 100 and 500 participants is an adequate number for estimating a 2PL model,29–31 so the relatively small sample size of 147 respondents could have produced a higher overall SE. Similarly, the study sampled a predominantly PGY1 resident population, which may have skewed the results of the IRT estimates. The non-randomised design used in this study permits possible sampling error that could impact the stability of the findings. For example, departments self-selected to participate in the study and administrations of the instruments were held in a rather uncontrolled environment (ie, residents coming and going as they pleased). Since a completely controlled testing situation was not possible, it is impossible to rule out potential cheating or lack of motivation, which could impact the BACES results.

Now that evidence for content validity and item parameters has been estimated, it is up to future researchers to confirm it. Specific steps that must be taken by future researchers include: (1) modifying the problematic items found during this study; (2) generating additional items to ensure the assessment includes all relevant topics; (3) administering the improved instrument to a far larger population; (4) testing the stability of item parameters found in this study within the much larger sample; and (5) continuing to investigate the BACES items for construct validity evidence and differential item functioning. These five steps will keep the development process moving forward and ultimately create a much stronger instrument. Ideally, this effort would culminate in a large bank of rigorously tested BEK items for use as teaching tools and/or self-assessment modules.

The BACES assessment was developed and tested for its validity evidence as well as its individual item parameters. In contrast to previous studies, this study was developed using an IRT approach, and its results have paved the way for a flexible yet psychometrically rigorous instrument for measuring the statistical and clinical epidemiological knowledge of postgraduate medical trainees.

Main messages

  • The Biostatistics and Clinical Epidemiology Skills (BACES) assessment was developed with primary emphasis on creating an adaptable yet psychometrically rigorous self-assessment tool for medical residents.

  • Item Response Theory (IRT) approaches provided an ideal approach to developing such an instrument because it allows for specific items to be selected for an individual group of participants (eg, items related to a single topic) without sacrificing their difficulty, discrimination or reliability.

  • In this study, 26 of the 30 BACES items fit a two-parameter IRT model with 13 items devoted to statistics and 13 to clinical epidemiology.

  • Now that the baseline estimates of these items have been established, a much broader item writing and piloting process will be needed to improve the instrument.

Current research questions

  • What additional concepts or skills need to be addressed when revising the BACES assessment?

  • Will the characteristics of the BACES items change with a larger sample of residents?

  • How can the BACES assessment and its descriptive answer key be used as a teaching tool?

  • How can IRT be used as an approach to measuring other areas of medical education such as ACGME milestones?

Key references

  • Windish DM, Huot SJ, Green ML. Medicine residents' understanding of the biostatistics and results in the medical literature. JAMA 2007;298:1010–22.

  • Enders F. Evaluating mastery of biostatistics for medical researchers: need for a new assessment tool. Clin Transl Sci 2011;4:448–54.

  • Case S, Swanson D. Constructing written test questions for the basic and clinical sciences. 3rd edn. Philadelphia: National Board of Medical Examiners (NBME), 2002.

  • De Ayala RJ. The theory and practice of item response theory. 1st edn. New York: Guilford Press, 2009.

  • DeMars C. Item response theory. 1st edn. New York: Oxford University Press, 2010.


The authors thank Drs Jennifer Ann Morrow, Kent Wagoner and Shawn Spurgeon for their service on the dissertation committee from which this work arose. Also, an additional thank you to the members of the content review group for their contributions, and particularly to the participants without whom there would be no study.



  • Contributors PBB: conceptualised the instrument, conducted data collection and analyses, and wrote the article and dissertation on which the article is based (available if requested). GJS: major professor of the dissertation on which the article is based as well as assisted with technical editing and revisions of the document. REH: assisted in data collection, instrument review and technical editing. WM: provided technical expertise in Graduate Medical Education and assisted with data collection, analysis and technical editing of the manuscript and is also a member of the dissertation committee on which the article is based. TLS: provided technical editing and formatting for the manuscript.

  • Competing interests None declared.

  • Ethics approval Ethics approval was obtained from the University of Tennessee Graduate School of Medicine and the University of Tennessee-Knoxville.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data sharing statement The author would be happy to make all of the original data from this study available to whoever wishes to use them but would appreciate working collaboratively if possible.