Now that the final thesis has been approved by the Boise State University Graduate College and I’ve safely graduated, I’m ready to present the research results. The actual thesis will be available for download here after it has been uploaded to the electronic thesis library. As most sane people would prefer not to plow through a 142-page document, I’ll summarize it in plain English.
The really short version: we’re not evaluating training based on on-the-job behaviors and organizational outcomes because there’s not enough time, people, and/or management support. However, we’re also not entirely sure what Level 4 evaluations are.
The really long version: My research was a look at training professionals’ usage and understanding of Kirkpatrick’s Level 3 and Level 4 evaluations. I wanted to explore the factors which helped or hindered the performance of evaluations which examined on-the-job application of the knowledge/skills learned (Level 3) and the impact of training on forwarding organizational goals (Level 4). I’d like to thank those individuals who completed the research survey and am particularly grateful to the 22 individuals who allowed me to interview them for this project.
Before beginning my research, I looked at the results of similar surveys conducted by ASTD in 2009 and a doctoral student named Joe Pulichino in 2007. ASTD only asked if respondents had conducted evaluations at the different levels to any extent; Pulichino’s survey questions allowed the respondents to decide on their own interpretation of terms such as “sometimes” and “rarely”.
I tried to be more precise and thus asked respondents to pick a percentage for each level of evaluation. I know this must have been annoying for survey respondents, but “frequently” is a fuzzy word that could mean 50% or 75% or whatever else an individual interprets it to be. I divided my results into five intervals and grouped together the top three as my equivalent of sometimes (41-60%), frequently (61-80%), and almost always (81-100%). And thus we have the following frequencies for which each level of evaluation is conducted at least sometimes by the respondents:
- Level 1: 88.13%
Level 2: 74.54%
Level 3: 43.47%
Level 4: 18.41%
Although the exact figures are lower than the figures ASTD and Pulichino generated, the trend is the same for all three surveys. Level 1 and Level 2 evaluations are commonly performed, Level 3 markedly less so, and Level 4 is a relatively rarity.
I asked respondents how sufficient the results of each level were for evaluating the effectiveness of training and how important it was to conduct each of the levels. Now here’s where things start to get interesting.
- Level 1: 52.54% said it was somewhat or very sufficient, 66.10% said it was somewhat or very important
Level 2: 61.02% said it was somewhat or very sufficient, 94.92% said it was somewhat or very important
Level 3: 59.32% said it was somewhat or very sufficient, 98.31% said it was somewhat or very important
Level 4: 42.37% said it was somewhat or very sufficient, 88.14% said it was somewhat or very important
I had expected that Level 3 would have been perceived as decidedly more useful than any other level as it would examine to what extent learners were applying their new knowledge/skills on the job. Almost everyone thought Level 3 was important, but why did so many of those same people not consider it sufficient for evaluating training?
In retrospect, I should have phrased the question more clearly. I intended to ask if a particular level was useful as a component for evaluating training, and it might have been interpreted as asking if a particular level is all you need to do so. I wish I had asked the latter question, now that I think of it; in my opinion, if you’re limited to one shot at evaluation it should be a comprehensive version of level 3.
Why would Level 4 be perceived as the least useful type of training evaluation? I’d love to explore that question more. I suspect that the difficulty in linking organizational outcomes to the training is a major issue. There’s also the perception that Level 4 is irrelevant for many organizations or categories of training interventions, and that is a whole other issue that I’ll discuss in a separate post.
I asked if respondents perceived any sort of positive correlation between the levels, so that a positive result for Level 1 and 2 evaluations strongly indicated that you’d also get a positive result when evaluating at Levels 3 and 4. I bet a lot of respondents thought this was a ridiculous question. However, one of the academic criticisms of Kirkpatrick’s levels is that it leads training professionals to assume this positive correlation. 69.12% of the respondents did not perceive any such correlation.
So let’s get back to conducting evaluations. What factors had the strongest impact on one’s success or lack of success in conducting Level 3 and 4 evaluations? First, let’s look at the reasons why people wanted to evaluate training at those levels.
The top three reasons training professionals wanted to evaluate at Level 3 were to assess the relevance of training (70.54%), to demonstrate their own value to the organization (47.73%), and to look at issues with transfer of training (40.91%). For Level 4, the top three reasons were demonstrating their own value to the organization (70.0%), assessing the relevance of training (50.0%), and looking at the organization’s actions which supported or hindered training efforts and results (50.0%).
What factors within the training department or organization had the most impact on whether or not such evaluations were done? For both Level 3 and Level 4, the top three factors were access to the learners for post-training data collection, the importance placed on conducting the evaluation, and the department’s expertise/experience in evaluating.
What prevented training departments from conducting Level 3 and 4 evaluations? For both levels, the biggest factors were a lack of resources (such as time and budget), lack of expertise, and lack of post-training access to learners also having a notable effect. Level 3 evaluations were more likely to be hindered by a lack of support from organizational management than were Level 4 evaluations; this may be because those lacking organizational support for post-training evaluation would not even consider attempting a Level 4.
Respondents had the opportunity to include a free response explanation of what they perceived to be the most important factors in facilitating and obstructing their attempts to evaluate training. I coded this qualitative data and found that support from organizational management for conducting evaluations was the most important factor, both in helping and in hindering such evaluations. The second most important facilitating factor was access to data needed for evaluation, while the second most important obstructing factor was a lack of resources followed closely by a lack of importance placed on evaluation by the organization.
After the survey, I interviewed 22 of the respondents to collect qualitative data in hopes of putting the numbers in context. I used Gilbert’s Behavior Engineering Model (BEM) to classify interview responses into organizational-level and individual-level factors. As expected, many of the comments fell into the areas of organizational data (example: the organization’s perception of the importance of training evaluation) and organizational resources (example: sufficient time and personnel available to conduct evaluations of training).
The role of organizational perception may be the single most important factor in the success or failure in conducting Level 3 and 4 evaluations. The other types of factors are likely to be a result of this perception. If the organizational management does not support evaluation, it will not make the necessary resources available nor will it provide any incentives for conducting – or cooperating with – evaluation efforts. Such an organization will not recruit training professionals skilled in evaluation, and those already within the organization may leave (taking their skills with them) or become frustrated and demotivated or simply resigned.
I ended the data section of my thesis with a comparison between two interviewees and how organizational support affected their efforts. The two were fairly comparable in how I perceived their levels of knowledge, charisma, and ambition. One worked for an organization which placed a very strong value on evaluation and data collection; this person had a wealth of data available for analyzing the efficiency and quality of the training programs in place, plus support for trying new approaches. The other worked for an organization which saw no value in evaluating training and would not approve any active pursuit of the necessary data; this person tried to demonstrate the value to the organization, was rebuffed multiple times, and left the company shortly after our interview.
None of these findings about factors were surprising to anyone in the field, and they tied in tidily with Gilbert’s belief that organizational-level data and resources were by far the most important factors in any workplace performance issue. As the interviews went along, however, I noticed a third critical factor emerging: individual knowledge. The issue was not a lack of expertise or experience, which is what you would expect given the complexity of Level 4 evaluations. Instead, it was the interpretation of what Level 4 is.
Kirkpatrick first presented his Four Levels in a series of articles published between November 1959 and February 1960 in the Journal for the American Society of Training Directors (now T&D, a publication of ASTD). He defined Level 4 as “the measure of final results which occur as a result of training, including increased sales, higher productivity, bigger profits, reduced costs, less employee turnover, and improved quality.” He acknowledged the difficulty in evaluating outcomes which were not readily quantifiable, and of linking such outcomes to a training program; in such cases, he suggested limiting evaluation to the other three levels. It seems that a common interpretation of Level 4 has become not the final results of training, but the final numerical and financial results of training and the organization’s return on investment for the training. If the long-term training goals aren’t things like increased sales or bigger profits, or if you don’t need to justify the investment in training because the organization will allocate resources regardless of the ROI, is Level 4 still relevant?
Well, yes. Even if the training is required by policy or does not directly affect organizational income, it makes sense to verify that the training did what it was supposed to do for the organization. At Level 3, we look at what the training did for the learners – are they performing a set of skills or competencies to the degree defined as “success”? At Level 4, we look at the organizational goal which prompted the training – did the success of the learners in performing those skills or competencies achieve the organizational goal?
In 1998, McLinden and Trochim wrote an article for Performance Improvement introducing the concept of “return on expectations” and their framework for setting expectations for training and then evaluating how well those expectations were met. Recently, Kirkpatrick Partners have been promoting the ROE concept as the new and improved definition of Level 4. Several interviewees were familiar with the term and thought it was a good idea, but did not connect ROE to Level 4. So, there’s the connection. ROE means working with the stakeholders for your training, setting the expectations (intended outcomes which meet the stakeholders’ goals), and then measuring training results versus training expectations. That’s Level 4. (however, don’t use the acronym ROE around the MBAs in your organization as it translates to “return on equity” in MBA-speak and they’ll wonder why you’re referring to profitability and investors)
Return on investment for training is still important in many contexts, but Phillips presented it as a new fifth level rather than a definition for the fourth.
One of my conclusions is that this misinterpretation of Level 4 as strictly financial/numerical has a strong impact on organizational support for Level 4 evaluations. If you define Level 4 as strictly measurable output that affects income (sales, productivity, manufacturing defects, etc), you cannot effectively present a case for evaluating training that does not directly affect income. Such training should still accomplish its purpose, however. I must note that training professionals often do not have the opportunity to discuss such organizational goals with the stakeholders; many organizations see the training department as an “order taker” that functions only to fulfill training requests without first determining whether training is even the right option for the situation.
Thoughts on Methodology
If you were just interested in the results, you can stop reading now! For those curious about how I went about conducting the research, carry on.
This research project was approved by Boise State University’s Institutional Review Board, which must approve any research that involves humans as subjects. For both the survey and interviews, I was required to make available a letter of informed consent explaining participants’ rights. The IRB had to approve the survey questions, interview questions, letters of informed consent, and even the text of the LinkedIn posting used to solicit participants. My research was classified as Expedited Review as it collected new data from human subjects but did not subject them to anything hazardous to physical or mental health. My own mental health during this process was beside the point!
My methodology was based on Brinkerhoff’s Success Case Method. The survey phase of the SCM helps you identify the extreme cases, defined as individuals who were the most and least successful at benefiting from a program of some sort. This is followed by open-ended interviews with those extreme cases with the intention of identifying why the best performers did so well and why the worst could not succeed. Although I borrowed this structure from the SCM, it would have been difficult to define degrees of success as I did not examine the beneficiaries of a single program. It would also have been difficult to select representative (typical) cases. About 48% of respondents had a master’s or doctoral degree in a field related to instructional design in some way, but this could have meant anything from a Ph.D in instructional systems technology to a M.Ed in elementary education. Meanwhile, nearly one-quarter of the respondents had no formal education in instructional design. With such a wide range of backgrounds, with only a professional function in common, I decided to do the only sensible thing and interview everyone who volunteered to do so. In a way this weakens the research because case selection is controlled by the subjects rather than the researcher, and is one reason why I would not consider my results truly transferable to other contexts. If I had a larger pool of participants, I might have tried to develop the extreme case selection, but I think that would have required a more focused survey.
I did not perform a statistical analysis on the survey data, although I had originally intended to do so, due to the low number of complete responses (68). I’ve certainly read plenty of academic research which included analysis on much smaller data sets, but I didn’t feel comfortable doing so for the thesis. I plan to teach myself R as my student license for SPSS has expired, so perhaps I’ll run my survey data as practice. What I’d like to see is the correlation between formal education in evaluation methods and the success in conducting Levels 3 and 4.
I asked each survey respondent to self-select as a success case (had conducted at least one evaluation and presented the results to the stakeholders), a non-success (had attempted at least one evaluation but could not complete it for whatever reason), or a non-evaluator (had never attempted an evaluation) for Level 3 and for Level 4. The interview subjects were assigned numbers based on interview order and codes based on their self-selection. Interviewee 6SC3NE4 would have been the sixth person interviewed and had classified himself or herself as a success case for Level 3 evaluations and a non-evaluator for Level 4.
ASTD and Pulichino were able to gather much more quantitative data through their surveys than I could. However, neither collected qualitative data. What do the numbers mean in context? One person’s idea of successful evaluation is another’s definition of inadequate and inconclusive. One month after each of his training programs, Bob sends out a 5-question survey to the learners and evaluates his programs based solely on the answers of the few that respond. Beth evaluates only a small selection of mission-critical programs, but does so by triangulating data from surveys, interviews, and observations of the learners, their supervisors, and their clients. If you ask each of them how frequently they evaluate training at Level 3, Bob would answer “almost always” and Beth “seldom”. Which one is the successful evaluator?
I interviewed 22 individuals for this research project, all of whom were involved in organizational training in some capacity. What I had hoped to find was a “magic bullet”, some critical factor that determined the fate of one’s evaluation attempts. There is one – get hired by the right company with the right management. Simple, right? Anyway, the interviews were scheduled for about 15-20 minutes each but I didn’t discourage individuals from continuing to talk if they had more to say. The shortest interview was 15 minutes, and the longest was somewhere over an hour. For each interview I had a semi-structured script so I could cover the relevant topics, but it was often valuable to let the conversation wander a bit.
My research questions were as follows:
Research Question 1:
With what frequency do training professionals conduct Level 3 and/or Level 4 evaluations for their organizations?
- Sub-question 1a: Who are the stakeholders for these evaluations?
Sub-question 1b: For what reasons do training professionals conduct or attempt to conduct the evaluations?
Research Question 2:
What factors act as facilitators or barriers to conducting Level 3 and Level 4 training evaluations?
-
Sub-question 2a: For Success Cases, what are the facilitating factors and the barriers, and how did they impact the performance of evaluations?
Sub-question 2b: For Non-Success Cases, what are the facilitating factors and the barriers, and how did they impact the attempts to perform evaluations?
Sub-question 2c: For Non-Evaluators, what are the facilitating factors and the barriers, and why were evaluations not attempted?
In retrospect, I asked the wrong questions. To be fair, I didn’t realize this until the interviews started and I noticed the varying interpretations of Level 4. What I should have asked is “how do training professionals interpret the purpose of Level 3 and Level 4 evaluations, and how do they define successful evaluations?”