Implementation of Item Response Theory (IRT) Rasch Model in Quality Analysis of Final Exam Tests in Mathematics

Article Info ________________ History Articles Received: 13 May 2021 Accepted: 07 June 2021 Published: 30 August 2021 ________________


INTRODUCTION
In learning in the school environment, teachers play an important role in evaluating and assessing the subjects being taught (Palinussa & Thaib, 2020;Sutama et al., 2017). Assessment conducted by the teacher to see progress (Santoso, 2014), improve the learning process (Umami, 2018) and obtain information about students by giving assignments, written tests, class questions and answers, daily tests, mid-semester tests, and end-of-semester tests. (R. Abdullah, 2017;Nurjannah, 2017). The evaluation process needs to be carried out properly in order to be able to measure the actual abilities of students (B, 2017;Talakua et al., 2020). A good assessment requires a good test instrument too (Agustin et al., 2018;Nufus et al., 2017). A good instrument is an instrument that has a valid measuring power (Muluki et al., 2020). Instruments that do not have valid measuring power will not be able to provide any information regarding the test taker's ability (Solichin, 2017). In fact, many test instruments still do not know the quality of the questions, so there is a quasi-assessment which has an impact on the actual ability of students who cannot be measured (Fauziana & Wulansari, 2021).
Item analysis can help improve the quality of questions through revision or discarding ineffective questions, besides that it can be used as diagnostic information for students, whether they have understood the material (Muharromah & Humaisi, 2020). In the field of education, there are two approaches used to analyze test quality, namely Classical Test Theory (CTT) and Item Response Theory (IRT) (Fitriani et al., 2019). However, analysis of test quality using the classical test approach has been abandoned because it has many weaknesses (Pratama, 2020). Classical test theory has several weaknesses, namely; the results of the measurement will depend on the characteristics of the test used, the parameter of the item depends on the ability of the test taker, and measurement error can only be known for groups, not individuals (Mardapi, 2012).
In contrast to the Item Response Theory (IRT) approach, it is a general statistical theory about examining items and testing performance and how performance relates to abilities as measured by items in the test (Aprita & Haryati, 2021). IRT is one way to assess the feasibility of items by comparing and appearance of items against the appearance of evidence of group ability predicted by the model (van der Linden & Hambelton, 2013). According to (Liang et al., 2014) say that "Item response theory (IRT) is a powerful scaling technique with appealing features such as the invariance of item and ability parameter values". IRT assumes that the probability of a test taker answering correctly for each item depends on the ability of the test taker. Thus, test takers who have high abilities have a greater chance of answering correctly than those with low abilities (Retnawati, 2014). According to (Hambelton & Swaminathan, 2013) IRT has several advantages, namely; The score describes the test taker's ability and does not depend on the difficulty of the test, can be used to relate the item to the test taker's ability, and does not require parallel tests to determine the reliability coefficient.
One of the simplest IRT models that has been widely used by experts in developing tests is the Rasch model, with one parameter (1-PL) (Chan et al., 2021;Falani & Kumala, 2017;Sumaryanto & Khumaedi, 2019). Rasch model is very easy to do and apply with accurate analysis results (Che Lah et al., 2021;Susdelina et al., 2018), also reviewing the opportunity to answer correctly on the questions by comparing students' abilities with the level of difficulty of the questions (Nielsen et al., 2021;Sumintono & Widhiarso, 2015). The Rasch model has a variable difficulty level, regardless of the sample involved in the initial variation (Scoulas et al., 2021;Wei et al., 2012). Rasch developed a data measurement model that can determine the relationship between the student's own level of ability (person ability) and the level of item difficulty by using a logarithmic function to be able to produce measurements with the same interval value (Bambang, 2014;Tseng & Wang, 2021). The main characteristic of the Rasch model is that it considers all responses from a test taker regardless of the order of problem solving (Isnani et al., 2019). In addition, the selection of the Rasch model is because this model at least meets the principles of the measurement model, namely; able to provide a linear measure with the same interval, able to overcome the problem of missing data, can provide more precise estimates, can detect the imprecision of a model, and provide independent measurement instruments from the parameters studied (H. Abdullah et al., 2012;Sumintono, 2014).
The relevance of this research is the research conducted by (Alfarisa & Purnama, 2019) shows that the analysis using the Rasch model approach can explain the quality of the test items. Other research on item analysis using Rasch modeling can explain an item and the abilities of students in permutations and combinations of mathematics learning (Dwinata, 2019). Furthermore, other research conducted by (Imaroh et al., 2020) analyzing the items using the Rasch model can provide information about the quality of the items in the final test of the odd semester mathematics class VII Junior High School (SMP).
Based on the previous research in this study, the researcher wanted to know the quality of the test instrument used to measure students' abilities in the odd semester final exam in mathematics for class VIII SMP with the Rasch model approach. This quality is measured based on several indicators, namely items that fit the Rasch model, level of item difficulty, and item reliability. Therefore, a test instrument was designed and then determined which items were fit and which did not fit the Rasch model. In addition, Cronbach's alpha value will be determined to determine the reliability of the items.

METHODS
This research is focused on the analysis of the final semester examination test instrument using the Rasch model approach. Sampling using purposive sampling technique. The subjects of this study were students of Junior High School (SMP) class VIII in Yogyakarta as many as 67 people. There are 40 multiple choice questions with a correct score of 1 and an incorrect score of 0 on the final semester exam test instrument aimed at students. The technique is the data analysis used is descriptive quantitative. The test results in the form of dichotomous scores were analyzed using Winsteps software version 3.73. From the output of the Winsteps software version 3.73, several parameter items were obtained that fit the Rasch model. In addition, Cronbach's alpha value is obtained which is the result of the overall item reliability test. While the MNSQ Outfit, ZSTD Outfit and the correlation value of the item with the question as a whole show the limit of items that are declared fit with the model. That is, if the Outfit MNSQ value is between 0.5 to 1.5; Outfit ZSTD value is between -2.0 to 2.0; and the correlation value of the item with the total score is positive (Sumintono & Widhiarso, 2015).

Summary of Statistics
Based on data analysis using Winsteps 3.73 software, there are 35 items that fit the Rasch model and 5 other items do not fit the Rasch model. These results are fully presented in Table 1.  Table 1 shows the logit value of the person or measure measure of -0.84 and the item measure value of 0 which means the person measure value is smaller than the item measure. It can be stated that the ability of students tends to be lower than the level of difficulty of the questions. Meanwhile, Person Reliability is worth 0.72, Item reliability is 0.87, and Cronbach's Alpha value is 0.77. From this value, it can be stated that the level of consistency of answers from students is good, namely 0.72, and the quality of the items on the test instrument used has very good reliability, which is 0.87. In addition, the value of Cronbach's Alpha which shows the interaction between person and item as a whole has a good value of 0.77.
Another quantity shown in table 1 is the Outfit Mean Squared (Outfit MNSQ) value of 1.08 in both the person and item columns. The value of 1.08 is included in the fit criteria, which is between 0.5 <MNSQ <1.5, meaning that the test instrument used is in accordance with the model to measure the competence of students in the final semester exam. Furthermore, the Outfit Z Standardized value (Outfit ZSTD) is -0.1 for the person and -0.2 for the item. The values of -0.1 and -0.2 are between -2.0 <ZSTD< 2.0, which means that the data has a possible rational value. This means that overall the questions or items are in accordance with the Rasch model and can be used as an achievement test instrument in the final semester exam.

Level Item Fit
The distribution of item items that are considered to be misfit or not fit with the model can be seen in figure 1. The item limit is declared fit with the model if it meets one or both of the following conditions. The first requirement is that the Outfit MNSQ value is between 0.5 and 1.5; Outfit ZSTD value is between -2.0 to 2.0; and the item correlation value with the total score (point measure correlation) is positive (Sumintono & Widhiarso, 2015). Based on the results of the analysis of the achievement test instrument using the Winsteps version 3.73 program which is in figure 1, it is obtained that there are 5 items that are misfit, namely item 1, item 29, item 36, item 37, and item 39, and fit questions are 35 items, so obtained the final instrument as many as 35 items.

Person Map Item
Rasch analysis has a feature that uses winstep is that there is a map that describes the distribution of the subject's ability and the distribution of item difficulty levels with the same scale. This map is called the Wright Map which is nothing but a person-item map (Salman & Abd.aziz, 2015).
Based on figure 2 in the left side is the distribution of subject abilities, while on the right side is the distribution of items. From the map, it can be seen that in general the questions in the test are more difficult than the subject's ability. The most difficult items are item 27 and item 36 which are in the topmost position. Theoretically with that question, there will be no subject who has the opportunity to answer the question correctly because it has a lower ability than the level of difficulty of the question. To see the difficulty level of the item in more detail, it will be reviewed below.

Item Difficulty Level
One thing we need to pay attention to is the results of Rasch's analysis with this winsteps. A high logit (measure) value indicates that the item has a high level of difficulty. It correlates with the total score, where multiple correct answers in the total score correlate with a higher measure score. This data size also has the same scale. To find out the classification of the level of difficulty of each item, it can be seen in table 2.

Differential Item Function
In the winsteps program package, information about the bias of this item can be found through Item:DIF, between / within. Items that have a P value (PROB < 0.05) indicate that the item is infected with DIF. In the results of the winsteps analysis, it is known that the probability value (PROB < 0.05) which means that infection is biased between men and women in five items, namely item 24 (PROB = 0.0326), item 32 (PROB = 0.0351), item 33 (PROB = 0.0262), item 38 (PROB = 0.0028), and item 39 (PROB = 0.0462). Furthermore, it can be seen in Figure 2.  Figure 3, the level of difficulty of the relative items for each group. The higher the graph point, the more difficult the item is for the group. There are three curves based on gender, namely L (male), P (female), and an * (star) which indicates the average value. From Figure 4, it can be seen roughly that the distance of the DIF measure value between L and P is the furthest at points 24, 32, 33, 38, and 39. While in other items the distance between L and P is not too far. This shows that the five items have quite a large difference in difficulty levels between men and women. In item 24, men benefit more because the item is more difficult for women than men. For items 32, 33, 38, and 39 women benefited more because these items were more difficult for men than women. Therefore, the five items should be reviewed whether it is true that these items are more profitable for women and men.

CONCLUSION
The test instrument used for the final exam in mathematics is fit with the Rasch model. This is indicated by an item score (item reliability) of 0.87, person reliability (person reliability) of 0.72, and Cronbach's alpha value of 0.77 while the Outfit Mean Square Statistic (Outfitt MNSQ) value of 1.09 in the person and item columns. The Outfit Z Standard (Outfit ZSTD) value is -0.1 in the person table and -0.2 in the item table. While the number of items that fit as many as 35 while those who do not fit as many as 5 items. The researchers suggest further research to calculate the quality of items using other models and approaches, such as 2PL, 3PL or 4PL with different software applications.