EVALUATING COMPUTERIZED ADAPTIVE TESTING EFFICIENCY IN MEASURING STUDENTS’ PERFORMANCE IN SCIENCE TIMSS

The current standards of assessment are demanding for high level of precision with less time-consuming and personalized opportunities, thus restrict the function of Paper and Pencil Test which has dominated the assessment field for a very long time. The Computerized Adaptive Testing (CAT) is viewed as an alternative testing tool to Paper and Pencil Test as it has adaptive feature which enables it to meet the current standards of assessment. This research is focusing on the evaluation of students’ ability in Grade 8 Science Trend in International Mathematics and Science Study (TIMSS) using Computerized Adaptive Testing (CAT) as an alternative instrument to Paper and Pencil Test to investigate whether the implementation of CAT can produce high level of precision with fewer items administered as well as differentiate different academic level among groups of students. CAT was configured in Concerto and was administered on Form 2 and Form 4 students selected through purposive sampling method from secondary schools in northern part of Malaysia. Students’ performance was analysed and compared in terms of score (theta value), SEM value and their response toward the items selected in CAT using SPSS. Finding shows that the administration of 20 objective items in fixed length CAT produced SEM ≤0.50 indicated that the implemented CAT increased the efficiency of assessment with fewer item administration. The t test showed that there was a significance difference between the two groups’ scores in CAT in which Form 4 students had higher ability level than Form 2 students proving that CAT’s configuration had been done correctly in Concerto and the test was more suitable in challenging Form 2 students’ Science knowledge thus the instrument fulfilled the known-group validity. © 2019 Science Education Study Program FMIPA UNNES Semarang


INTRODUCTION
In the conventional of conducting assessment, Paper and Pencil Test is used widely as a testing tool in educational field. The items used in this linear instrument are more suitable in testing average ability students (Chuesathuchon & Waugh, 2010). According to Mansoor Al-A'ali (2007), this instrument is suitable in measuring student's whole performance in a large-scale assessment but it could not provide a precise evidence about individual's ability as the administration of the items is not adaptive thus a student might have to answer very easy or very difficult items which could not provide enough information on the student's ability. Therefore, Computerized Adaptive Testing (CAT) could be an alterna-tive testing instrument to Paper and Pencil Test in current assessment practices (Bakker, 2014) as it is an instrument which focuses on measurement of an individual's ability. The adaptivity feature in CAT enables it to select the item's difficulty parameter based on the examinee's level of ability thus producing more precise individual measurement, shorter testing time and faster score reporting (Linden & Glas, 2010). Therefore, this study is done in order to test the mechanism of CAT to ensure it can work well to discriminate students with different level of ability but at the same time, it works in adaptive manner by intelligently sequence the items based on their difficulty levels. This is done by examining the theta values which is the unit of measurement for Science TIMSS achievement produced by CAT. It is hypothesized that if CAT is able to produce range of theta values for students taking CAT, therefore, it can be inferred that CAT works well under the principle of adaptive testing principle. A theoretical assumption is being speculated that students with older age will obtain higher theta values compared to the students with younger age.
Generally, in the beginning of the CAT, the computer would select the first item randomly with medium difficulty parameter as the examinee's ability could not be estimated yet. The examinee's response towards the first item is analysed and his or her current ability is estimated. Maximum Likelihood Estimator (MLE) is the most frequently used method in estimating examinee's ability in CAT because it provides low level of bias towards the ability measurement (Linden & Glas, 2010). After that, CAT will repeat these next two steps. According to Davey (2011), in the first step, an item is chosen according to examinee's current ability through various methods such as Maximum Fisher Information, Kullback-Leibler divergence or random item selection method. In the second step, the examinee's response towards the item is analysed and his or her ability is estimated. CAT will repeat these two steps until it meets the test stopping rule when all the items have been used, the measurement's precision level has been achieved or the test has reached its maximum testing time (Oppl et al., 2017). During the test, if the examinee answers wrongly, an easier item will be selected by the computer as the next item and if the examinee answers correctly, a more difficult item will be selected. The examinee's ability increased linearly with the increase of the item difficulty parameter. The repetition of these two steps produces a converging pattern of estimated ability towards a stable point which thus produces more accurate measurement of an individual's ability. At the end of the testing session in which the test has met its stopping rule, the final estimated ability will be reported. Examinee's ability in CAT is referred as theta value estimated in logit unit. The higher the theta value, the higher the examinee's ability (Davey, 2011;Linden & Glas, 2010).
CAT can be administered by configuring it in a testing platform. The web-based platform enables the CAT to be administered online. According to Oppl et al. (2017), there are three criteria in choosing the best testing platform for CAT. Those criteria are (1) the platform offers various testing strategies; (2) the platform offers various item selection methods; and (3) the platform offers various test stopping rules. The web-based open-source platform, Concerto, meets all those criteria. This is because Concerto provides several Item Response Theory testing strategies, offers several item selection methods, and contains three types of test stopping rules. Simple coding in R language using catR-library in Concerto (Magis & Raîche as cited in Magis & Barrada, 2017) makes the test development process become easier, plus the test developer is flexible in designing the test process to meet own's standard (Aybek as cited in Aybek & Demirtasli, 2017). Therefore, Concerto becomes the ideal platform to be used by the CAT test developer (Scalise & Allen, 2015).
The main advantage of using CAT in assessment is more precise individual ability measurement can be made with shorter testing time and instant score reporting. Research by Lilley & Barker (2003) showed that the administration of linear computerized testing and CAT on similar students produced the same score. Kalender (2012) compared the score between CAT and Paper and Pencil Test in Science subtest and found that there was a high correlation of scores between two instruments. CAT provided higher reliability on ability estimation using half the number of items given in Paper and Pencil format. Besides that, research by Martin & Lazendic (2018) agreed with the finding from Davey (2011). They stated that the items selected by CAT matched well with the student's ability as the selected items were in the examinee's learning zone leading to a less measurement error thus produced better precision. Furthermore, research by Mizumoto et al. (2017) which used the Concerto as CAT testing platform found that CAT can reduce the number of items used and testing time without affecting the testing precision. Other advantages of using CAT are cheating problem can be avoided as each examinee receives unique set of items (Chuesat-huchon & Waugh, 2010) and CAT can improve examinee's motivation towards the test because all the test takers' abilities are not being compared to each other (Betz & Turner, 2011). Due to the advantages offered by CAT, this instrument has been widely used in high stake or low stake testing environments. In educational field, CAT has been used in Adaptive English Proficiency Test for a Web, Alberta Computer Adaptive Assessment System, Japanese Computerized Adaptive Test, Measures of Academic Progress and Graduate Record Examination (IACAT, 2016), Danish National Test (Beuchert & Nandrup, 2015), School Graduation Exams in Georgia (Bakker, 2014), RISE (Utah State Board Education, 2018) and RoSA which use Concerto platform (Psychometric Unit University of Cambridge, 2018).
There was not much research conducted about the implementation of CAT in Malaysia. Desa & Latif (2007) proposed CAT as an alternative instrument to Paper and Pencil Test and explained its criteria and adaptivity feature in providing more precise ability measurement without conducting a real assessment using this instrument. Norah and Nor Azean (2008) configured CAT using C++ and implemented CAT in assessing student's ability in programming subject. They found that Malaysian students showed a positive attitude in using CAT in assessment and the students believed that there was no difference of scores between linear test and adaptive test. Although it was reported that the readiness of students in using more advance mode of testing with the integration of ICT is at the par level (Kiran et. al (2012), the integration of ICT into education in Malaysia is considered to be not at the satisfying level especially in the specific field of educational assessment and evaluation (Umar & Hassan, 2015). Therefore, the integration of ICT in education has been one major focus in Malaysian Educational Development Plan 2013-2025 to increase the quality of educational practice and to meet the current technology development (Ministry of Education, 2013). More research needs to be done about CAT in Malaysia and at the same time fulfilling the Malaysian Educational Development Plan 2013-2025. The configuration of CAT in this research is using R language which is more powerful and simpler than C++ (Venables et al., 2018).

METHODS
This research was a quantitative research which analysed numeric data (Noraini, 2013). Cross-sectional survey method was implemented in which the instrument was given to the respondents in a specific point in time (Fraenkel & Wallen, 2009). The respondents were selected through purposive sampling method (Noraini, 2013) from secondary schools in the northern part of Malaysia. There were three stages in conducting the research: CAT configuration, CAT administration, and data analysis. In the first stage, CAT was configured in Concerto using R language. Concerto is a web-based open sourced platform that enables the linear test as well as adaptive test to be configured (Magis & Raîche, 2012). The platform consists of three types of modules. The testing module enables the planning of the test's process using R language to be done, HTML template module enables the designation of the test template, and the table module is used to record all the information, responses given by examinees as well as to record all the outputs by R (Aybek & Dermirtasli, 2017). Five nodes were selected to produce a complete test flow in which the selected nodes consist of these three modules. The selected five nodes were create_template_definition node, info node, form node, save_data node, and CAT node. The setting for five criteria of CAT namely calibrated item bank, first item selection method, item selection method, ability estimation method, and test stopping rule had been done in the CAT node. The released multiple-choice questions (MCQ) Grade 8 Science TIMSS 2003-2015 items were adopted from https://nces.ed.gov/timss/ educators.asp. The adopted TIMSS item bank was calibrated using anchor test design (Ryan & Brockmann, 2018), and a total of 122 items was uploaded into the CAT node with the respective items' difficulty parameter measured in logit unit by Rasch modelThe most difficult item had the highest difficulty parameter of 3.15 logit while the easiest item had difficulty parameter of -5.30 logit. The first item was set to be selected by CAT randomly with average difficulty level, Maximum Fisher Information had been used as the item selection method, Maximum Likelihood Estimator was used to estimate the students' ability, and 20 items fixed length test was set as stopping rule.
In the second stage, the configured CAT was administered to a class of Form 2 and Form 4 students concurrently in the beginning of the year selected through purposive sampling method. Form 2 students were selected as the Grade 8 items were used. It was assumed that older students would get better achievement thus Form 4 students were selected to compare these two groups of students' achievements in identifying the capability of CAT to function properly based on its adaptive feature. A total of 30 Form 2 stu-dents and 35 Form 4 students were involved. The test was administered online, and the students needed to answer 20 MCQ Grade 8 Science TIMSS items within 30 minutes. As CAT is an adaptive testing, thus the arrangement of items administered depended on the responses given by the students. Therefore, every student would get a unique set of 20 items. During the test session, the first item selected by CAT was the item with average difficulty. For a given correct response, score 1 will be given while for a wrong response, score 0 will be given. After that, the student's ability (theta value) was measured in logit unit through the Maximum Likelihood Estimation. Through the Maximum Fisher Information method, the next new item with difficulty parameter near to the measured theta would be selected. Generally, easier items with lower difficulty parameter than the previous item would be selected by CAT if the student answered wrongly on the previous item and vice versa. The item selection process would continue until the test stopping rule was fulfilled and then the final theta was measured and reported.
In the third stage, the student's ability which referred to the theta value estimated by CAT, measurement error value (SEM), and students' responses towards the selected items were analysed and compared using SPSS software. The ability measured by CAT is referred as theta and it is measured in logit unit. In CAT, the theta value will be estimated for every item administered. The theta measured using Maximum Likelihood Estimation (MLE) method is the theta value that maximises the log likelihood function considering the responses given from the examinee towards the items administered (Ӧzdemir, 2016). The higher the theta value, the higher someone's ability. The measurement error (SEM) shows how accurate the measured theta. Therefore, the lower the SEM value, the higher the accuracy of the measured theta by CAT (Barnard, 2018). In addition, the t test was used to investigate any significance different of achievement between the two groups of students to test the known-group validity.

RESULTS AND DISCUSSION
Form 2 Students' Ability in Science TIMSS Computerized Adaptive Testing (CAT) Table 1 shows Form 2 students' ability (theta) in Science TIMSS (CAT) according to every student's ID. According to Table 1, a total of 33 Form 2 students were involved in this testing process. The highest theta value was 2.570 logit while the lowest theta value was -3.135 logit. A total of 25 Form 2 students obtained positive theta values while the rest of the students obtained negative theta values. All the students obtained theta value with Standard Error of Measurement (SEM) less than 0.50 after the administration of 20 items which is equivalent to the reliability in the order of 0.75 (Barnard, 2018). The estimated theta value had high reliability.
Next, Table 2 shows the result of descriptive statistical analysis of Form 2 students' ability in Science TIMSS CAT. According to Table 2, the mean score for 33 Form 2 students was 0.740 logit with standard deviation of 1.657 logit and the median value was 1.413 logit. Estimated theta value of 1.452 logit had the highest frequency. Table 2 also reports that minimum theta value obtained by Form 2 students was -3.135 logit while the maximum theta value obtained was 2.570 logit producing range of 5.705 logit. Based on Table 2, percentiles analysis shows that 25% of Form 2 students obtained theta value less than -0.162 logit. Half of the students or 50% of the students obtained theta value less than 1.413 logit while the other half of the students obtained theta value more than 1.413 logit. A total of 75% of Form 2 students obtained theta value less than 1.881 logit while the rest of the students obtained theta value more than 1.881 logit. This analysis shows that Form 2 students had medium ability level. The higher the theta value, the higher the ability. Therefore, Form 2 students need an increase in their understanding of this basic science knowledge.
Furthermore, item pattern and students' responses were also being analysed. Table 3 shows the list of items with the frequency that has been selected by Science TIMSS CAT for Form 2 students. From Table 3, the total items used was 97 items out of 122 calibrated Grade 8 released Science TIMSS items. Item 15 had the highest frequency with 33 times indicated that every Form 2 student had answered this item in Science TIMSS CAT. The frequency of items used decreased gradually. Form 2 students' responses affected the item selected by the CAT and these selected items covered a wide range of item difficulty parameter from high, medium, and low level.

11
Which of the following properties of a substance is conserved during thermal expansion? A. Mass B. Volume C. Shape D. Distance between particles 3.15 1 4.00 The second selected item was Item 11 testing Physics content domain with the highest difficulty level, 3.15 logit as this difficulty level was nearer to the previous estimated theta.
The selected item's difficulty level was below the student's previous estimated theta, thus this student had high chances to answer correctly. This student answered the second item correctly thus the measured theta was 4.00 logit.

8
The diagram below shows an example of interdependence among organisms. During the day the organisms either use up or give off (a) or (b) as shown by the arrows.
Choose the right answer for  Item 4 tested Biology content domain which was selected as the fourth item because its difficulty level was the nearest to the previous estimated theta and it was positioned just below the Item 8 in the bank item list. The student answered wrongly (score 0) thus the estimated theta was decreased from the previous one. As the student failed to answer the fourth item, an easier item with difficulty level less than 2.23 logit will be selected as the next item. Item 60 which tested Physics content domain was selected as the fifth item with difficulty level less than the previous item and its difficulty level was positioned just below Item 4. The student answered correctly thus the estimated theta increased 3.62 logit.
The figure shows an iron nail with an insulated wire coiled around it. The wire is connected to a battery. What will happen to the nail when current flows through the wire? A. The nail will melt B. Electric current will flow through the nail C. The nail become a magnet D. Nothing will happen to the nail 2.04 2 2.83 Item 18 which tested Physics content domain was selected as the sixth item. The student answered correctly in the previous item thus the next item with difficulty level more than 2.16 logit should be selected. However, Item 18 with difficulty level 2.04 logit was selected as its difficulty level was the nearest to the previous estimated theta. The student answered wrongly thus the estimated theta decreased to 2.83 logit. This selected item pattern made sense because the Science TIMSS CAT used Maximum Fisher Information in item selection method in which the difficulty level of the next item is selected closer to the estimated theta (Özdemir, 2016). Easier item will be given for a previous wrongly answered item and vice versa. Generally, this item selection procedure was similar to every student based on the response given to the previous item. In addition, Item 11 (Figure 1) with the highest difficulty level had 19 times wrongly answered by the Form 2 students from 22 times of selection. The correct answer for this item is option A. Out of 19 students who answered item 11 wrongly, 12 students selected option D, 4 students selected option C, and 3 students selected option B. These students might have blurry understanding on the word "conserved". In conclusion, Form 2 students had medium ability level in Science TIMSS CAT. Different students' abilities affected the selection of different item difficulty parameter which covers a wide range of difficulty levels. Based on the ability estimated, Form 2 students need to improve their knowledge on basic Science especially those with the negative theta values.
Form 4 Students' Ability in Science TIMSS Computerized Adaptive Testing (CAT) Table 6 shows Form 4 students' ability (theta) in Science TIMSS CAT according to ID. By referring to Table 6, 35 Form 4 students were involved in the testing session. The highest theta value was 4.000 logit while the lowest theta value was -1.891 logit. Overall, there were 31 students who obtained positive theta value and 4 students with negative theta value. Moreover, there were ten students who obtained SEM value higher than 0.50 with theta value more than 3.500 logit. This might be due to the usage of Maximum Likelihood Estimation method which could not produce precise measurement if nearly all items were answered correctly or wrongly (Song as cited in Özdemir, 2016). Therefore, these Form 4 students needed more difficult items for the Maximum Likelihood Estimation method to measure properly. The rest of the students who obtained SEM value less than 0.50 showed that the obtained theta had high reliability in the order of 0.75 (Barnard, 2018). Table 7 shows statistical data of Form 4 students' ability in Science TIMSS CAT. According to Table 7, the mean score for 35 Form 4 students was 2.368 logit with standard deviation of 1.531 logit. The median value was 2.837 logit with mode value of 2.837 logit. Table 7 also reports that minimum theta value obtained by Form 4 students was -1.891 logit while the maximum theta value obtained was 4.000 logit producing range of 5.891 logit. According to Table  7, percentiles analysis shows that 25% of Form 4 students obtained theta value less than 1.881 logit. Half of the students or 50% of the students obtained theta value less than 2.837 logit while the other half of the students obtained theta value more than 2.837 logit. A total of 75% of Form 4 students obtained theta value less than 3.507 logit while the rest of the students obtained theta value more than 3.507 logit. This analysis shows that majority of Form 4 students had high ability level. The analysis made sense because Form 4 students had learnt the Grade 8 Science knowledge and currently, they are learning pure Science in the upper secondary level.  After that, item pattern and students' responses were also analysed. Table 8 shows the list of items with the frequency that has been selected by Science TIMSS CAT for Form 4 students. According to Table 8, the total of items used was 75 items out of 122 calibrated Grade 8 released Science TIMSS items. Item 15 had the highest frequency with 35 times indicated that every Form 4 student had answered this item in Science TIMSS CAT. There was a huge gap between the existing item's frequency. A total of 20 items had frequency of 30 and above. Majority of Form 4 students answered items with frequency of 30 and above which comprises of these 20 items. These 20 items consisted of 19 items with high difficulty level and 1 item with negative difficulty level. These 19 items had difficulty level ranging from 1.33 logit to 3.15 logit and these items were the most difficult items in the item's difficulty level list. The remaining 1 item was item 15 with difficulty level of -0.02 logit. These frequencies of selected items indicated that CAT selected difficult items for Form 4 students as these students have higher ability in answering difficult questions.
Further analysis was made on the items' pattern and the students' responses. It shows that the first item used in Science TIMSS CAT was Item 15. If correctly answered, Item 11 will be given. If wrongly answered, Item 102 will be given. This item selection pattern was similar to the item selection for Form 2 students. Selection of the next item was based on the student's response towards the previous item. From this further analysis also, 6 students with theta value 4.000 logit had answered all 19 or 20 items correctly.  , 10, 11, 28, 38 3, 4, 8, 18, 26, 34, 40, 57, 60, 61, 103, 108, 110, 112 36, 52, 56, 102 20, 45, 119 5, 9, 13, 29, 31, 32, 49, 58, 62, 75, 78, 82, 85, 89, 96, 98 7, 12, 17, 23, 30, 35, 37, 43, 44, 47, 48, 50, 55, 59, 63, 65, 70, 71, 73, 74, 76, 79, 81, 83, 90, 94, 95, 109, 114, 117, 120  As they had passed the Grade 8 Science's syllabus, their obtained theta values were considered suitable with their educational level. Therefore, these students need more difficult items to obtain more precise ability measurement. Table 9 shows the comparison of statistical data between Form 2 students' ability and Form 4 students' ability in Grade 8 Science TIMSS CAT. Overall, the mean, median, mode, and the percentiles of Form 4 students were higher than the Form 2 students. The obtained scores for these two groups of students were appropriate with their academic levels which were taken in the beginning of the schooling year. Majority of Form 4 students obtained high theta value and they answered most items with highest difficulty parameter while Form 2 students obtained medium and slightly low theta values indicating that these Form 2 students need to learn more Science knowledge. Moreover, majority of the students obtained the estimated theta with SEM value less than or equal to 0.50 for 20 items administered. The SEM result was parallel with the CAT simulation research conducted by Barnard (2018) using Maximum Fisher Information and Maximum Likelihood Estimation method. The result indicates that this configured CAT was capable in producing precise ability estimation with fewer items. Therefore, this instrument is suitable to be used in assessing students' basic Science TIMSS knowledge as the estimated theta had high reliability.

Known-Group Validity
Known-group validity is used to test how good an instrument can differentiate between the two known groups. The validity can be conducted by administering an instrument simultaneously to two different known groups of people. The criteria for known-group validity is met when there is a statistically significance different of scores between the two known groups and can be analysed using t test (Devellis as cited in NSSE, 2009). In this study, calibrated Grade 8 TIMSS item bank was used in CAT, thus the target group for this instrument was Form 2 students. Therefore, known-group validity was tested using t test by comparing the mean score of Form 2 students' ability with Form 4 students' ability in Science TIMSS CAT. Table 10 shows t test for these two groups of students. By referring to Table 10, the Sig.
(2-tailed) shows the value of 0.000. This value was less than 0.05 thus there was a significance difference between the mean scores of Form 2 and Form 4 students in Science TIMSS CAT (Pallant, 2011).
The Eta squared or magnitude of differences was calculated using the following formula (1): where, t = t value in t-test for Equality of Means; N1 = total group 1 respondence; and N2 = total group 2 respondence The calculated eta squared was 0.212. According to Cohen (1988) in Pallant (2011), Eta squared value of 0.01 is considered small, 0.06 is considered medium, and 0.14 is considered large. The obtained Eta squared in this study was higher than 0.14 thus there was a big different in magnitude between the two mean scores. As there was a significance difference between mean scores of these two groups of students and the previous analysis showed that Science TIMSS CAT was more challenging to Form 2 students, this instrument met the known-group validity. Science TIMSS' curriculum framework comprising of intended curriculum, implemented curriculum, and attained curriculum were developed based on the similarity of the Science curriculum among the participating countries. Generally, the items used in Grade 8 Science TIMSS consist of multiple-choice questions and subjective questions. Science TIMSS tests four content domains: 35% Biology items, 25% Physics items, 20% Chemistry items, and 20% Earth Science items. Furthermore, the tested items consist of three cognitive domains: 35% knowing items, 35% applying items, and 30% reasoning items. Items with knowing cognitive domain assess factual knowledge, concept, and relationship. Items in the applying cognitive domain assess the student's ability in making comparison, interpreting the information, and explaining scientifically, while items in the reasoning cognitive domain require students to analyse data and evaluate it, make a generalisation and justification (Mullis & Martin, 2017). The Science TIMSS calibrated bank item used in this research involved only multiple-choice questions with the items covered knowing and applying cognitive domains because items with reasoning cognitive domain mostly are subjective items.
According to Piaget, there are four stages of cognitive development with specific range of children's age. Children at the age between 0-2 years are experiencing sensorimotor stage in which the cognitive developments are based on the interaction of the five senses with the environment. Children at the age between 2-6 years are experiencing preoperational stage in which they use symbol to represent the image or word without having the ability to give reason logically. Children at the age between 7-12 years are experiencing concrete operation in which they can think logically based on the concrete event only, while children at the age of 12 years upwards are experiencing a formal operational stage involving abstract and logic thinking ability (Lazarus, 2010).
Based on the classification of ages from Piaget's theory, Form 2 and Form 4 students are experiencing formal operational stage. Form 2 students (Grade 8) are the beginner to this formal operational stage, and they are starting to develop the ability to think in the abstract ways. They are in the process of developing a deep abstract thinking about a concrete event and provide a systematic logic reason to the event which mainly involve the applying and reasoning cognitive domains, so they have not yet mastered these higher cognitive skills properly. Form 2 students are in the process of engaging themselves in the problem-solving method. Therefore, majority of these students are not so capable in answering more difficult Science items. As the selected items by CAT relied on the given response to the previous item as well as the previous estimated theta, it made sense that the selected items by CAT to Form 2 students covered a wide range of item's difficulty level in the previous analysis and used almost 80% of items from the whole item bank.
Form 4 students are more mature than Form 2 students because they have been experiencing formal operational stage in a longer period than the Form 2 students. These Form 4 students have mastered more abstract thinking skills, so they can think logically in applying and relating various scientific knowledge and analyse the data better thus they are capable in learning more abstract scientific knowledge. Currently, these students are learning pure Science subjects comprising Biology, Chemistry, and Physics. Therefore, Form 4 students had more Science knowledge with higher order thinking skills than Form 2 students, so they are capable in answering more difficult Science items involving applying and reasoning cognitive domains. Due to this situation, the previous analysis showed that most of the 20 items selected by CAT to all Form 4 students were the items with high difficulty level positioned at the most upper part in the item's difficulty list. The higher the item's difficulty, the harder the item to be answered correctly. As the students were capable in answering difficult items correctly, their estimated theta increased showed that their ability also increased.
Overall analysis shows Form 4 students obtained better achievement in CAT than the Form 2 students. Form 4 students generally have mastered the Grade 8 Science TIMSS knowing and applying items more than the Form 2 students as they were capable in answering more difficult Science items correctly. The analysis indicates that the instrument was configured correctly and capable in differentiating two levels of knowledge using its adaptive feature by correctly selecting item's difficulty level with the estimate theta based on the selection method used, thus producing high ability precision using fewer items.

CONCLUSION
CAT was configured in Concerto. Concerto is a platform containing the computer algorithm which enables the testing to operate adaptively. The measurement of CAT is based on the theta values produced by the calculation of the alignment between the student's ability level and item difficulty level. The CAT was then administered to Form 2 and Form 4 students simultaneously in the beginning of the schooling year in which Form 2 students were still in the process of learning lower secondary Science syllabus which is equivalent to Grade 8 level, while Form 4 students had learnt the lower secondary Science syllabus and now are learning upper secondary Science syllabus. Their ability towards the test was evaluated as well as the analysis of the item selection pattern and the responses given by the students. Form 2 students used overall 97 items with wide range of item difficulties while Form 4 students used a total of 75 items and most of the items had high item difficulty level. The Science TIMSS CAT met the known-group validity with Form 4 students obtained higher ability than Form 2 students indicating that the instrument is more suitable in challenging lower secondary Form 2 students. This analysis shows that the configuration of CAT in Concerto has been done correctly because it can differentiate two levels of knowledge significantly. Moreover, this configured CAT proves that it can increase the test measurement accuracy while using fewer items.