DETERMINATION OF GENDER DIFFERENTIAL ITEM FUNCTIONING IN TEGAL-STUDENTS’ SCIENTIFIC LITERACY SKILLS WITH INTEGRATED SCIENCE (SLiSIS) TEST USING RASCH MODEL

The emergence of Differential Item Functioning (DIF) indicates an external bias in an item. This study aims to identify items at Scientific Literacy Skills with Integrated Science (SLiSIS) test that experience DIF based on gender. Moreover, it is analyzed the emergence of DIF, especially related to the test construct measured, and concluded on how far the validity of the SLiSIS test from the construct validity of consequential type. The study was conducted with a quantitative approach by using a survey or non-experimental methods. The samples of this study were the responses of the SLiSIS test taken from 310 eleventh-grade high school students in the science program from SMA 2 and SMA 3 Tegal. The DIF analysis technique used Wald Test with the Rasch model. From the findings, eight items contained DIF in a 95 % level of trust. In 99 % level of trust, three items contained DIF, items 1, 6, and 38 or 7%. The DIF is caused by differences in test-takers ability following the measured construct, so it is not a test bias. Thus, the emergence of DIF on SLiSIS test items does not threaten the construct validity of the consequential type. © 2021 Science Education Study Program FMIPA UNNES Semarang


INTRODUCTION
Citizen science literacy has a very significant effect on the progress of a Nation. The reason is that the scientific literacy of the community has a positive effect on the quality of economic development, democracy, and community culture (Hanushek & Woessmann, 2016;Roth & Lee, 2016;Rudolph & Horibe, 2016). Therefore, students' scientific literacy must be the main goal in science education (McFarlane, 2013). Therefore, many attempts were made to increase students' scientific literacy in science learning by developing science learning models and assessments (Ardianto & Rubini, 2016;Rusilowati et al., 2016;Ratini et al., 2018;Fakhriyah et al., 2019).
In Indonesia, science education in high school aims to: (1) build and apply information, knowledge, and technology logically, critically, creatively, and innovatively; (2) demonstrate the ability to think logically, critically, creatively, innovatively, and independently; (3) demonstrate the ability to analyze and solve complex problems; (4) demonstrate the ability to analyze natural phenomena, use the environment productively and responsibly, also master the knowledge needed for higher education (Ministry of Education and Culture of the Republic of Indonesia, 2018). The purpose is in line with the scientific literacy skills developed by the 2015 PISA (Program for International Science Student Assessment), which includes: (1) explaining phenomena scientifically; (2) evaluating and designing scientific investigations; (3) interpreting scientific data and evidence (Chiang & Tzou, 2018).
The competency standards set by the Government before 2020 were measured through the National Examination. However, there are some weaknesses in the implementation of the national examination. First, the Government does not use the National Examination results as a determinant of graduation, so there is no guarantee of compliance with competency standards for high school students who graduate. Second, not all subjects that build scientific competence are tested, and students may choose just one subject. Those weaknesses cause students who pass to be not comprehensive under the competency standards that students should master. Likewise, at the International level, the Government has never taken a survey to study achievement at the high school level, so it does not have the quality parameters of high school graduates in Indonesia.
In this regard, there needs to be a comprehensive examination that can ensure the competence of high school graduates following the predetermined competency standards. Therefore, SMA 2 and SMA 3 Tegal, Central Java, have conducted integrated science literacy tests on high school students in the twelfth grade of MIPA (Mathematics and Natural Sciences) to ensure that the graduates meet national standards and the main objectives of science education at the international level. The test is called the SLiSIS test (Scientific Literacy Skills with Integrated Science Test). SLiSIS test is a standardized test tested empirically and meets content standards, scientific literacy achievements, and measurement models.
In the aspect of content, SLiSIS Test covers Mathematics, Physics, Chemistry, and Biology competencies in an integrated manner through integrated science cases. Several studies show that science learning presented in an integrated manner significantly affects increasing student scientific literacy (Turiman et al., 2012;Tamassia & Frans, 2014). In scientific literacy achievement, SLiSIS Test refers to PISA 2015 (OECD, 2016). SLiSIS Test uses 14 testlets, and each testlet consists of 3 items. In each testlet there is one scientific news and three items that measure the achievement of scientific literacy according to PISA 2015 standards. Item validation and scoring models do not use the classical test theory that has been widely used but uses Rasch modeling. The use of classical test theory only produces scores at the ordinal level, while Rasch modeling can produce scores at the interval level to meet the measurement assumptions (Mari et al., 2012;Bond & Fox, 2015;Susongko, 2016;Rusch et al., 2017). The validity with Rasch modeling refers to the validity of Messick, where construct validity is considered a single concept consisting of several aspects (Runnels, 2012;Ravand & Firoozi, 2016). Rasch's analysis explains the construct validity which is more comprehensive than classical test theory. There are at least six aspects of construct validity: content, substantive, structural, external, generalisability, and consequential aspects (Sabah et al., 2013;Wang et al., 2014;Jong et al., 2015).
The construct validity of consequential aspects is related to the desired and undesirable consequences (e.g., bias) of the assessment and the implications derived from the score's meaning. The validity of consequential aspects focuses on the implications of the interpretation of the score as information. Evidence regarding the consequential aspects of construct validity also discusses the actual and potential consequences of testing the scores used, especially in terms of sources of invalidity such as bias and fairness (Welner, 2013). The consequential aspect of construct validity is the aspect that is highly considered in the SLiSIS test. The reason is the use of SLiSIS test results used by schools in determining the graduation status of high school students of the Science program at SMA 2 and SMA 3 Tegal. SLiSIS test is a high-stakes test, so bias can be a major issue in administering the test.
Previous studies have shown a gender bias in scientific tests. For example, on the physics test and teaching physics, there is a gender bias (Hofer, 2015; Wilson et al., 2016). Gender bias is also found in chemistry and biology tests (Grunspan et al., 2016;Rachmatullah & Ha, 2019). In addition, several studies have shown a gender bias in scientific literacy tests in the PISA and TIMMS surveys (Lisova & Kovalchuk, 2017;Cheema, 2019). Bias is the emergence of several characteristics of different items in individuals with the same abilities but from different ethnic groups, genders, cultures, or religions (Rouquette et al., 2016). In other words, an item can occur if individuals who have the same ability but come from different groups do not have the same opportunity to answer an item correctly. These conditions originate from several character items or situations that are not relevant to the test objectives. Bias is a systematic error that affects the validity of test scores (Demirtasli, 2015). Items must be tested for potential contain bias to ensure the accuracy of the decision to be based on test scores. The method of determining item bias is focused on the validity of test items between different subgroups. Measurement bias can be generated from the presence of differential item functioning (DIF) in items (Rouquette et al., 2016). DIF describes the phenomenon of one or several items from the test "functioning" differently in the group of individuals to be compared. These individuals are distinguished by characteristics such as age, gender, ethnicity, religion, country of origin (Chiang & Tzou, 2018). Statistically, that means the function parameters that link latent variables (constructs measured) with observations (responses to items) differ in the various groups involved (Kendhammer et al., 2013). If the data are analyzed using the Rasch model, the DIF phenomenon will emerge if differences in the parameters of difficulties in the groups are compared (Millsap, 2012).
Many studies show DIF caused by gender on science tests, especially on science tests that are high-stakes tests. All tests in mathematics, physics, chemistry, biology, and measurement of critical thinking ability are not free from the presence of DIF based on gender (French et al., 2012;Steinmayr et al., 2015). In international surveys such as PISA and TIMSS, DIF testing is not only done based on gender but also conducted against the cultural background, country, and level of scientific literacy ability of students (Mesic, 2012;Lyons-Thomas et al., 2014;Choi et al., 2015;Demirtasli, 2015;Huang et al., 2016;Chiang & Tzou, 2018;Cheema, 2019). Thus it is possible that the SLiSIS test items still contain DIF based on gender.
The appearance of DIF on a test item indicates an external bias in an item (Embretson & Reise, 2013). It is one of the things that threaten the validity of the test so that it can reduce the level of confidence in the score generated by the SLiSIS test. A preliminary study of the scientific literacy test using the unit testlet analysis and involving 112 test participants showed the existence of DIF based on gender as many as two items from the 17 items given. These items are related to the theme of nuclear physics and Astronomy (Susongko et al., 2019). The SLiSIS test was applied to 310 students. Therefore, it is necessary to investigate the extent of the existence of DIF on the test.
Various methods have been proposed to identify DIF, such as Mantel-Haenzel (MH), difference of difficulties, Lord's χ2, Non-compensatory DIF (NCDIF), SIBTEST, logistic regression, and others (Cuevas & Cervantes, 2012). Some statistical tests used directly to evaluate the existence of DIF in Rasch modeling are the use of Wald test, Likelihood Ratio test (LR), the score test, and Simultaneous Item Bias Test (SIBTEST). The first three used are tests using large sample sizes, the Maximum Likelihood estimation approach, and parametric assumptions (Strobl et al., 2015). Several studies compare the effectiveness of the Wald test, MH, LR test, and SIBTEST methods and show that the Wald test method is the most effective compared to others (Hou et al., 2014). Several studies show that the Wald test method with type I errors and maximum Likelihood estimates are the most sensitive compared to other variations (Woods et al., 2013;Battauz, 2019). It shows that the Wald method is the most effective in detecting LDIF in the Rasch model approach.
This study aims to identify items on the SLiSIS test that experience DIF based on gender. When the items are known, an analysis of the occurrence of the DIF is mainly related to the type of scientific literacy skills measured and concluded by concluding the validity of the SLiSIS test from the construct validity of consequential type. It is to assess bias or whether an item requires more in-depth information and study, primarily related to whether the emergence of differences in the opportunity to answer correctly under the construct measured or not. The existence of a DIF on an item is not automatically biased. However, if the difference in opportunities to answer correctly between male and female groups is caused by characters that do not fit the measurement construct, the item is called bias (He & van de Vijver, 2012).

METHODS
The study was conducted with a quantitative approach by using survey or non-experimental methods. In the design of survey methods, some trends, behaviors, or opinions of a population by examining a population sample were described quantitatively. From this sample, the researcher generalizes or makes claims about the population (Creswell & Poth, 2016). The stages of this study were compiling the SLiSIS test, conducting empirical trials, and analyzing DIF based on student responses to the test.
The SLiSIS test is a scientific literacy test that aims to measure scientific literacy skills that refer to the achievements of scientific literacy used by PISA in the 2015 survey. PISA divides scientific literacy skills into three domains, they are: (1) explaining phenomena scientifically as well as recognizing, offering, and evaluating explanations for various natural and technological phenomena; (2) interpret scientific data and evidence as well as analyze and evaluate data, claims, and arguments in various representations and draw scientific conclusions; (3) evaluating and designing scientific investigations as well as describing and evaluating scientific investigations and making generalizations from explanations (OECD, 2016). Specifically, items that measure the skills of the Evaluate and design scientific inquiry in the SLiSIS test are focused on the skills of making generalizations from scientific explanations or investigations.
The SLiSIS test material is a brief description relating to integrated science themes or scientific news. For each scientific reading, there are three multiple-choice items with five alternative answers. Scientific reading is taken from various sources such as www.ScienceNews.org, www. sciencenewsforstudents.org, www.readwork.org, and some of the integrated science exams on college entrance selection in Indonesia. Each item sequentially measures students' ability to explain phenomena scientifically (first item), interpret data and scientific evidence (second item), also evaluate and design scientific investigations (third item). SLiSIS test consists of 14 testlets, with each testlet consisting of three items, so that the number of items is 42. Scoring each item is considered to be independent and dichotomous (1 or 0).
The samples of this study were the responses to the SLiSIS Test conducted on 310 students of senior high school grade XII of Science program from SMA 2 and SMA 3 Tegal city. The SLiSIS test for SMA 2 Tegal was held on February 11, 2020, at 08.00-10.00 WIB. Meanwhile, for SMA 3 Tegal city, the test was held on February 21, 2020, from 7.30-9.30 WIB. From 310 students, there are 102 male students and 208 female students with ages between 17 to 19 years. All students come from the area of Tegal city and surrounding areas. The distribution of students who are subject to this study can be seen in Table 1. The research procedure began with estimating the difficulty level of items with Rasch modeling involving all responses or only involving responses from male students or female students. The Rasch model analysis used software version R 3.5.0 through package version 0.15-6 (Mair et al., 2019). The basic Rasch modeling uses the following formula: , with: : the opportunity for someone with the ability to answer the i item correctly. : difficulty level parameters for the i problem : 1, 2, 3,... 42 : natural logarithm : parameters of the participants' ability (Bond & Fox 2015) The Wald test analysis was then performed to determine whether the differences in the difficulty level of items in the two groups by gender were significant. The W test was used to test the significance of the effect of the independent variable (X i ) partially on the dependent variable (Y) in the logistic regression model carried out by the Wald Test. The Wald value in the W (Wald) test uses the formula: The hypothesis used for the w test is: H 0 : (There is no significant effect of gender (X i ) on the level of item difficulty (Y)) H 1 : (There is a significant influence between the gender variables (X i ) on the level of item difficulty (Y)) For i = 1, 2, ... , p Criteria for decision making is: H 0 is refused if H 0 is accepted if ; (Woods et al., 2013) Mapple software version 13 was used to draw the characteristic curve of items detected by DIF. Previously, an equation was made to equalize the scale of item difficulty of male participant responses and the scale of item difficulty of male participants responses using the linear regression method.

RESULTS AND DISCUSSION
The result of the study began by describing the results of item analysis with Rasch modeling involving all responses of SLiSIS test takers and the test participants' responses of male or female only. Moreover, Wald test analysis results on the student responses to the SLISIS test will also be included in the presentation. The data from the analysis can be seen in Table 2.  Table 2 shows eight items out of 42 items detected containing DIF when using the Wald test with a 95% level of trust. It can be seen that the eight items have a P value of less than 0.05. Information on the items detected by DIF can be seen in Table 3.  Table 3 shows that out of the 14 items that measure Evaluate and Design Scientific Inquiry skills, there are three items number 6, 9, and 15 that experience DIF and consistently benefit the male test participants. Likewise, the 14 items that measure the ability to interpret data and evidence have two items: numbers 26 and 38, which experience DIF and consistently benefit the female test participants. For items that measure the explain phenomena, three items, number 1, 22, and 34, experience DIF. However, these items are not consistently seen from the groups that are benefited. For example, number 1 and number 34 benefit the male students while number 22 benefits the female students. ICC description for DIF detected items is in Figure 2, Figure 3, and Figure 4. Figure 1 explains the difference in the Item Characteristics Curve (ICC) for items detected by DIF (number 2) and those not detected by DIF (number 22). In Figure 1, it can be seen that for item number 2, where DIF is not detected, the ICC for male and female test-takers coincides so that it can be concluded that the opportunity to answer correctly between groups of male and female at all levels of ability is the same. Otherwise, for DIF detected items, in item number 22, it is seen that the opportunity of the female group (in red) in answering item number 22 is higher than the chance of the male group (in blue). Figure 1 explains the difference in the Item Characteristics Curve (ICC) for items detected by DIF (number 1) and those not detected by DIF ( number 34). Moreover, Figure 2 shows that in the two items that measure explain phenomena, the chance of the male group answering correctly (in blue) is higher than the chance of the female group (in red) for all ability levels. However, item number 22 is different from the two items. As shown in Figure 1, in item no 22, , the chance of the male group answering correctly is lower than the female group.
Whereas Figure 4 shows that the two items that measure data interpretation and evidence at all levels of ability, the chance of the female  Figure 3 shows that the two items that measure evaluate and design scientific inquiry show that at all levels of ability, the chance of the fe-male group answering correctly (in red) is higher than those for the male group (in blue).  When observed from eight items that contain DIF, five items benefit the male while three items benefit the female. Items that measure Evaluation and design scientific inquiry consistently benefit the male students, while items that measure the interpretation of data and evidence consistently benefit the female. The result is following some previous studies that the male students benefit a lot from the performance of science (Ganley et al., 2014;Disenhaus, 2015;Reilly et al., 2015;Lisova & Kovalchuk, 2017;Wang & Degol, 2017;Balart & Oosterveen, 2019).
The following are examples of two items that measure Evaluate and design scientific inquiry skills (numbers 6 and 9) and contain DIF. 6. Based on this research, if X = = the possibility of stopping growth or shrinking tumors in cancer patients with high fiber diets and Y = the possibility of stopping growth or shrinking tumors in cancer patients on low fiber diets, the relationship of X and Y is as follows: A.Y=5X B. Y=X/5 C. X=Y/5 D. Y=X E. Y>X 9. Earth's sea surface temperature has risen by about half a degree Celsius from 1930 to 2010. Based on this data, the temperature rise in 2130 is estimated at ...
To answer both items required a strong mathematical reasoning ability. For example, at number 6, students must state the relationship between variables in mathematical equations, while at number 9, students are required to make generalizations from existing data and then make predictions. These abilities are needed to answer 14 SLiSIS test items which measure Evaluate and design scientific inquiry skills. The results of this study are following several previous DIF studies which benefit the male in measuring mathematical reasoning abilities (Coletta et al., 2012;Reilly, 2012;Stoet & Geary, 2012;Taylor & Lee, 2012;Ong et Al., 2015;Yildirim, 2019).
The following are examples of two items that measure data interpretation and evidence skills (numbers 26 and 38) and contain DIF. 26. Scientists have succeeded in giving false memories to the mouse that was electrocuted at specific locations. Which evidence from the text supports this conclusion? A. Scientists stimulate the mouse brain areas that are activated in the first location. B. The mouse is allowed to explore the first location calmly. C. The mouse receives shocks in the second location. D. The mouse is afraid of locations where they are not shocked. E. The mouse is afraid of the second location. These items can be answered quickly if the test-takers pay attention to the reading in the form of scientific news given according to the item. From the accuracy in reading, then the test-takers can provide the correct interpretation. Both of these items contain DIF and are beneficial for female students. The results of this study are consistent with some previous studies where there are DIFs that benefit female in the measurement of reading ability and textual interpretation (Taylor & Lee, 2012;Hyde, 2014;Voyer& Voyer, 2014;Balart & Oosterveen, 2019).
The following are examples of items that measure the skill of explaining scientifically and contain DIF phenomena. Item 1 and item 34 benefit the male students, while item 22 benefits the female students. 1. The Chemical formula from MSG is: 22. From these readings, photosynthesis is basically A. One of the big ideas of philosophy. B. The beginning of life on Earth. C. One of humanity's best inventions. D. One of the most famous scientific innovations. E. One of the keys to all processes on Earth. 34. In the human body, ammonia occurs as a result of the breakdown of: A. Amino acids B. Fat C. Carbohydrates D. Vitamins E. Gastric acid When sifted through, to answer items number 1 and 34, students need the ability to think critically, not the ability to recall, while item 22 requires the interpretation ability of reading or reading ability. For example, when asked about the chemical formula of Monosodium Glutamate (MSG), students who can think critically will choose a chemical formula in sodium (Na). From the five chemical formulas, only option A contains Na, so it is easy to determine that the correct answer is option A. Similarly, item 34 is asked about the presence of ammonia in the human body. Again, if students can think critically, then it is easy to find the answer because without seeing the molecular formula, it can be ascertained that only ammonia acid has a similarity in naming with ammonia. Therefore students can answer that ammonia is a breakdown of amino acids without remembering the chemical formula of amino acids and ammonia. The ability to conclude based on the data provided is one indicator of critical thinking ability (Fisher, 2011;McPeck, 2016). The results of this study are following several previous studies, which found that there are differences in the ability to think critically based on gender (Aliakbari & Sadeghdaghighi, 2011;French et al., 2012;Harish, 2013;Preiss et al., 2013). Item 22 benefits the female students because they measure ability and accuracy in reading and interpreting. The condition corresponds as happened with items 26 and item 38.
The emergence of items that experience DIF is not necessarily a weakness of a measurement instrument. The result is at least obtained from methodological reasons as well as the reasons for the test construction. Some DIF methodology studies show that the more sample sizes used, the more items detected by DIF (Zwick, 2012). Likewise, it is found that the use of samples that are not equivalent to the presence of DIF can be undetected (Rahmawati, 2019).
The DIF methodology study also shows that the non-uniform DIF and Crossing DIF are not the real DIF (Ong et al., 2015;Rouquette et al., 2016;Gómez-Benito et al., 2018). Based on consideration of these methodological aspects, it is necessary to be careful in determining the level of trust when assigning the DIF status of an item. The level of trust used should be as high as possible so that the errors for rejecting correct items are minimal. For example, when using a 99% level of trust, only three items from the SLiSIS Test experienced DIF: items 1, 6, and 38 or about 7% of all items used in the SLiSIS Test.
From the constructed test, it can be seen that mathematical reasoning, verbal reasoning, and critical thinking become part of the construct measured in the SLISIS test so that the emergence of DIF does not mean there is a test bias. Bias occurs when differences in scores on items or indicators of a particular construct do not corres-pond to differences in the nature or abilities of the underlying or construct measured by the test (He & van de Vijver, 2012). The score difference that emerges on the items detected biased the SLiSIS test based on abilities that become indicators of the construct measured in the SLiSIS test. Thus the consequence validity of the SLiSIS test is not affected by containing DIF based on gender in some of its items.
The main finding in this study is that there is a tendency for male students to benefit more in working on items that measure Evaluation and scientific inquiry design, while female students are more likely to benefit in working on items that measure interpretation of data and evidence. The DIF knowledge can be reflective material in learning science. It is needed to strengthen the mathematics logic ability for female students and verbal logic for male students.

CONCLUSION
Using a 95% level of trust, eight items in the SLiSIS test contain DIF based on gender. They are items 1, 6, 9, 15, 22, 26, 34, and 38 or by 19 %. While in 99% level of trust, there are three items with DIF, items 1, 6, and 38 or by 7%. The emergence of DIF on SLiSIS test items is due to differences in test takers' ability under the measured construct, so it is not a test bias. Thus the appearance of DIF on SLiSIS test items does not threaten the construct validity of consequential type. If gender differences cause the emergence of DIF, the differences are not following the construct being measured. Such a condition is called test bias and will threaten the validity of the test. In science learning for male students, it is necessary to strengthen the ability to interpret data and evidence, while for female students, it is necessary to strengthen the ability to evaluate and scientific inquiry design. In addition, in school, mathematics learning must be strengthened in mathematical logic, while language learning must be strengthened in verbal logic.

ACKNOWLEDGMENTS
Thank you to the Ministry of Education and Culture of the Republic of Indonesia for funding this research through the higher education applied research grant projects of 2019 and 2020 no 321/K/A-5/LPPM-UPS/V/2020. Hopefully, the results of this study can strengthen the competence of Science educators in Indonesia when conducting an assessment of the Indonesian students' scientific literacy competence.