Analysis of HOTS Instrument for Prospective Physics Teacher Using Generalized Partial Credit Model

This study aims to analyze item parameters characteristics and estimate prospective physics teachers ability on the Higher Order Thinking Skills (HOTS) Instrument using the Generalized Partial Credit Model (GPCM). The research subjects were 251 prospective physics teacher students in two universities concerned with producing prospective teacher graduates. Two-Tier Multiple Choice (TTMC) forms with a polytomous score of four categories make up the test instrument. Data analysis includes two stages: testing assumptions and the suitability of the Politomus IRT model. The results showed that the most suitable polytomous scoring IRT model was GPCM2PL. The item parameter analysis for the HOTS Instrument test shows the discriminating power parameter (a) value for all items included in the good category, namely the interval 0.00 to 2.00. The difficulty level analysis (b) also shows the percentage of 100% of items included in the medium category because the b value of all items is in the interval (-2) to (2).


Introduction
Education is an important aspect of the development of a country (Hasudungan & Kurniawan, 2018). Operationally, the implementation of education is regulated by using the curriculum. The curriculum is a reference for implementing education at all levels, from basic and secondary education to higher education (Hadi et al., 2018). Implementing education in higher education refers to the higher education curriculum with graduate competency standards based on the Kerangka Kualifikasi Nasional Indonesia (KKNI). By Volume: 8 Nomor : 4 Bulan : November Tahun : 2022 that adding justifications to the second level of the two-tier choice question format will help test-takers use higher-order thinking abilities and justifications (Cullinane & Liston, 2011). Preparing test-takers for higherorder thinking tasks must consider the justifications that correspond to the answer options before selecting their responses. In addition, it can be seen that the lack of quality assessment is due to the selection of multiple-choice test models commonly used to measure low-level thinking skills (Istiyono et al., 2014). Multiple-choice tests must be modified to measure higher-order thinking skills (Brookhart, 2010). One of the efforts is making a two-tier instrument, often called a two-tier multiple-choice (TTMC) (Istiyono et al., 2020).
Apart from the form of the test instrument, ensuring that the evaluation results appropriately reflect students ability is another factor that must be considered. An evaluation is considered accurate if the results show the least amount of mistakes or errors possible. The test instrument s quality must be valid, trustworthy, and have excellent item parameters to produce results that correctly reflect students abilities. Item response theory and classical test theory are two methods that may be used to estimate item parameters for this purpose. Traditional test theory is said to have weaknesses. The primary weakness of traditional test theory is the inability to distinguish between the examinee and test characteristics, which can only be understood in a different context (Hambleton et al., 1991). In other words, the examination establishes the examinee s aptitude. When the test is challenging, the test-taker will do poorly. The individual will have a greater skill level if the test is simple. In other words, the subject/test taker and item characteristics are significantly correlated. Both the examinees traits and the item s attributes will vary as the examinees themselves change. Because the assessment outcomes depend on the test taker s subjects, conventional test theory cannot be applied in this case.
Item response theory contains the idea of releasing the connection between items and samples or test takers, which is a way to address the flaws in traditional test theory. Even if they work on items with various features, the examinees traits or abilities won t change. On the other hand, even if examinees execute things to the best of their skills, the objects qualities will stay the same. The item response concept is no longer based on test kits but actual items. Item response theory is based on two theories: (a) a collection of qualities, latent traits, or skills can predict (or explain) test takers performance on test items; and (b) as ability improves, the respondent s likelihood of responding an item correctly also rises. The function of item response theory can be applied when the model used has a good fit with the test (Hambleton et al., 1991). Item parameter estimation could be disrupted when the model does not match the data (Stone & Zhang, 2003). In the IRT approach with polytomous scoring, several models are known, including the Partial Credit Model (PCM), Graded Response Model (GRM), and Generalized Partial Credit Model (GPCM). This study uses GPCM analysis. The GPCM model is suitable for analyzing multiple-choice data (Si & Schumacker, 2004). The same thing is also reinforced by the opinion of Retnawati (Retnawati, 2011), which states that the GPCM is the most suitable model for analyzing test results with the polytomous scoring model because this item is the score in a tiered category. Still, the difficulty index in each step is not ordered; a step can be more difficult than the next step.
PCM analysis is widely used as an alternative to polytomous data analysis. PCM is used as an analysis that aims to analyze students critical thinking skills (Asysyifa et al., 2019). The results of this study indicate that all items are categorized as good. Further analysis related to the parameter Estimation of students critical thinking skills showed that there were no students who had the highest score on critical thinking skills, 1.67% of students had high critical thinking skills, 60% of students had average critical thinking skills, 1.67% students have low critical thinking skills, and 3.33% of students have the lowest critical thinking skills. Another research related to the use of PCM analysis was conducted by Istiyono (Istiyono, 2017). The purpose of this study was to describe the results of measuring higher-order thinking skills in physics (PhysHOTS). The results showed that each item in the instrument used was valid to measure students higher-order thinking skills ranging from very low, low, medium, high, and very high categories. Both of these studies foc used on PCM analysis, which only analyzed one item parameter, namely the item difficulty level. In fact, the characteristics of an instrument are not only represented by the quality with the level of difficulty but also the index of differentiating power. The discriminatory index is very important to ensure that the items developed are able to compare high and low-ability students. Based on the background, this study focused on Item Parameter Analysis and Estimation of The Ability of Prospective Physics Teachers on The HOTS Instrument  Table 2 shows that eigenvalues with more than one indicate one factor; the HOTS test instrument has three factors based on this Eigenvalue. These three variables account for 38.913 of the variance. These eigenvalues are then shown in Figure1 as a scree plot.

Model Fit Test
Determination of the theoretical model is based on the suitability of the instrument s character. Judging from the character of the instrument developed, GPCM is very suitable for polytomous instruments with instrument characteristics that consider each step s difficulty level to estimate participants ability (Retnawati, 2011). But to support the analysis, the model s fit must be tested. The probability value (significance, sig) can determine the model s suitability. If the value of sig < 0.5, the item is unsuitable or does not fit (Retnawati, 2014). The model containing the fittest items was selected for data analysis of the several models (GRM, PCM, GPCM2PL, GPCM3PL). The probability value (significance, sig) is obtained from the PARSCALE output. Based on the results (see Apendix) of the analysis it was found that the most suitable model is GPCM2PL.

Item Parameters
The results of the analysis of parameter estimates for HOTS instrument using the GPCM2PL IRT model can be seen in the PARSCALE phase 2 program. The results of the analysis of the estimation of the level of difficulty and different power parameters and with the GPCM2PL model for HOTS Instrument Dynamics tests are presented in Table 3.

Ability Estimation
The ability of students to measure using the HOTS Instrument Dynamics test is shown by the amount of ability in the output of the analysis based on the GPCM2PL IRT model. The results of the analysis of students abilities on HOTS are presented in Table 4. The distribution of students abilities on the two tests is presented in a histogram, as shown in Figure 2. Based on Figure 2, it can be concluded that, in general, the distribution of test takers abilities is close to the normal curve. Information on item characteristics and estimation of students abilities provides other information related to the test information function. If the test items have a high information function, the test information function will be high. The information function of the two test devices can be presented in Figure 3.

Discussion
The first stage in data analysis is the dimensionality test. Factor analysis is used to evaluate its dimensionality, starting with the KMO test to guarantee the sample s sufficiency. It uses the KMO test. The sample adequacy test determines whether or not the sample obtained meets the sample adequacy criteria (KMO-MSA > 0.5), and the data are data homogeneous (Barlett test <0.05), so that factor analysis can be performed. The unidimensionality test using factor analysis and scree plot found that, from these two scree plots, the Eigenvalue appears to decline sharply between factors 1 and 2; the Eigenvalue then begins to tilt at factor 3 so that the scree plot almost forms a right-angled angle. The HOTS instrument test tool measures at least two dominant factors. Although it appears to measure two factors, if you look at the percentage of variance explained by the first factor for the two test sets (38,913), the value is greater than 20%. In line with this, comparing the first with the second eigenvalues shows five times. Both of these conditions meet the requirements to be said to be (Heri Retnawati, 2017;Wells & Purwono, 2009). Based on this test, it can be concluded that instrument sets only contain a single dimension or are unidimensional. Local independence is another test. This premise of local independence will be met if the participant s response to one item does not affect the participant s response to the other items (Retnawati, 2014). According to De Mars (2010) (Salkind, 2013), the unidimensional assumption can also be used to discover local independence (Retnawati, 2016). The local independence assumption is also met if the unidimensional assumption is met. Because the unidimensional assumption was met in this investigation, the local independence test was also met.
Based on analysis for model Fit Test, the results show that the fittest or provide information on each item are GPCM2PL. This result is in line with the opinion of Si (Si & Schumacker, 2004), which states that the GPCM model is suitable for analyzing multiple-choice data. The same thing is also reinforced by the opinion of Retnawati (Hidayati & Retnawati, 2011), which states that the GPCM is the most suitable model for analyzing test results with the polytomous scoring model because this item is scored in a tiered category. Still, the difficulty index in each step is not ordered; a step can be more difficult than the next step. Therefore, Istiyono asserted that using GPCM to analyze multiple-choice tests is a fair alternative assessment model in learning (Istiyono et al., 2020).
Further analysis for item parameters shows that the recapitulation of the results of the analysis shows that for items of the HOTS test, the different power parameter value s 100% is included in the good category, namely the interval 0.00 to 2.00 (Hambleton & Swaminathan, 1985) and the difficulty level analysis also shows the percentage of 100% of items included in the medium category because the b value of all items is in the interval (-2) to (2) (Hambleton & Swaminathan, 1985). It shows that the items that make up the test are worthy of being used as a good instrument and can accurately measure students abilities. The feasibility of the instrument in measuring student abilities can be viewed from the information function Based on the result, the HOTS test provides the highest information for students with abilities around -0.3. It is also characterized by the smallest standard error in the range of capabilities. In the interval -0.7 to + 0.3, the information function s value is greater than the standard error measurement (SEM) so that the measurement accuracy is considered good (Retnawati, 2014), and the smaller the SE, the greater the reliability of the test (Salkind, 2013). Based on this information, the HOTS instrument accurately measures students ability (θ) between this interval of -0.7 to +0.3.

Conclusions
Assessment using test instruments needs to consider the characteristics of the test instruments used. This research attempts to analyze the characteristics of the HOTS test instrument items using the IRT model of polytomous scoring that is fit. Based on the fit test, it was found that the model that is most suitable for the data is GPCM2PL. The item parameter analysis shows that the value of the different power parameters for all items with a percentage of 100% is included in the good category, namely the interval of 0.00 to 2.00. The difficulty level analysis shows the percentage of 100% of items included in the medium category because the b value of all items is in the interval (-2) to (2).