Validity and reliability of the revised Arabic language test for 2–4-year-old children: cross-sectional study

Language assessment in children using subjective and objective tests has been an issue to discuss. The aim of this study is to revise and prove the validity and reliability of the Arabic language test (ALT) for the age range from 2 to 4 years old. New design of the test format and test pictures was performed and tested on a pilot study of 30 normal children with no language problems, 15 in each 1 year age group, within the same age range of the standardization sample. The standardization sample on which the test was then applied was 400 normal Egyptian children in the age range from 2 to 4 years old 200 at 2–3 years old and 200 at 3–4 years old. Retesting was done on 30 children (15 in each group) to prove test-retest reliability, with an interval of 2 weeks. Validity of the test was done using, internal consistency validity, contrasted group validity, factorial validity, face validity, and judgment validity. In the contrasted group validity, a sample of 40 children with delayed language was used. All validity tests used gave significant scores that proved the high validity of the newly revised test. Also, reliability tests were highly significant. The newly revised Arabic language test for 2–4 years old is a reliable and valid test to be used to evaluate language development and to detect language deficits among Egyptian children in the same age range.


Background
Language and speech communication are essential to mental development and academic learning. Language assessments include initial screening, diagnosis of impairment, decision for intervention, outcome measurements, and epidemiological purposes [1].
Previous studies have identified limitations as regards psychometric properties of language assessments for school aged children [2,3]. The first edition of the Peabody Picture Vocabulary Test [4] and the Reynell Developmental Language Scales [5] reported validity only by correlation with Wechsler Intelligence scale. The Illinois Test of Psycholinguistic abilities when introduced in 1967 by Kirk et al. [6] just used face validity and developmental validity. Lately, the revised version of the Peabody Picture Vocabulary Test [7] used better validity measures, concurrent with Expressive Vocabulary Test 2nd edition. It also used better reliability measures by test-retest and internal consistency.
In the Arabic field, the Arabic language test [8,9] was the first and pioneering test developed for language evaluation of Arabic-speaking children. In 2009, Aboras et al. [10] investigated a language test for children aged 3-6 years. In 2011, Abu Haseeba [11] modified and standardized the Preschool Language Scale 4 (PLS4) on the Egyptian Arabic-speaking children. Although that test has fair psychometric properties, it is not a comprehensive test as it lacks measuring some language domains. Also, the pictures used in the test are not so clear to the child. The items are not arranged from easier to more difficult, and this affects reaching a ceiling and yields a false language age. Moreover, the specificity of measuring a certain target in the items is sometimes not accurate.
The Arabic language test was designed and standardized in 1995 by Kotby et al. [9] as the first language test in the Arabic field to evaluate Arabic-speaking children in the age range from 2 to 8 years. It has been used ever since in Egypt and most of the Arabic countries as the only Arabic standardized tool to test language performance in cases of delayed language development. The test measures are semantics through picture vocabulary items, syntax both receptively and expressively, pragmatics, and phonology.
In 2004, a revision of the test was done by Rifaie and Hassan [12]; the test items were re-arranged according to the age into 6 groups each representing a 1-year age range from 2 to 8 years. Test reliability and validity were proved and a new scoring system was reached.
By time, some limitations appeared while applying the test. For example, the test was not so sensitive to detect a problem in phonological processing as well as testing pragmatics in a specific child, because the items testing the phonology were few and not detailed. The items testing the more complex syntax were lacking for the bigger ages. Also, the pragmatics was represented with a small number of items and a few scoring.
Accordingly, the test needed to be revised again for the sake of improving its ability to diagnose details of language development and to obtain a more accurate language age based on a bigger number of samples.
In this study, revision of the test was started on the age range from 2 to 4 years old as a first step. In another study, the authors will continue revising the test for the ages from 4 to 8 years old.

Ethics approval and consent to participate
The study was conducted according to the declaration of Helsinki of Biomedical Research Involving Human Subjects. A written consent has been obtained from the parents of all children included in the study. Patient privacy and confidentiality were protected. Deceptive practices were avoided during designing the research. The participants had the right to withdraw from the study at any time they wished.

Study design
This study is a cross-sectional study, including normal children attending nurseries and kindergarten and patients with delayed language development attending phoniatrics unit, at the period from October 2017 to May 2019.

Pilot sample
Thirty normal children (with no language problems), with age range 2-4 years old, divided into two groups each with fifteen children, were included in the pilot study of the new test design.

Standardization sample
Four hundred normal Egyptian children were randomly selected from different public and private nurseries. These were divided into two groups: the first group (group I-A) with age range 2-3 years old and the second group (group I-B) with age range from 3 to 4 years old.

For retest reliability
Thirty children (fifteen in each group) were chosen randomly from the standardization group for retesting after 2 weeks to get data for proving test reliability.

For contrasted group validity
Forty children suffering from delayed language development (randomly chosen in the same age range 2-4 years old) who attended the Phoniatric clinic in Ain Shams University hospitals were included in the validity study.

Inclusion criteria of the standardization sample
Inclusion criteria include Egyptian, Arabic native speaking children, children reported by their teachers or caregivers to have good attention, good hearing, good language, and normal mentality. Average intelligence was proved by applying Stanford-Binet intelligence test-Arabic version [13].
Any diagnosed or undiagnosed communication disorders were excluded.
A. Test material: All the test pictures were changed from hand drawn pictures to real photographed ones ( Figure 1, in Appendix), helped by a professional photographer. Some pictures were changed from the original ones to cope with the recent variables. For the syntax, some items were added to evaluate the syntactic complexity in elder children. For example, for age 3-4 years the verbs used in testing, the verb tense was changed together with the pictures (e.g., in the original test, the girl was taking the book and sitting. While in the revised test, it was boy drinking and girl drawing). Also, one item was added which tested the child's ability to express 4-5-word sentence. Testing phonology was expanded to include detecting phonological processes; in the original test, only one item was used to test phonology in a general way, whether there were errors, few or many. In the revised test, this was tested in more details whether there were phonetic errors or phonological errors which were scored individually. Pragmatics was expanded too and detailed; pragmatics was tested in the original test by two questions and a simple conversation. In the revised test, by the help of story pictures, other items were tested as the child's ability to attract attention before starting conversation, to use simplified words with younger children, to take permission, to fine tune his/her words to explain himself and to correct a listener's errors. The scoring system accordingly had to be re-distributed to allow an equal representation of the different language parameters in the total score of the test. For example, for age 3-4 years, the total semantic score was 40, the total receptive syntax score was 40, the total receptive syntax score was 40, the total expressive syntax score was 40, the total pragmatic score was 40, and the total phonology score was 40. This was not the same in the original test where the scores of the syntax were weighing a bigger part of the whole test. Tables with raw scores for each section with corresponding T-scores to detect the presence and the degree of language delay in each section. B. Pilot study: After preparing the test material, the test was experimented on a small group of children (

Statistical analysis
Statistical analysis was done using the SPSS (Statistical Package for Social Sciences) version 20. Quantitative variables were presented as means and standard deviations. The following were done: tests of validity (internal consistency, contrasted group validity, factorial validity, judgment and face validity); tests of reliability (test-retest, Pearson's correlation test was used to detect the relation between 2 variables), P was considered significant when P < 0.05; and Split-Half method (correlation between forms, Spearman-Brown coefficient, Guttman Split-Half coefficient, and Cronbach's alpha).

Results
A. Descriptive statistics of the results of the Arabic language test of the two age groups: 1. Age groups: Table 1 shows the age groups 2. Raw scores: Table 2 and Table 3 show the raw scores, means and standard deviations among the two groups.

A. Tests of validity
It was measured by the following methods: 1. Judgment validity: All judges agreed upon the validity and fitness of the test for the purpose it was designed for, after advising small changes in the patterns of questioning of some items, as well as the order of presentation of test items. 2. Face validity: From the superficial point of view, the test appears to be valid, since it measures various domains of language including syntax, semantics, pragmatics and phonology, with all details receptively and expressively. 3. Internal consistency: It is a measure of homogeneity of the test contents. The internal structure of the Arabic language test was examined by making correlation between the total score of each item and the total score of the whole test.
Using Pearson's correlation, as shown in the following tables, correlation coefficients for internal consistency were highly statistically significant, and this proves the strong internal consistency of the test (see Tables 4 and 5). Internal consistency proved that all test items are valid. Tables 6 and 7).

Contrasted group validity: It was done by making correlation between control groups (group I-A and group I-B) and the groups of cases of delayed language development (group II-A and group II-B) (see
Contrasted group validity proved that all items are valid.

Factorial validity
Regarding factorial validity, correlation factors have been extracted for the test items and were analyzed using the principal component of Hotelling.
Only factors with an eigenvalue of 1 were taken.
The factor is considered a main one if at least 3 items of the test or more were significantly loaded upon.
For group I-A, eleven factors were extracted and the % of variance for them ranged between 2.995 and 27.98. The variance % of the main 5 items of the test was 75.74 (see Table 8).
For group I-B; Seven factors were extracted and the % of variance for them ranged between 1.89 and 58.58. The variance % of the main 5 items of the test was 78.84 (see Table 9).
These results show that all test items are valid, since they were all highly loaded on the common factors of the test.

Tests of reliability
Reliability coefficient is a quantitative expression of the reliability or internal consistency in the measurement of test scores. It was measured by the following:

Test-retest method
Pearson correlation was used for testing test-retest reliability analysis of the Arabic language test groups I-A and I-B (see Table 10).
From the above table, the correlation between the whole test scores for the 1st and 2nd administration is statistically highly significant, indicating that the test is highly reliable.

Split-half method
Split-half reliability was used for the Arabic language test for the entire sample of group I-A.
Correlation between forms was 0.587, Spearman-Brown coefficient was 0.740, and Guttman Split-Half coefficient was 0.740 revealing high consistency and reliability (see Table 11).
Split-half reliability for the Arabic language test for the entire sample of group I-B. Correlation between forms was 0.893, Spearman-Brown coefficient was 0.943, and Guttman Split-Half coefficient was 0.497 revealing high consistency and reliability (see Table 12).

Cronbach's alpha
Internal consistency reliability was determined using Cronbach's alpha coefficient. Cronbach's alpha values for the Arabic language test for the entire sample of group I-A aged 2-3 years were 0.896 and of group I-B aged 3-4 years were 0.729 indicating high reliability of this tool (see Table 13).

Discussion
This study presents a second revision of the Arabic language test (ALT) that was designed and standardized in 1995 by Kotby et al. [9] and first revised by Rifaie and Hassan in 2004 [12].
Usually tests are revised aiming at refining the test itself for better application and more accurate evaluation of results. Revision takes place either to change items that proved not much differentiating on the long-term use, to change or add materials that should change by time change, to re-arrange items in a better way, to add items that would give better evaluation of a certain modality, to add items that would suit a younger or an elder age range, or to increase the number of the sample that would enhance better reliability and validity measures of the test.
The Boehm test of Basic Concepts 3rd edition [14] is a revised form of the original test in 1971 that extended the number of concepts being tested and achieved good validity information that was not reported in the original test.
The Test of Auditory Comprehension of Language-4 (TACL-4) in its revision by Carrow [15] extended the age range of the normative sample and addressed floor effects and ceiling effects by adding easy and difficult items.
The Arabic language test revised-2 (ALT-revised 2) (for 2-4 years old) in the present study used a larger current normative sample covering a wide variety of socioeconomic levels in the Great Cairo city. The new test tackled two domains of language that are probably been neglected by most of the available standardized language tests. These are the pragmatics and the phonology. Thus, the test became more comprehensive. Few of the available language tests present items related to pragmatics and phonology.
Reliability of the test was proved by all measures and they all revealed a highly reliable test.
Validity was proved by five methods: face validity, internal consistency validity, judgment validity, factorial validity, and contrasted group validity. All methods proved high validity of the test under study. Factorial validity is the most powerful proof of test validity. Few of the available tests for evaluating child language used   factorial validity in its statistics for psychometric properties. The comprehensive language test [10] used only construct-, content-, and criterion-related validity measures. While its reliability was proved by test-retest and internal consistency. The modified Preschool Language Scale 4 (mPLS4) [11] did not use factorial validity but only used developmental, internal consistency and contrasted group validity measures. The Arabic Token test for children (A-TTFC) was translated and validated to measure only the receptive language impairments in children [16]. Construct validity of the Arabic token test for children was tested using factorial analysis and supported the appropriateness in assessing Arabic-speaking children with receptive language problems.

Conclusions
It seems obvious that the current revised Arabic Language Test has shown highly significant data upon measuring its reliability and validity which qualifies the test to be used on ground basis to evaluate language of Arabicspeaking children as well as to detect subtle defects in any of the domains of language to help to build therapy programs for those with delayed language development.