- Original Article
- Open Access
Validity and reliability of the revised Arabic language test for 2–4-year-old children: cross-sectional study
The Egyptian Journal of Otolaryngology volume 37, Article number: 26 (2021)
Language assessment in children using subjective and objective tests has been an issue to discuss. The aim of this study is to revise and prove the validity and reliability of the Arabic language test (ALT) for the age range from 2 to 4 years old. New design of the test format and test pictures was performed and tested on a pilot study of 30 normal children with no language problems, 15 in each 1 year age group, within the same age range of the standardization sample. The standardization sample on which the test was then applied was 400 normal Egyptian children in the age range from 2 to 4 years old 200 at 2–3 years old and 200 at 3–4 years old. Retesting was done on 30 children (15 in each group) to prove test-retest reliability, with an interval of 2 weeks. Validity of the test was done using, internal consistency validity, contrasted group validity, factorial validity, face validity, and judgment validity. In the contrasted group validity, a sample of 40 children with delayed language was used.
All validity tests used gave significant scores that proved the high validity of the newly revised test. Also, reliability tests were highly significant.
The newly revised Arabic language test for 2–4 years old is a reliable and valid test to be used to evaluate language development and to detect language deficits among Egyptian children in the same age range.
Language and speech communication are essential to mental development and academic learning. Language assessments include initial screening, diagnosis of impairment, decision for intervention, outcome measurements, and epidemiological purposes .
Previous studies have identified limitations as regards psychometric properties of language assessments for school aged children [2, 3]. The first edition of the Peabody Picture Vocabulary Test  and the Reynell Developmental Language Scales  reported validity only by correlation with Wechsler Intelligence scale. The Illinois Test of Psycholinguistic abilities when introduced in 1967 by Kirk et al.  just used face validity and developmental validity. Lately, the revised version of the Peabody Picture Vocabulary Test  used better validity measures, concurrent with Expressive Vocabulary Test 2nd edition. It also used better reliability measures by test-retest and internal consistency.
In the Arabic field, the Arabic language test [8, 9] was the first and pioneering test developed for language evaluation of Arabic-speaking children. In 2009, Aboras et al.  investigated a language test for children aged 3–6 years. In 2011, Abu Haseeba  modified and standardized the Preschool Language Scale 4 (PLS4) on the Egyptian Arabic-speaking children. Although that test has fair psychometric properties, it is not a comprehensive test as it lacks measuring some language domains. Also, the pictures used in the test are not so clear to the child. The items are not arranged from easier to more difficult, and this affects reaching a ceiling and yields a false language age. Moreover, the specificity of measuring a certain target in the items is sometimes not accurate.
The Arabic language test was designed and standardized in 1995 by Kotby et al.  as the first language test in the Arabic field to evaluate Arabic-speaking children in the age range from 2 to 8 years. It has been used ever since in Egypt and most of the Arabic countries as the only Arabic standardized tool to test language performance in cases of delayed language development. The test measures are semantics through picture vocabulary items, syntax both receptively and expressively, pragmatics, and phonology.
In 2004, a revision of the test was done by Rifaie and Hassan ; the test items were re-arranged according to the age into 6 groups each representing a 1-year age range from 2 to 8 years. Test reliability and validity were proved and a new scoring system was reached.
By time, some limitations appeared while applying the test. For example, the test was not so sensitive to detect a problem in phonological processing as well as testing pragmatics in a specific child, because the items testing the phonology were few and not detailed. The items testing the more complex syntax were lacking for the bigger ages. Also, the pragmatics was represented with a small number of items and a few scoring.
Accordingly, the test needed to be revised again for the sake of improving its ability to diagnose details of language development and to obtain a more accurate language age based on a bigger number of samples.
In this study, revision of the test was started on the age range from 2 to 4 years old as a first step. In another study, the authors will continue revising the test for the ages from 4 to 8 years old.
Ethics approval and consent to participate
The study was conducted according to the declaration of Helsinki of Biomedical Research Involving Human Subjects. A written consent has been obtained from the parents of all children included in the study. Patient privacy and confidentiality were protected. Deceptive practices were avoided during designing the research. The participants had the right to withdraw from the study at any time they wished.
This study is a cross-sectional study, including normal children attending nurseries and kindergarten and patients with delayed language development attending phoniatrics unit, at the period from October 2017 to May 2019.
Thirty normal children (with no language problems), with age range 2–4 years old, divided into two groups each with fifteen children, were included in the pilot study of the new test design.
Four hundred normal Egyptian children were randomly selected from different public and private nurseries. These were divided into two groups: the first group (group I-A) with age range 2–3 years old and the second group (group I-B) with age range from 3 to 4 years old.
For retest reliability
Thirty children (fifteen in each group) were chosen randomly from the standardization group for retesting after 2 weeks to get data for proving test reliability.
For contrasted group validity
Forty children suffering from delayed language development (randomly chosen in the same age range 2–4 years old) who attended the Phoniatric clinic in Ain Shams University hospitals were included in the validity study.
Inclusion criteria of the standardization sample
Inclusion criteria include Egyptian, Arabic native speaking children, children reported by their teachers or caregivers to have good attention, good hearing, good language, and normal mentality. Average intelligence was proved by applying Stanford-Binet intelligence test—Arabic version .
Any diagnosed or undiagnosed communication disorders were excluded.
Test material: All the test pictures were changed from hand drawn pictures to real photographed ones (Figure 1, in Appendix), helped by a professional photographer. Some pictures were changed from the original ones to cope with the recent variables. For the syntax, some items were added to evaluate the syntactic complexity in elder children. For example, for age 3–4 years the verbs used in testing, the verb tense was changed together with the pictures (e.g., in the original test, the girl was taking the book and sitting. While in the revised test, it was boy drinking and girl drawing). Also, one item was added which tested the child’s ability to express 4–5-word sentence. Testing phonology was expanded to include detecting phonological processes; in the original test, only one item was used to test phonology in a general way, whether there were errors, few or many. In the revised test, this was tested in more details whether there were phonetic errors or phonological errors which were scored individually. Pragmatics was expanded too and detailed; pragmatics was tested in the original test by two questions and a simple conversation. In the revised test, by the help of story pictures, other items were tested as the child’s ability to attract attention before starting conversation, to use simplified words with younger children, to take permission, to fine tune his/her words to explain himself and to correct a listener’s errors. The scoring system accordingly had to be re-distributed to allow an equal representation of the different language parameters in the total score of the test. For example, for age 3–4 years, the total semantic score was 40, the total receptive syntax score was 40, the total receptive syntax score was 40, the total expressive syntax score was 40, the total pragmatic score was 40, and the total phonology score was 40. This was not the same in the original test where the scores of the syntax were weighing a bigger part of the whole test. Tables with raw scores for each section with corresponding T-scores to detect the presence and the degree of language delay in each section.
Pilot study: After preparing the test material, the test was experimented on a small group of children (15 from each one year age range; total 30) to check the clarity of the new pictures used and the wording of each of the items. Accordingly, some changes were done.
Test application: The test was applied on the sample of standardization (200 from each 1 year age range; total 400). Scores were used to test validity and reliability.
Testing procedure: Test items were presented in colloquial Arabic. Assistance was not allowed during testing. Children had to take all test items and the test takes about 15–20 min. Environment of the test should be a quiet, well-lighted, and well-ventilated room with no distracting elements.
Tests of validity: This was done using internal consistency (measuring the homogeneity of the test itself, by making a correlation of the test scores with the total score). Contrasted group validity correlating the test scores of a group of children with delayed language development (20 from each 1 year age range; total 40) and those test scores of the group of normal children of the original sample of standardization). Factorial validity (it is a refined statistical technique for analyzing the interrelationships of data, the test can be characterized in terms of the major factors determining its scores together with the weight or loading of each factor and the correlation of the test to each factor). Judgment validity (where the test was shown to seven judges experienced in the subject of the test, to add their experience in the appropriateness of individual test items) and face validity (it is concerned with what the test appears superficially to measure, or whether the test “looks valid” to examinees who take it, the administrative personnel who decide on its use, and other technically untrained observers).
Tests of reliability: Reliability coefficient is a quantitative expression of the reliability or consistency in the measurement of test scores. Using test-retest (This was performed on 15 children in each group), these children were tested by the Arabic language test and then re-evaluated by the same test and the same clinician after a 2-week interval. The correlation between the scores of both tests was done (by Pearson correlation). Also, Split-Half method (correlation between forms, Spearman-Brown coefficient, and Guttman Split-Half coefficient) was used and Cronbach’s alpha.
Statistical analysis was done using the SPSS (Statistical Package for Social Sciences) version 20. Quantitative variables were presented as means and standard deviations. The following were done: tests of validity (internal consistency, contrasted group validity, factorial validity, judgment and face validity); tests of reliability (test-retest, Pearson’s correlation test was used to detect the relation between 2 variables), P was considered significant when P < 0.05; and Split-Half method (correlation between forms, Spearman-Brown coefficient, Guttman Split-Half coefficient, and Cronbach’s alpha).
Descriptive statistics of the results of the Arabic language test of the two age groups:
Age groups: Table 1 shows the age groups
Tests of validity
It was measured by the following methods:
Judgment validity: All judges agreed upon the validity and fitness of the test for the purpose it was designed for, after advising small changes in the patterns of questioning of some items, as well as the order of presentation of test items.
Face validity: From the superficial point of view, the test appears to be valid, since it measures various domains of language including syntax, semantics, pragmatics and phonology, with all details receptively and expressively.
Internal consistency: It is a measure of homogeneity of the test contents. The internal structure of the Arabic language test was examined by making correlation between the total score of each item and the total score of the whole test.
Using Pearson’s correlation, as shown in the following tables, correlation coefficients for internal consistency were highly statistically significant, and this proves the strong internal consistency of the test (see Tables 4 and 5).
Internal consistency proved that all test items are valid.
Contrasted group validity: It was done by making correlation between control groups (group I-A and group I-B) and the groups of cases of delayed language development (group II-A and group II-B) (see Tables 6 and 7).
Contrasted group validity proved that all items are valid.
Regarding factorial validity, correlation factors have been extracted for the test items and were analyzed using the principal component of Hotelling.
Only factors with an eigenvalue of 1 were taken.
The factor is considered a main one if at least 3 items of the test or more were significantly loaded upon.
For group I-A, eleven factors were extracted and the % of variance for them ranged between 2.995 and 27.98. The variance % of the main 5 items of the test was 75.74 (see Table 8).
For group I-B; Seven factors were extracted and the % of variance for them ranged between 1.89 and 58.58. The variance % of the main 5 items of the test was 78.84 (see Table 9).
These results show that all test items are valid, since they were all highly loaded on the common factors of the test.
Tests of reliability
Reliability coefficient is a quantitative expression of the reliability or internal consistency in the measurement of test scores.
It was measured by the following:
Pearson correlation was used for testing test-retest reliability analysis of the Arabic language test groups I-A and I-B (see Table 10).
From the above table, the correlation between the whole test scores for the 1st and 2nd administration is statistically highly significant, indicating that the test is highly reliable.
Split-half reliability was used for the Arabic language test for the entire sample of group I-A.
Correlation between forms was 0.587, Spearman-Brown coefficient was 0.740, and Guttman Split-Half coefficient was 0.740 revealing high consistency and reliability (see Table 11).
Split-half reliability for the Arabic language test for the entire sample of group I-B. Correlation between forms was 0.893, Spearman-Brown coefficient was 0.943, and Guttman Split-Half coefficient was 0.497 revealing high consistency and reliability (see Table 12).
Internal consistency reliability was determined using Cronbach’s alpha coefficient. Cronbach’s alpha values for the Arabic language test for the entire sample of group I-A aged 2–3 years were 0.896 and of group I-B aged 3–4 years were 0.729 indicating high reliability of this tool (see Table 13).
Usually tests are revised aiming at refining the test itself for better application and more accurate evaluation of results. Revision takes place either to change items that proved not much differentiating on the long-term use, to change or add materials that should change by time change, to re-arrange items in a better way, to add items that would give better evaluation of a certain modality, to add items that would suit a younger or an elder age range, or to increase the number of the sample that would enhance better reliability and validity measures of the test.
The Boehm test of Basic Concepts 3rd edition  is a revised form of the original test in 1971 that extended the number of concepts being tested and achieved good validity information that was not reported in the original test.
The Test of Auditory Comprehension of Language-4 (TACL-4) in its revision by Carrow  extended the age range of the normative sample and addressed floor effects and ceiling effects by adding easy and difficult items.
The Arabic language test revised-2 (ALT-revised 2) (for 2–4 years old) in the present study used a larger current normative sample covering a wide variety of socioeconomic levels in the Great Cairo city. The new test tackled two domains of language that are probably been neglected by most of the available standardized language tests. These are the pragmatics and the phonology. Thus, the test became more comprehensive. Few of the available language tests present items related to pragmatics and phonology.
Reliability of the test was proved by all measures and they all revealed a highly reliable test.
Validity was proved by five methods: face validity, internal consistency validity, judgment validity, factorial validity, and contrasted group validity. All methods proved high validity of the test under study. Factorial validity is the most powerful proof of test validity. Few of the available tests for evaluating child language used factorial validity in its statistics for psychometric properties. The comprehensive language test  used only construct-, content-, and criterion-related validity measures. While its reliability was proved by test-retest and internal consistency. The modified Preschool Language Scale 4 (mPLS4)  did not use factorial validity but only used developmental, internal consistency and contrasted group validity measures. The Arabic Token test for children (A-TTFC) was translated and validated to measure only the receptive language impairments in children . Construct validity of the Arabic token test for children was tested using factorial analysis and supported the appropriateness in assessing Arabic-speaking children with receptive language problems.
It seems obvious that the current revised Arabic Language Test has shown highly significant data upon measuring its reliability and validity which qualifies the test to be used on ground basis to evaluate language of Arabic-speaking children as well as to detect subtle defects in any of the domains of language to help to build therapy programs for those with delayed language development.
Availability of data and materials
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.
Arabic language test
- ALT-revised 2:
The Arabic Language Test revised-2
The Arabic Token test for children
The modified Preschool Language Scale 4
The Test of Auditory Comprehension of Language-4
Paul R, Norbury CF (2012bchapter 2) Assessment in language disorders from infancy through adolescence. In: Paul R, Norbury CF (eds) Listening, speaking, reading, writing and communicating, 4th edn. Mosby Elsevier, St. Louis, pp 22–60
Spaulding TJ, Plante E, Farinella KA (2006) Eligibility criteria for language impairment: is the low end of normal always appropriate? Lang Speech Hear Serv Sch 37(1):61–72. https://doi.org/10.1044/0161-1461(2006/007)
Friberg JC (2010) Consideration for test selection. How do validity and reliability impact diagnostic decisions? Child Lang Teach Ther 26(1):77–92. https://doi.org/10.1177/0265659009349972.
Dunn LM, Dunn LM (1981) Peabody picture vocabulary test-revised. American Guidance Service, Circle Pines
Reynell J, Gurber C (1990) Rynell developmental language scale. Western psychological services, Los Angeles
Kirk SA, McCarthy J, Kirk WD (1967) The Illinois test of psycholinguistic abilities. University of Illinois Press, Urbana
Dunn LM, Dunn DZ (2007) Peabody picture vocabulary test – fourth edition. Pearson, San Antonio
Rifaie N (1994) The construction of an Arabic test to evaluate child language: MD thesis submitted to phoniatric unit. Ain Shams University, Cairo
Kotby MN, Khairy A, Barakah M, Rifaie N, Elshobary A (1995) Language testing of Arabic speaking children: proceedings of XXIII world congress of the international Association of Logopedics and Phoniatrics. Ain Shams University, Cairo
Aboras Y, Aref S, El-Ragy A, Gaber O, El-Maghraby R (2009) Comprehensive Arabic language test: MD thesis submitted to phoniatric unit. Alexandria University, Alexandria
Abu Haseeba A, El Sady S, Elshobary A, Gamal N, Ibrahim M, Abd El-Azeem A (2011) Standardization, translation and modification of the preschool language scale – 4MD Thesis submitted to phoniatric unit. Ain Shams University, Cairo
Rifaie N, Hassan S (2004) The Arabic language test – revised. Benha Medic J 21(2):205–216
Faraj (2010) Stanford-Binet intelligence scales (SB5), 5th edn. the Anglo Egyptian Bookshop, Cairo
Boehm AE (2000) Boehm test of basic concepts – revised. The psychological corporation, New York
Carrow EW (2018) TACL_4. Test for auditory comprehension of language – fourth edition. PRO – ED inc
Alkhamra RA, Al Jazi AB (2016) Validity and reliability of the Arabic token test for children. Int J Lang Commun Disord 51(2):183–191. https://doi.org/10.1111/1460-6984.12198.
Ethics approval and consent to participate
Informed written consent was obtained from parents of children before enrollment in the study. The study protocol has been approved by Ain Shams institute’s ethical committee of human research on September 2017. The committee’s reference number is not available. The participants had the right to withdraw from the study at any time they wished.
Consent for publication
The authors have no competing interest.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Rifaie, N., Hamza, T.M.A.W. & Elfiky, Y.H. Validity and reliability of the revised Arabic language test for 2–4-year-old children: cross-sectional study. Egypt J Otolaryngol 37, 26 (2021). https://doi.org/10.1186/s43163-021-00088-8
- Speech language pathology
- Speech language impairment
- Speech language assessment
- Test battery