How are IQ tests created?
Ever wondered how IQ tests are created?
In the present section, you will explore everything about the procedures behind the creation of the Brain Assessment Center IQ test (and behind any other online or offline IQ tests).
After Reading it, you will understand the most important test development concepts and you will know how all IQ tests work, including the Brain Assessment Center (BAC) IQ test and other reliable online/digital tests, the Raven’s progressive matrices or the Wechsler.
Of course, it could take us ages, hundreds of pages, and a master’s degree in psychometrics to explain all the intricacies of these procedures, so what we will be providing here is a very brief, general, and easy summary.
There are two psychometric theories of test construction, classical test theory and item response theory. As said, what we will see here is an oversimplification, based on facts mostly coming from the classical theory.
Let’s dive in!
What conditions must IQ tests fulfill?
In order to create a valid IQ test, it must fulfill two conditions (we are oversimplifying for the sake of simplicity, no pun intended):
1) The sample of test-takers must be representative of the general population (that is, of the overall/general population) and
2) The sets of items used must have psychometric validity and reliability.
(Again, the whole science behind it is more complex, this is just a simplification outlining the most important and remarkable factors).
Let’s explain these two points in detail.
A representative sample is a subset of a group that seeks to accurately reflect the characteristics of a larger group.
For example, a university classroom of 50 students, 25 female, and 25 male, could generate a representative sample of 5 males and 5 females (it is representative because the target variable of the example, gender, is present in the same proportion as in the larger sample).
Of course, gender is not the only variable that must be taken into account when developing IQ tests, the sample must also be representative of at least ethnicity/country, age, and performance/intellectual ability, among others.
The latter means that, since IQ tests generate scores by comparing your performance against the performance of others, these other test-takers must be composed of groups of all the existing intellectual/IQ levels.
For instance, if your performance was only compared against Einsteins, your resulting score would be very low, below average, and that would be wrongly measured!
In the past, finding samples of test-takers who were also representative of the general population was very hard and costly.
However, nowadays, thanks to the Internet and the new digital technologies, it is possible to obtain thousands of input data points (that is, thousands of participants) in a matter of days.
And thanks to the power of algorithmic classification, it is possible to assess and separate these test-takers into different representative groups, by variables such as gender, age, nationality, or performance.
Last, “items with psychometric validity“, oversimplifying, means that the items of an IQ test are valid and reliable, or in other words, 1) that they measure what they are designed to measure and 2) that they always give the same (or very similar) results each time they are taken by the same person.
Let’s see what each of these points means more in detail:
-Validity: The tests measure what they’re designed to measure. In Layman’s terms, what this means is that these items must assess IQ/intellectual ability, and not something different (e.g., plain knowledge or someone’s ability to dance).
There are several ways to psychometrically assess this property. For instance, when designing an IQ test, one way to analyze if the items are valid is to administer our test-takers another IQ test that has already been validated. If the correlation between the scores in that IQ test and ours is strong enough (above 0.8, but ideally around 0.9), it means our test is valid.
It is measuring the same as a test that has already been validated, therefore our test is valid too.
-Reliability: it is the property of a measurement tool to produce consistent measurements across different observations of the same sample. For instance, if the same person is given a widely different score each time he takes an IQ test, the test would have no reliability. If a thermometer gives a very different number each time we measure the temperature of an object (assuming that the temperature of that object is not really varying), then it means it is broken, there is something wrong with that thermometer.
How are IQ tests made? The process of creating an IQ test
In order to make an IQ test, first of all, we need to collect a panel of experts to develop the items of the test, and then another group of experts to review them. By the way, we have prepared another article explaining what IQ tests are, in case you don’t fully know what these tools are or how they are used.
In the beginning, a large sample of items is developed. The number of items must be considerably higher than the number of items we will want our final version of the test to have.
Once all the items of the test have been developed, we need a sample of test-takers.
What we will do is we will give them the test, the one with all the extra items we have created.
We will also give them an already-validated test, in order to test for the validity of the scores of the test under development.
Now we have the following data:
- The number of correct questions each person has had in our test
- The scores of those persons in the already-validated test
Remember that the way by which IQ tests produce their scores/assess intelligence consists in comparing your performance (your number of correct questions) with the performance of all the other test-takers (the mean number of correct questions).
Now that we know the number of correct questions each person has had, we can compute the mean of correct answers and the standard deviation.
The standard deviation is a measure of the spread of the measurements around the mean, or in other words, of the variability in the sample. For instance, if everyone had 5 correct answers in a sample, we would say there is no variance, but in a sample with the same mean but in which some had 4 correct answers, and others had 6 (with mean=5), there is variance.
With these two statistics (the mean and the standard deviation) it is possible to calculate the precise percentage of people that have answered correctly N number of items.
With those statistics, we can already create our IQ scale and start to assess people. (Eg: if you answered 5 answers correctly, and that’s the number of correct answers only 2% of test-takers got, then that means you are in that top 2%, it means you are smarter than 98% of the people who took the test).
But first, let’s come back to the scores of the already-validated IQ test. Do they present a significant and strong correlation with the results of our test? If so, congrats, the test is valid, it is properly measuring IQ, and we can move to the next steps.
Of course, the process is in reality much longer, we are skipping many steps for the sake of simplicity.
So now we already know our items are valid but… didn’t we say we would drop some of them?
Exactly. Roughly speaking, what we will do now is compute all the pairwise correlations between items. This statistic is called “Cronbach alpha” and it indicates the internal consistency of a test; the extent to which items correlate with one another (and broadly speaking, the more they do the better).
For each item, we will then calculate the said statistic without that item itself, that is, dropping the said item from the calculation.
This way, we can see how this statistic indicator would vary depending on the items that are excluded.
Therefore, if we see that there are some items that, when dropped, increase the internal consistency of the test, we know the entire test would be more reliable and valid without them; we now know which items we must remove from the test.
We already have the final version (the final items) of the test, and we already know it is valid and reliable, we can proudly conclude, that our test is finished and it has been a great success.
Of course, there can be several types of IQ tests, each measuring IQ through its own unique set of batteries and items. However, all of them will have one thing in common; all will be measuring the same (IQ/intelligence), since all of them must present a high correlation with already-validated tools to be deemed valid.
And that’s basically it!
Note that we have left out many steps and oversimplified everything in order to make everything easier to understand.
References:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4096146/
Author:
Muhammad Ovais is Ph.D. from Chinese Academy of Sciences (CAS), National Center for Nanoscience and Technology of China. His research interests are in Neurodegenerative diseases, Nanomedicines and Biomaterials. He is the recipient of over 30 international awards including, 2019 Outstanding International Researcher by the Ministry of Education, China and 2019 Premium Award for Best Research Paper by IET-Institution of Engineering and Technology, UK. He owns to his credit over 60 scientific articles including research studies, reviews, editorials and book chapters in peer-reviewed journals/publishers such as, Advanced Materials-Wiley, NanoToday-Elsevier, Nanomedicine-Future Medicine, with h-index of 30. He is the co-founder of Synthon Nanotech, a Netherlands based Startup Company developing peptides and also working as a Tech Ethicist for a US based company GenoEmote; that is developing novel Brain-Computer Interface technologies.