Some explanations
Please, watch the video. What follow is only a very brief and useless secondary explanation of what we will do.
We seek to know the size of a set. However, the problem is that we only have access to a few elements of this set.
For example, I have pages 2, 3, and 6 of an unknown book (of course, since I am giving you the explanation, I already know the size of the book: it is 8 pages, but you do not know this yet). I must guess the most probable size of the book to which these pages belong.
The method is simple and intuitive: it consists of adding to the highest value of the provided data (here the page numbers 2, 3, and 6), the average of each interval, assuming that the element numbers I know are uniformly distributed.
Let’s go over it again: my data are the numbers 2, 3, and 6. Their intervals are: 1 and 3. The average of 1 and 3 is 2. The sum of 6 and 2 is 8. Therefore, the book has 8 pages.
Another example: if I have the two values 8 and 9, their interval is 1. I add 1 to 9, so the size of the set is estimated to be 10.
If I have the values 1, 6, and 9, the first interval is 5 (distance between 1 and 6), and the second is 3 (distance between 6 and 9). The average between 5 and 3 is 4. I add 4 to 9, so the size of the set is estimated to be 13.
The algorithm is thus as follows:
Add to the maximum value the average of the absolute values of the pairwise differences of the sorted data.
Now, let's transcribe this into J.
Implementation
Let’s revisit the definition of the estimation algorithm: adding to the maximum value the average of the absolute values of the pairwise differences of the sorted data.
We know a priori the valence of each verb. We can immediately write the algorithm in J:
Assuming that we have defined each verb as follows:
We can also write it all in one go in tacit style:
Usages
Returns :
Which is the estimated total size of the data sample 2, 3, 6.
Test
We will test the power of this estimation method.
We choose sets of 50, 100, 200, 1000, and 10000 elements.
For each set, we progressively choose from 2% to 100% of elements at random.
We perform the estimation 100 times in a row each time, and finally, we average everything. We display the evolution of the prediction quality of the estimation in a graph
Output:
We see that very quickly, regardless of the size of the set, we are below a 3% margin of error. Not bad.
Variation
The same thing, using harmonic sums: