Comparing lists at insertion time: Inputs, Outputs and Repetitions

Thiago iotti
5 min readJan 2, 2021

TL;DR
The implementations and benchmarks are on the bottom of the page
Repository is here

If you have worked on software development, there is a real chance that you needed to compare two sets of informations. Comparing an input of products with your current warehouse stock, or receive a list of users and check who is in, who is out and who remains unchanged.
There is a lot of ways to do that: Sending that list to a temporary table and making a (quite) complex query or, a more common way, make the query to bring a list and comparing to another list (usually, an user’s input); And that last case we will talk here, since it is more common.

Most of languages already have some built-in way to compare two sets:

C#:

Ruby:

JS:

C++:

So, there is a pattern here. At some point we must load both lists and when both lists are loaded in memory, we will work on operations.
Now, what if we have a way to make this comparision while we were filling these lists? The first thing that comes to the mind of many is: At every element that comes in the list A, we must check the if there is the same element on list B and vice-versa.

Your CPU checking if the element is on the list

Of course, it seems to be a terrible idea! At every item entering on any list you will check the opposite list, and that list is increasing and also you might have some moments that a new elements enters on the opposite list before you checking another element.
Well, if we think about it, we are trying to put an algorithm that works good on complete lists to work on incomplete lists. It is like, make the planes flap their wings to fly just because it worked with the birds!

Using complete list operations on incomplete lists

It seems to be clear that common list operations will not work here, it’ll either cost too much or simply give us wrong results and since we are in a new situation, we may need to think it a little bit different.

An Example on how we can make it work

Okay, so, we have used these operations on list for a quite long time and it is quite common to think there is no better way to work on list without using it. I have used this solutions on personal problems long ago and when I presented it to other people, most of it have some difficult to understand, some other people just told me it was impossible to be faster.

Impossible, you say…

When load the lists we have to store it somewere, arrays might be ok, but using hashes will gave us keys to search it faster and since we are adding it to somewhere, why don’t we put it on the same list instead and just mark the ones came from each list? A simple math operation will do!
Let’s just say we were receiving a list with users’ e-mails, this list will shows us which users are new, which users will be kept, any other e-mail we have saved on our database and not on that list, must be deleted. Let’s just say, we have the following database:

And they send to us the following list:

Ok, so, we only need e-mail to compare, let’s create a hash table where e-mail should be the key and, for every element came from the database, we will add 1 and for every element came from the input, we will add 2. Now, let’s take a look on what will happen step by step:

Step 1(from DB): Since the key doesn’t exists before, we will just set its value as 1
Step 2(from input): The key already exists, and since it came from the input, we will add 2 to its value
Step 3(from DB): Again, the key doesn’t exists, since it came from DB, we will set the value as 1
Step 4(from input): Now, the key doesn’t exist but it came from input, so we will set the value as 2
Step 5(from DB): The key exists, and since it came from DB, we will just add 1
Step 6(from input): New key and came from input, so set 2 as value

Well, I think I have made it clear how it’ll do! It doesn’t matter the input order. The final result might be something like this:

When the lists finished loading we already have the results, and know exactly which one we should delete, and which one we should insert. And now, instead of making 3 different operations get us the ins, outs and keeps, we can do it in a single iteration on that hash we generated!

Implementations

Ok, the theory is quite good but how we put things to work?
One solution I have found right “out of the box” was using Redis’s sorted sets(zadd) and if you are interested in using parallelism, this is the easiets shot! Just make the zadd calls adding 1 or 2 to the keys and you are good to go! But it is not all applications makes use of Redis and we know that just put something in an architeture for a specific case might be a huge overkill. So we might need to implement it inside application. So I have put it to work in a few languages, it is not the best ones, but will work.
Let’s have List A as the list that we already have on our database and List B as a list we are receiving.

To make it work, we must assure that List A must have all itens unique.

Right now, I’ll just put Go and Ruby but I’ll add more in the future, on these examples I have used two csv lists with approximatedly 380k rows each.

Go:

A normal comparing(about 3 seconds) x Insertion comparing(about 500ms)

Ruby:

Normal listing (~3.0s) x Insertion listing (~2.5s)

Benchmarks:

Ruby:

We may have some time increase on load, but it pays off on collect

Go:

The difference on data collect is quite big!

--

--

Thiago iotti
0 Followers

Game Engineer @ Wildlife Studios