Finding the difference between two Arrays, or un-LINQ-ing your code

imageLINQ is great, up to the point when it's not.  Then it's really not great at all.  When we trade simplicity of syntax for performance, we have to keep in mind there may come a point when we have to go back the code and factor out the shiny new toys.

My system is complex, but this task is simple.  I have about 95,000 video files to keep track of in a database.  Try as we might, at some point someone is going to make changes, add or remove files, without going through our software, so a periodic file audit is required.

I have two lists of file names, one is the list as it exists in the database, the other as it exists on disk.  This is greatly simplified, but the initial idea was something like this:

List<String> toAdd = (from f in existingFiles
                      where !indexedFiles.Contains(f)
                      select f).ToList();

List<String> toRemove = (from f in indexedFiles
                         where !existingFiles.Contains(f)
                         select f).ToList();

The power of LINQ, short and sweet.  Sadly, it appears to act like this under the hood:

List<String> toAdd = new List<String>();
foreach (String f in existingFiles)
    if (!indexedFiles.Contains(f))
        toAdd.Add(f);

List<String> toRemove = new List<String>();
foreach (String f in indexedFiles)
    if (!existingFiles.Contains(f))
        toRemove.Add(f);

These statements pegged the CPU for minutes at a time - can you see why?  How many times am I looping through each collection of strings?  Twice? Four times?  No, I'm afraid this is classic "Oh 2 Da N" performance here.  The problems lies in using the Contains() method - this method must search through the collection each time it's called, which is often.  To solve this, we'll have to kick it old school C-style:

indexedFiles.Sort();
existingFiles.Sort();

for (int iE = 0, iI = 0; iE < existingFiles.Count || iI < indexedFiles.Count; ) {
    if (iE >= existingFiles.Count) {
        toRemove.Add(indexedFiles[iI]);
        iI++;
    }
    else if (iI >= indexedFiles.Count) {
        toAdd.Add(existingFiles[iE]);
        iE++;
    }
    else {
        Int32 diff = existingFiles[iE].CompareTo(indexedFiles[iI]);
        if (diff == 0) {
            iE++;
            iI++;
        }
        else if (diff > 0) {
            toRemove.Add(indexedFiles[iI]);
            iI++;
        }
        else if (diff < 0) {
            toAdd.Add(existingFiles[iE]);
            iE++;
        }
    }
}

A quick sort of the lists and we can make some time saving assumptions as well as walk through both lists at the same time (.Net had no trouble sorting lists of 95,000 strings in record breaking time).  We start off comparing the strings at index 0 of each list (the first if and else if will be false - I'll come back to those shortly).  If they match, then the file both exists and is indexed - go to the next file in both lists.  If the existing string is greater than the index string, then we know the file no longer exists (remember, lists are sorted).  In this case only advance the index list to see if it "catches up" with the existing list.  If the existing string is less than the index string, the reverse is done.  If we run out of existing list before we finish the index list, those files no longer exist - which is what the first if statement handles (the first else if handles the other case for running out of index list first).

If you have to work through this for loop on scratch paper, don't feel bad - I did.  It's been a very long time since I've had to get this "raw" with my code.  It was worth it as the performance went from several minutes at 100% CPU to so fast I couldn't believe it had run.

I googled a bit to see if there were other solutions out there, and more importantly some method in the .Net framework that would solve this issue and came up empty.  I'm not convinced there isn't something in the bazillion methods living in the framework, so if you know of something, please let me know!

Posted By Mike On Friday, October 03, 2008
Filed under developer linq | Comments (3)

Submit this story to DotNetKicks   

Joe - Wednesday, July 29, 2009 10:54:04 AM

Not that I'm a Linq expert (or even a Linq user, or understander), but you may be able to have your cake and eat it too in this situation. One thing to note is that it isn't much more expensive (and in the best case, log(n) cheaper) to build to Hashtables than it is to sort those lists, and good hashtables have nice, concise ways of expressing (and implementing) union and intersection with a bunch of O(1) lookups.


HashSet tokeep = new HashSet(indexedFiles);
HashSet toremove = new HashSet(tocompare);
HashSet toadd = new HashSet(existingFiles);

tokeep.IntersectWith(existingFiles); // existingFiles.Count ops
toremove.ExceptWith(existingFiles); // existingFiles.Count ops
toadd.ExceptWith(indexedFiles); // indexedFiles.Count ops

If this is all you need (and I'm right about the O(1) Hashtable behavior) then the whole thing above should be O(n), and still pretty fun to read :)

Mike - Wednesday, July 29, 2009 1:19:50 PM

When I posted this, I had overlooked the Except() extension method, which does exactly what I needed and replaced the code with it. On the devlicio.us post, there was a good size comment thread about which solution had the best performance - http://devlicio.us/blogs/vinull/archive/2008/10/03/finding-the-difference-between-two-arrays-or-un-linq-ing-your-code.aspx

Joe - Wednesday, July 29, 2009 2:51:38 PM

Ach! Sorry to come so late to the dance! I'll keep a closer eye on the dates next time.

Thanks!

Leave a comment



Your name:
 

Your email (not shown):
 
Will display your Gravatar image.

Your website (optional):



About Me

Michael C. Neel, born 1976 in Houston, TX and now live in Knoxvile, TN. Software developer, currently .Net focused. Board member and President of ETNUG, and organizes CodeStock, East Tennessee's annual developers conference. .Net speaker, a Microsoft ASP.NET MVP and ASPInsider. Founder of FuncWorks, LLC and Feel The Func podcast.

Proud father of two amazing girls, Rachel and Hannah.

 Subscribe to ViNull.com |  Comments

Follow me on Twitter | Contact Me

Related Posts

XNA 3D Primer Published – Get a free copy!

In June of 2006 I officially became a professional author when ASP.NET Pro published my article “Google Can You Hear Me?”.  (So eager was I to be ... Read more

Critical Thinking and Dissent is a Requirement

So the fires continue to burn, as Joel Spolsky’s internet access hasn’t been disconnected.  For someone who is supposed to be an idiot and irrelevant, ... Read more

You should be using Balsamiq Mockups

I’ve heard many people rave about Balsamiq Mockups, an Adobe AIR app that helps design user interfaces, but until today I had never looked into using it ... Read more

Game Development going Agile

Game review site Giant Bomb has a great article about Mass Effect 2’s development process.  The article is written by one of the review staff, who ... Read more

TeamCity.CodeBetter.com, A continuous integration server farm for open source projects

CodeBetter – in collaboration with JetBrains, IdeaVine, and Devlicio.us – is proud to announce the launch of TeamCity.CodeBetter.com – a continuous integration ... Read more

FeelTheFunc Podcast

CodeStock
Are you going?

ASPInsiders Member

ETNUG Member