Assembly Diff Algorithms – it’s all in the details

A long, long time ago in a galaxy far, far away, Sean Hederman wrote an add-in for Reflector called Reflector.Diff. It was originally created to help him dive into code differences between Framework versions, but he figured it could also work nicely for people who wanted to diff their own assemblies.

Screenshot of the old Reflector.Diff

We got in touch with Sean, and he gave us some more details into how he developed the Algorithm and UI in Reflector.Diff 2


To perform the diffs, I used an excellent library by Michael Potter which described how to create a diff engine. I modified it (with his permission), and hooked it into .NET Reflector (with Lutz Roeder’s very patient help); suffice to say, this compact little tool became pretty popular. I have no idea how popular because I wasn’t tracking downloads or anything like that, but I certainly got a lot of link love for it (although to my now-defunct dotnet.org.za blog – lesson learned: don’t put your work under the control of other people).

Anyway, bygones.

After a while, I decided it was time for a revamp of the project. I wanted Diff to work on the newest and coolest version of Reflector, I wanted to get the UI updated, and I wanted to make it fast. Damn fast.

This is a story, not about me, but about some very clever people and their very clever algorithms.

In the beginning was the algorithm

All differencing algorithms, given 2 inputs, attempt to try and find a set of modifications which will turn Input A into Input B.

The flow from Input A through the modifications to Input B

Counter-intuitively, most difference algorithms focus not so much on finding a set of differences as finding the similarities between the two inputs; this is known as the Longest Common Subsequence problem. Think about it: if you find the commonalities; then all the bits that aren’t common are the changes. If the change is found in the source, but not in the destination, it’s a deletion, and if vice versa, it’s an insert.

As I mentioned, the algorithm I found for Reflector.Diff version 1 was written by Michael Potter, and he kindly allowed me to use it to build my original differencing engine. His algorithm was, in turn, based on one by Aprenot, which was all fine, considering my original intended purpose for the tool, but there’s always room for improvement.

More recently, I wrote a little test harness running diffs against a large project and checking the performance. For most purposes, the original version of Reflector.Diff was good enough for my needs; but I noticed that sluggishness would start to creep in for large diffs (such as looking at the System namespace in mscorlib). With a more detailed analysis, I found that it became slower when given larger texts and more changes within those texts, as would be expected. It also seemed to fall off a cliff at about 4,000 lines (but admittedly, I only tested up to 5,000 lines).

Speed is everything

I did a little digging around and found a well-known algorithm by Eugene Myers, as well as some little-known optimizations to this algorithm (designed by Neil Fraser) that help speed it up significantly in the most common scenarios. The new engine looked good, so I slapped it into Reflector.Diff and reran my tests. The question is, did it help?

In 42% of cases, my new algorithm was actually slower by an average of 622μs per file. Not ideal, but not too bad, either. And in the 58% of cases when it was faster? In those cases, it averaged 64ms faster.

To put that in perspective,  in the58% of cases when it was faster, it was 100 times faster than the old algorithm. Overall (i.e. averaging across all the test scenarios), the new algorithm worked out to be about 9 times faster. So, for a set of 100 files totalling 140,000 lines, it compared them in 0.76 seconds, as opposed to the 7 seconds it would take the original algorithm. I excluded file access times from these calculations, so the figures I’m bandying about were for the straight line comparison of sets of two large strings, held in memory.

Naturally, I’m pretty happy with the improvement, especially considering the scare I originally had while working out the performance benchmarks:

<flashback>

…I did a set of trial runs once I’d initially worked out how to build my performance improvements, and the new algorithm was vastly slower. Hundreds of times slower, in some cases.

I broke out my trusty ANTS Performance Profiler, compared the two algorithms, and found that, for all of its theoretical improvements, my new design was doing hugely more work. A little spelunking through the call stack and I came to the reason why: the old algorithm was differencing the files on a line-by-line basis, and the new one was doing it character-by-character. Once I changed the new one to work line-by-line, I started getting much better figures. I have no idea how long I’d have flailed around if it hadn’t been for that “oh crap” epiphany looking at line hit counts in ANTS.

The Diff algorithm taking a huge performance hit.

560,000 iterations is probably not a good sign for finding differences in a single file, even if it is a couple thousand lines.

</flashback>

Speed isn’t everything

On the other hand, we also need to consider the tried-and-tested technique of faking performance, which I’ve written this before. There are two psychological considerations to bear in mind.

Firstly, when people see your app stop responding (hanging), they get a negative impression of its performance. Secondly, people perceive operations accompanied by progress bars as being faster than the exact same operation without a progress bar.

However, I didn’t want to just pop up a progress bar willy nilly whenever Reflector.Diff was doing anything at all – that would be a User Experience mess, and I wanted to use my new-found knowledge wisely. So, I created a BackgroundWorker subclass to handle the Differencing operation for me, and designed it to pop up a screen with a progress bar, but only if the operation has taken longer than one second.

This way, we get the best of all worlds; a responsive UI, coupled with a progress bar if it starts taking too long. The progress bar itself? Eye candy, nothing more – there for the sole purpose of implying that the application is working faster than it actually is. Or rather, of giving the user confidence that the application is working, not hanging.

The takeaway

Diff is designed to do one thing, and one thing well: compare two assemblies and display the differences between them. With that in mind, the work I’ve done to enhance Reflector.Diff is to make it easier and faster to use – take it for a spin, and I hope it works well for you.

This entry was posted in Community, extensions and tagged , , , on by .

About Chris

A background in technical publishing; editing articles on Simple-Talk and SQLServerCentral for 3 years now. When I’ve not been editing articles, I’ve been editing or proofing books covering everything from .NET performance testing to Exchange Server, XML Schema Design, and the SQL Server Query Optimizer. I built a few websites to help pay my way through college, but the less said about them, the better.

Leave a Reply

Your email address will not be published. Required fields are marked *