The trouble with having a code base that you didn’t write yourself is that you’re often surprised to discover features that you’ve never seen before. I had this revelation some time ago when I was looking through the Reflector code, and had a close look at the IAssemblyManager interface:
I’m interested into using Reflector programmatically to do various kinds of things, and so I’d used most of the methods of this interface in the past. The one that stuck out this time was the SaveFile method, which looked like it would write out an assembly back to disc. I tried it and found that, although it didn’t work, there was already a lot of code inside Reflector for writing metadata and IL as an assembly – I assume this was written back when the code for reading assembly information was being created.
Internally, we use Cecil as a tool for reading and modifying assemblies, and we use it inside Reflector VSPro to add a debugger signature to assemblies that you need to debug. Cecil is also used inside other Red Gate tools (such as Exception Hunter) as a way to get at the IL, and information about assemblies that you don’t necessarily wish to load into your process.
Tools that read and write assemblies have a great number of uses which interest me.
- They allow access to the metadata of an assembly without you having to load the assembly into your process.
- They can be used by the backend of a compiler to produce the final object file.
- They can be used to modify the content of methods, allowing you to do Aspect Oriented programming by weaving extra code into methods that you designate
- They can be used to link your assemblies, allowing you to pull in code from dependent assemblies and paste it into your own assembly.
- They can be used to experiment with partial compilation.
- They can be used to instrument existing assemblies to find out which parts are actually used.
Why do we need another byte code manipulation library?
We’ve already starting adding byte code analysis functionality into .NET Reflector. To improve our de-compilation, I recently added some code that analyses IL and determines dead instructions, and I’ve been experimenting with some code which takes some IL and determines the types of the stack at various points during its execution. It would be good to make this kind of functionality available to users of .NET Reflector.
As usual with unknown functionality, the best way to get it going is to first determine criteria for success, and then write some tests, starting simple and building up to increasingly complicated cases. So I took a set of assemblies with the intention of reading them in and then writing it out again using the code inside Reflector, and to pass the test, I demanded:
- Using ILDASM on the initial assembly and on the result assembly, the textual representation of the assembly contents should be the same. There are a couple of things that can change when the assembly is written out (the timestamp inside the header and the base address, for example), but apart from a couple of lines of text, everything else should be the same.
- The output should be deterministic apart from these bytes.
- Peverify should say that the produced assembly is valid.
- The loading and saving should be idempotent. If we take an initial assembly and then write it out, there is no guarantee that a byte-wise comparison of the data will be the same. There is flexibility in the layout of the data, and there are various dead bytes in the output that mean nothing, but can be set to any value. However, if we take the assembly produced by Reflector, load it and then write it out again, we would expect a byte wise comparison to be the same.
What does a loaded assembly look like in the object model that reflector exposes?
It’s easy to think of the loaded assembly as a tree. When you load the assembly, you are given a pointer to an Assembly object which implements the following interface:
From this you can walk down into the Modules collection, which allows you access to all kinds of information, such as the bitness and the collection of assembly references:
From that you can walk into the Types collection, telling you about the type’s contents (the attached methods, fields, properties, etc.):
Then down into the Methods:
And from this get to a method body, which lets you get to the actual instructions themselves. Note that the Body has the type object because it can have a number of forms. It can be a raw set of IL instructions, or can be used to hold the higher level object model for decompiled code (which we won’t go into here):
At the leaves, we get a set of instructions which contain an offset, an opcode and a Value (which is the extra data that the opcode might need):
I started with an initial test assembly that has the following form when displayed in Reflector:
This is really the simplest C# method:
We can then write some test code, which we can run using the Resharper test runner:
From the simple use of the object model above, it feels like we have a tree of objects. Unfortunately we don’t really have a tree, but a graph – there are back-pointers from various nodes to other parts of the tree, via properties like Owner and Context. This means that editing becomes… interesting.
Some parts of the model are lazily loaded into memory – until that part of the model is needed, the objects just keep pointers to pieces of data, but don’t actually read the data until it is needed. Such pieces of the object model are immutable – if you try to modify them using a property, you get an exception.
Internally, Reflector gets around this by cloning; there is a clone operation which uses a visitor type pattern to visit all of the nodes of the tree and make a copy of them. Initially I started using this to clone the entire tree after reading it in, though there were numerous problems with this. Reflector itself only copies subsets of the graph at the moment, and the existing clone functions didn’t always maintain the correct object identity for the back pointers.
After thinking about this some more though, I realised that a more functional approach to manipulating the graph seems to work a lot better. This is, for example, the path that Microsoft have taken with the Roslyn project. The trees they produce are immutable – when you want to edit something, you generate a new version of it, though you still point to the existing tree for things that you are not editing. This approach is nice, as you don’t side-effect data that is already in memory and which other parts of your application could be using. This means that it is easy to start editing and even throw away your changes without affecting other parts of the world, and you don’t have to clone an entire object graph just to make one small change.
In this initial version, I’ve therefore gone for a model where you ask for an editable version of an object. This gives you back a proxy object which allows the edits and which, in the future, will be responsible for maintaining the invariants on the graph that make it valid. I’d have loved to use a Zipper, my favourite data structure, but we didn’t really have a suitably functional data structure which we were editing.
My first test case was to take the assembly shown above and change the GetValue method to return the string “Modified” instead of “Hello”.
I started by adding various utility functions to a new IlEditing static class… in particular I added a method for getting an IAssemblyManager from a new Reflector instance:
With that in place, we can then use a series of accessors that return editable objects as we descend the tree to the target method.
After that, it is a simple matter to create a new output directory into which we can write the modified assembly:
And then we can test the modified method, by loading the type using Reflection, creating an instance, and then checking the return value.
Modifying an existing method is all very good, but what I really want to do is add new stuff to the assembly.
To do that, we need to create instances implementing an IMethodBody, which we can fill with instructions. We really need to add utility functions (factory methods) to make the creation easier. I have only just started experimenting with this. There are also a few complications, in that we need to generate the various tokens that are used internally to represent the metadata items. It’s easy to add a method to the assembly proxy which will discover all of the tokens that are currently used:
It’s then easy to write something that will construct a new static method on a given type. We can pass in the return type as a string, and the utility method can use the assembly references of the containing assembly to convert it into the relevant internal representation:
We can then extend the unit tests to check that we can see this method using Reflection, and then call it to check that it returns the correct value:
We can look at the modified assembly using Reflector:
The intention is now to expand the set of utility methods to make this more useful. The utility methods can have responsibility for keeping the graph in a valid state, though you’ll still be able to drop down to editing at the level of the normal object model if you need access at this level. I also intend to expose the various IL dataflow analysis options from the extended object model to enable various byte code manipulation scenarios.
Unfortunately, this work was done as part of a Red Gate Down Tools Week (occasional weeks when we are given a week of time to work on anything that takes our fancy), so the future is a little unclear. The API described above isn’t anywhere near as comprehensive as I would have liked; my excuse is that the majority of the time was spent fixing bugs in the metadata-writing code. The metadata of a .NET assembly is described using a set of tables, with the tokens (mentioned above) acting as keys into these tables. Debugging incorrect metadata is fiddly, and often involves taking hex dumps of files and annotating them using a pen to mark out the various sections.
I spent a fair chunk of time getting up to speed with the format of the metadata, and spent one long session debugging the disappearance of unmanaged resources in mscorlib.dll. I’m currently working on the generics heavy LinqBridge assembly (as and when I have the time), which is generating invalid structures. This is due to a cache firing when a type with method-level generic parameters is found to be equal to a type when the parameter is at the class level.
It is fascinating work, seeing how high level concepts like generic methods in C# are mapped down on to the virtual machine, but it can be time consuming. Getting things working assembly-by-assembly seems to be the way to go – we have a large set of test assemblies that we use for testing Reflector and Reflector VSPro, so I’ll continue working my way through them one by one.
The work that I have done has made it into the latest Reflector EAP, though there is so little there that I might not show it to my Mum. As a user of Reflector, you’ll see some extra interfaces publically exposed in the Reflector.CodeModel.Editors namespace, and there are obviously implementation classes backing these interfaces (though they are not exposed publically).
I also exposed the Instruction class which makes it easy to refer to the various IL instructions using their symbolic names:
With this exposed functionality, you can run the examples I have given above in the code, though beware that you need to compile against the full .NET 4 profile and not just the client version. The writing of an assembly seems to work fairly well for non-generic types, but this is currently unsupported work. Until I’ve worked my way through a lot more test cases, I’m sure there are lots of bugs to work through.
When (if) the work is ever finished, it will allow add-ins to do even more interesting things, although ReflexIl admittedly already allows sophisticated byte code manipulation by way of the Cecil library. I’m hoping to convince Greg of the worth of this project so that I can get some more time on it at some point.
And why is it called Morrissey? Well, there’s already Cecil, and most of the time was spent checking to see if something that started as IL was “Still IL” at the end of the process.