How to acquaint yourself with masses of code?

During the last few months, I've been delving into a huge lump of source code, trying to understand its structure and come up with ways to make the application more structured and expandable. I'm blessed: the code itself is actually rather decent, although the passage of time has certainly deteriorated the structure somewhat (this happens with all code anyway). It's been another great learning process, so I'll share some of the fruit: Jouni's five steps towards turning a huge mass of code into something you actually understand:

1. Scan through the directory structure

Print out a directory tree of the source. Run quickly through the code and make notes on what's in each directory (which classes, what sort of functionality). If you already have this sort of a document, great – but don't use it, do this yourself. Compare with the existing notes and see what you've missed. Be careful, as the old document might be somewhat outdated. Spend no more than one minute per file.

You should end up with a working knowledge of what's in the tree and what's not, although you probably cannot remember it all. Turn your structure document into an electronic form (unless it's there already) or make sure the existing document is up to date.

2. Gather the requirements, bug reports and ideas for the code

Ask around, take a look at mail archives and whatever you have. Try to find out what's wrong with the code, what kind of structural issues have been bothering people, how should the application in its whole develop and so on. Particularly, try to identify the "can-of-worms-bugs" – the ones which everybody constantly talks about but which never get done. Understanding the needs is important even if you're not planning on massive developments, as it often also relays information on how people actually use the code.

If you have a decent bug reporting system and people actually use it, you're going to have an easy time here.

3. Review the codebase

This one takes time, but it's worth it. Go through every class and every method (perhaps not every line, but almost) of the code. Try to identify the major problems you found during the last step on the source code level. If you've heard explanations on why something is difficult to fix, try to understand the reasoning yourself. Make notes to support your memory.

Try to spot patterns. This is very hard unless you're an experienced programmer, but do it anyway. Try to identify the sorts of operations that repeat themselves throughout the codebase. Pay particular attention to these: Is the code required for a frequent task readable? Is it error-prone? Is it easily expandable? For example, if your frequent event is "open a database connection", how do you do it? Do you pass DB connection info and credentials around every time? What if you need to pass a special timeout value – could you do it? Is exception handling done the same way every time? Is it done in any reasonable way?

If you have special expertise, you can use this review round to spot other issues: security flaws, performance bottlenecks, globalization concerns, whatever. But! Don't make the mistake of thinking you'd find all the problems – f.e. exhaustive searching for security flaws cannot be combined with introductory review of an existing codebase. The same goes for most other non-mechanic hunts for code issues.

Allocate sufficient time. For me, it takes an hour to effectively go through 20-50 k of C# code, depending on the complexity of the operations involved. Your speed will vary a lot based on your experience and working habits. I repeat: This part takes ridiculous amounts of time. However, it will provide you with knowledge that considerably helps you in the next steps.

4. Build the code

If we lived in a perfect world, this task would always be trivial. Automated build environments should turn this task into a no-op (by forcing the code solution/project/package to be very independent of any configuration), but it rarely happens. Even with a decent autobuild, there's often some work required in setting up your personal build environment. Be analytic. Why the requirements? Could they be removed? Is the process of doing a clean build straightforward enough?

Although the process of creating a build isn't particularly strongly related to the code itself, the inter-package relations (such as those implied by project references in Visual Studio solutions or whatever your build environment has) tend to become more clear by looking at the build process. Also, if you don't have a module dependency graph (which modules require which parts to build), draw one. Any format is acceptable as long as it's accurate. Again, verify that any existing documents are up-to-date before relying on them.

If the build produces warnings, note them. See if you can figure out the bad practices behind them. If the build produces errors, you're in for a world of hurt. Find a way to fix the issues now or you'll regret it. Return here when you're done. If a breaking build is everyday stuff in your dev team and you can't convince them into changing the habit, go on. It can't stop you, but you're still going to suffer. You have been warned.

After this step, you should have a working understanding (not just knowledge!) of how the software is composed (the modules) and how the modules themselves work (from the code review). You could've built the software earlier – actually, most of us do it as the first step. That may work, if you have a strong build environment and everything goes fine. But if you get errors and the build fails, it'll get frustrating. On the other hand, if you already know how the code works, it's likely that some exploration on the code errors will become a good learning experience.

5. Get a feel on the development

Pick the most trivial of the issues you identified in step 2. Fix it. Make sure everything works thereafter. Repeat a few times, depending on the complexity of the bugs. The purpose of this exercise is not to enhance the product, but to provide a better understanding of how the software gets developed. If the bugs you fixed were more than typo fixes, you should've written or changed at least a few dozens of lines of code. Get somebody from the old dev team to review your changes so you'll get feedback.

The more sophisticated your build environment is, the more there is to learn in this phase. Take a look at all the metrics your code changes produced. If you have code churn analysis or unit tests, there will probably be quite a few interesting reports to scroll through, perhaps even some tests to write.

Once you're past this phase, there's little mental virginity left in your head for this project. Therefore, this is the last possible moment of writing up the ideas you got during the process. What frustrated you? Which parts of the code looked most dubious? Which patterns and practices felt uncomfortable? Raise discussion. Propose better approaches. File bugs. You won't have the same edge later on. Beware of the cynicism that naturally comes from the senior members of the development team.

Next up: Jouni's five steps to fixing the issues found in this process ;-) (no, not really, but I _will_ try to post more notes later on)

March 25, 2005 В· Jouni Heikniemi В· One Comment
Posted in: Misc. programming

One Response

  1. nyhydro50 - May 5, 2006

    Seriously – I once was hired by a comany who manted me to learn their credit checking software inside out, and they gave me the time I needed. I went through every single function, and wrote debug messages "Enetering Function 1: Parameters a,b,c values x,y,z", and at the end of every function I had "Exiting function … etc". Each time I ran trials, I got detailed examples of what functions wre called, and what values were passed – it took some time and effort, but I found this the best way to underatand what was happening. And when I say every single function, I mean it – I missed not one function. If they give you the time, Id say go in there and write output messages, either to stdout, or to a file, or whatever…its the only way.