Friday, December 27, 2013

Optimizing your Code

The other day i was writing some code, and I stumbled myself doing this:

public Documents[] ConvertDocuments(DataTable docsSource){
    List<Document> docs = new List<Document>();

    for(int i=0; i<docsSource.Rows.Count; i++){
          docs.Add(FillNewDocument(docsSource.Rows[i]));
    }

    return docs.ToArray();
}

Although it's valid code, many times programmers, specially in OO language relax their implementations, giving the excuse that "oh the compiler optimizes that..." or "if you want it to run fast, write it in C...", but for me this are just excuse to comfort lazy programmers minds...


Analyzing the code above, we can extract the following:



  • The variable docs is a List, the List structure typically can store internally a pointer to the tail, so insertions can be made in O(1) time, this is the good part;
  • However, suppose Document has the field ID (int), Name (char[128]), Path (char[1024]), so each document object will measure 4+128+1024 bytes = 1156 bytes = 1.16K;
  • Then you convert it to an array, imagine docsSource.Rows.Count = 1000, which means you allocate 1000 blocks of continuous memory to store your "Converted Documents" => 1000 * 1.16K, doing the maths it equals to 1.1MB;
  • Plus the fact that the ToArray() takes O(n) time where n is the size of the List;

So the above code overall killer cost reduces to:
  • Memory Usage: 1 List of 1000 elements = 1.1MB + 1 Array of 1000 copies of elements, this means that although the ToArray() documentation says it copies the elements, in Lists of objects it means, it creates new pointers to the objects, so the extra array will size 1000*4Bytes = 4000 Bytes ~ 3.9K;
  • Computation Costs: Well this is the funny part, the List in .NET is implemented internally as a dynamically resizable array, so your generic List is in fact an array, but in the implementation it is not even pre allocated, so at run-time it will have an initial size, and will be doubling it's size according to the number of elements it carries as it grows when you add items to it.

    That being said for n elements the growth of the list will be something like n.log(n) since for many elements it will have capacity of elements count doubled, and the need the list capacity will tend to stabilize;

    The final array takes O(n) typically, in languages like C, here we are dealing with an heap so the time it takes to allocate memory is un-deterministic, but in terms of big-O let's say it's O(n) for memory allocation.
In this quick analysis we can observe how easy some ordinary programmer can fall into the Lazy Programmer Paradigm, when dealing with High Level languages.

Dynamic Memory Management (Run-Time), Heap Memory Allocations, Lists that are implemented as Arrays and so forth... all concepts that we must understand an study, so we can write better programs, here once more the key is knowledge, but to be prepared to receive knowledge we must first understand what is wrong with our implementations and what can we do to improve your code.

Don't just type code, bind with it like if it's part of you, because it is!

To finish this post, let me just show you how easy and more elegant the solution could be:

public Documents[] ConvertDocuments(DataTable docsSource){

   Document[] docs = new Document[docsSource.Rows.Count];

   for(int i=0; i<docsSource.Rows.Count; docs[i] = FillNewDocument(docsSource.Rows[i]), i++) ;

   return docs;

}

Memory Usage: For n elements n.sizeof(n).
Time Complexity: O(n), assuming FillNewDocument is O(1).

No dynamic memory allocation, no messing with heap allocations for the data structures, just straight forward and simple code.


It's true the compiler does lots of things for you, but the perfect programmer will have read the language compiler book, so he can know for sure how much can he relax his code implementation!


Or you can just code in C, with all advantages and evils it could bring upon you :) !