Gmvault v1.8-beta, nickname 'kick-ass performances' released !

First thing first, if all you are interested in is to get Gmvault v1.8-beta, please go here.

On the 21st of March 2013, second day of Spring 2013 and because I waited for the solar system planets to be all aligned, a new version of Gmvault has been released. It took me 7 months to release a new version because Gmvault needed to be re-factored to be more manageable and also because I wanted to take some times to improve the Gmvault kernel. Many new features and bug fixes have been implemented in version 1.8-beta like the support of German, French, Japanese for labels and emails as well as the implementation of the export function allowing you to view your emails in your favourite email client in Gmvault but for today I have decided to focus on the performance improvements journey in that post.  If you want more information on the new features implemented in v1.8-beta please refer to the download page.

So going back to my train of thought, the Gmvault core had 2 weaknesses in my opinion: performances and internationalisation. Following a normal craftsman development cycle, I first implemented my ideas, focusing on the usability and friendliness of the tool, leaving out the performances issues. Then as users with 300 000 emails and more started to appear I decided to work on the performance issues. 

1) Performance improvements

A Gmvault contributor (Thank you Dave Vasilesky) spotted early the performance issues encountered by Gmvault and even provided a solution. His solution was to batch request the Gmail IMAP server during a sync operation instead of querying it for every individual requests which was a simple but dumb idea. I wanted to go even further and also experimented with a multi-threaded Gmvault version and a multi-process one to circumvent the GIL problem. This took me some time but it turned out that the gain given by multi-tasking Gmvault was really small because most of the time is spent waiting on I/Os from the Gmail server (How surprising). For that reason and because the error management is much more complicated in a multi-thread env, I decided to drop the multi-tasking for the moment. In addition, experimenting with asynchronous I/O might have been the next step but the network calls used by IMAPClient and imaplib, the Gmvault libraries used to talk "IMAP" are blocking calls and I didn't want to rewrite the IMAP layers at the stage.

The result is still quite good as I can now backup my Gmail mailbox of 44000 emails in around 2.5 hours as before it was more around 9 hours. An also very interesting side effect is that scanning your complete mailbox to look for modified or new emails is now blazing fast.
It takes 2 to 3 minutes to scan my entire mailbox of 44000 emails so one could drop the quick mode (only looking for new emails during the last week) and always use the full mode with a medium sized mailbox.

I also tried to apply the same solution to the restore operation which recreates your gmail mailbox on a new account. There I was less successful because you cannot upload emails in batch. I also tried a multi-task approach (thread and process) but it seems that Gmail is serializing the upload of emails contents at its levels so while an email is inserted in there backend, it looks that it isn't possible to add another one in parallel. Still I found some room for improvement in the way labels are applied to an email. Now I "batch" apply labels to multiple emails (batches of 80) at the same time because applying one label takes some time. I also did more optimisations that I will not detail here which are due to the Gmail back-end implementation: Check the code if you wish to learn more about it: Gmvault on Github. All in all now I can restore my entire mailbox of 44000 in 6 hours instead of 12 hours which is half of the time required previously.

Finally to mock Microsoft even if I do not hate them anymore (never say never as I might have to be one day a windows phone app developer), I had to do some specific optimisations for Windows because I could not find the Filesystem on Windows. As you may know Gmvault is storing all the emails in directories with one directory created per month (2012-01, ...) and for the restore operation, Gmvault reads all the email ids in all the dirs before to start restoring the emails. It turns out that get the directory content on Windows when a directory is a bit crowded (more than 10 emails) takes lots of time. On the same machine running successively Win7 and Linux I created a test case where I read 250 000 emails over 50 directories. On Linux, it takes around a seconds and on Win7, 30 seconds. Windows makes you humble, never take anything for granted with it. 

2) Roadmap for the future developments

I now feel that the current version (1.8-beta) of Gmvault is in a stable state and should serve well users. I might even drop the beta in few weeks if no real problems have been reported. Now where is Gmvault going ?

The next steps and features will be the following: 

  • Build a nice user interface around the Gmvault kerne

Even if Gmvault is quite easy to use, it is necessary to create a better user experience for allowing lambda Gmail users to backup their mailbox. Now, should the GUI built on python technologies like the rest of Gmvault ? I don't think so, it will have to be based on a stack HTML5/js which can be used for building maybe a cloud service in the future. This will be my focus for the next few months.

  • Automatic backups and scheduling

Users would like to use Gmvault in the background daily without having to think about it. Gmvault would then only backup new emails and would update the gmvault-db with modifications on already saved emails. The necessary information is now present in Gmvault to implement that feature and it is only a matter of saving more organised information. The automatic scheduling should be also added in the system some how (delegated to the system scheduling capabilities or not).

  • View emails from the local email repository and search their content

A nice to have would be some capabilities allowing to view email contents and search them. But wait this could be integrated in the Graphical Interface and what is currently missing in Gmvault is some sort of full-text search capabilities and a database to organise the access to the emails. I think that this feature will appear naturally once the GUI is there. The choice of database technology is yet definitive but a simple Sqlite3 DB to handle the email metadata information (not the complete email, part of it) might be sufficient with some full-text search engine.

I have many more ideas in mind that might be added to the road-map but I think that with these 3 objectives, my plate is already pretty full.
Once the GUI is here anyway I think that the road will start to be larger or I will reach an invisible wall.

3) Help me to revamp the Gmvault site

The site is one year old and needs to be revamped but I am not a web-site designer so if you wish to help in that matter please contact me

If you wish to leave comments, please use the HackerNews comments as they have not yet been enabled in Posthaven and support Posthaven which seems to be genuine effort.