Gmvault v1.8-beta, nickname 'kick-ass performances' released !

First thing first, if all you are interested in is to get Gmvault v1.8-beta, please go here.

On the 21st of March 2013, second day of Spring 2013 and because I waited for the solar system planets to be all aligned, a new version of Gmvault has been released. It took me 7 months to release a new version because Gmvault needed to be re-factored to be more manageable and also because I wanted to take some times to improve the Gmvault kernel. Many new features and bug fixes have been implemented in version 1.8-beta like the support of German, French, Japanese for labels and emails as well as the implementation of the export function allowing you to view your emails in your favourite email client in Gmvault but for today I have decided to focus on the performance improvements journey in that post.  If you want more information on the new features implemented in v1.8-beta please refer to the download page.

So going back to my train of thought, the Gmvault core had 2 weaknesses in my opinion: performances and internationalisation. Following a normal craftsman development cycle, I first implemented my ideas, focusing on the usability and friendliness of the tool, leaving out the performances issues. Then as users with 300 000 emails and more started to appear I decided to work on the performance issues. 

1) Performance improvements

A Gmvault contributor (Thank you Dave Vasilesky) spotted early the performance issues encountered by Gmvault and even provided a solution. His solution was to batch request the Gmail IMAP server during a sync operation instead of querying it for every individual requests which was a simple but dumb idea. I wanted to go even further and also experimented with a multi-threaded Gmvault version and a multi-process one to circumvent the GIL problem. This took me some time but it turned out that the gain given by multi-tasking Gmvault was really small because most of the time is spent waiting on I/Os from the Gmail server (How surprising). For that reason and because the error management is much more complicated in a multi-thread env, I decided to drop the multi-tasking for the moment. In addition, experimenting with asynchronous I/O might have been the next step but the network calls used by IMAPClient and imaplib, the Gmvault libraries used to talk "IMAP" are blocking calls and I didn't want to rewrite the IMAP layers at the stage.

The result is still quite good as I can now backup my Gmail mailbox of 44000 emails in around 2.5 hours as before it was more around 9 hours. An also very interesting side effect is that scanning your complete mailbox to look for modified or new emails is now blazing fast.
It takes 2 to 3 minutes to scan my entire mailbox of 44000 emails so one could drop the quick mode (only looking for new emails during the last week) and always use the full mode with a medium sized mailbox.

I also tried to apply the same solution to the restore operation which recreates your gmail mailbox on a new account. There I was less successful because you cannot upload emails in batch. I also tried a multi-task approach (thread and process) but it seems that Gmail is serializing the upload of emails contents at its levels so while an email is inserted in there backend, it looks that it isn't possible to add another one in parallel. Still I found some room for improvement in the way labels are applied to an email. Now I "batch" apply labels to multiple emails (batches of 80) at the same time because applying one label takes some time. I also did more optimisations that I will not detail here which are due to the Gmail back-end implementation: Check the code if you wish to learn more about it: Gmvault on Github. All in all now I can restore my entire mailbox of 44000 in 6 hours instead of 12 hours which is half of the time required previously.

Finally to mock Microsoft even if I do not hate them anymore (never say never as I might have to be one day a windows phone app developer), I had to do some specific optimisations for Windows because I could not find the Filesystem on Windows. As you may know Gmvault is storing all the emails in directories with one directory created per month (2012-01, ...) and for the restore operation, Gmvault reads all the email ids in all the dirs before to start restoring the emails. It turns out that get the directory content on Windows when a directory is a bit crowded (more than 10 emails) takes lots of time. On the same machine running successively Win7 and Linux I created a test case where I read 250 000 emails over 50 directories. On Linux, it takes around a seconds and on Win7, 30 seconds. Windows makes you humble, never take anything for granted with it. 

2) Roadmap for the future developments

I now feel that the current version (1.8-beta) of Gmvault is in a stable state and should serve well users. I might even drop the beta in few weeks if no real problems have been reported. Now where is Gmvault going ?

The next steps and features will be the following: 

  • Build a nice user interface around the Gmvault kerne

Even if Gmvault is quite easy to use, it is necessary to create a better user experience for allowing lambda Gmail users to backup their mailbox. Now, should the GUI built on python technologies like the rest of Gmvault ? I don't think so, it will have to be based on a stack HTML5/js which can be used for building maybe a cloud service in the future. This will be my focus for the next few months.

  • Automatic backups and scheduling

Users would like to use Gmvault in the background daily without having to think about it. Gmvault would then only backup new emails and would update the gmvault-db with modifications on already saved emails. The necessary information is now present in Gmvault to implement that feature and it is only a matter of saving more organised information. The automatic scheduling should be also added in the system some how (delegated to the system scheduling capabilities or not).

  • View emails from the local email repository and search their content

A nice to have would be some capabilities allowing to view email contents and search them. But wait this could be integrated in the Graphical Interface and what is currently missing in Gmvault is some sort of full-text search capabilities and a database to organise the access to the emails. I think that this feature will appear naturally once the GUI is there. The choice of database technology is yet definitive but a simple Sqlite3 DB to handle the email metadata information (not the complete email, part of it) might be sufficient with some full-text search engine.

I have many more ideas in mind that might be added to the road-map but I think that with these 3 objectives, my plate is already pretty full.
Once the GUI is here anyway I think that the road will start to be larger or I will reach an invisible wall.

3) Help me to revamp the Gmvault site

The site is one year old and needs to be revamped but I am not a web-site designer so if you wish to help in that matter please contact me

If you wish to leave comments, please use the HackerNews comments as they have not yet been enabled in Posthaven and support Posthaven which seems to be genuine effort.

3 month later, Gmvault 1.7-beta: the initial version I would have liked to build

First if all you are interested in, is to get Gmvault 1.7-beta, click here.

Gmvault: Gmail backup simply,  v1.0 has been released on the 7th of May 2012 on Hackernews and since then, I have been tremendously surprised by its success.Gmvault is a command-line tool built to backup on your disk, your Gmail inbox and restore it as it was, in any Gmail account you wish. It is full of features such as encryption, compression, automatic syncing, GTalk chats backups (and many more), while being very simple to use. It had to be simple because I wanted everybody to be able to use it, not only geeks like me, but also normal users that have a GMail account and wanted to reclaim their emails from Google. Simplicity and usability is not the topic of this post so let's go back to it.

There has been more than 30000 downloads and more the 80000 users coming to www.gmvault.org in total since the 7th of May which is a lot considering that the only advertisement I did was to post it on HackerNews.There have been also many website articles (http://bitly.com/bundles/gaubert/5) and even youtube videos explaining how to use it (http://bit.ly/OWN4Xc) :-)

I am very happy about this adventure and would like to pursue it by developing v2.0 that will come up with a graphical interface in order to be really available for any users in the Gmail users' spectrum: from my Granma to myself. But before to go to v2.0, there is a mandatory stop in v1.7-beta.

v1.7-beta is the fourth version released 3 months after v1.0-beta and it is the first version I am starting to be statisfied with. In between there was v1.5-beta and v1.6-beta:

v1.0-beta was the proof of concept that established the name and demonstrated that there was a need for such tool, but it was not fully ready. The backup-restore engine was already working very well but being an experienced developer I knew I had not hit the performance issues you will always encounter when you launch a new software. I also knew that the deployment and packaging was not ready but I wanted to release it to see if the idea would get a bit of traction (and it did !). I was in the Fuck it ship it mode (http://bit.ly/PQIuFE) when I did it and am very grateful for having done that.

v1.5-beta fixed the major issues of v1.0-beta which were, to not work properly on Linux because of some bash script issues and to not uninstall cleanly on Windows. These were minor issues but that could totally stop users running it, so I had to fix them quickly and I released a new version less that 10 days after. In my hurry, I even messed-up one feature (the encryption could not be activated anymore) as I deactivated some of my tests (so bad).

v1.6-beta started to tackle some performances issues reported by some users. In some cases, the restore operation was eating all the memory of the machine and eventually crashing it. I had to do some profiling in Python which is not the easiest and found that the issue was coming from the socket and ssl python layer. I also worked a lot on the MacOSX deployment as there were some issues with 32 bits machines. It is a bit tricky to make a clean Python OSX app and I will probably write a blog post about it. With that version I was closer to my goal but users had started to request new features that needed to be added and new restore issues started to appear.

So I started to work on v1.7-beta but decided to take my time in building this version as I needed to erase the performance issues. In special cases, the Python process could start to spin-out and consume lots of CPU. Again I did find a bug in the socket layer and had to monkey patch the socket module to fix it. Note that the bug has been reported to the Python developers but it did not make it through all releases yet. It also turns out that GMail IMAP doesn't want to ingest back some of the emails it originally contained and it spits them out in error so I created a quarantine area to stage these emails. Most of the time, these are advertising emails that contain bad characters. I also added new features such as Chats backup. They are now stored like to emails and can be restored in a special label gmvault-chats as we are not allow to write in Chats. GMail IMAP is like its own little world as it is full of specific behaviours that not necessarly follow the IMAP philosophy. It is compliant to IMAP but Gmail engineers had to make some choices like for example on how to map IMAP folders with labels, ... This will also deserve a blog post as I have some questions for the GMail team. I hope that v1.7-beta is in the stable state I would have liked to have from the beginning. Note that v1.0 was not at all desastrous but I could have done a little bit better for a launch.

So what should I do next time I release a new product ?

Lessons Learned

  • Launch early: I said that v1.7-beta was the version I wanted to build but I am really happy to have released v1.0 early even if it needed more work. First the users' reponse told me if I should invest more time in Gmvault and the tool success lifted me and that pushed me to improve it. I also fixed many little issues and imperfection reported by users and could not have found so many of them even with the testing methods I applied. I am also a perfectionnist and have already waited so long that in the end I lost interest in some projects and never released them. Of course you should reach a reasonnable stage for your software (for me it is 90-95% ready), I am not telling you to launch when it is barely working. Also, don't be afraid to launch: if this is useful, you will get some feedback from the users or at the worst, they will ignore it.
  • Focus on the software deployment: You should spend a good chunk of the development time in building the packaging for deploying the software. This is the first part of the software users are going to see so the win installer, linux deb ... should be easy to install, clean and shiny if possible as well as easy to uninstall. You do not want to be intrusive and let users think that you do not want them to uninstall your tool.
  • Spend lots of time building the marketing: If there was one thing I would like to do, it is to focus more on how to sell the product. I have realised that it is not enough to have a product translating a neat idea into a well executed software (execution is still important though), you need to know how to sell it. I have still a lot to learn in that field but now I am convinced that you also need to spend heck a lot of time building your web site. For v2.0 I will spend much more time doing that for sure. Like anybody, I would prefer to solve technical issues that writing english explaining how to use Gmvault and selling its merits (not so easy) but you really really need to do it or find somebody able to do it.
  • Launch in the right community: I advertised Gmvault on Hacker News because I knew it was a popular destination for specialists not afraid about running scripts and CLI tools. Also releasing in this community allows you then to reach a larger group as the HackerNews page is scrutinsed by the more general Tech web sites (PCWorld, Slashdot, ...). It is the centre of the tech web :-) and the HackerNews crowd is a great community of professionals that will not shy away and will tell frankly you what is missing and what is ok regarding your product. 
  • Do not mess up with your test suite and don't be pretentious: To be quick and because I was too pretentious, I deactivated part of my tests and I will not do it again. Because of that in one of my intermediate version I destroyed one of the feature (encrypting stored emails) but luckily it was not used by lots of users yet. With the pressure I neglected the testing and paid for it.

This is it, to enjoy Gmvault v1.7-beta please go there.

Do not hesitate to contact me via Github, Twitter, email, Gmvault-Users forum ... if you want to report a problem or would like to suggest a nice feature.

How to add Google Calendar iframes in a MoinMoin Page

We use a simple MoinMoin wiki at work in our team in order to record some best practices, information related to problems and so on.

I wanted to maintain a calendar of events related to the team and Google Calendar is the right tool for that but I didn't want to give to the team and new URL to track.

So I looked how to embed it in a MoinMoin page. Google Cal for that advises to embed an iframe as you can also show multiple calendars at once (see picture).

So I did a little GSearch and ended up on the HTML.py macro for MoinMoin.

It is a pretty dumb macro as all it does is to pass the macro argument to the HTML engine that is going to render it.

The issue is that the macro is compatible with version anterior to 1.6.

So I did modified it to be compatible with version 1.6 onwards.

It is available on github under https://github.com/gaubert/geekomotion/blob/master/HTML.py

Get the file and drop it under [MY_MOIN_MOIN_ROOT]/data/plugin/macro or in [PYTHON_DISTRIB]/lib/pythonx.x/site-packages/MoinMoin/macro.

If it works you should have a handy page like that.

Just one additional comment, When you will copy the iframe tag from the GoogleCal page (see below), remove all quotes from the iframe attributes otherwise the macro system of MOINMOIN will not work.

If  you have iframe src="https://www.google.com/calendar/embed?src=koa1pe0jc7195mb51carkmtpak%40group.calendar.google.com&ctz=Europe/Berlin" style="border: 0" width="800" height="600" frameborder="0" scrolling="no"></iframe>

You shoud insert the following HTML macro:

 

<<HTML("iframe src=https://www.google.com/calendar/embed?src=koa1pe0jc7195mb51carkmtpak%40group.calendar.google.com&ctz=Europe/Berlin style=border: 0 width=800 height=600 frameborder=0 scrolling=no></iframe>")>>

 


Enjoy !!!