Friday, March 12, 2021

Importing mail from Google Takeout into another Gmail account

I had a Google Takeout export of my old University Google account, which I wanted to import into my personal Gmail. Unfortunately, Google does not provide an easy way to do this. I decided to use this as an opportunity to refresh my Haskell language skills, and in particular, to familiarize myself with the Pipes library. The idea was to process an arbitrary size mailbox in fixed memory without explicitly relying on Haskell's laziness. I will not bother you with details of my Haskell implementation. You can take a look at the code yourself, but take it with a grain of salt: I am not a practicing Haskell programmer. For me, this language is more of a lingua franca used in the academic functional programming community. I want to tell you about a few quirks and bugs of the Gmail IMAP interface I've discovered.

First of all, Google Takeout conveniently exports all your mail in standard mbox format as one huge file. Interestingly, this file also contains your chat transcripts, disguised as RFC822-formatted messages, but with the Message-Id field. They could be distinguished by the presence of the "Chat" label in the X-GM-Labels header field. I chose to discard them.

Speaking of labels: IMAP has folders while Gmail has labels. Gmail IMAP interface maps labels to folders. I use IMAP's STORE command and Gmail X-GM-LABELS IMAP extension to assign multiple labels to a single message.

Standard IMAP protocol (without UIDPLUS extension) does not give you a unique ID of a message you just appended. I need this unique ID to assign labels. So after successful message addition, I immediately look it up by Message-Id header field. At this point, I hit two bugs in Gmail IMAP implementation:

Bug #1

When Message-Id contains '%' character, the IMAP SEARCH command fails to find it. This a screenshot of this message present in my mailbox taken from the Gmail web interface:



But the following IMAP command fails to locate it:
UID SEARCH HEADER Message-ID "<D7DA993B.B30DC%psztxa@exmail.nottingham.ac.uk>"
Some people observed the same problem with "!" character and speculated that Gmail split message-id into "words" before indexing. The workaround proposed simulates this split and performs a search on the conjunction of several parts of the subject and then verifies that the found message has the correct Message-Id by fetching it. Instead, I chose to implement a more lightweight solution using X-GM-RAW Gmail's extension to IMAP's SEARCH command, which allows searching using google search syntax. In particular, for the example used above, one can use 
 
"Rfc822msgid:<D7DA993B.B30DC%psztxa@exmail.nottingham.ac.uk>" 
 
search query, which successfully finds the message with "%" in Message-Id.

Bug #2

Another Message-Id-related bug I've stumbled upon is more surprising. It turns out, when indexing messages, Gmail strips square brackets from Message-Ids! If you look at the screenshot from the web interface, you can see that Message-Id shown in the header does not match the actual field value below! To search for such messages, I had to strip square brackets in the search string.


Conclusion

This exercise is not a production-ready script, and some other problems may occur during the import, but I was able to successfully import my mailbox with about 32K messages. You are welcome to hack my script for your own needs. I've submitted a pull request for HaskellNet library with my bugfixes and changes to support Gmail IMAP extensions.
 
P.S. Both bugs were reported to Google: #183687621 and #183677218.