Our email inboxes are flooded daily with an uninterrupted flow of emails, many of which can be classified as spam , for this reason it is essential that services like Gmail have sufficiently effective weapons to identify this unwanted content and filter the incoming mail, to all to the advantage of security and user experience.
Gmail is stronger against spam thanks to RETVec
The battle against spam is an endless and constantly evolving battle and the latest innovation from Gmail has just been illustrated by Google in its effectiveness: the technology used has made a 38% improvement in text identification possible.
Services such as Gmail, but also Google Play and YouTube – explains Google – rely on text classification models to be able to recognize potentially harmful content (such as phishing attacks, inappropriate comments and scams). These types of text are more difficult for machine learning models to classify, as attackers use various techniques — homoglyphs (characters that appear similar to real letters), invisible characters, excess keywords and other “adversarial text manipulations” — studied precisely in order to evade classifiers.
The Mountain View giant’s counter move consists in focusing on RETVec (Resilient & Efficient Text Vectorizer), a new approach developed by Google Research (it is open source ) which helps models to achieve state-of-the-art classification performance and to simultaneously reduce computational costs, all while supporting “ every language and all UTF-8 characters without the need for text preprocessing .” In short, a system that lends itself to use on mobile devices, via the web and to large-scale use cases.
In the case of Gmail , RETVec made it possible to recognize spam 38% more effectively, all while reducing the false positive rate by 19.4% and the use of the Tensor Processing Unit (TPU) by as much as 83%.
Google explains that RETVec has made such important improvements possible thanks to a particularly light word embedding model (around 200 parameters), thus making it possible to reduce the size of the Transformer model with unchanged or better performance and with the further possibility of splitting the computing tasks between host and TPU efficiently (in terms of network and memory).