Online advertising has been the engine that powers online content creation for decades. The modern online advertising ecosystem has evolved to optimize for conversions (i.e. clicks and purchases). This practice often leads to persistent and privacy-invasive browsing and behavioral data collection to feed into analytics models, while an individual is browsing seemingly unrelated web sites through an infrastructure of third-party trackers. Over the last decade, users have increasingly pushed back against these privacy-violating practices, both through adoption of products such as Brave that automatically block third-party trackers, and through legislative and regulatory actions such as GDPR and the like.
However, while it’s possible to suppress tracking, there is plenty of user interest in providing relevant recommendations. As a result, we have seen a number of academic proposals like Priva or Adnostic for privacy-preserving advertising systems. However, Brave Ads was the first system to deploy in production. Introduced in April 2019, Brave Ads provide Brave’s current 18 million monthly active users the choice to opt into privacy-preserving advertising and to get rewarded for their attention.
Why do things this way?
While privacy researchers and advocates often focus on preventing certain behaviors, it’s important to realize the tremendous opportunities presented by client-side machine learning. Indeed, in the context of a web browser, the browser itself sees the entirety of a user’s interaction with the web, which includes both “basic” unauthenticated browsing such as reading the news as well as obviously privacy-sensitive activities, such as reading the user’s emails or social media interactions through a web interface.
This means that with the right approach, one can design client-side machine learning models for ad matching that are more accurate than whatever is achieved through third-party tracking, which ultimately lead to fairly few clicks. This shows in the 7% lift Brave Ads achieve (9%) over conventional ad-tech (2%) in terms of click-through rates .
When it comes to data sources, we may consider several options, only some of which are currently used by Brave:
- Pages that the user visits — this forms both long- and short-term user interests — if we build a model based on the contents of the pages that the user visited over the last two months, we will get a pretty good sense of their overall interests such as sports or politics or agriculture.
- Search queries — search intent is immensely powerful, as it often indicates direct and immediate interests of the user; this is obviously what stands behind the effectiveness of search-based advertising — the major driver behind the success of search engines like Google in the marketplace.
In summary, it is entirely possible to both train and apply machine learning models based on user behavior on their local device; what is far from trivial is how to create ML models so that different users inform each other’s client-side models.
Figure 1: Current snapshot of the Brave Ads system. Advertisers create and run campaigns (1) that are disseminated through the Brave Ads Server to all browsers in a given country (2). Non-campaign specific ML models are trained offline and published to browsers for on-device inference (3) when matching ads based on browsing behaviour on sites of publishers and content creators (4) and (5). Ad notifications are reported via a privacy preserving protocol (6).
Brave Ads System Overview
Key functions of an ads platform involve the delivery, matching and reporting of ads and their effectiveness, as measured for example in clicks or conversions. At Brave we have to enable all this in a privacy respecting way which – if taken to the limit – could look like what we have proposed in Themis . To achieve this vision we need to divide and conquer and thus the current state of the Brave Ads platform is more of a stepping stone than a final design, as will be outlined below.
Figure 1 describes the high level flow of an ad making its way onto a user’s device. All starts with advertisers buying inventory on the Brave ads server. While conventional ad-tech platforms would offer up bids on ad opportunities via real-time auctions, Brave currently sells directly to advertisers.
Once bought, an advertiser’s ads will be added to a catalog that will be delivered to the end user in two steps. First the catalog has to be distributed to a user’s devices. In order to keep the dissemination of ads private the browser doesn’t request ads in real-time but periodically downloads a catalog with a subset of all available ads. Currently this is done by creating a version of the catalog for each country. The second step of delivery is happening in the browser and involves a basic user attention model to select and serve an ad at a non-intrusive time.
Besides showing ads at the right time it is paramount to only select ads relevant to a user. Matching in a privacy preserving setting is exponentially harder than in conventional ad-tech since it rules out most data-mining based approaches. To this end we use a combination of general purpose and user-specific on-device machine learning as detailed in later sections of this post.
Finally, we need to report on performance metrics like impressions, clicks and conversions. To enable unlinkable and private reporting we developed a protocol based on privacy pass. The mechanism is described in detail on our Wiki .
After having a high-level overview of the system’s key notions, let’s dive deeper into private ad matching and how we employ on-device machine learning to boost performance.
Private Ad Matching
Matching can come in many flavours and is mainly divided into context based and behaviour based matching. The next section briefly introduces both concepts and explains how advertisers can run campaigns under both regimes.
Strictly speaking contextual advertising only considers the user-independent context of a given ad. For conventional in-page ads, that would involve characteristics about the website displaying the ad like its topic of content. Since Brave uses page-independent system notifications to serve ads we use a short-term summary of a user’s browsing history to establish the relevant context. An upside of serving ads independent of a single page is that it increases brand safety. The downside of the contextual approach is that it might not reflect a user’s commercial intent to purchase an advertised product or service. This is where behavioural matching comes into play.
In contrast to contextual matching, the behavioural approach doesn’t care much for the immediate context of a user’s browsing session but focuses more on patterns that emerge from a user’s long-term browsing behaviour like frequently visited websites, online searches or attention spent on different types of content. Behavioural advertising commonly distinguishes between interest and intent, where purchase intent is especially of value for advertisers who want to reach audiences that are “in market” for their product or service. Behavioral targeting traditionally involves tracking users through third party audience segments, re-targeting them with ads as they browse and begin their purchasing journey. Current methods rely on third party data collection. Brave innovates by being first to market with a privacy-preserving purchase intent mechanism that leverages locally available data that never leaves the device.
To understand what a user might be interested in buying we employ a set of heuristics that are informed by consumer research and industry knowledge in search and keyword marketing. The challenge of this approach is that we have to answer questions like “what do users search for when they want to buy a new smartphone” a priori, without observing any actual user data.
To summarise, at Brave we offer two private means of matching to advertisers which can be understood as contextual advertising to match users based on interest and behavioural advertising to match users based on purchase intent.
Local Machine Learning
Machine Learning is a key component to enable private matching on the Brave Ads platform. Figure 2 gives a high level overview of how this is done. Every step in the pipeline up to publishing the models on our model serving environment is offline in the sense that it doesn’t involve the browser. The models are trained on publicly available data sources like common crawl where one of the key challenges is generating accurate labels at scale. After QA, models are periodically published and downloaded by the browser. The browser then uses these “general purpose” models to make predictions on ad relevance for a given user’s browsing history. It is worth mentioning that we never try to infer interest on login-based websites like your email or social media.
Figure 2: High-level ML Pipeline to train and publish models for on-device inference.
Looking into the Future
There is a lot of room for improvement when it comes to understanding a user’s interest and intent. One of the most obvious limitations of the current system is its ignorance of a user’s latent preferences. Imagine a user who is enthusiastic about home & gardening but never browses related content on their work laptop. In that case the local user models would never predict interest for that segment. We thus need a way of exploring segments independent of the overt browsing behaviour.
One class of algorithms to tackle such exploration based problems is called multi-armed bandits. We have already conducted research and published work on making such bandits private . We are currently working on implementing a first version of said bandits in the browser.