Model Training and Testing

LLM Fine-tuning for Anti-Tracking in Web Browsers

Welcome to the third session of our course: LLM-Fine-tuning for Anti-Tracking in Web Browsers.

In this series, we are exploring how to leverage Large Language Models (LLMs) to improve online privacy by detecting and blocking trackers embedded in websites. By the end of this series, you’ll have built a complete browser extension powered by a fine-tuned language model that can identify tracking URLs with remarkable accuracy.

Today’s session is focusing on model training and testing.

Don’t miss the video demo – it walks you step-by-step through the training and testing scripts in action, complete with practical tips for optimizing your model’s performance in a real-world setting. Also, you’ll find an invitation to participate in our training challenge and see if you can make it to the top of the leaderboard

The remaining article below explores the foundations of BERT.

Model Training and Testing
Fine-Tuning a Lightweight Transformer to Detect Online Trackers

In the previous session, we generated a synthetic dataset that simulates the diverse nature of real-world web traffic. We now possess a structured and labeled dataset containing URLs categorized as either trackers or non-trackers. This dataset forms the foundation of our next major step: building and fine-tuning a machine learning model that can accurately classify new, unseen URLs.

In this session, we will explore the journey from understanding what BERT models are, to fine-tuning a compact version of BERT specifically for our anti-tracking task. We will also discuss the rationale behind choosing a lightweight model variant, the fine-tuning process itself, and how we evaluate the model’s performance to ensure it meets real-world requirements.


1. Understanding BERT and the Rise of Transformers

Before we dive into the specifics of model training, it is important to appreciate the paradigm shift that Transformer models, particularly BERT, introduced to the world of Natural Language Processing (NLP).

BERT, which stands for Bidirectional Encoder Representations from Transformers, was introduced by Google Research in 2018. Unlike previous models that processed text either from left-to-right or right-to-left, BERT reads text in both directions simultaneously. This bidirectional attention allows it to capture much richer semantic relationships within text, understanding how each part of a sentence relates to the rest in a deeply contextualized way.

For our use case — detecting trackers — this ability to understand context is incredibly valuable. Trackers often disguise themselves with seemingly benign-looking URLs. A model that can holistically read and interpret the structure, components, and subtle hints within a URL is far better equipped to recognize malicious tracking attempts than a simple keyword-matching system.

However, full-sized BERT models come with a cost: they are very large, computationally expensive, and slow for real-time applications like browser extensions. This brings us to the need for more efficient alternatives.


2. Choosing DistilBERT: Balancing Performance and Resources

Given the constraints of our application — where speed and model size are critical for a smooth browsing experience — we opt for DistilBERT instead of using the original BERT model.

DistilBERT is a compressed, lighter version of BERT that was developed through a process known as knowledge distillation. Essentially, a smaller “student” model learns to mimic the behavior of the larger “teacher” BERT model. The result is a model that:

  • Retains around 97% of BERT’s language understanding capabilities,

  • Is 60% faster at inference,

  • And has about 40% fewer parameters.

This significant reduction in size and increase in speed comes with only a minimal sacrifice in performance. For a task like URL classification, where texts are short and the vocabulary is less complex compared to full sentences or paragraphs, DistilBERT offers a perfect balance between accuracy and efficiency. It allows us to build a responsive anti-tracking system that can run comfortably within browser-related constraints.


3. The Concept of Fine-Tuning

Now that we have selected a model, it is important to understand what fine-tuning means.

BERT models, including DistilBERT, are originally pre-trained on massive general text corpora like Wikipedia and BookCorpus. They are not trained for specific tasks like spam detection, sentiment analysis, or, in our case, tracker identification. However, the magic of transfer learning allows us to take these pretrained models and fine-tune them on smaller, task-specific datasets.

Fine-tuning involves slightly adjusting the internal weights of the pre-trained model so that it becomes specialized for a new task. Importantly:

  • We do not need to retrain from scratch.

  • The pre-trained model already knows how to process and understand text at a fundamental level.

  • Fine-tuning simply adapts this knowledge to a new, narrower domain — in our case, understanding the nuances that distinguish tracking URLs from non-tracking URLs.

During fine-tuning, a small classification layer (often just a fully connected dense layer) is added on top of the model. This new layer is trained alongside the rest of the network to map the learned representations to the desired output labels — here, tracker or non-tracker.


4. Preparing for Training

Before starting the fine-tuning process, several preparatory steps are essential. These are already coded for you in the script train_model.py available at the github repository:

  1. Dataset Loading:
    We begin by reading in the training_data.csv file generated earlier. Each row contains a URL and its corresponding label.

  2. Tokenization:
    Transformers like BERT and DistilBERT do not operate directly on raw text. Instead, text must be tokenized i.e. split into smaller meaningful pieces known as tokens. The tokenizer associated with DistilBERT will break down URLs into subtokens that the model can understand.
    Despite URLs not being conventional natural language, tokenizers trained on English often handle them surprisingly well, splitting at slashes, dots, and special characters.

  3. Dataset Formatting:
    The tokenized URLs, along with their labels, are organized into a format compatible with training. We ensure that padding and truncation are handled correctly:

  • Padding ensures that all sequences in a batch have the same length.

  • Truncation limits overly long inputs, though in practice, URLs rarely exceed a reasonable token count.

  1. Training-Testing Split:
    It is standard practice to split the training data into two parts:

  • A training set used to adjust model weights.

  • A testing set used to periodically assess the model’s performance during training, helping prevent overfitting.


5. The Fine-Tuning Process

Once the preparations are complete, we initiate the fine-tuning procedure. Here’s how it proceeds conceptually:

  1. Model Initialization:
    We load the pre-trained DistilBERT model with an added classification head suitable for binary classification.

  2. Training Configuration:
    Several hyperparameters are defined:

  • Epochs: How many times the model sees the entire dataset.

  • Batch Size: How many samples are processed before the model updates its parameters.

  • Learning Rate: Controls how big the updates to model weights are after each batch.

  • Optimizer: We typically use AdamW, a variant of the Adam optimizer that is particularly effective for Transformer models.

  1. Training Loop:
    During training:

  • The model predicts the label (tracker or non-tracker) for each input URL.

  • The predicted label is compared with the true label.

  • The loss function (cross-entropy loss) measures how far off the predictions are.

  • Gradients are calculated and used to update model weights to minimize loss.

Training proceeds in batches for multiple epochs until the model achieves satisfactory performance on the validation set.


6. Saving the Fine-Tuned Model

Upon successful training and evaluation, the train_model.py script saves the model to disk for future use.
The following artifacts are stored inside the final_model/ directory:

  • pytorch_model.bin: The fine-tuned weights.

  • config.json: Metadata describing the model architecture.

  • tokenizer.json: The tokenizer settings necessary to prepare future inputs.

  • Additional supporting files such as special_tokens_map.json and vocab.txt.

Saving the model ensures that we do not have to retrain from scratch every time.

Later, our API server will load this saved model and serve real-time predictions to the browser extension.


7. Evaluating the Model

Once training is complete, evaluating the model is crucial. A model that simply memorizes training examples is useless in practice. Therefore, evaluation is conducted on the separate testing set (testing_data.csv), which contains URLs the model has never seen during training. You can run script test_model.py, (downloaded from the github repository) to test the model. 

Key evaluation metrics include:

  • Accuracy: The proportion of total correct predictions out of all predictions. While useful, accuracy alone can be misleading if the classes are imbalanced.

  • Precision: Out of all URLs predicted as trackers, how many were truly trackers? High precision means few false positives.

  • Recall: Out of all actual trackers in the dataset, how many did we successfully detect? High recall means few false negatives.

  • F1-Score: The harmonic mean of precision and recall. It provides a balanced metric when dealing with imbalanced classes.

For privacy-related applications like anti-tracking, high recall is especially important. Missing a tracker (a false negative) could lead to a privacy breach. At the same time, high precision is important too — we don’t want to wrongly block legitimate resources and break websites.


8. Summary

In this episode, we learned:

  • how to fine-tune a lightweight, efficient Transformer model to detect online trackers based on URL structures.
  • the reasoning behind selecting DistilBERT, and understood the fine-tuning process in depth,
  • how to properly evaluate and save our trained model.

 

In the next session, we will setup a server to call our fine-tuned model for inferencing.

Tried it yourself? Share your results with us to get featured on the leaderboard! You can email your scores and model details to webmail@digitalmunich.com. We’d love to see how your model performs.


Next Episode Preview: Setting up the server for inferencing.

You can revisit the data preparation session here, or go to Lumen Home

Contact

Talk to us

Have questions? We’re here to help! Whether you’re curious to learn more, want guidance on applying, or need insights to make the right decision—reach out today and take the first step toward transforming your career.