Setting up the Server
LLM Fine-tuning for Anti-Tracking in Web Browsers
Welcome to the fourth session of our course: LLM-Fine-tuning for Anti-Tracking in Web Browsers.
In this series, we are exploring how to leverage Large Language Models (LLMs) to improve online privacy by detecting and blocking trackers embedded in websites. By the end of this series, you’ll have built a complete browser extension powered by a fine-tuned language model that can identify tracking URLs with remarkable accuracy.
Today’s session is focusing on setting up the server.
Check out the video for a quick, hands-on demo. For more theory and background, read on.
Serving the Fine-Tuned Model Through a Lightweight Flask API
Having fine-tuned and optimized our anti-tracking model, we are now ready to make it accessible to external systems — particularly, our browser extension. To do this, we will create a lightweight API server that hosts the model and responds to requests in real-time.
This is a crucial step. Without an API, the browser extension would have no way of querying the model to classify URLs dynamically. By the end of this episode, we will have a working backend that takes a URL as input and returns a prediction about whether it is a tracker or not.
1. Why an API Server is Needed?
When building an application that uses machine learning models, there are typically two main ways to integrate the model:
Embed the model directly into the application (e.g., compile it into the extension).
Serve the model through a backend server that the application communicates with over a network.
In our case, serving the model through an API server is the better choice for several reasons:
Model Size: Even an optimized Transformer model is still large for direct embedding into a browser extension.
Flexibility: By decoupling the model from the extension, we can update, retrain, or optimize the model independently without needing to update the extension itself.
Scalability: Multiple clients (different users or browser tabs) can share the same server, reducing resource duplication.
Security: Hosting the model on a local server (during development) or a trusted remote server (in production) allows better control over access.
Thus, our goal is to create a lightweight but efficient API server that acts as a bridge between the browser extension and the machine learning model.
2. Choosing Flask for the API
There are many frameworks available for building APIs in Python, such as Django, FastAPI, and Flask.
For our project, we choose Flask, and the reasons are straightforward:
Lightweight: Flask is a micro-framework, meaning it provides only what is necessary to build simple applications without extra overhead.
Quick to set up: A basic Flask API can be created in just a few lines of code.
Flexible: It allows us to customize request handling easily.
Well-documented and widely used: Plenty of tutorials and help are available if needed.
Given that our task is relatively simple – load a model and classify incoming URLs – Flask is the perfect fit.
3. Overview of the Server Design
The server will consist of a single Flask application with the following components:
Model Loading: When the server starts, it loads the fine-tuned DistilBERT model and tokenizer into memory.
Prediction Endpoint: A
POST
HTTP endpoint (/predict
) that accepts a JSON body containing a URL and returns whether it is a tracker.
In simple terms: the server will wake up, load the model once, and then wait for the browser extension to send URL classification requests.
4. Key Steps in Building the API Server
Let’s now walk through what each part of the API server will do.
a) Import Required Libraries
First, we import Flask and its request handling utilities, along with the Hugging Face Transformers library to load the model and tokenizer.
This forms the skeleton of our application.
b) Load the Model and Tokenizer
When the server script runs, it immediately loads the pre-trained DistilBERT model and associated tokenizer from the final_model/
directory.
Loading at startup ensures that the model is held in memory and not reloaded for every incoming request — a crucial optimization for speed.
This step may take a few seconds initially, but afterward, all predictions will be fast.
c) Define the Prediction Endpoint
The core of the server is the /predict
endpoint:
It accepts only
POST
requests.It expects a JSON object in the request body with a key
"url"
.It tokenizes the input URL.
It passes the tokenized input to the model.
It interprets the model’s output logits and determines:
is_tracker
:true
if predicted class is 1, elsefalse
.confidence
: a float value representing the probability of URL being a tracker.
Finally, it sends a well-structured JSON response back to the client (the browser extension).
d) Running the Server
Once all routes are defined, the server is started by running app.py
.
bash
(tracking-llm-venv) llm-anti-track% python app.py
By default, Flask servers run on localhost
at port 5000
. This means the browser extension will send HTTP requests to http://localhost:5000/predict
during development.
In production, you could deploy this server on cloud platforms like AWS, Azure, or even a private server, depending on the needs and security requirements.
5. Testing the API
After starting the server, it is essential to test it independently before integrating it with the browser extension.
This can be done using command-line tools like curl
.
Here’s an example of testing with curl
, run this in a new terminal window:
curl -X POST http://localhost:5000/predict \
-H "Content-Type: application/json" \
-d '{"url":"https://tracking-site.com/pixel.gif"}'
If everything works correctly, the server should respond with a JSON object similar to:
{
"is_tracker": true,
"confidence": 0.98
}
This confirms that:
The server is properly receiving and interpreting requests,
The model is correctly classifying URLs,
The API is formatting the response in a clean, predictable way.
Once this basic communication is verified, we are ready to move to the final integration step: building the browser extension itself.
Try it out yourself: Run the curl command with various known trackers and non-trackers and see how your model responds.
6. Best Practices and Potential Improvements
While the basic Flask server will work well for development and small-scale deployment, here are some practices to consider if scaling up:
Concurrency: Use production-ready servers like Gunicorn with Flask if handling many simultaneous requests.
Caching: Frequently seen URLs can be cached temporarily to avoid redundant model inferences.
Security: Always validate and sanitize inputs more strictly if deploying publicly.
Monitoring: Logging server activity helps in diagnosing issues during real-world use.
For this tutorial series, a simple Flask server is sufficient, but keep these enhancements in mind if you plan to expand your project in the future.
7. Summary
In this episode, we built a critical bridge between our machine learning model and the real world. You learned:
- How to access the ML model from the server
- How to open up a port to allow external clients to access the model.
With the API now live and functional, we are ready to tackle the final engineering piece: developing the browser extension that will call this API and provide real-time privacy protection to users.
Next Episode Preview: Developing the Complete Browser Extension.
You can revisit the model training & testing session here, or go to Lumen Home.
Contact
Talk to us
Have questions? We’re here to help! Whether you’re curious to learn more, want guidance on applying, or need insights to make the right decision—reach out today and take the first step toward transforming your career.