Author: Giga Alqeeq

  • Data Classification

    Data Classification

    Data classification involves defining and categorizing information based on its type, sensitivity, and value to the organization. This process allows for more effective management, protection, and utilization of data. By identifying data as confidential, sensitive, internal, or public, organizations can implement appropriate security controls, access restrictions, and handling procedures to safeguard confidentiality, integrity, and availability.

    Furthermore, classification enhances operational efficiency by making it easier for authorized users to locate and access information while ensuring compliance with regulatory requirements and internal policies. Organizations typically create their own classification models and categories to align with their business objectives, regulatory obligations, and risk tolerance. This enables them to prioritize resources and protect their most critical or valuable information.

    Content-Based Classification

    This approach examines the actual content of files (payload) to determine sensitivity, often identifying patterns such as credit card numbers or PII (personally identifiable information).

    • Techniques: Uses automated scanning, pattern matching, algorithms, or machine learning to scan text.
    • Pros: Highly accurate because it analyzes data content directly.
    • Cons: Can be resource-intensive.

    Context-Based Classification

    This approach analyzes the surrounding circumstances (the metadata) rather than the data itself to infer sensitivity.

    • Techniques: Evaluates application (e.g., Salesforce, Jira), location (e.g., specific file paths), creator, or time of creation.
    • Pros: Fast and efficient, often used in DLP (Data Loss Prevention) tools.
    • Cons: May miss sensitive data if the context is deceptive.

    User-Based Classification

    This approach relies on human judgment, where creators or users manually select a classification label for a file at creation or modification.

    • Techniques: Manual tagging prompts that ask the user to classify the data (e.g., Public, Confidential).
    • Pros: Highly accurate for understanding business value, as the creator knows the data’s true purpose.
    • Cons: Subjective, inconsistent, and prone to user error or negligence

    Military Classification Scheme

    • Top Secret
      • Data requires the highest degree of protection, and disclosure of it would cause exceptionally grave damage to national security
      • Policy for conducting intelligence
    • Secret
      • Disclosure of it would cause serious damage to national security
      • Indications of weakness
    • Confidential
      • Disclosure of it would cause damage to national security
      • Intelligence reports
    • Sensitive
      • Data is not classified, and disclosure of it would cause limited damage to national security
      • For Official Use Only (FOUO)
      • Limited Official Use (LOU)
      • Official Use Only (OUO)
    • Unclassified
      • Data is not classified and non-sensitive

    Commercial Classification Scheme

    • Restricted
      • High sensitive data and access is restricted to specific individuals or authorized third parties (disclosure to it would lead to permanent damage)
      • Examples:
        • SSN
        • Credit cards
        • Criminal Record
        • Medical info
        • Biometric data
    • Confidential
      • Sensitive data that is team-wide and disclosure to it would harm the origination operation
      • Examples:
        • Vendor contracts
        • Employees salaries
        • Names, addresses, and dates
    • Sensitive
      • Non-Sensitive data that is origination-wide and cannot be disclosed to anyone
      • Examples:
        • Internal policies
        • Internal user guides
        • Ogrinzaitonl charts
        • Project documents
    • Public
      • Information that can be disclosed to anyone
      • Examples:
        • Public API documents
        • Job titles and names
        • Open API Data
  • Data States

    Data States

    Data states refer to the different conditions in which data exists, encompassing both structured and unstructured information. They are typically divided into three categories: at rest, in use, and in transit.

    Data at Rest

    Data stored on physical or digital media that is not actively being processed or transmitted.

    • Examples: Databases, File servers, Cloud storage, Backups, Endpoint devices  
    • Security Controls:
      • Encryption: Full disk, file-level, and database encryption to protect confidentiality.  
      • Access Controls: Role-Based Access Control (RBAC) and the principle of least privilege.  
      • Data Loss Prevention (DLP): Identifies and protects sensitive stored data.  
      • Integrity Controls: Hashing and checksums to detect unauthorized modifications. 
      • Availability Controls: Backups, redundancy, and disaster recovery plans.  
      • Cloud Access Security Broker (CASB): Enforces policies for cloud-stored data.  
      • Mobile Device Management (MDM): Secures data on mobile endpoints (e.g., remote wipe, enforced encryption).

    Example

    echo # prints text to standard output
    “Hello World” # the exact string being printed
    > # redirects output into a file (overwrites file if it exists)
    file.txt # destination file

    echo "Qeeqbox" > file.txt

    ls # list directory contents
    – l # use long listing format (permissions, owner, size, date)
    file.txt # the specific file to display info about

    ls -l file.txt

    Data in Use

    Data actively accessed, processed, or modified by users or applications, typically in memory (RAM).

    • Examples: Editing documents, Running applications, Processing transactions  
    • Security Controls:
      • Access Controls & Authentication: Ensures only authorized users or processes can access data.  
      • Privileged Access Management (PAM): Monitors and restricts administrative access.  
      • Rights Management (Digital Rights Management/Information Rights Management): Controls usage (e.g., restricts copy, print, and forwarding).  
      • Endpoint Security: Endpoint Detection and Response (EDR) and antivirus solutions to detect malicious activity during use.  
      • Data Masking/Tokenization: Protects sensitive data during processing.  
      • Session Controls: Implement timeouts, re-authentication, and continuous monitoring.  
      • DLP (Endpoint): Prevents unauthorized actions, such as copying to USB devices.  

    Note: Traditional encryption does not fully protect data in use since it must be decrypted in memory. Advanced methods like confidential computing exist but are not yet standard.

    Example

    nano # open the nano text editor
    file.txt # target file to open or create

    nano file.txt

    ps aux # list all running processes with details
    | # pipe sends output of left command to right command
    grep nano # filter results to only lines containing “nano”

    ps aux | grep nano

    Data in Transit

    Data that is transmitted between systems, networks, or users.

    • Examples: Emails, Web traffic, File transfers, API communications  
    • Security Controls:
      • Encryption in Transit: Utilize TLS/SSL (HTTPS), secure email encryption, and VPNs.  
      • Secure Protocols: Use SFTP and SSH instead of insecure protocols like FTP and Telnet.  
      • DLP (Network): Monitors and blocks unauthorized data exfiltration.  
      • CASB: Controls data movement to and from cloud services.  
      • Integrity Controls: Use digital signatures to verify authenticity and prevent tampering.  
      • Network Security Monitoring: Intrusion Detection Systems (IDS) and Intrusion Prevention Systems (IPS) to detect attacks and anomalies.  
      • Rights Management (DRM/IRM): Maintains usage restrictions after sharing.  

    Example

    curl # Run the curl command-line download tool
    https://qeeqbox.com/file.txt # URL of the file to download
    -o file.txt # Save the downloaded content as “file.txt”

    curl https://qeeqbox.com/dummy.txt -o file.txt
  • Data Visualization

    Data Visualization

    The process of translating data into a visual context (A graphical representation of data). This process is very important because it allows businesses to see the relationships and patterns between the data. Visualization makes large datasets coherent and makes them more accessible and understandable.


    Line Chart

    A line chart is a graphical representation used to track changes in data over time. It displays data points connected by straight lines, making it easy to visualize trends, patterns, and fluctuations. Line charts are commonly used for time-series data, such as stock prices, temperature changes, website traffic, or sales performance

    Example

    from datetime import datetime, timedelta # Import tools to work with dates and time differences
    from random import randint # Import function to generate random numbers
    import matplotlib.pyplot as plt # Import Matplotlib for plotting graphs
    x = [datetime.now() + timedelta(hours=i) for i in range(24)] # Create 24 timestamps (one per hour starting now)
    y = [randint(0, i) for i, _ in enumerate(x)] # Generate random values based on index position
    plt.plot(x, y) # Plot the x (time) and y (random values) data
    plt.show() # Display the graph

    from datetime import datetime, timedelta
    from random import randint
    import matplotlib.pyplot as plt
    x = [datetime.now() + timedelta(hours=i) for i in range(24)]
    y = [randint(0,i) for i,_ in enumerate(x)]
    plt.plot(x,y)
    plt.show()

    Output

    You can also plot multiple lines like this


    Scatter Chart

    A scatter plot is a graphical representation in which each value in a dataset is plotted as a dot. It is used to visualize the relationship or correlation between two variables. The position of each dot along the x-axis and y-axis corresponds to the values of the two variables. Scatter plots are useful for identifying patterns, trends, clusters, and outliers in data

    Example

    from datetime import datetime, timedelta # Import date/time tools (not used in this example)
    import numpy as np # Import NumPy for generating random data
    import matplotlib.pyplot as plt # Import Matplotlib for plotting
    x_1 = np.random.randint(low=20, high=50, size=20) # Generate 20 random x-values for Day 1
    y_1 = np.random.randint(low=25, high=120, size=20) # Generate 20 random y-values for Day 1
    x_2 = np.random.randint(low=20, high=50, size=20) # Generate 20 random x-values for Day 2
    y_2 = np.random.randint(low=25, high=70, size=20) # Generate 20 random y-values for Day 2
    plt.scatter(x_1, y_1) # Create scatter plot for Day 1 data
    plt.scatter(x_2, y_2) # Create scatter plot for Day 2 data
    plt.legend(labels=[‘Day 1’, ‘Day 2′], loc=’upper right’) # Add legend to distinguish datasets
    plt.show() # Display the scatter plot

    from datetime import datetime, timedelta
    import numpy as np
    import matplotlib.pyplot as plt
    x_1 = np.random.randint(low=20,high=50, size=20)
    y_1 = np.random.randint(low=25,high=120, size=20)
    x_2 = np.random.randint(low=20,high=50, size=20)
    y_2 = np.random.randint(low=25,high=70, size=20)
    plt.scatter(x_1,y_1)
    plt.scatter(x_2,y_2)
    plt.legend(labels=['Day 1', 'Day 2'], loc='upper right')
    plt.show()

    Output


    Bar Chart

    A bar chart is a graphical representation in which values are depicted as vertical or horizontal bars. The length of each bar corresponds to the magnitude of the value it represents, making it easy to compare different categories or groups. Bar charts are commonly used to display discrete data, such as sales by product, population by region, or survey results

    Example

    from datetime import datetime, timedelta # Import date/time tools (not used in this example)
    import matplotlib.ticker as mticker # Import ticker module to control axis ticks
    import numpy as np # Import NumPy for handling arrays
    import matplotlib.pyplot as plt # Import Matplotlib for plotting
    x = np.array([“MON”, “TUE”, “WED”, “THU”, “FRI”, “SAT”, “SUN”]) # Days of the week
    y = np.array([20, 10, 5, 5, 8, 1, 1]) # Malware counts per day
    plt.bar(x, y) # Create a bar chart
    plt.gca().yaxis.set_major_locator(mticker.MultipleLocator(5)) # Set y-axis ticks at intervals of 5
    plt.xlabel(‘Day’) # Label x-axis
    plt.ylabel(‘Malware Count’) # Label y-axis
    plt.show() # Display the bar chart

    from datetime import datetime, timedelta
    import matplotlib.ticker as mticker
    import numpy as np
    import matplotlib.pyplot as plt
    x = np.array(["MON", "TUE", "WED", "THU", "FRI", "SAT", "SUN"])
    y = np.array([20,10, 5, 5, 8, 1, 1])
    plt.bar(x,y)
    plt.gca().yaxis.set_major_locator(mticker.MultipleLocator(5))
    plt.xlabel('Day')
    plt.ylabel('Malware Count')
    plt.show()

    Output


    Maps

    Maps are a type of data visualization used to display geographic data. You can plot points, lines, or areas on a map to show locations, routes, or spatial patterns. Tools like Plotly provide built-in integration with OpenStreetMap, allowing you to create interactive maps without needing an access token. Maps are useful for visualizing data such as population distribution, weather patterns, travel routes, or incidents across different locations

    Example

    import plotly.express as px # Import Plotly Express for interactive plotting
    from random import uniform # Import uniform to generate random floating-point numbers
    temp_list = [] # Initialize empty list to store random coordinates
    for i in range(5): # Loop 5 times
        temp_list.append({‘lat’: round(uniform(-90, 90), 5), ‘lon’: round(uniform(-180, 180), 5)}) # Append a dictionary with random latitude (-90 to 90) and longitude (-180 to 180)
    fig = px.scatter_mapbox(temp_list, lat=”lat”, lon=”lon”, zoom=3) # Create an interactive scatter map using the generated coordinates
    fig.update_layout(mapbox_style=”open-street-map”, margin={“r”:0,”t”:0,”l”:0,”b”:0}) # Set the map style and remove extra margins
    fig.show()  # Display the interactive map

    import plotly.express as px
    from random import uniform

    temp_list = []

    for i in range(5):
        temp_list.append({'lat':round(uniform( -90,  90), 5),'lon':round(uniform(-180, 180), 5)})

    fig = px.scatter_mapbox(temp_list, lat="lat", lon="lon", zoom=3)
    fig.update_layout(mapbox_style="open-street-map", margin={"r":0,"t":0,"l":0,"b":0})
    fig.show()

    Output

    You can also add lines between dots

    Example

    import plotly.graph_objects as go # Import Plotly Graph Objects for more customizable plots
    fig = go.Figure(go.Scattermapbox( # Create a scatter map with markers connected by lines
        mode=”markers+lines”, # Show both points (markers) and connecting lines
        lat=[45.6280, 38.9072], # Latitude coordinates of the points
        lon=[-122.6615, -77.0369], # Longitude coordinates of the points
        marker={‘size’: 10} # Set the size of the markers
    ))
    fig.update_layout(mapbox_style=”open-street-map”, margin={“r”:0, “t”:0, “l”:0, “b”:0}) # Set map style and remove extra margins
    fig.show() # Display the interactive map

    import plotly.graph_objects as go

    fig = go.Figure(go.Scattermapbox(
        mode = "markers+lines",
        lat = [45.6280, 38.9072],
        lon = [-122.6615, -77.0369 ],
        marker = {'size': 10}))

    fig.update_layout(mapbox_style="open-street-map",margin={"r":0,"t":0,"l":0,"b":0})
    fig.show()

    Output

  • Deep Learning

    Deep Learning (DL)

    Deep Learning (DL) is a subset of Machine Learning that utilizes artificial neural networks with multiple layers to identify complex patterns in data. Unlike traditional machine learning methods, which often depend on manually engineered features, deep learning models automatically learn hierarchical representations directly from raw data. This capability makes them particularly effective for processing unstructured data, such as images, audio, text, and video. Deep learning is widely applied in various fields, including image recognition, speech processing, natural language processing, autonomous systems, and cybersecurity, where large-scale and complex data need to be analyzed efficiently.

    Process

    • Input (raw data)
    • Hidden layers (learn low-level -> high-level features automatically)
    • Output (prediction / classification)

    Example (Addition)

    import numpy as np # For numerical operations and generating random data
    from tensorflow.keras.models import Sequential # For building a sequential neural network
    from tensorflow.keras.layers import LSTM, Dense, Dropout, SpatialDropout1D, Embedding # Neural network layers
    from keras.callbacks import EarlyStopping # Stop training early if model stops improving

    # Generate random input data
    x = np.random.randint(0, 500, size=(1000,2)) # 1000 samples, 2 features each (random integers 0-499)
    y = x[:, 0] + x[:, 1] # Target is sum of two features

    # Build a simple neural network
    model = Sequential() # Initialize sequential model
    model.add(Dense(32, input_shape=(2,), activation=’relu’)) # Hidden layer with 32 neurons, ReLU activation
    model.add(Dense(1)) # Output layer with 1 neuron (predict sum)

    # Compile the model
    model.compile(loss=’mean_absolute_error’, optimizer=’adam’, metrics=[‘mae’]) # Use MAE loss and Adam optimizer

    # Train the model
    model.fit(
        x, y, # Training data and targets
        validation_split=0.2, # Use 20% of data for validation
        batch_size=32, # Batch size for training
        epochs=100, # Maximum number of epochs
        verbose=1, # Show progress
        callbacks=[EarlyStopping(monitor=’val_loss’, patience=5)] # Stop early if validation loss doesn’t improve for 5 epochs
    )

    # Predict on new data
    print(model.predict(np.array([[0.2, 10], [50, 1]]))) # Predict sum for two new samples

    import numpy as np
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import LSTM,Dense, Dropout, SpatialDropout1D,Embedding
    from keras.callbacks import EarlyStopping

    x = np.random.randint(0, 500, size=(1000,2))
    y = x[:, 0] + x[:, 1]

    model = Sequential()
    model.add(Dense(32, input_shape=(2,), activation='relu'))
    model.add(Dense(1))
    model.compile(loss='mean_absolute_error', optimizer='adam', metrics=['mae'])

    # ~1000 samples, batch size 32 (hyperparameter)
    # For fixed validation, use train_test_split instead of validation_split
    model.fit(x, y, validation_split=0.2, batch_size=32, epochs=100, verbose=1, callbacks=[EarlyStopping(monitor='val_loss', patience=5)])

    print(model.predict(np.array([[0.2, 10], [50, 1]])))

    Example (Multiplication)

    import numpy as np # For creating and handling arrays
    from tensorflow.keras.models import Sequential # For building a sequential neural network
    from tensorflow.keras.layers import LSTM, Dense, Dropout, SpatialDropout1D, Embedding # Neural network layers
    from keras.callbacks import EarlyStopping # Stop training early if validation loss stops improving

    # Generate random input data
    x = np.random.randint(0, 10, size=(1000,2)) # 1000 samples, each with 2 features (integers 0-9)
    y = x[:, 0] * x[:, 1] # Target = multiplication of the two features

    # Build the neural network
    model = Sequential() # Initialize sequential model
    model.add(Dense(64, input_shape=(2,), activation=’relu’)) # First hidden layer with 64 neurons, ReLU activation
    model.add(Dense(64, activation=’relu’)) # Second hidden layer with 64 neurons, ReLU activation
    model.add(Dense(1)) # Output layer with 1 neuron (predict the product)

    # Compile the model
    model.compile(loss=’mean_absolute_error’, optimizer=’adam’, metrics=[‘mae’]) # MAE loss for regression, Adam optimizer

    # Train the model
    model.fit(
        x, y, # Training data and targets
        validation_split=0.2, # Use 20% of data for validation
        batch_size=32, # Batch size
        epochs=100, # Maximum number of epochs
        verbose=1, # Show progress bar
        callbacks=[EarlyStopping(monitor=’val_loss’, patience=5)] # Stop early if validation loss does not improve for 5 epochs
    )

    # Predict new data
    print(model.predict(np.array([[2, 3]]))) # Predict the product of 2*3

    import numpy as np
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import LSTM,Dense, Dropout, SpatialDropout1D,Embedding
    from keras.callbacks import EarlyStopping

    x = np.random.randint(0, 10, size=(1000,2))
    y = x[:, 0] * x[:, 1]

    model = Sequential()
    model.add(Dense(64, input_shape=(2,), activation='relu'))
    model.add(Dense(64, activation='relu'))
    model.add(Dense(1))
    model.compile(loss='mean_absolute_error', optimizer='adam', metrics=['mae'])

    # ~1000 samples, batch size 32 (hyperparameter)
    # For fixed validation, use train_test_split instead of validation_split
    model.fit(x, y, validation_split=0.2, batch_size=32, epochs=100, verbose=1, callbacks=[EarlyStopping(monitor='val_loss', patience=5)])

    print(model.predict(np.array([[2, 3]])))

    Predicting Suspicious Emails (phishing)

    import numpy as np # Numerical operations (not heavily used here but commonly included)
    from tensorflow.keras.models import Sequential # Sequential model (stack layers linearly)
    from tensorflow.keras.layers import Dense # Fully connected (dense) neural network layers
    from sklearn.feature_extraction.text import CountVectorizer # Converts text into numeric feature vectors (bag-of-words)
    emails = [
        “Click here to reset your password”, # Likely phishing example
        “Your invoice is attached”, # Likely safe example
        “Verify your bank account immediately”, # Likely phishing example
        “Meeting tomorrow at 10am”, # Likely safe example
    ]
    labels = [1, 0, 1, 0] # Target labels: 1 = phishing, 0 = safe
    vectorizer = CountVectorizer() # Initialize text vectorizer (bag-of-words model)
    features = vectorizer.fit_transform(emails).toarray() # Learn vocabulary + convert emails into numeric feature matrix
    model = Sequential() # Create a sequential neural network model
    model.add(Dense(32, input_shape=(features.shape[1],), activation=’relu’)) # Input layer + first hidden layer (32 neurons)
    model.add(Dense(16, activation=’relu’)) # Second hidden layer (16 neurons)
    model.add(Dense(1, activation=’sigmoid’)) # Output layer (1 neuron for binary classification, sigmoid = probability)
    model.compile(optimizer=’adam’, loss=’binary_crossentropy’, metrics=[‘accuracy’]) # Configure model training settings
    model.fit(features, labels, epochs=50, verbose=0) # Train the model for 50 iterations (epochs), no training output shown
    new_emails = vectorizer.transform([
        “Your account will be locked, click here”, # Suspicious/phishing-like message
        “Lunch tomorrow?” # Normal/safe message
    ]).toarray() # Convert new emails into the same feature format
    prediction = model.predict(new_emails) > 0.5 # Predict probabilities and convert to True/False using threshold 0.5
    print(“Phishing predictions (True=Phishing, False=Safe):”, prediction) # Display prediction results

    import numpy as np
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import Dense
    from sklearn.feature_extraction.text import CountVectorizer

    emails = [
        "Click here to reset your password",
        "Your invoice is attached",
        "Verify your bank account immediately",
        "Meeting tomorrow at 10am",
    ]

    labels = [1, 0, 1, 0]  # 1 = phishing, 0 = safe

    vectorizer = CountVectorizer()
    features = vectorizer.fit_transform(emails).toarray()

    model = Sequential()
    model.add(Dense(32, input_shape=(features.shape[1],), activation='relu'))
    model.add(Dense(16, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))  
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    model.fit(features, labels, epochs=50, verbose=0)

    new_emails = vectorizer.transform([
        "Your account will be locked, click here",
        "Lunch tomorrow?"
    ]).toarray()

    prediction = model.predict(new_emails) > 0.5
    print("Phishing predictions (True=Phishing, False=Safe):", prediction)

    Predicting Suspicious Files (Malware)

    import numpy as np # Library for numerical operations and arrays
    from tensorflow.keras.models import Sequential # Sequential model to stack layers
    from tensorflow.keras.layers import Dense # Fully connected neural network layers
    x = np.random.randint(0, 100, size=(1000, 3)) # Generate 1000 samples, each with 3 random features (0–99)
    y = (x[:,0] + x[:,1] + x[:,2] > 150).astype(int) # Label: 1 (malware) if sum > 150, else 0 (safe)
    model = Sequential() # Initialize the neural network model
    model.add(Dense(32, input_shape=(3,), activation=’relu’)) # Input layer + first hidden layer (32 neurons, ReLU activation)
    model.add(Dense(16, activation=’relu’)) # Second hidden layer (16 neurons)
    model.add(Dense(1, activation=’sigmoid’)) # Output layer (1 neuron, sigmoid for binary classification)
    model.compile(optimizer=’adam’, loss=’binary_crossentropy’, metrics=[‘accuracy’]) # Configure model with optimizer, loss, and accuracy metric
    model.fit(x, y, epochs=50, batch_size=32, verbose=0)  # Train the model for 50 epochs with batch size of 32
    new_files = np.array([[60, 50, 50], [10, 5, 15]]) # New data samples to classify (each has 3 features)
    prediction = model.predict(new_files) > 0.5 # Predict probabilities and convert to True/False using threshold 0.5
    print(“Malware predictions (True=Malware, False=Safe):”, prediction) # Print classification results

    import numpy as np
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import Dense

    x = np.random.randint(0, 100, size=(1000, 3))
    y = (x[:,0] + x[:,1] + x[:,2] > 150).astype(int)  # 1 = malware, 0 = safe

    model = Sequential()
    model.add(Dense(32, input_shape=(3,), activation='relu'))
    model.add(Dense(16, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))

    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    model.fit(x, y, epochs=50, batch_size=32, verbose=0)

    new_files = np.array([[60, 50, 50], [10, 5, 15]])
    prediction = model.predict(new_files) > 0.5
    print("Malware predictions (True=Malware, False=Safe):", prediction)
  • Machine Learning

    Machine Learning (ML)

    Machine Learning (ML) is a branch of artificial intelligence that allows systems to learn from data and enhance their performance on tasks without needing explicit programming. ML algorithms examine data to detect patterns and relationships, which can then be utilized for making predictions, classifications, or decisions. These techniques are commonly applied in areas such as fraud detection, recommendation systems, and predictive analytics. Unlike traditional programming, ML focuses on data-driven learning and can handle both structured and unstructured data.

    Process

    • Training
      • Input data
      • Feature extraction (manual in traditional ML, automatic in deep learning)
      • Model learning
    • Prediction (Inference)
      • New input data
      • Apply trained model
      • Output prediction or classification

    Data Splitting

    • Training set: Used to train the model
    • Validation set: Used to tune and evaluate during training
    • Test set: Used to evaluate final performance on unseen data
    • A common split is 70% / 20% / 10%, but this may vary.

    Example

    import numpy as np # For handling arrays
    from sklearn.feature_extraction.text import CountVectorizer # Convert text to numeric feature vectors
    from sklearn.ensemble import RandomForestClassifier # Machine learning model for classification

    # Input texts (simulated messages) and labels
    texts = np.array([
        ‘Click at this link’, # Suspicious / phishing-like message
        ‘Click at this link to download’, # Suspicious
        ‘Click here to transfer money’, # Suspicious
        ‘My name is Jone’, # Normal / safe message
        ‘How are you’ # Normal / safe message
    ])
    labels = np.array([1, 1, 1, 0, 0]) # 1 = positive/suspicious, 0 = negative/normal
    tags = np.array([“negative”, “positive”]) # Labels for display

    # Extract features from text using Bag-of-Words
    count_vectorizer = CountVectorizer(min_df=1) # Convert text to word frequency vectors
    features = count_vectorizer.fit_transform(texts).toarray() # Learn vocabulary and convert texts to array

    # Train Random Forest classifier
    random_forest_classifier = RandomForestClassifier() # Initialize model
    random_forest_classifier.fit(features, labels) # Train model on features and labels

    # Predict new text
    features = count_vectorizer.transform([‘How are you’]) # Convert new text to feature vector
    prediction = random_forest_classifier.predict(features) # Predict label (0 or 1)
    print(prediction, tags[prediction]) # Print numeric prediction and human-readable tag

    import numpy as np
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.ensemble import RandomForestClassifier

    #Input
    texts = np.array(['Click at this link', 'Click at this link to download', 'Click here to transfer money', 'My name is Jone', 'How are you'])
    labels = np.array([1, 1, 1, 0, 0])
    #0 = negative
    #1 = positive
    tags = np.array(["negative","positive"])

    #Extract Features
    count_vectorizer = CountVectorizer(min_df=1)
    features = count_vectorizer.fit_transform(texts).toarray()

    #Train
    random_forest_classifier = RandomForestClassifier()
    random_forest_classifier.fit(features, labels)

    #Predict
    features = count_vectorizer.transform(['How are you'])
    prediction = random_forest_classifier.predict(features)
    print(prediction, tags[prediction])
  • Natural Language Processing

    Natural Language Processing

    Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on enabling computers to understand, interpret, and interact with human language in a meaningful way. It combines linguistics, computer science, and machine learning to process text and speech, allowing machines to analyze syntax, semantics, and context in written or spoken language. NLP is used for tasks such as sentiment analysis, language translation, chatbots, information extraction, and text summarization. While NLP focuses on understanding and interpreting language, rather than predicting future events, it forms the foundation for applications that require machines to comprehend and respond to human communication in a natural, human-like manner.


    Text Pre-Processing

    There is a popular module in Python called nltk that used for NLP methodology. This module can be used to enhance threat detection and response

    Install

    pip3 # Python package installer for Python 3
    install # Command that tells pip to install a package
    nltk # The Natural Language Toolkit library (used for NLP tasks)

    pip3 install nltk

    Run this in Python

    import nltk # Imports the Natural Language Toolkit (NLP library) into your Python script
    nltk.download(‘all’) # Downloads all available NLTK datasets, models, and corpora

    import nltk
    nltk.download('all')

    Breaking Sentences Into Words

    You can break unstructured data and natural language text into chunks of information (Numerical data structure that can be used for machine learning) using a tokenizer. E.g., breaking a sentence words using the word_tokenize() method

    Example

    from nltk.tokenize import word_tokenize # Imports the word_tokenize function from NLTK’s tokenize module
    print(word_tokenize(“Please follow this link.”)) # Tokenizes (splits) the sentence into individual words and punctuation, then prints the resulting list

    from nltk.tokenize import word_tokenize
    print(word_tokenize("Please follow this link."))

    Output

    ['Please', 'follow', 'this', 'link', '.']

    Finding Common Words

    You can find common words in a sentence using the FreqDist() method

    Example

    from nltk.probability import FreqDist # Imports FreqDist class to calculate word frequency distribution
    from nltk.tokenize import word_tokenize # Imports the word_tokenize function to split text into tokens
    tokens = word_tokenize(“Please follow this link.”) # Tokenizes the sentence into individual words and punctuation marks
    FreqDist(tokens).tabulate() # Creates a frequency distribution of the tokens and displays the counts in a formatted table

    from nltk.probability import FreqDist
    from nltk.tokenize import word_tokenize
    tokens = word_tokenize("Please follow this link.")
    FreqDist(tokens).tabulate()

    Output

     Please follow    this    link       . 
          1       1       1       1       1 

    Finding Senetnce Parts

    If you want to find nouns, pronouns, verbs, adjectives, adverbs, prepositions, conjunctions, interjections, etc tags in a sentence, you can use pos_tag() method, you can review all the tags using nltk.help.upenn_tagset()

    Example

    from nltk import pos_tag # Imports the part-of-speech (POS) tagging function
    from nltk.tokenize import word_tokenize # Imports the tokenizer to split text into words
    tokens = word_tokenize(“Please follow this link.”) # Splits the sentence into individual tokens (words and punctuation)
    for token in tokens: # Loops through each token
        print(pos_tag([token])) # Tags the token with its part of speech and prints it

    from nltk import pos_tag
    from nltk.tokenize import word_tokenize
    tokens = word_tokenize("Please follow this link.")
    for token in tokens:
        print(pos_tag([token]))

    Output

    [('Please', 'VB')]
    [('follow', 'NN')]
    [('this', 'DT')]
    [('link', 'NN')]
    [('.', '.')]

    Normalizing Words

    If you want to normalize a word, you can use the PorterStemmer() method or lemmatize(). Stemming removes the last few characters from a word (It removes the suffix from the word), whereas lemmatization replaces a word with its root or head (It returns the lemma of the word). Usually, search engines use them to analyze the meaning of a word, then use that to return search results that include all relevant forms of that word used. E.g., if you search for cars, you also get result for car. Bots, use that to understand the overall meaning of the sentence.

    Example

    from nltk.stem import PorterStemmer # Imports the Porter Stemmer algorithm for word stemming
    for item in [“test”, “tests”, “testing”, “tested”]: # Loops through each word in the list
        print(item, “: “, PorterStemmer().stem(item)) # Applies stemming to each word and prints the original word along with its stemmed (root) form

    from nltk.stem import PorterStemmer
    for item in ["test","tests","testing","tested"]:
        print(item, ": ",PorterStemmer().stem(item))

    Output

    test

    Example

    from nltk.stem import WordNetLemmatizer # Imports the WordNet lemmatizer (uses vocabulary + morphology rules)
    for item in [“test”, “tests”, “testing”, “tested”]: # Loops through each word in the list
        print(item, “: “, WordNetLemmatizer().lemmatize(item)) # Lemmatizes (reduces to dictionary base form) each word and prints the original word with its lemma

    from nltk.stem import WordNetLemmatizer
    for item in ["test","tests","testing","tested"]:
        print(item, ": ", WordNetLemmatizer().lemmatize(item))

    Output

    testing

    Example

    from nltk.stem import WordNetLemmatizer  # Imports the WordNet lemmatizer
    from nltk.corpus import wordnet  # Imports WordNet corpus (provides POS constants)
    from nltk import word_tokenize, pos_tag # Imports tokenizer and POS tagger
    from collections import defaultdict # (Not used here, but commonly used for default dictionary behavior)
    mapped = {
        “V”: wordnet.VERB, # Maps POS tags starting with ‘V’ to VERB
        “J”: wordnet.ADJ, # Maps POS tags starting with ‘J’ to ADJECTIVE
        “R”: wordnet.ADV  # Maps POS tags starting with ‘R’ to ADVERB
    }
    tokens = word_tokenize(“caring”) # Tokenizes the word
    for token, tag in pos_tag(tokens): # Tags the token with its Penn Treebank POS tag (e.g., VBG, NN, JJ)
        tag = mapped.get(tag[0], wordnet.NOUN) # Looks at the first letter of the POS tag, of it exists in the mapped dictionary, use the corresponding WordNet POS, otherwise, default to NOUN
        print(token, WordNetLemmatizer().lemmatize(token, tag)) # Lemmatizes the token using the correct POS

    from nltk.stem import WordNetLemmatizer
    from nltk.corpus import wordnet
    from nltk import word_tokenize, pos_tag
    from collections import defaultdict

    mapped = {
        "V": wordnet.VERB,
        "J": wordnet.ADJ,
        "R": wordnet.ADV
    }

    tokens = word_tokenize("caring")
    for token, tag in pos_tag(tokens):
        tag  = mapped.get(tag[0], wordnet.NOUN)
        print(token, WordNetLemmatizer().lemmatize(token, tag))

    Part-Of-Speech

    POS stands for Part-Of-Speech, which is a grammatical category assigned to each word in a sentence. POS tagging tells you whether a word is a noun, verb, adjective, adverb, etc., based on its role in the sentence

    CC Coordinating conjunction
    CD Cardinal number
    DT Determiner
    EX Existential there 
    FW Foreign word
    IN Preposition or subordinating conjunction
    JJ Adjective
    JJR Adjective, comparative
    JJS Adjective, superlative
    LS List item marker
    MD Modal
    NN Noun, singular or mass
    NNS Noun, plural
    NNP Proper noun, singular
    NNPS Proper noun, plural
    PDT Predeterminer
    POS Possessive ending
    PRP Personal pronoun
    PRP$ Possessive pronoun
    RB Adverb
    RBR Adverb, comparative
    RBS Adverb, superlative
    RP Particle
    SYM Symbol
    TO to
    UH Interjection
    VB Verb, base form
    VBD Verb, past tense
    VBG Verb, gerund or present participle
    VBN Verb, past participle
    VBP Verb, non-3rd person singular present
    VBZ Verb, 3rd person singular present
    WDT Wh-determiner
    WP Wh-pronoun
    WP$ Possessive wh-pronoun
    WRB Wh-adverb

    Remove Stops Words

    If you want to remove stopwords from a sentence, you can compare the words of the sentence with the stopwords

    Example

    from nltk.tokenize import sent_tokenize, word_tokenize # Import sentence and word tokenizers
    from nltk.corpus import stopwords # Import stopwords list
    tokens = word_tokenize(“Please followw this link.”) # Tokenize sentence into words
    stop_words = set(stopwords.words(‘english’)) # Get the set of English stopwords
    filtered = [w for w in tokens if w.lower() not in stop_words] # Filter out tokens that are stopwords
    print(filtered) # Print the filtered words

    from nltk.tokenize import sent_tokenize, word_tokenize
    from nltk.corpus import stopwords
    tokens = word_tokenize("Please followw this link.")
    stop_words = set(stopwords.words('english'))
    filtered = [w for w in tokens if w not in stop_words]
    print(filtered)

    Output

    ['Please', 'followw', 'link', '.']

    Example #1

    You can clean text using regex and nltk

    import re # Import regular expressions for pattern-based text cleaning
    from nltk.corpus import stopwords # Import list of common English stopwords
    def clean_text(text):
        text = text.lower() # Convert all letters to lowercase so that ‘This’ and ‘this’ are treated the same
        text = re.sub(r’\d+’, ‘ ‘, text) # Remove all digits/numbers by replacing them with a space
        text = re.sub(r'[^\w\s]’, ‘ ‘, text) # Remove punctuation by replacing anything that is NOT a word character or whitespace with a space
        text = ” “.join(w for w in text.split() if w not in set(stopwords.words(‘english’))) # Remove stopwords (common words like ‘the’, ‘is’, ‘this’)
        return text # Return the cleaned text
    print(clean_text(“Please follow this link.”)) # Expected output: “please follow link”

    import re
    from nltk.corpus import stopwords

    def clean_text(text):
        text = text.lower()
        text = re.sub(r'\d+', ' ', text)
        text = re.sub(r'[^\w\s]', ' ', text)
        text = " ".join(w for w in text.split() if w not in set(stopwords.words('english')))
        return text

    print(clean_text("Please follow this link."))

    Output

    please follow link

    Example #2

    If you want to check a phishing email for broken words, you can do that using nltk module

    import nltk # Import NLTK library
    words = set(nltk.corpus.words.words()) # Load the set of valid English words from the NLTK corpus
    sentence = “Please followw this link.” # Example sentence to check
    errors = [] # List to store words not found in the dictionary (possible typos)
    for w in nltk.wordpunct_tokenize(sentence): # Tokenize the sentence into words and punctuation
        if w.lower() in words or not w.isalpha(): # Check if the word is in the dictionary or is non-alphabetic (punctuation, numbers)
            pass # Word is correct or ignored
        else:
            errors.append(w) # Word is likely a typo
    print(“Error(s): “, len(errors)) # Print the number of errors found

    import nltk 
    words = set(nltk.corpus.words.words())
    sentence = "Please followw this link."
    errors = []
    for w in nltk.wordpunct_tokenize(sentence):
        if w.lower() in words or not w.isalpha():
            pass
        else:
            errors.append(w)
    print("Error(s): ", len(errors))

    Output

    Error(s): 1
  • Web Scraping Prevention

    Web Scraping Prevention Techniques

    Many websites prohibit web scraping and use anti-scraping measures to block automated data extraction. These protections can make it challenging and time-consuming to scale scraping activities. For instance, if a script sends requests too frequently (like once every second), the website may block those requests or display a message asking the user to slow down or try again later.

    Fingerprinting

    Fingerprinting is a technique used to identify and track clients based on detailed technical information such as IP addresses, user-agent strings, browser versions, operating systems, screen resolutions, installed fonts, and even hardware characteristics. By combining these signals, websites can create a unique “fingerprint” for each visitor. If multiple requests appear to originate from the same fingerprint in an automated pattern, the system can flag or block them, even if the IP address changes.

    Example

    from http.server import BaseHTTPRequestHandler, HTTPServer # import base classes for HTTP server
    from time import time # import time function for request timing
    requests = {} # dictionary to store request history per fingerprint

    class CustomHandler(BaseHTTPRequestHandler): # define request handler class
        def do_GET(self): # handle GET requests
            now = time() # current timestamp
            ip = self.client_address[0] # get client IP address
            user_agent = self.headers.get(“User-Agent”, “”) # browser info
            accept_lang = self.headers.get(“Accept-Language”, “”) # language preference
            encoding = self.headers.get(“Accept-Encoding”, “”) # compression support
            fingerprint = f”{ip}{user_agent}|{accept_lang}|{encoding}” # create a simple fingerprint using IP + headers
            requests[fingerprint] = [t for t in requests.get(fingerprint, []) if now – t < 10] # keep only requests from last 10 seconds for this fingerprint
            requests[fingerprint].append(now) # log current request time

            if len(requests[fingerprint]) > 5: # if too many requests in time window, block client
                self.send_response(429) # HTTP status: Too Many Requests
                self.send_header(‘Content-type’, ‘text/plain’) # response type
                self.end_headers() # finish HTTP headers
                self.wfile.write(f”Fingerprint:{fingerprint} – Too many requests…”.encode(“utf-8”)) # send blocked message with fingerprint info
            else:
                self.send_response(200) # HTTP OK
                self.send_header(‘Content-type’, ‘text/plain’) # response type
                self.end_headers() # finish headers
                self.wfile.write(f”Fingerprint:{fingerprint} – Server Running…”.encode(“utf-8”)) # send normal response with fingerprint info

            return # end request handling

    HTTPServer((“”, 8085), CustomHandler).serve_forever() # start server on port 8080 and run forever

    from http.server import BaseHTTPRequestHandler, HTTPServer
    from time import time
    requests = {}

    class CustomHandler(BaseHTTPRequestHandler):
        def do_GET(self):
            now = time()
            ip = self.client_address[0]
            user_agent = self.headers.get("User-Agent", "")
            accept_lang = self.headers.get("Accept-Language", "")
            encoding = self.headers.get("Accept-Encoding", "")
            fingerprint = f"{ip}{user_agent}|{accept_lang}|{encoding}"
            requests[fingerprint] = [t for t in requests.get(fingerprint, []) if now - t < 10]
            requests[fingerprint].append(now)

            if len(requests[fingerprint]) > 5:
                self.send_response(429)
                self.send_header('Content-type', 'text/plain')
                self.end_headers()
                self.wfile.write(f"Fingerprint:{fingerprint} - Too many requests...".encode("utf-8"))
            else:
                self.send_response(200)
                self.send_header('Content-type', 'text/plain')
                self.end_headers()
                self.wfile.write(f"Fingerprint:{fingerprint} - Server Running...".encode("utf-8"))

            return

    HTTPServer(("", 8080), CustomHandler).serve_forever()

    Authentication

    Authentication systems require users to verify their identity before accessing content. This is often achieved through login pages, API keys, or session tokens. By requiring users to authenticate, websites can better control who accesses their data and monitor usage per account. This also allows them to enforce limits on a per-user basis rather than per IP address, making scraping more challenging. 

    Example

    from http.server import BaseHTTPRequestHandler, HTTPServer # import basic HTTP server classes
    api_keys = {“Example-6C324086-6B3B-48D5-9FEE-4A30C66B70CC”:[“ip”:””,”user”,””]} # dictionary storing valid API keys and optional metadata (invalid Python dict syntax for nested list here)

    class CustomHandler(BaseHTTPRequestHandler): # define request handler class
        def do_GET(self): # handle GET requests
            api_key = self.headers.get(“X-API-Key”, “”) # extract API key from request headers
            if api_key not in api_keys: # check if API key is invalid or missing
                self.send_response(401) # return HTTP 401 Unauthorized
                self.send_header(‘Content-type’, ‘text/plain’) # set response content type
                self.end_headers() # finish HTTP headers
                self.wfile.write(b”Authentication required”) # send authentication error message
            else: # if API key is valid
                self.send_response(200) # return HTTP 200 OK
                self.send_header(‘Content-type’, ‘text/plain’) # set response content type
                self.end_headers() # finish HTTP headers
                self.wfile.write(b”Server Running…”) # send success response message
            return # end request handling

    HTTPServer((“”, 8080), CustomHandler).serve_forever() # start server on port 8080 and run forever

    from http.server import BaseHTTPRequestHandler, HTTPServer
    api_keys = {"Example-6C324086-6B3B-48D5-9FEE-4A30C66B70CC":["ip":"","user",""]}

    class CustomHandler(BaseHTTPRequestHandler):
        def do_GET(self):
            api_key = self.headers.get("X-API-Key", "")
            if api_key not in api_keys:
                self.send_response(401)
                self.send_header('Content-type', 'text/plain')
                self.end_headers()
                self.wfile.write(b"Authentication required")
            else:
                self.send_response(200)
                self.send_header('Content-type', 'text/plain')
                self.end_headers()
                self.wfile.write(b"Server Running...")
            return

    HTTPServer(("", 8080), CustomHandler).serve_forever()

    Challenges (CAPTCHA)

    CAPTCHA tests are designed to differentiate humans from bots. They may involve identifying distorted text, selecting images, solving puzzles, or performing simple interactive tasks. Since most automated scripts struggle with these challenges, CAPTCHA serves as an effective barrier to prevent large-scale scraping or automated form submissions. 

    Example

    from http.server import BaseHTTPRequestHandler, HTTPServer # HTTP server framework
    from random import randint # generate random numbers for CAPTCHA
    from uuid import uuid4 # generate unique session ID for each CAPTCHA
    captcha_db = {} # store captcha_id -> correct answer mapping

    class Handler(BaseHTTPRequestHandler): # request handler class
        def do_GET(self): # handle GET requests (show CAPTCHA page)
            random_a = randint(1, 10) # first random number
            random_b = randint(1, 10) # second random number
            captcha_id = str(uuid4()) # create unique ID for this CAPTCHA session
            captcha_db[captcha_id] = str(random_a + random_b) # store correct answer on server
            self.send_response(200) # HTTP 200 OK
            self.send_header(“Content-type”, “text/html”) # response is HTML page
            self.end_headers() # finish headers
            # send HTML form to user
            self.wfile.write(f”””
            <html>
            <body>
                <h3>CAPTCHA: What is {random_a} + {random_b}?</h3>
                <form method=”POST”>
                    <input name=”answer” type=”text”>
                    <input type=”hidden” name=”captcha_id” value=”{captcha_id}”>
                    <input type=”submit” value=”Submit”>
                </form>

            </body>
            </html>
            “””.encode())

        def do_POST(self): # handle form submission
            length = int(self.headers.get(‘Content-Length’)) # get size of request body
            data = self.rfile.read(length).decode() # read and decode form data
            fields = dict(x.split(“=”) for x in data.split(“&”)) # parse form fields
            user_answer = fields.get(“answer”, “”) # user submitted answer
            captcha_id = fields.get(“captcha_id”, “”) # session id from form
            correct_answer = captcha_db.get(captcha_id, “”) # get stored correct answer
            self.send_response(200) # HTTP OK
            self.send_header(“Content-type”, “text/plain”) # plain text response
            self.end_headers() # finish headers
            if user_answer == correct_answer: # check if answer is correct
                self.wfile.write(b”CAPTCHA passed”) # success message
            else:
                self.wfile.write(b”CAPTCHA failed”) # failure message

            del captcha_db[captcha_id] # remove CAPTCHA after attempt (single-use)

    HTTPServer((“”, 8080), Handler).serve_forever() # start server on port 8080

    from http.server import BaseHTTPRequestHandler, HTTPServer
    from random import randint
    from uuid import uuid4
    captcha_db = {}

    class Handler(BaseHTTPRequestHandler):
        def do_GET(self):
            random_a = randint(1, 10)
            random_b = randint(1, 10)
            captcha_id = str(uuid4())
            captcha_db[captcha_id] = str(random_a + random_b)
            self.send_response(200)
            self.send_header("Content-type", "text/html")
            self.end_headers()
           
            self.wfile.write(f"""
            <html>
            <body>
                <h3>CAPTCHA: What is {random_a} + {random_b}?</h3>
                <form method="POST">
                    <input name="answer" type="text">
                    <input type="hidden" name="captcha_id" value="{captcha_id}">
                    <input type="submit" value="Submit">
                </form>

            </body>
            </html>
            """.encode())

        def do_POST(self):
            length = int(self.headers.get('Content-Length'))
            data = self.rfile.read(length).decode()
            fields = dict(x.split("=") for x in data.split("&"))
            user_answer = fields.get("answer", "")
            captcha_id = fields.get("captcha_id", "")
            correct_answer = captcha_db.get(captcha_id, "")
            self.send_response(200)
            self.send_header("Content-type", "text/plain")
            self.end_headers()
            if user_answer == correct_answer:
                self.wfile.write(b"CAPTCHA passed")
            else:
                self.wfile.write(b"CAPTCHA failed")

            del captcha_db[captcha_id]

    HTTPServer(("", 8080), Handler).serve_forever()

    Dynamic Content

    Dynamic content is generated at runtime rather than being fixed in the HTML source. This often involves JavaScript rendering, API calls, or asynchronous data loading. Since the content is not directly present in the initial page source, simple HTML-only scraping tools cannot easily extract the data without simulating a real browser environment. 

    from http.server import BaseHTTPRequestHandler, HTTPServer # HTTP server framework
    from datetime import datetime # used to generate dynamic runtime timestamp

    class CustomHandler(BaseHTTPRequestHandler): # request handler class
        def do_GET(self): # handle GET requests
            if self.path == “/”: # main webpage route
                self.send_response(200) # HTTP 200 OK
                self.send_header(‘Content-type’, ‘text/html’) # response is HTML page
                self.end_headers() # finish headers
                self.wfile.write(b”””
                <html>
                <body>
                    <h1>Server Running…</h1>
                    <div id=”data”>Loading…</div>
                    <script>
                        setTimeout(() => { // wait 10 seconds before loading data
                            fetch(“/data”) // request dynamic backend endpoint
                            .then(r => r.text()) // convert response to text
                            .then(t => document.getElementById(“data”).innerText = t); // update page content
                        }, 10000); // 10000ms delay (10 seconds)
                    </script>
                </body>
                </html>
                “””)
                return # stop processing this request

            if self.path == “/data”: # dynamic data endpoint
                self.send_response(200) # HTTP OK
                self.send_header(‘Content-type’, ‘text/plain’) # plain text response
                self.end_headers() # finish headers
                self.wfile.write(f”Dynamic Content Loaded: {datetime.now().strftime(“%m-%d-%Y %I:%M %p”)}”.encode()) # write the dynamic content
                return # end request

    HTTPServer((“”, 8080), CustomHandler).serve_forever() # start server on port 8080

    from http.server import BaseHTTPRequestHandler, HTTPServer
    from datetime import datetime

    class CustomHandler(BaseHTTPRequestHandler):# request handler class
        def do_GET(self):
            if self.path == "/":
                self.send_response(200)
                self.send_header('Content-type', 'text/html')
                self.end_headers()
                self.wfile.write(b"""
                <html>
                <body>
                    <h1>Server Running...</h1>
                    <div id="data">Loading...</div>
                    <script>
                        setTimeout(() => { // wait 10 seconds before loading data
                            fetch("/data") // request dynamic backend endpoint
                            .then(r => r.text()) // convert response to text
                            .then(t => document.getElementById("data").innerText = t); // update page content
                        }, 10000);// 10000ms delay (10 seconds)
                    </script>
                </body>
                </html>
                """)
                return

            if self.path == "/data":
                self.send_response(200)
                self.send_header('Content-type', 'text/plain')
                self.end_headers()
              self.wfile.write(f"Dynamic Content Loaded: {datetime.now().strftime("%m-%d-%Y %I:%M %p")}".encode())
                return

    HTTPServer(("", 8080), CustomHandler).serve_forever()

    Randomized Identifiers

    Websites often change element IDs, class names, or API endpoints dynamically. This prevents scrapers from relying on fixed selectors to locate data. For instance, a product price element might have a different ID each time the page loads. This forces scrapers to constantly adapt and makes automation less reliable. 

    from http.server import BaseHTTPRequestHandler, HTTPServer # import HTTP server classes
    from random import randint # used to generate random IDs

    class CustomHandler(BaseHTTPRequestHandler): # define request handler
        def do_GET(self): # handle GET requests
            self.send_response(200) # send HTTP 200 OK status
            self.send_header(‘Content-type’, ‘text/html’) # response is HTML
            self.end_headers() # finish headers
            random_id = f”id_{randint(1000,9999)}” # generate random element ID each request
            # send HTML response to client
            self.wfile.write(f”””
            <html>
                <body>
                    <div id=”{random_id}”>Gas Price is: $5.99 per gallon</div>
                </body>
            </html>
            “””.encode()) 

    HTTPServer((“”, 8080), CustomHandler).serve_forever() # start server on port 8080

    from http.server import BaseHTTPRequestHandler, HTTPServer
    from random import randint 

    class CustomHandler(BaseHTTPRequestHandler): 
        def do_GET(self):
            self.send_response(200)
            self.send_header('Content-type', 'text/html') 
            self.end_headers()
            random_id = f"id_{randint(1000,9999)}"
            self.wfile.write(f"""
            <html>
                <body>
                    <div id="{random_id}">Gas Price is: $5.99 per gallon</div>
                </body>
            </html>
            """.encode()) 

    HTTPServer(("", 8080), CustomHandler).serve_forever()

    User Behavior Analysis

    User Behavior Analysis technique focuses on analyzing how users interact with a website over time. Typical human behavior includes pauses, scrolling, clicks, and irregular timing, while bots tend to generate consistent, fast, and repetitive request patterns. Websites use machine learning or rule-based systems to detect anomalies, such as extremely fast navigation, identical click paths, or repetitive page access patterns, and subsequently restrict or block suspicious activity.


    Honeypots

    Honeypots are hidden elements embedded in a webpage that are either invisible or irrelevant to normal users (such as hidden links or form fields). Bots that blindly follow all available elements may end up interacting with these traps. Once triggered, the system can flag the behavior as automated and take action such as blocking the IP address, logging the activity, or redirecting the user. 

  • Web Scraping

    Web Scraping

    Data Scraping

    Data scraping is the process of extracting information from a target source and saving it into a file for further use. This target could be a website, an application, or any digital platform containing structured or unstructured data. The main goal of data scraping is to collect large amounts of data efficiently without manual copying, making it easier for organizations or individuals to gather the information they need for analysis or reporting.

    The process often involves using automated tools or scripts, such as web crawlers, bots, or specialized scraping frameworks. These tools navigate the target source, locate the desired data, and extract it in a structured format such as CSV, JSON, or Excel. Depending on the source, data scraping may require overcoming challenges such as dynamic content, login requirements, or anti-bot measures. It is a technical process that requires careful handling to ensure accuracy and efficiency.

    While data scraping focuses on data collection, the extracted information is often analyzed in a subsequent process called data mining. For example, a web crawler may scrape product details, prices, and reviews from e-commerce websites, and the collected data can then be analyzed to identify trends, patterns, or insights. By separating extraction from analysis, organizations can efficiently manage raw data and transform it into actionable intelligence, making data scraping a crucial first step in many data-driven workflows.


    Web Scraping

    Web Scraping is the automated process of extracting data from websites by using software tools or scripts to collect information directly from web pages. Websites can contain either static content, which is fixed in the page’s HTML and generally easier to scrape, or dynamic content, which is generated using JavaScript and may require more advanced tools or browser automation to access. Web scraping is commonly used for data collection, research, price monitoring, market analysis, and cybersecurity investigations. However, it is important to follow ethical and legal guidelines when scraping data, including reviewing the website’s terms of service and robots.txt file to ensure that scraping is permitted, as unauthorized data extraction may violate policies or laws.


    Manual Web Scraping

    The process of extracting data from webpages without using any scraping tools or features is convenient for very small amounts of content. Still, it becomes very complicated if the data is large or needs to be scraped more often. One of the great benefits of manual scraping is human review; every data point is checked by the person who scrapes it.


    Manual Web Scraping (Example #1)

    Getting all the URLs from this wiki page

    Right click of the page and choose View Page Source

    Search the page for the href html tags (This tag defines a hyperlink), click on Highlight All and copy them one by one, this will take very long time, what you can do is taking the content and paste it into a text editor, and use href=["'](?<link>.*?)['"] or (?<=href=")[^"]* regex 

    Save them into a file

    href="/w/load.php?lang=en&amp;modules=codex-search-styles%7Cext.cite.styles%7Cext.uls.interlanguage%7Cext.visualEditor.desktopArticleTarget.noscript%7Cext.wikimediaBadges%7Cjquery.makeCollapsible.styles%7Cskins.vector.icons%2Cstyles%7Cwikibase.client.init&amp;only=styles&amp;skin=vector-2022"
    href="/w/load.php?lang=en&amp;modules=ext.gadget.SubtleUpdatemarker%2CWatchlistGreenIndicators&amp;only=styles&amp;skin=vector-2022"
    href="/w/load.php?lang=en&amp;modules=site.styles&amp;only=styles&amp;skin=vector-2022"
    href="//upload.wikimedia.org"
    href="//en.m.wikipedia.org/wiki/Malware"
    href="/w/index.php?title=Malware&amp;action=edit"
    href="/static/apple-touch/wikipedia.png"
    href="/static/favicon/wikipedia.ico"
    href="/w/opensearch_desc.php"
    href="//en.wikipedia.org/w/api.php?action=rsd"
    href="https://en.wikipedia.org/wiki/Malware"
    href="https://creativecommons.org/licenses/by-sa/4.0/deed.en"
    href="/w/index.php?title=Special:RecentChanges&amp;feed=atom"
    href="//meta.wikimedia.org"
    href="//login.wikimedia.org"
    ...
    ...
    ...

    Automated Web Scraping

    This is done by utilizing tools that get the content and save it into files; Python has been heavily utilized for web scraping. There are different Python modules like beautifulsoup or pandas that are used for both scraping and mining.


    Automated Web Scraping (Example #1)

    The beautifulsoup module is good for getting all the URLs from a webpage, this method of scraping is limited, it works great with static content, but you cannot get dynamic content or  a screenshot of the website using this method

    Install beautifulsoup4 and lxml using the pip command

    from bs4 import BeautifulSoup # Import BeautifulSoup for HTML parsing
    from requests import get # Import get() to send HTTP requests
    headers = {“User-Agent”: “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.75 Safari/537.36”} # Mimic a real browser
    response = get(“https://en.wikipedia.org/wiki/Main_Page”, headers=headers) # Send GET request with defied header
    print(response.status_code) # Print HTTP status code (200 = OK)
    soup = BeautifulSoup(response.text, ‘html.parser’) # Parse HTML content
    for item in soup.find_all(href=True): # Loop through all tags containing an href attribute
        print(item[‘href’]) # Print the link URL

    from bs4 import BeautifulSoup
    from requests import get
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.75 Safari/537.36"}
    response = get("https://en.wikipedia.org/wiki/Main_Page", headers=headers)
    print(response.status_code)
    soup = BeautifulSoup(response.text, 'html.parser')
    for item in soup.find_all(href=True):
        print(item['href'])

    Output

    href="/w/load.php?lang=en&amp;modules=codex-search-styles%7Cext.cite.styles%7Cext.uls.interlanguage%7Cext.visualEditor.desktopArticleTarget.noscript%7Cext.wikimediaBadges%7Cjquery.makeCollapsible.styles%7Cskins.vector.icons%2Cstyles%7Cwikibase.client.init&amp;only=styles&amp;skin=vector-2022"
    href="/w/load.php?lang=en&amp;modules=ext.gadget.SubtleUpdatemarker%2CWatchlistGreenIndicators&amp;only=styles&amp;skin=vector-2022"
    href="/w/load.php?lang=en&amp;modules=site.styles&amp;only=styles&amp;skin=vector-2022"
    href="//upload.wikimedia.org"
    href="//en.m.wikipedia.org/wiki/Malware"
    href="/w/index.php?title=Malware&amp;action=edit"
    href="/static/apple-touch/wikipedia.png"
    href="/static/favicon/wikipedia.ico"
    href="/w/opensearch_desc.php"
    href="//en.wikipedia.org/w/api.php?action=rsd"
    href="https://en.wikipedia.org/wiki/Malware"
    href="https://creativecommons.org/licenses/by-sa/4.0/deed.en"
    href="/w/index.php?title=Special:RecentChanges&amp;feed=atom"
    href="//meta.wikimedia.org"
    href="//login.wikimedia.org"
    ...
    ...
    ...

    Automated Web Scraping (Example #2)

    The pandas module is good for getting all tables within a page, similar to the previous example, this method of scraping is limited, it works great with static content, but you cannot get dynamic content or  a screenshot of the website using this method

    Install pandas and lxml using the pip command

    # bash /Applications/Python*/Install\ Certificates.command # macOS command to install SSL certificates if needed
    import pandas as pd # Import pandas for data handling and HTML table parsing
    import ssl # Import SSL module to handle HTTPS settings
    ssl._create_default_https_context = ssl._create_unverified_context # Disable SSL certificate verification (useful when encountering certificate errors)
    tables = pd.read_html(“https://goblackbears.com/sports/baseball/stats”) # Read all HTML tables from the given URL into a list of DataFrames
    for i, table in enumerate(tables): # Loop through each table with its index
        print(“Table %s\n” % i, table.head()) # Print table index and first 5 rows

    import pandas as pd
    tables = pd.read_html("https://goblackbears.com/sports/baseball/stats")
    for i, table in enumerate(tables):
        print("Table %s\n" % i,table.head())

    Output

    Table 0
         0                                                  1
    0 NaN  This article has multiple issues. Please help ...
    1 NaN  This article needs to be updated. Please help ...
    2 NaN  This article needs additional citations for ve...
    Table 1
         0                                                  1
    0 NaN  This article needs to be updated. Please help ...
    Table 2
         0                                                  1
    0 NaN  This article needs additional citations for ve...
    Table 3
          Virus  ...                                              Notes
    0     1260  ...   First virus family to use polymorphic encryption
    1       4K  ...  The first known MS-DOS-file-infector to use st...
    2      5lo  ...                            Infects .EXE files only
    3  Abraxas  ...  Infects COM file. Disk directory listing will ...
    4     Acid  ...  Infects COM file. Disk directory listing will ...

    [5 rows x 9 columns]
    Table 4
          vteMalware topics                                vteMalware topics.1
    0   Infectious malware  Comparison of computer viruses Computer virus ...
    1          Concealment  Backdoor Clickjacking Man-in-the-browser Man-i...
    2   Malware for profit  Adware Botnet Crimeware Fleeceware Form grabbi...
    3  By operating system  Android malware Classic Mac OS viruses iOS mal...
    4           Protection  Anti-keylogger Antivirus software Browser secu...

    Automated Web Scraping (Example #3)

    One of the best web scraping techniques is using a headless browser, which means running a browser that runs without a graphical user interface (GUI). This was originally used for automated quality assurance tests but has recently been used for scraping. The main two benefits of using the headless browser is rendering dynamic content and behaving like a human browsing a website.

    The following scripts will not run on Google Colab

    Scrape using Firefox (with geckodriver setup)

    1. Install the latest Firefox version
    2. Install selenium using the pip command
    3. Download the geckodriver from here (The Firefox application version has to match the webdriver version)
    4. Extract the geckodriver and note the location (E.g., /scrape/geckodriver)

    from selenium import webdriver # Import Selenium WebDriver
    options = webdriver.firefox.options.Options() # Create Firefox options object
    options.add_argument(“–headless”) # Run Firefox in headless mode (no GUI)
    service = webdriver.firefox.service.Service(r’path to the geckodriver’) # Specify the local path to geckodriver executable
    browser = webdriver.Firefox(options=options, service=service) # Launch Firefox with the specified options
    browser.get(‘https://www.google.com’) # Open Google homepage
    # print(browser.find_element(By.XPATH, “/html/body”).text) # (Optional) Print the full page text
    browser.save_screenshot(“screenshot_using_firefox.png”) # Save a screenshot of the loaded page
    browser.close() # Close the browser window
    browser.quit()

    from selenium import webdriver
    options = webdriver.firefox.options.Options()
    options.add_argument("--headless")
    service = webdriver.firefox.service.Service(r'path to the geckodriver')
    browser = webdriver.Firefox(options=options, service=service)
    browser.get('https://www.google.com')
    #print(browser.find_element(By.XPATH, "/html/body").text)
    browser.save_screenshot("screenshot_using_firefox.png")
    browser.close()
    browser.quit()

    Scrape using Firefox (without geckodriver setup)

    1. Install the latest Firefox version
    2. Install selenium and webdriver-manager using the pip command

    from selenium import webdriver # Import Selenium WebDriver
    from webdriver_manager.firefox import GeckoDriverManager # Automatically download/manage GeckoDriver
    options = webdriver.firefox.options.Options() # Create Firefox options object
    options.add_argument(“–headless”) # Run Firefox in headless (no GUI) mode
    service = webdriver.firefox.service.Service(GeckoDriverManager().install()) # Set up GeckoDriver service
    browser = webdriver.Firefox(options=options, service=service) # Launch Firefox with specified options
    browser.get(‘https://www.google.com’) # Open Google homepage
    # print(browser.find_element(By.XPATH, “/html/body”).text) # (Optional) Print full page text
    browser.save_screenshot(“screenshot_using_firefox.png”) # Capture a screenshot of the page
    browser.close() # Close the browser window
    browser.quit()

    from selenium import webdriver
    from webdriver_manager.firefox import GeckoDriverManager
    options = webdriver.firefox.options.Options()
    options.add_argument("--headless")
    service = webdriver.firefox.service.Service(GeckoDriverManager().install())
    browser = webdriver.Firefox(options=options, service=service)
    browser.get('https://www.google.com')
    #print(browser.find_element(By.XPATH, "/html/body").text)
    browser.save_screenshot("screenshot_using_firefox.png")
    browser.close()
    browser.quit()

    Scrape using Chrome (with chromedriver setup)

    1. Install the latest Chrome version
    2. Install selenium using the pip command
    3. Download the ChromeDriver from here (The chrome web browser version has to match the webdriver version)
    4. Extract the ChromeDriver and note the location (E.g., /scrape/chromedriver)

    from selenium import webdriver # Import Selenium WebDriver
    options = webdriver.chrome.options.Options() # Create Chrome options object
    options.add_argument(‘–headless’) # Run Chrome in headless (no GUI) mode
    options.add_argument(‘–no-sandbox’) # Disable sandbox (required in containers/VMs)
    options.add_argument(‘–disable-dev-shm-usage’) # Prevent shared memory issues
    service = webdriver.chrome.service.Service(r’path to the chromedriver’) # Specify the local path to chromedriver
    browser = webdriver.Chrome(options=options, service=service) # Launch Chrome with specified options
    browser.get(‘https://www.google.com’) # Open Google homepage
    browser.save_screenshot(“screenshot_using_chrome.png”) # Take a screenshot of the loaded page
    browser.close() # Close the browser window
    browser.quit()

    from selenium import webdriver
    options = webdriver.chrome.options.Options()
    options.add_argument('--headless')
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-dev-shm-usage')
    service = webdriver.chrome.service.Service(r'path to the chromedriver')
    browser = webdriver.Chrome(options=options, service=service)
    browser.get('https://www.google.com')
    #print(browser.find_element(By.XPATH, "/html/body").text)
    browser.save_screenshot("screenshot_using_chrome.png")
    browser.close()
    browser.quit()

    Scrape using Chrome (without chromedriver setup)

    1. Install the latest Chrome version
    2. Install selenium and webdriver-manager using the pip command

    from selenium import webdriver # Import Selenium WebDriver
    from webdriver_manager.chrome import ChromeDriverManager # Automatically download/manage ChromeDriver
    options = webdriver.chrome.options.Options() # Create Chrome options object
    options.add_argument(‘–headless’) # Run Chrome in headless (no GUI) mode
    options.add_argument(‘–no-sandbox’) # Disable sandbox (required in some environments)
    options.add_argument(‘–disable-dev-shm-usage’) # Avoid shared memory issues in containers
    service = webdriver.chrome.service.Service(ChromeDriverManager().install()) # Set up ChromeDriver service
    browser = webdriver.Chrome(options=options, service=service) # Launch Chrome with specified options
    browser.get(‘https://www.google.com’) # Open Google homepage
    browser.save_screenshot(“screenshot_using_chrome.png”) # Capture a screenshot of the page
    browser.close() # Close the browser
    browser.quit()

    from selenium import webdriver
    from webdriver_manager.chrome import ChromeDriverManager
    options = webdriver.chrome.options.Options()
    options.add_argument('--headless')
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-dev-shm-usage')
    service = webdriver.chrome.service.Service(ChromeDriverManager().install())
    browser = webdriver.Chrome(options=options, service=service)
    browser.get('https://www.google.com')
    #print(browser.find_element(By.XPATH, "/html/body").text)
    browser.save_screenshot("screenshot_using_chrome.png")
    browser.close()
    browser.quit()

    Automated Web Scraping (Example #4 – Best Option)

    You can run this one in google colab

    Install latest chrome version

    !apt update # Update the package list from repositories
    !apt install libu2f-udev libvulkan1 # Install dependencies required by Google Chrome
    !wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb # Download the Google Chrome .deb package
    !dpkg -i google-chrome-stable_current_amd64.deb # Install the Chrome package manually
    !apt –fix-broken install # Fix missing dependencies caused by dpkg install
    !pip install selenium webdriver-manager # Install Selenium and Chrome driver manager via pip

    !apt update
    !apt install libu2f-udev libvulkan1
    !wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
    !dpkg -i google-chrome-stable_current_amd64.deb
    !apt --fix-broken install 
    !pip install selenium webdriver-manager

    Scrape the website

    from selenium import webdriver # Import Selenium WebDriver
    from webdriver_manager.chrome import ChromeDriverManager # Automatically manage ChromeDriver
    from selenium.webdriver.common.by import By # Import locator strategies (e.g., XPATH)
    options = webdriver.chrome.options.Options() # Create Chrome options object
    options.add_argument(‘–headless’) # Run Chrome without a visible window
    options.add_argument(‘–no-sandbox’) # Disable sandbox (needed in containers/Colab)
    options.add_argument(‘–disable-dev-shm-usage’) # Prevent shared memory issues
    service = webdriver.chrome.service.Service(ChromeDriverManager().install()) # Install and configure ChromeDriver service
    browser = webdriver.Chrome(options=options, service=service) # Launch Chrome with defined options
    browser.get(‘https://www.google.com’) # Open Google homepage
    # print(browser.find_element(By.XPATH, “/html/body”).text) # (Optional) Print page text using XPath
    browser.save_screenshot(“screenshot_using_chrome.png”) # Save a screenshot of the loaded page
    browser.close() # Close the browser window
    browser.quit()

    from selenium import webdriver
    from webdriver_manager.chrome import ChromeDriverManager
    from selenium.webdriver.common.by import By 
    options = webdriver.chrome.options.Options()
    options.add_argument('--headless')
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-dev-shm-usage')
    service = webdriver.chrome.service.Service(ChromeDriverManager().install())
    browser = webdriver.Chrome(options=options, service=service)
    browser.get('https://www.google.com')
    #print(browser.find_element(By.XPATH, "/html/body").text)
    browser.save_screenshot("screenshot_using_chrome.png")
    browser.close()
    browser.quit()

    If you want to wait until a website loads, you can use the sleep function

    from selenium import webdriver # Import Selenium WebDriver
    from webdriver_manager.chrome import ChromeDriverManager # Automatically manage ChromeDriver
    from selenium.webdriver.common.by import By # Import locator strategies (e.g., XPATH)
    from time import sleep # Import sleep function
    options = webdriver.chrome.options.Options() # Create Chrome options object
    options.add_argument(‘–headless’) # Run Chrome without a visible window
    options.add_argument(‘–no-sandbox’) # Disable sandbox (needed in containers/Colab)
    options.add_argument(‘–disable-dev-shm-usage’) # Prevent shared memory issues
    service = webdriver.chrome.service.Service(ChromeDriverManager().install()) # Install and configure ChromeDriver service
    browser = webdriver.Chrome(options=options, service=service) # Launch Chrome with defined options
    browser.get(‘https://us.shop.battle.net/en-us’) # Open battle homepage
    sleep(10) # Wait 10 seconds
    # print(browser.find_element(By.XPATH, “/html/body”).text) # (Optional) Print page text using XPath
    browser.save_screenshot(“screenshot_using_chrome.png”) # Save a screenshot of the loaded page
    browser.close() # Close the browser window
    browser.quit()

    from selenium import webdriver
    from webdriver_manager.chrome import ChromeDriverManager
    from selenium.webdriver.common.by import By 
    from time import sleep
    options = webdriver.chrome.options.Options()
    options.add_argument('--headless')
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-dev-shm-usage')
    service = webdriver.chrome.service.Service(ChromeDriverManager().install())
    browser = webdriver.Chrome(options=options, service=service)
    browser.get('https://us.shop.battle.net/en-us')
    sleep(10)
    #print(browser.find_element(By.XPATH, "/html/body").text)
    browser.save_screenshot("screenshot_using_chrome.png")
    browser.close()
    browser.quit()
  • TinyDB

    TinyDB

    A document-oriented database written in pure Python, you will need to download and install it using the pip command

    Install

    pip # Python’s package manager
    install # A command to download and install libraries from PyPI (Python Package Index
    tinydb # a lightweight Python NoSQL database library

    pip install tinydb

    Create a Database

    The TinyDB() function is used to connect to the local database or create a new one if the file does not exist 

    from tinydb import TinyDB # Import the TinyDB class from the tinydb module
    db = TinyDB(‘database.json’) # Create (or open) a TinyDB database stored in a JSON file named ‘database.json’, if the file doesn’t exist, TinyDB will create it automatically

    from tinydb import TinyDB
    db = TinyDB('database.json')

    List All Tables

    You can list all tables using the .table() method, you do need to have data inside the table, otherwise it won’t be shown

    from tinydb import TinyDB # Import the TinyDB class from the tinydb module
    db = TinyDB(‘database.json’) # Create (or open) a TinyDB database stored in a JSON file named ‘database.json’, if the file doesn’t exist, TinyDB will create it automatically
    db.tables() # List all tables in the TinyDB database

    from tinydb import TinyDB
    db = TinyDB('database.json')
    db.tables()

    Output

    {'_default'}

    Create a Table

    Tinydb supports tables (You do not need to use them), to create a table use the .table() method

    from tinydb import TinyDB # Import the TinyDB class from the tinydb module
    db = TinyDB(‘database.json’) # Create (or open) a TinyDB database stored in a JSON file named ‘database.json’, if the file doesn’t exist, TinyDB will create it automatically
    table = db.table(‘users’) # Access (or create if it doesn’t exist) a table named ‘users’ in the TinyDB database

    from tinydb import TinyDB
    db = TinyDB('database.json')
    table = db.table('users')

    Delete Table

    You can delete all the data within a database using the .drop_table() method

    from tinydb import TinyDB # Import the TinyDB class from the tinydb module
    db = TinyDB(‘database.json’) # Create (or open) a TinyDB database stored in a JSON file named ‘database.json’, if the file doesn’t exist, TinyDB will create it automatically
    db.drop_table(‘users’) # Delete the entire table named ‘users’ from the TinyDB database
    print(db.tables()) # Show all tables

    from tinydb import TinyDB
    db = TinyDB('database.json')
    db.drop_table('users')
    print(db.tables())

    Output

    {'_default'}

    Insert Data

    To add new data, use the .insert() method

    from tinydb import TinyDB # Import the TinyDB class from the tinydb module
    db = TinyDB(‘database.json’) # Create (or open) a TinyDB database stored in a JSON file named ‘database.json’, if the file doesn’t exist, TinyDB will create it automatically
    db.drop_table(‘users’) # Delete the entire table named ‘users’ from the TinyDB database
    table = db.table(‘users’) # Access (or create if it doesn’t exist) a table named ‘users’ in the TinyDB database
    table.insert({“id”: 1,”user”: “john”,”hash”: “e66860546f18”}) # Insert a new record (dictionary) into the ‘users’ table 
    table.insert({“id”: 2,”user”: “jane”,”hash”: “cdbbcd86b35e”, “car”:”ford”}) # Insert a new record (dictionary) into the ‘users’ table 

    from tinydb import TinyDB
    db = TinyDB('database.json')
    db.drop_table('users')
    table = db.table('users')
    table.insert({"id": 1,"user": "john","hash": "e66860546f18"})
    table.insert({"id": 2,"user": "jane","hash": "cdbbcd86b35e", "car":"ford"})

    Output


    Fetching Results

    To fetch items from the database, use the .all() method

    from tinydb import TinyDB # Import the TinyDB class from the tinydb module
    db = TinyDB(‘database.json’) # Create (or open) a TinyDB database stored in a JSON file named ‘database.json’, if the file doesn’t exist, TinyDB will create it automatically
    db.drop_table(‘users’) # Delete the entire table named ‘users’ from the TinyDB database
    table = db.table(‘users’) # Access (or create if it doesn’t exist) a table named ‘users’ in the TinyDB database
    table.insert({“id”: 1,”user”: “john”,”hash”: “e66860546f18”}) # Insert a new record (dictionary) into the ‘users’ table 
    table.insert({“id”: 2,”user”: “jane”,”hash”: “cdbbcd86b35e”, “car”:”ford”}) # Insert a new record (dictionary) into the ‘users’ table
    print(table.all()) # Retrieve and print all records from the ‘users’ table

    from tinydb import TinyDB
    db = TinyDB('database.json')
    db.drop_table('users')
    table = db.table('users')
    table.insert({"id": 1,"user": "john","hash": "e66860546f18"})
    table.insert({"id": 2,"user": "jane","hash": "cdbbcd86b35e", "car":"ford"})
    print(table.all())

    Output

    [{'id': 1, 'user': 'john', 'hash': 'e66860546f18'}, {'id': 2, 'user': 'jane', 'hash': 'cdbbcd86b35e', 'car': 'ford'}]

    Find Data

    You can fetch a specific data using the .search() method

    from tinydb import TinyDB # Import the TinyDB class from the tinydb module
    db = TinyDB(‘database.json’) # Create (or open) a TinyDB database stored in a JSON file named ‘database.json’, if the file doesn’t exist, TinyDB will create it automatically
    db.drop_table(‘users’) # Delete the entire table named ‘users’ from the TinyDB database
    table = db.table(‘users’) # Access (or create if it doesn’t exist) a table named ‘users’ in the TinyDB database
    table.insert({“id”: 1,”user”: “john”,”hash”: “e66860546f18”}) # Insert a new record (dictionary) into the ‘users’ table 
    table.insert({“id”: 2,”user”: “jane”,”hash”: “cdbbcd86b35e”, “car”:”ford”}) # Insert a new record (dictionary) into the ‘users’ table
    results = table.search(where(‘user’) == ‘jane’) # Search the ‘users’ table for all records where the ‘user’ field equals ‘jane’
    print(results) # Print the list of matching records

    from tinydb import TinyDB, where
    db = TinyDB('database.json')
    db.drop_table('users')
    table = db.table('users')
    table.insert({"id": 1,"user": "john","hash": "e66860546f18"})
    table.insert({"id": 2,"user": "jane","hash": "cdbbcd86b35e", "car":"ford"})
    results = table.search(where('user') == 'jane')
    print(results)

    Output

    [{'id': 2, 'user': 'jane', 'hash': 'cdbbcd86b35e', 'car': 'ford'}]

    Update Data

    You can update data by using the .update() method

    from tinydb import TinyDB # Import the TinyDB class from the tinydb module
    db = TinyDB(‘database.json’) # Create (or open) a TinyDB database stored in a JSON file named ‘database.json’, if the file doesn’t exist, TinyDB will create it automatically
    db.drop_table(‘users’) # Delete the entire table named ‘users’ from the TinyDB database
    table = db.table(‘users’) # Access (or create if it doesn’t exist) a table named ‘users’ in the TinyDB database
    table.insert({“id”: 1,”user”: “john”,”hash”: “e66860546f18”}) # Insert a new record (dictionary) into the ‘users’ table 
    table.insert({“id”: 2,”user”: “jane”,”hash”: “cdbbcd86b35e”, “car”:”ford”}) # Insert a new record (dictionary) into the ‘users’ table
    table.update({‘car’: ‘jeep’}, where(‘user’) == ‘jane’) # Update all records in the ‘users’ table where ‘user’ is ‘jane’, change the field ‘car’ with value ‘jeep’
    print(table.all()) # Retrieve and print all records from the ‘users’ table

    from tinydb import TinyDB, where
    db = TinyDB('database.json')
    db.drop_table('users')
    table = db.table('users')
    table.insert({"id": 1,"user": "john","hash": "e66860546f18"})
    table.insert({"id": 2,"user": "jane","hash": "cdbbcd86b35e", "car":"ford"})
    table.update({'car': 'jeep'}, where('user') == 'jane')
    print(table.all())

    Output

    [{'id': 1, 'user': 'john', 'hash': 'e66860546f18'}, {'id': 2, 'user': 'jane', 'hash': 'cdbbcd86b35e', 'car': 'jeep'}]

    Delete Specific Data

    You can delete data by using the .remove() method

    from tinydb import TinyDB # Import the TinyDB class from the tinydb module
    db = TinyDB(‘database.json’) # Create (or open) a TinyDB database stored in a JSON file named ‘database.json’, if the file doesn’t exist, TinyDB will create it automatically
    db.drop_table(‘users’) # Delete the entire table named ‘users’ from the TinyDB database
    table = db.table(‘users’) # Access (or create if it doesn’t exist) a table named ‘users’ in the TinyDB database
    table.insert({“id”: 1,”user”: “john”,”hash”: “e66860546f18”}) # Insert a new record (dictionary) into the ‘users’ table 
    table.insert({“id”: 2,”user”: “jane”,”hash”: “cdbbcd86b35e”, “car”:”ford”}) # Insert a new record (dictionary) into the ‘users’ table
    table.remove(where(‘user’) == ‘jane’ # Remove all records in the ‘users’ table where ‘user’ is ‘jane’
    print(table.all()) # Retrieve and print all records from the ‘users’ table

    from tinydb import TinyDB, where
    db = TinyDB('database.json')
    db.drop_table('users')
    table = db.table('users')
    table.insert({"id": 1,"user": "john","hash": "e66860546f18"})
    table.insert({"id": 2,"user": "jane","hash": "cdbbcd86b35e", "car":"ford"})
    table.remove(where('user') == 'jane')
    print(table.all())

    Output

    [{'id': 1, 'user': 'john', 'hash': 'e66860546f18'}]

    Delete All Data

    You can delete all the data within a database using the .drop_table() method

    from tinydb import TinyDB # Import the TinyDB class from the tinydb module
    db = TinyDB(‘database.json’) # Create (or open) a TinyDB database stored in a JSON file named ‘database.json’, if the file doesn’t exist, TinyDB will create it automatically
    db.drop_table(‘users’) # Delete the entire table named ‘users’ from the TinyDB database
    print(db.tables()) # Retrieve and print all tables

    from tinydb import TinyDB
    db = TinyDB('database.json')
    db.drop_table('users')
    print(db.tables())

    Output

    {'_default'}

    User Input (NoSQL Injection)

    A threat actor can construct a malicious query and use it to perform an authorized action

    rom tinydb import TinyDB # Import the TinyDB class from the tinydb module
    temp_user = input(“Enter username: “) # Prompt the user to enter a username
    temp_hash = input(“Enter password: “) # Prompt the user to enter a password (Usually, there will be a function to hash the password, it’s removed from here)
    db = TinyDB(‘database.json’) # Create (or open) a TinyDB database stored in a JSON file named ‘database.json’, if the file doesn’t exist, TinyDB will create it automatically
    db.drop_table(‘users’) # Delete the entire table named ‘users’ from the TinyDB database
    table = db.table(‘users’) # Access (or create if it doesn’t exist) a table named ‘users’ in the TinyDB database
    table.insert({“id”: 1,”user”: “john”,”hash”: “e66860546f18”}) # Insert a new record (dictionary) into the ‘users’ table 
    table.insert({“id”: 2,”user”: “jane”,”hash”: “cdbbcd86b35e”, “car”:”ford”}) # Insert a new record (dictionary) into the ‘users’ table
    if len(temp_hash) == 12: # Check if hash value length is 12
        results = table.search(Query().user.search(temp_user) & Query().hash.search(temp_hash)) # Search the table for records where the ‘user’ field matches temp_user  and the ‘hash’ field matches temp_hash using regex search
        print(results) # Print all results

    from tinydb import TinyDB, Query
    temp_user = input("Enter username: ")
    temp_hash = input("Enter password: ")
    db = TinyDB('database.json')
    db.drop_table('users')
    table = db.table('users')
    table.insert({"id": 1,"user": "john","hash": "e66860546f18"})
    table.insert({"id": 2,"user": "jane","hash": "cdbbcd86b35e", "car":"ford"})
    if len(temp_hash) == 12:
        results = table.search(Query().user.search(temp_user) & Query().hash.search(temp_hash))
        print(results)

    Malicious statement

    If a user enters [a-zA-Z0-9]+ for the username and any password, it will pass the length check, then the users john and jane will be triggered by the regex pattern (When TinyDB evaluates Query().user.search(temp_user), it’s not searching literally for [a-zA-Z0-9]+, Instead, it treats that as a regex pattern, which will match any username composed of letters/numbers.)

    [a-zA-Z0-9]+ detects on john -> True, retrieve this user
    [a-zA-Z0-9]+ detects on jane -> True, retrieve this user

    Output

    [{'id': 1, 'user': 'john', 'hash': 'e66860546f18'}, {'id': 2, 'user': 'jane', 'hash': 'cdbbcd86b35e', 'car': 'ford'}]
  • Non-Relational Databases

    Non-Relational Databases

    Non-relational databases, often called NoSQL databases, are designed to store data in a more flexible format compared to relational databases. They can handle structured, semi-structured, and unstructured data, making them ideal for modern applications that deal with diverse data types. Instead of tables with fixed rows and columns, non-relational databases use user-defined models such as documents, key-value pairs, wide columns, or graphs. This flexibility allows developers to easily adapt the database to changing requirements without redesigning the entire schema.

    Non-relational databases organize data according to the chosen data model. For example, document databases like MongoDB store data as JSON-like documents, while key-value stores like Redis store data as key-value pairs. Graph databases, on the other hand, focus on relationships between data points, making them ideal for social networks or recommendation systems. Unlike relational databases, non-relational databases often do not enforce strict schemas or relationships, allowing rapid development and the handling of large-scale, dynamic datasets.

    Non-relational databases are widely used in applications that require high scalability, performance, and flexibility, such as big data analytics, real-time web applications, and content management systems. They can efficiently manage large volumes of diverse data and are often horizontally scalable, meaning they can distribute data across multiple servers. Popular non-relational databases include MongoDB, Cassandra, Redis, and Neo4j, each optimized for specific use cases. Their ability to handle various data types and adapt to changing requirements makes them a critical component in modern data architectures.

    Example

    A database that has a collection of 2 documents that have different key:value pairs

    [
      {
        "id": 1,
        "user": "john",
        "hash": "e66860546f18"
      },
      {
        "id": 2,
        "user": "jane",
        "hash": "cdbbcd86b35e",
        "car": "ford"
      }
    ]

    Non-Relational Databases Pros and Cons

    • Pros of Non-Relational Databases (NoSQL)
      • Flexible Schema
        • No fixed tables or columns; can store structured, semi-structured, and unstructured data.
        • Easy to adapt to changing application requirements without redesigning the database.
      • High Scalability
        • Designed for horizontal scaling across multiple servers, ideal for handling large datasets.
      • Performance
        • Optimized for high-throughput reads/writes, making them suitable for real-time applications.
      • Diverse Data Models
        • Support for documents (MongoDB), key-value pairs (Redis), wide-columns (Cassandra), and graphs (Neo4j) allows flexibility for different use cases.
      • Rapid Development
        • Lack of strict schema enforcement allows faster development cycles.
      • Big Data and Analytics
        • Well-suited for large-scale, dynamic datasets and big data applications.
    • Cons of Non-Relational Databases
      • Lack of Standardization
        • No universal query language like SQL; each database has its own API or query syntax.
      • Data Consistency Challenges
        • Many NoSQL systems prioritize availability and partition tolerance over strict consistency (CAP theorem).
      • Complex Relationships
        • Difficult to enforce relationships between datasets compared to relational databases.
      • Limited Transaction Support
        • ACID transactions may be limited or unavailable in some NoSQL databases.
      • Tooling and Expertise
        • Smaller ecosystem compared to mature RDBMS systems; may require specialized knowledge.
      • Data Duplication
        • Denormalization is common, which can increase storage requirements and complicate updates.