Build a security analytics data warehouse with BigQuery and Redpanda

Learn how to run powerful streaming data and analysis for a cybersecurity use case

15 min readMar 11, 2024

Google’s BigQuery is a fully managed, serverless data warehouse that delivers exceptional speed and scalability — two of the most important characteristics in data analytics applications, especially those that handle security data.

You can use BigQuery’s SQL query processing and massive parallel processing capabilities to quickly analyze extensive data sets. Beyond basic analytics, BigQuery is a robust data warehousing solution that supports informed decision-making through aggregation and reporting.

Additionally, you can use BigQuery ML to build and deploy machine learning models directly within the database using SQL commands. This enables predictive analytics and anomaly detection for proactive threat detection in cybersecurity.

In this tutorial, you’ll integrate Google BigQuery with Redpanda (a performant C++ drop-in for Apache Kafka®) to build a simple security analytics data warehouse that provides real-time data streaming and advanced analytics for effective cybersecurity threat detection and analysis.

What is BigQuery?

BigQuery is an enterprise data warehouse from Google Cloud that’s built for agility, scalability, and performance. You can query massive data sets at a super high speed using SQL syntax.

BigQuery is a fully managed platform, meaning it eliminates the need for infrastructure management to give users a seamless experience that can automatically scale to meet the demands of their data. Plus, BigQuery has a serverless architecture that enables analysts and engineers to focus on extracting value from data rather than managing hardware and software.

The use cases for BigQuery span across various industries and applications. However, one of its primary roles is in security analytics. Companies in the field of cybersecurity need to process and analyze logs, network data, and threat intelligence in real time. The data processing capabilities of BigQuery allow these companies to perform complex joins, window functions, and time series analysis at lightning-fast speed. This rapid analysis capacity is crucial to detect anomalies, understand threat patterns, and respond to incidents with minimal delay.

Furthermore, BigQuery is integrated with Google’s data and machine learning ecosystem. By combining the strength of BigQuery with machine learning and AI, companies can develop advanced models to predict and detect cyber threats before they have an impact. Additionally, as a data warehousing solution, BigQuery can be used to store and analyze historical data, which is important for trend analysis, compliance reporting, and long-term strategic planning in cybersecurity.

Now that you understand the technologies, let’s dive into the tutorial so you can practice using them.

Building a security analytics data warehouse with Redpanda and BigQuery

The following tutorial demonstrates how to build a security analytics data warehouse with Redpanda and BigQuery using a hypothetical scenario of a cybersecurity company trying to enhance its threat detection capabilities. The company plans to integrate the real-time streaming capabilities of Redpanda with the data storage and analytics of Google BigQuery.

To accurately simulate this scenario, you’ll first generate your own synthetic data to represent security event streams from various sources. Then, you’ll learn how to create the necessary topic in Redpanda, which will act as a conduit for your streaming data. You’ll then configure Redpanda to establish a connection with Google BigQuery, ensuring the connector uses the created topic that will hold the security event data.

The data will then produce to the Redpanda topic and set up for ingestion into BigQuery. In the end, the streamed data will be shown and analyzed in BigQuery. Here’s a quick overview of what you’ll be implementing.

Workflow of the security analytics application

Prerequisites

To complete this tutorial, you’ll need the following:

A Python virtual environment created and activated: All Python-related commands should be run in this environment.
Python 3.11 or higher: The code is written in Python, so you need to have Python installed on your system. You can download it from Python’s official website.
Docker: You should have a recent version of Docker installed on your machine (this tutorial uses Docker 24.0.6).
Redpanda: Redpanda is an event streaming platform compatible with Apache Kafka. You need to have it installed and running. Follow the instructions on the Redpanda website to install and start a Redpanda cluster.

1. Setting up the folder structure

To proceed with the setup, you should create the following folder structure on your system:

redpanda_bigquery
├── configuration
├── plugins

Here, redpanda_bigquery is the main directory that contains the subdirectories configuration and plugins. This structure will organize your configuration files and the plugins necessary for integrating Redpanda with BigQuery.

2. Generating the data set

In this section, you’ll create dummy security event data to simulate a variety of sources, such as network logs, intrusion detection system (IDS) alerts, and authentication records. For this, create a Python script called generate_data.py in the redpanda_bigquery directory and add the following Python code to programmatically generate a series of events with attributes relevant to each type:

import json
import random
from datetime import datetime, timedelta

# Constants for the types of data points
NETWORK_LOG = "network_log"
IDS_ALERT = "ids_alert"
AUTH_RECORD = "auth_record"

# Generate a random IP address
def generate_ip():
    return f"{random.randint(0, 255)}.{random.randint(0, 255)}.{random.randint(0, 255)}.{random.randint(0, 255)}"

# Generate a random timestamp within the last 24 hours
def generate_timestamp():
    return (datetime.now() - timedelta(minutes=random.randint(0, 1440))).isoformat()

# Generate a consistent data structure for network log
def generate_network_log():
    return {
        "type": NETWORK_LOG,
        "timestamp": generate_timestamp(),
        "source_ip": generate_ip(),
        "destination_ip": generate_ip(),
        "bytes_sent": random.randint(1, 65535),
        "protocol": random.choice(["TCP", "UDP", "ICMP"]),
        "flag": random.choice(["SYN", "ACK", "RST", "FIN"])
    }

# Generate a consistent data structure for IDS alert
def generate_ids_alert():
    return {
        "type": IDS_ALERT,
        "timestamp": generate_timestamp(),
        "source_ip": generate_ip(),
        "alert_type": random.choice(["Malware", "Unauthorized Access", "DDoS", "Data Leak"]),
        "severity": random.choice(["Low", "Medium", "High"])
    }

# Generate a consistent data structure for authentication record
def generate_auth_record():
    return {
        "type": AUTH_RECORD,
        "timestamp": generate_timestamp(),
        "user": f"user{random.randint(1, 100)}",
        "auth_method": random.choice(["Password", "Biometrics", "Token"]),
        "success": random.choice([True, False])
    }

# Generate the dummy data with consistent structure for each type
def generate_dummy_data(file_path, num_records=1000):
    with open(file_path, 'w') as file:
        for _ in range(num_records // 3):
            # Generate and write network log data
            network_log = generate_network_log()
            file.write(json.dumps(network_log) + "\n")

            # Generate and write IDS alert data
            ids_alert = generate_ids_alert()
            file.write(json.dumps(ids_alert) + "\n")

            # Generate and write authentication record data
            auth_record = generate_auth_record()
            file.write(json.dumps(auth_record) + "\n")

# Specify the file path and generate the data
file_path = 'security_analytics_data.json'
generate_dummy_data(file_path)

# Return the path of the generated file
file_path

The provided Python code generates mock data points. The data set consists of 999 total entries, divided evenly across three types of security analytics data: network log entries, IDS alert entries, and authentication record entries.

The network log entries (333 entries) simulate network traffic logs and contain details such as timestamps, source and destination IP addresses, bytes sent, the communication protocol used (TCP, UDP, ICMP), and the TCP flag status (SYN, ACK, RST, FIN). For example:

{
 "type": "network_log",
 "timestamp": "2023-11-07T12:18:17.810401",
 "source_ip": "3.147.192.253",
 "destination_ip": "234.105.239.182",
 "bytes_sent": 5138,
 "protocol": "TCP",
 "flag": "RST"
}

The IDS alert entries (333 entries) represent alerts from an intrusion detection system and include a timestamp, the source IP address of the potential threat, the type of alert (eg Malware, Unauthorized Access, DDoS, Data Leak), and the severity level (Low, Medium, High). For example:

{
 "type": "ids_alert",
 "timestamp": "2023-11-07T13:23:17.810542",
 "source_ip": "33.52.72.239",
 "alert_type": "DDoS",
 "severity": "High"
}

The authentication record entries (333 entries) mimic authentication attempts and include a timestamp, a user identifier, the method of authentication used (Password, Biometrics, Token), and a Boolean indicating whether the authentication attempt was successful. For example:

{
 "type": "auth_record",
 "timestamp": "2023-11-07T03:57:17.810583",
 "user": "user50",
 "auth_method": "Biometrics",
 "success": False
}

To execute the script and generate the data, open a terminal in your home directory where the generate_data.py file is saved, then execute the following command in the terminal in the Python virtual environment that you created beforehand:

python generate_data.py

When executed, this script writes the synthesized events, formatted as JSON objects, into a file named security_analytics_data.json in the redpanda_bigquery directory.

3. Setting up BigQuery

You'll now set up your BigQuery environment to receive and store streaming data. Refer to the Redpanda blog post on setting up BigQuery for guidance.

During the setup of BigQuery, you'll be asked to create a data set and a table and define a schema for the data set. To be consistent with this tutorial, it's recommended to name the data set security_analytics_data and your table security-events. When defining the schema, it's also recommended to use the same one as in this tutorial:

[
   {
       "name": "success",
       "type": "BOOLEAN",
       "mode": "NULLABLE"
   },
   {
       "name": "alert_type",
       "type": "STRING",
       "mode": "NULLABLE"
   },
   {
       "name": "user",
       "type": "STRING",
       "mode": "NULLABLE"
   },
   {
       "name": "severity",
       "type": "STRING",
       "mode": "NULLABLE"
   },
   {
       "name": "flag",
       "type": "STRING",
       "mode": "NULLABLE"
   },
   {
       "name": "protocol",
       "type": "STRING",
       "mode": "NULLABLE"
   },
   {
       "name": "source_ip",
       "type": "STRING",
       "mode": "NULLABLE"
   },
   {
       "name": "bytes_sent",
       "type": "INTEGER",
       "mode": "NULLABLE"
   },
   {
       "name": "auth_method",
       "type": "STRING",
       "mode": "NULLABLE"
   },
   {
       "name": "timestamp",
       "type": "TIMESTAMP",
       "mode": "NULLABLE"
   },
   {
       "name": "destination_ip",
       "type": "STRING",
       "mode": "NULLABLE"
   },
   {
       "name": "type",
       "type": "STRING",
       "mode": "NULLABLE"
   }
]

4. Creating Topics in Redpanda

You'll also need to create a Redpanda topic called security-events. Later, you'll produce the generated data set for this topic. To create the topic, execute the following code in the terminal:

rpk topic create security-events

If you are running Redpanda in Docker, make sure to execute the command inside the container:

docker exec -it redpanda-0 rpk topic create security-events

In this case, redpanda-0 is the name of the running Docker container.

5. Setting up Redpanda to connect to BigQuery

The following sections focus on connecting Redpanda with Google BigQuery to enable data flow between them. This involves setting up Kafka Connect, a tool from Kafka's suite that facilitates data integration between Kafka (or Redpanda, in this case) and a variety of external systems like databases and cloud services.

The setup has two main components: configuring Kafka Connect to interface with Redpanda and setting up a sink connector for BigQuery. This will establish a pipeline that allows data to be streamed from Redpanda to BigQuery for analysis.

6. Setting up Kafka Connect

Kafka Connect was designed to enable data flow between Kafka and other systems to ensure robust and scalable data streaming. Kafka Connect is used to establish connections to databases, search engines, and various cloud storage services. Furthermore, its adaptability allows it to be used effectively with Redpanda.

Kafka Connect uses source and sink connectors. Source connectors are used to channel data into Kafka from external systems. On the other hand, sink connectors transmit the data from Kafka to other systems (in this case, BigQuery).

To set up Kafka Connect, you must download the Apache Kafka Connect package. For this tutorial, download the Kafka 3.1.0 binary package. Save the downloaded binary in the redpanda_bigquery folder.

7. Configuring a Kafka Connect cluster

Next, you must initiate a Kafka Connect cluster. To do so, download the binaries of the Kafka Connector and unzip the contents of the folder kafka-connect-bigquery into the directory redpanda_bigquery/plugins.

Following this, create a configuration file called connect.properties and save it in the directory redpanda_bigquery/configuration. In the connect.properties file, paste the following configuration below:

# Addresses of the Kafka broker
bootstrap.servers=localhost:19092

# Path where the plugins should be stored
plugin.path=redpanda_bigquery/plugins

# JSON schemas are enabled
key.converter.schemas.enable=true
value.converter.schemas.enable=true

# The cluster-level converters are used when the connectors don't define any converters
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter

# Specify where the topic offset configs are kept
offset.storage.file.filename=/tmp/connect.offsets
offset.flush.interval.ms=10000

In this file, the bootstrap.servers parameter must be assigned to the address of the Redpanda cluster (in this case, localhost:19092). This links the Kafka Connect cluster to the Redpanda cluster. plugin.path has to point to the path where the binaries of the connector were saved (in this case, redpanda_bigquery/plugins).

Your redpanda_bigquery directory should now reflect the following structure:

redpanda_bigquery
├── configuration
│   ├── connect.properties
├── plugins
│   ├── kafka-connect-bigquery
│   │   ├── api-common-1.10.0.jar
│   │   ├── ...
│   │   └── threetenbp-1.4.4.jar
└── kafka_2.13-3.1.0

You’ve now configured the Kafka Connect cluster to integrate with Redpanda.

8. Configuring the BigQuery sink connector

Create a configuration file called bigquery-sink-connector.properties and save the file in the directory redpanda_bigquery/configuration. Paste the following configuration into the file to configure the sink connector to connect with BigQuery:

name=bigquery-sink-connector

connector.class= com.wepay.kafka.connect.bigquery.BigQuerySinkConnector

# Specify if value converter contains a schema
value.converter.schemas.enable=false

# This is the converter for the value 
value.converter=org.apache.kafka.connect.json.JsonConverter

# This is the converter for the key
key.converter=org.apache.kafka.connect.storage.StringConverter

topics=security-events

# The name of the BigQuery project to write to
project=pandaq-404414

# The name of the BigQuery data set to write to (leave the .*= at the beginning and enter your
# data set after it)
datasets=.*=security_analytics_data

# The location of a BigQuery service account or user JSON credentials file
# or service account credentials or user credentials in JSON format (non-escaped JSON blob)
defaultDataset=security_analytics_data

keyfile=<PATH OF YOUR BIGQUERY KEY FILE>

# 'FILE' if keyfile is a credentials file, 'JSON' if it's a credentials JSON
keySource=FILE

# Whether to automatically sanitize topic names before using them as table names
# If not enabled, topic names will be used directly as table names
sanitizeTopics=false

# Automatically create BigQuery tables if they don't already exist
autoCreateTables=false

# Whether or not to automatically update BigQuery schemas
autoUpdateSchemas=false

schemaRetriever=com.wepay.kafka.connect.bigquery.retrieve.IdentitySchemaRetriever

# The maximum number of records to buffer per table before temporarily halting the flow of new records, or -1 for unlimited buffering
bufferSize=100

maxWriteSize=100

tableWriteWait=1000

timestamp=UTC

bigQueryPartitionDecorator=false

In this configuration, you’ll have to input some information according to your unique project configuration:

topics is the name of the Redpanda topic (in this tutorial, it's security-events)
project is the name of the project created during the BigQuery setup (in this tutorial, it's pandaq-404414)
datasets and defaultDataset represent the name of the created data set in BigQuery during setup (here, it's called security_analytics_data)
keyfile is the path of the downloaded BigQuery security credentials, obtained when setting up BigQuery

By configuring the file, you’ll enable the Kafka Connect cluster to recognize and utilize the Redpanda streams, effectively bridging them with BigQuery for data ingestion and analysis.

9. Starting the Kafka Connect cluster

Open a terminal on your system and navigate to the redpanda_bigquery/configuration directory. Within this directory, execute the command to start the Kafka Connect cluster with your specified settings:

../kafka_2.13-3.1.0/bin/connect-standalone.sh connect.properties bigquery-sink-connector.properties

This process activates the connectors with the configurations set in the connect.properties and bigquery-sink-connector.properties files, establishing the data pipeline between Redpanda and BigQuery.

10. Producing data to the Redpanda Topic

You’ve successfully set up the Redpanda cluster and created a topic named security-events. Additionally, you've configured a connection between Redpanda and BigQuery.

With these preparations complete, you’re ready to begin producing the generated dummy data to the Redpanda topic. This data will subsequently be sent to BigQuery, allowing you to simulate a real-world data pipeline and analyze the flow and processing of information as it would occur in a live environment.

But first, you must install the Confluent Kafka-Python library (this tutorial uses version 2.2.0), which will provide the necessary Consumer and Producer classes used in the following code. You need to install the library in the Python virtual environment that you created beforehand. To install the library, open a terminal and execute the following command in your virtual environment:

pip install confluent-kafka

After that, create a Python file called produce.py in the redpanda_bigquery directory and paste in the code below to produce the data to the Redpanda topic:

import json
from confluent_kafka import Producer

# Configuration for connecting to Redpanda
config = {
    'bootstrap.servers': 'localhost:19092',  # Replace with your Redpanda server address
    'client.id': 'security-analytics-producer'
}

# Create a producer instance
producer = Producer(config)

# Topic to produce to
topic_name = 'security-events'

# Callback function to check if message delivery was successful
def delivery_report(err, msg):
    if err is not None:
        print(f'Message delivery failed: {err}')
    else:
        print(f'Message delivered to {msg.topic()} [{msg.partition()}]')

# Function to produce messages
def produce_message(data):
    # Trigger any available delivery report callbacks from previous produce() calls
    producer.poll(0)

    # Asynchronously produce a message; the delivery report callback will be triggered once the message has been successfully produced or failed
    producer.produce(topic=topic_name, value=json.dumps(data), callback=delivery_report)

# Produce messages from the data file
def produce_data_from_file(file_path):
    with open(file_path, 'r') as file:
        for line in file:
            record = json.loads(line.strip())
            produce_message(record)

    # Wait for any outstanding messages to be delivered and delivery report callbacks to be triggered
    producer.flush()

# Path to your data file
file_path = 'security_analytics_data.json'  # Replace with your actual file path

# Start producing data to Redpanda topic
produce_data_from_file(file_path)

This Python script serves as a data producer for the Redpanda topic security-events. It establishes a connection to a Redpanda server using a given address and creates a Kafka producer instance. The core function produce_message sends messages to the Redpanda topic, with each message being the JSON-serialized version of the data.

The produce_data_from_file function reads from a file containing the security analytics data, line by line, deserializing each line from JSON format and passing it to the produce_message function to be sent to the topic. Finally, the script calls produce_data_from_file, starting the process of reading the data from security_analytics_data.json and producing it to the specified Redpanda topic.

To run the script, open a terminal in the redpanda_bigquery directory and use the following command:

python produce.py

Upon executing this command, the records from security_analytics_data.json are produced to the specified Redpanda topic. At the same time, the same data is sent to BigQuery. This will be verified in the following section.

11. Examining the data in BigQuery

Now that the data has been successfully uploaded to the corresponding table in BigQuery, you can run queries against that table. To do this, go into your BigQuery console and click Query > In new tab. In this tab, you can now perform some SQL queries to examine the data.

12. Getting an overview of the data

To get a basic overview of the uploaded data, you can run the following SQL query:

SELECT * FROM `pandaq-404414.security_analytics_data.security-events` LIMIT 10;

This SQL query selects all columns from the security-events table within the security_analytics_data data set of the pandaq-404414 project. It limits the result set to the first ten rows, providing a quick snapshot of the top ten records from the specified table. The result should look like this:

13. Identifying large data transfers

To detect large data transfers, which can be useful for identifying potential data exfiltration, you can use the query below:

SELECT *
FROM `pandaq-404414.security_analytics_data.security-events`
WHERE bytes_sent > 64000
ORDER BY bytes_sent DESC;

The query selects all records, filtering for entries where the bytes_sent value exceeds 64,000 bytes. The results are ordered in descending order based on the bytes_sent value, showing the largest transfers at the top:

14. Seeing the distribution of authentication methods

To see the distribution of different authentication methods, you can use the following query:

SELECT auth_method, COUNT(*) AS Count
FROM `pandaq-404414.security_analytics_data.security-events`
GROUP BY auth_method;

This SQL query counts the number of occurrences for each unique authentication method. It groups the records based on different auth_method values and displays the count for each. The result should look like this:

Distribution of different authentication methods

14. Viewing a summary of IDS alerts

Insights into IDS alerts can be gained through the following query:

SELECT alert_type, severity, COUNT(*) AS Count
FROM `pandaq-404414.security_analytics_data.security-events`
GROUP BY alert_type, severity
ORDER BY Count DESC;

This SQL query aggregates records from the security-events table, grouping them by alert_type and severity. It counts the number of occurrences for each combination of alert type and severity and orders the results in descending order based on this count:

15. Finding failed authentication attempts

To find details on failed authentication attempts, you can use the SQL query below:

SELECT *
FROM `pandaq-404414.security_analytics_data.security-events`
WHERE success = false
LIMIT 10;

This SQL query retrieves the first ten records, filtering to include only those records where the success field is false. The output should look as follows:

Conclusion

Using Redpanda and Google BigQuery allows for efficient real-time data streaming and analysis, which is particularly valuable for cybersecurity.

Redpanda provides a Kafka-compatible platform for immediate data ingestion and distribution, while BigQuery offers a serverless architecture for fast and scalable data processing and analysis. This integration is a suitable solution for companies that aim to enhance their cybersecurity capabilities through rapid data analysis and threat detection.

For a deeper dive, the complete code of this tutorial is available via this GitHub repository, which contains all necessary scripts and configurations. You can also explore the Redpanda documentation and the Redpanda blog.

For community support and lighthearted discussions, the Redpanda Community on Slack is an excellent place to connect with experts and enthusiasts!