Wednesday, November 26, 2025

Generator Expression vs List Comprehension in Python

When handling large datasets in Python, both performance and memory usage are key concerns. Python offers two powerful tools for creating sequences: list comprehensions and generator expressions. They look similar in code, but they behave very differently.

This article explains both concepts clearly and provides a complete code example along with sample outputs.

What Is a List Comprehension?

A list comprehension creates all results immediately and stores them in memory.

Example:


lst = [i * i for i in range(100)]

How It Works

Python evaluates the entire expression.
Every squared value is computed.
All results are stored inside a list.
Requires enough memory to hold the entire list.

Effects

Fast when you need repeated access.
High memory usage for large ranges (easily in GBs).

What Is a Generator Expression?

A generator expression does not create or store all results. It produces one value at a time only when needed.

Example:


gen = (i * i for i in range(100))

How It Works

No values are computed immediately.
When you iterate over it (for example, using sum()), Python generates one value at a time.
Only one item exists in memory at any given moment.

Effects

Very low memory usage.
Ideal for large datasets.
Slightly slower if you eventually need all results, because values are generated lazily.

Complete Code Example


import time
import sys

# Generator expression
start_time = time.time()
gen = (i * i for i in range(100000000))
gen_sum = sum(gen)
gen_time = time.time() - start_time

print(f"Generator sum: {gen_sum}")
print(f"Generator runtime: {gen_time:.4f} seconds")
print(f"Generator memory: {sys.getsizeof(gen)} bytes")

# List comprehension
start_time = time.time()
lst = [i * i for i in range(100000000)]
lst_sum = sum(lst)
lst_time = time.time() - start_time

print(f"List sum: {lst_sum}")
print(f"List runtime: {lst_time:.4f} seconds")
print(f"List memory: {sys.getsizeof(lst)} bytes")

Example Output

Below is a typical output you would see (numbers will vary depending on your system):


Generator sum: 333333328333333300000000
Generator runtime: 4.8123 seconds
Generator memory: 112 bytes

List sum: 333333328333333300000000
List runtime: 12.5478 seconds
List memory: 800000112 bytes

What This Output Shows

Both produce the same sum, so the logic is identical.
Generator uses only 112 bytes regardless of how large the range is.
List uses enormous memory because it stores all values
Time difference:

Generator: Faster startup but computes as it goes.
List: Takes much longer because it must generate 100 million elements first

When to Use List Comprehension

Choose a list comprehension when:

You need all results stored.
You want to access values multiple times.
Memory is not a concern.
You need fast random access.

Typical use cases:

Preparing training data for ML models
Filtering small or medium datasets
Performing multiple operations on the same data

When to Use Generator Expression

Choose a generator when:

You are working with massive datasets.
Memory is limited.
You only need each value once.
You want streaming-like processing.

Use cases:

Reading large log files
Streaming data from a database or API
Processing big data line by line
Feeding values to sum(), min(), max(), or loops without storing them

Summary Table

Feature	List Comprehension	Generator Expression
Memory Usage	High. Stores all elements in memory.	Very low. Generates one value at a time.
Evaluation Method	Eager. Computes all values immediately.	Lazy. Computes values only when needed.
Speed	Fast when reusing data multiple times.	Efficient for single-pass operations.
Storage	Stores a complete list in memory.	Stores no data; holds only iteration logic.
Use Cases	Small and medium datasets, repeated access, ML preprocessing.	Large datasets, streaming, memory-sensitive operations.
Syntax	[expr for item in iterable]	(expr for item in iterable)

Final Thoughts

Both list comprehensions and generator expressions are essential tools for writing efficient Python code. If you need to store all results and reuse them, choose a list comprehension. If you want to save memory and process large datasets efficiently, a generator expression is the right choice.

Assisted by ChatGPT

Monday, November 24, 2025

Tests in DBT

Understanding Tests in DBT

Testing is one of the most important features of dbt. It helps ensure that your data models are correct, reliable, and ready for downstream analytics. In dbt, tests are written using SQL and YAML, and they run automatically as part of your pipeline. This article explains the two main types of tests in dbt: singular tests and generic tests. We will also look at how to create reusable macros and how to configure tests using YAML files.

What Are Tests in dbt?

dbt tests are small SQL queries that check the quality of your data. Tests return either:

Pass: when the query returns zero rows
Fail: when the query returns one or more rows

A failing test indicates that something is wrong with the data and needs attention.

1. Singular Tests in dbt

A singular test is the simplest type of test in dbt. It is a SQL file that contains a query designed to catch bad data.

How a Singular Test Works

You write a SQL query.
The query should return only the rows that violate your expectation.
If the query returns rows, the test fails.

Example: Check for Negative Values

Suppose you want to ensure that the column amount never contains negative values.

Create a file inside:


tests/
    test_negative_amounts.sql

Inside this file:


select *
from {{ ref('sales') }}
where amount < 0

If any row in the sales model has a negative amount, this test will fail.

When To Use Singular Tests

When the logic is complex
When the condition is unique to a specific model
When the validation cannot be generalized

2. Generic Tests in dbt

A generic test is a reusable test that can be applied to multiple models and multiple columns. Instead of writing SQL again and again, you create a macro once and use it everywhere.

Why Use Generic Tests?

Saves time
Reduces repeated code
Ensures consistent testing across the project

3. Creating a Generic Test Macro

Generic tests are stored inside:


macros/tests/

Example Macro: Check Non-Negative Values

Create a file:


macros/tests/non_negative.sql

Add the following:


{% test non_negative(model, column_name) %}

select *
from {{ model }}
where {{ column_name }} < 0

{% endtest %}

Explanation:

test non_negative is the name of the test.
model and column_name are variables.
The SQL returns rows where the condition is violated.

4. Using YAML to Apply Generic Tests to Models

Generic tests are applied in the YAML file of your model.
dbt looks for a schema.yml file inside the model directory.

Example:


models/sales/schema.yml

Inside the YAML file:


version: 2

models:
  - name: sales
    columns:
      - name: amount
        tests:
          - non_negative

This applies the generic non_negative test to the amount column.

5. Passing Variables to Generic Tests

You can also create tests where additional conditions are passed as variables.

Example: Test Threshold

Suppose you want to create a test that checks whether a value exceeds a threshold.

Macro:


{% test value_above_threshold(model, column_name, threshold) %}

select *
from {{ model }}
where {{ column_name }} > {{ threshold }}

{% endtest %}

YAML:


version: 2

models:
  - name: sales
    tests:
      - value_above_threshold:
          column_name: score
          threshold: 90

This allows the same macro to be used with different threshold values on different models.

6. Benefits of Using YAML for dbt Tests

Using a YAML file helps in:

Clear organization of tests
Easier maintenance
Version-controlled configuration
Quick visibility of all tests related to a model

YAML also ensures that tests stay close to the model definitions, making the project easier to read and understand.

7. Summary

dbt testing is powerful yet simple. Here is a quick summary:

Singular Tests

Written as SQL queries
Best for complex or model-specific checks
Fail if the query returns rows

Generic Tests

Created using macros
Reusable across multiple models and columns
Configured using YAML
Accept variables for flexibility

Using both types effectively makes your dbt project more reliable and maintainable.

Content assisted by ChatGPT

Friday, November 14, 2025

Creating a kernel in a UV project for jupyter notebooks in VS Code

This guide explains how to set up a dedicated Jupyter kernel for your uv project that uses the project's own virtual environment (.venv).

1. Install `ipykernel` as a Development Dependency

First, ensure that your project has ipykernel installed so that it can create kernels for Jupyter.


 uv add --dev ipykernel

--dev: installs it as a development dependency.
ipykernel: the package that allows creating Jupyter kernels.

2. Create a Jupyter Kernel for Your Project's `.venv`

Run the following command to register a Jupyter kernel that points to the project's virtual environment:


 uv run ipython kernel install --user --env VIRTUAL_ENV=$(pwd)/.venv --name=project

Explanation of the command:

uv run ipython kernel install: runs the IPython kernel installation inside the uv environment.
--user: installs the kernel only for the current user.
--env VIRTUAL_ENV=$(pwd)/.venv: points the kernel to use your project's virtual environment located at .venv.
--name=project: the name of the kernel that will appear in Jupyter.

3. Reload VS Code (if using VS Code)

After creating the kernel, reload your VS Code window to make it available:

Press Ctrl+Shift+P (or Cmd+Shift+P on Mac) to open the Command Palette.
Type Reload Window and hit Enter.

4. Select the Kernel in Jupyter Notebook or VS Code

Open your Jupyter notebook.
Click on the kernel name (top right corner) or Select Kernel.
Choose the newly created kernel project.

Notes

Make sure your .venv is already created and has Python installed.
Using this method ensures that the notebook runs in the same environment as your uv project.
This is especially useful when working with isolated dependencies for different projects.

Wednesday, November 26, 2025

Generator Expression vs List Comprehension in Python

What Is a List Comprehension?

How It Works

Effects

What Is a Generator Expression?

How It Works

Effects

Complete Code Example

Example Output

What This Output Shows

When to Use List Comprehension

When to Use Generator Expression

Summary Table

Final Thoughts

Monday, November 24, 2025

Tests in DBT

Understanding Tests in DBT

What Are Tests in dbt?

1. Singular Tests in dbt

How a Singular Test Works

Example: Check for Negative Values

When To Use Singular Tests

2. Generic Tests in dbt

Why Use Generic Tests?

3. Creating a Generic Test Macro

Example Macro: Check Non-Negative Values

4. Using YAML to Apply Generic Tests to Models

5. Passing Variables to Generic Tests

Example: Test Threshold

6. Benefits of Using YAML for dbt Tests

7. Summary

Singular Tests

Generic Tests

Friday, November 14, 2025

Creating a kernel in a UV project for jupyter notebooks in VS Code

1. Install ipykernel as a Development Dependency

2. Create a Jupyter Kernel for Your Project's .venv

Explanation of the command:

3. Reload VS Code (if using VS Code)

4. Select the Kernel in Jupyter Notebook or VS Code

Notes

Generator Expression vs List Comprehension in Python

1. Install `ipykernel` as a Development Dependency

2. Create a Jupyter Kernel for Your Project's `.venv`