Wednesday, November 26, 2025

Generator Expression vs List Comprehension in Python

 When handling large datasets in Python, both performance and memory usage are key concerns. Python offers two powerful tools for creating sequences: list comprehensions and generator expressions. They look similar in code, but they behave very differently.

This article explains both concepts clearly and provides a complete code example along with sample outputs.


What Is a List Comprehension?

A list comprehension creates all results immediately and stores them in memory.

Example:

lst = [i * i for i in range(100)]

How It Works

  • Python evaluates the entire expression.
  • Every squared value is computed.
  • All results are stored inside a list.
  • Requires enough memory to hold the entire list.

Effects

  • Fast when you need repeated access.
  • High memory usage for large ranges (easily in GBs).


What Is a Generator Expression?

A generator expression does not create or store all results. It produces one value at a time only when needed.

Example:

gen = (i * i for i in range(100))

How It Works

  • No values are computed immediately.
  • When you iterate over it (for example, using sum()), Python generates one value at a time.
  • Only one item exists in memory at any given moment.

Effects

  • Very low memory usage.
  • Ideal for large datasets.
  • Slightly slower if you eventually need all results, because values are generated lazily.


Complete Code Example

import time import sys # Generator expression start_time = time.time() gen = (i * i for i in range(100000000)) gen_sum = sum(gen) gen_time = time.time() - start_time print(f"Generator sum: {gen_sum}") print(f"Generator runtime: {gen_time:.4f} seconds") print(f"Generator memory: {sys.getsizeof(gen)} bytes") # List comprehension start_time = time.time() lst = [i * i for i in range(100000000)] lst_sum = sum(lst) lst_time = time.time() - start_time print(f"List sum: {lst_sum}") print(f"List runtime: {lst_time:.4f} seconds") print(f"List memory: {sys.getsizeof(lst)} bytes")

Example Output

Below is a typical output you would see (numbers will vary depending on your system):

Generator sum: 333333328333333300000000 Generator runtime: 4.8123 seconds Generator memory: 112 bytes List sum: 333333328333333300000000 List runtime: 12.5478 seconds List memory: 800000112 bytes

What This Output Shows

  • Both produce the same sum, so the logic is identical.
  • Generator uses only 112 bytes regardless of how large the range is.
  • List uses enormous memory because it stores all values
  • Time difference:
    • Generator: Faster startup but computes as it goes.
    • List: Takes much longer because it must generate 100 million elements first

When to Use List Comprehension

Choose a list comprehension when:

  • You need all results stored.
  • You want to access values multiple times.
  • Memory is not a concern.
  • You need fast random access.

Typical use cases:

  • Preparing training data for ML models
  • Filtering small or medium datasets
  • Performing multiple operations on the same data


When to Use Generator Expression

Choose a generator when:

  • You are working with massive datasets.
  • Memory is limited.
  • You only need each value once.
  • You want streaming-like processing.

Use cases:

  • Reading large log files
  • Streaming data from a database or API
  • Processing big data line by line
  • Feeding values to sum(), min(), max(), or loops without storing them


Summary Table

Feature List Comprehension Generator Expression
Memory Usage High. Stores all elements in memory. Very low. Generates one value at a time.
Evaluation Method Eager. Computes all values immediately. Lazy. Computes values only when needed.
Speed Fast when reusing data multiple times. Efficient for single-pass operations.
Storage Stores a complete list in memory. Stores no data; holds only iteration logic.
Use Cases Small and medium datasets, repeated access, ML preprocessing. Large datasets, streaming, memory-sensitive operations.
Syntax [expr for item in iterable] (expr for item in iterable)

Final Thoughts

Both list comprehensions and generator expressions are essential tools for writing efficient Python code. If you need to store all results and reuse them, choose a list comprehension. If you want to save memory and process large datasets efficiently, a generator expression is the right choice.


Assisted by ChatGPT

Monday, November 24, 2025

Tests in DBT

 

Understanding Tests in DBT

Testing is one of the most important features of dbt. It helps ensure that your data models are correct, reliable, and ready for downstream analytics. In dbt, tests are written using SQL and YAML, and they run automatically as part of your pipeline. This article explains the two main types of tests in dbt: singular tests and generic tests. We will also look at how to create reusable macros and how to configure tests using YAML files.


What Are Tests in dbt?

dbt tests are small SQL queries that check the quality of your data. Tests return either:

  • Pass: when the query returns zero rows

  • Fail: when the query returns one or more rows

A failing test indicates that something is wrong with the data and needs attention.


1. Singular Tests in dbt

A singular test is the simplest type of test in dbt. It is a SQL file that contains a query designed to catch bad data.

How a Singular Test Works

  • You write a SQL query.

  • The query should return only the rows that violate your expectation.

  • If the query returns rows, the test fails.

Example: Check for Negative Values

Suppose you want to ensure that the column amount never contains negative values.

Create a file inside:

tests/ test_negative_amounts.sql

Inside this file:

select * from {{ ref('sales') }} where amount < 0

If any row in the sales model has a negative amount, this test will fail.

When To Use Singular Tests

  • When the logic is complex

  • When the condition is unique to a specific model

  • When the validation cannot be generalized


2. Generic Tests in dbt

A generic test is a reusable test that can be applied to multiple models and multiple columns. Instead of writing SQL again and again, you create a macro once and use it everywhere.

Why Use Generic Tests?

  • Saves time

  • Reduces repeated code

  • Ensures consistent testing across the project


3. Creating a Generic Test Macro

Generic tests are stored inside:

macros/tests/

Example Macro: Check Non-Negative Values

Create a file:

macros/tests/non_negative.sql

Add the following:

{% test non_negative(model, column_name) %} select * from {{ model }} where {{ column_name }} < 0 {% endtest %}

Explanation:

  • test non_negative is the name of the test.

  • model and column_name are variables.

  • The SQL returns rows where the condition is violated.


4. Using YAML to Apply Generic Tests to Models

Generic tests are applied in the YAML file of your model.
dbt looks for a schema.yml file inside the model directory.

Example:

models/sales/schema.yml

Inside the YAML file:

version: 2 models: - name: sales columns: - name: amount tests: - non_negative

This applies the generic non_negative test to the amount column.


5. Passing Variables to Generic Tests

You can also create tests where additional conditions are passed as variables.

Example: Test Threshold

Suppose you want to create a test that checks whether a value exceeds a threshold.

Macro:

{% test value_above_threshold(model, column_name, threshold) %} select * from {{ model }} where {{ column_name }} > {{ threshold }} {% endtest %}

YAML:

version: 2 models: - name: sales tests: - value_above_threshold: column_name: score threshold: 90

This allows the same macro to be used with different threshold values on different models.

6. Benefits of Using YAML for dbt Tests

Using a YAML file helps in:

  • Clear organization of tests

  • Easier maintenance

  • Version-controlled configuration

  • Quick visibility of all tests related to a model

YAML also ensures that tests stay close to the model definitions, making the project easier to read and understand.


7. Summary

dbt testing is powerful yet simple. Here is a quick summary:

Singular Tests

  • Written as SQL queries

  • Best for complex or model-specific checks

  • Fail if the query returns rows

Generic Tests

  • Created using macros

  • Reusable across multiple models and columns

  • Configured using YAML

  • Accept variables for flexibility

Using both types effectively makes your dbt project more reliable and maintainable.


Content assisted by ChatGPT

Friday, November 14, 2025

Creating a kernel in a UV project for jupyter notebooks in VS Code


This guide explains how to set up a dedicated Jupyter kernel for your uv project that uses the project's own virtual environment (.venv).



1. Install ipykernel as a Development Dependency

First, ensure that your project has ipykernel installed so that it can create kernels for Jupyter.

uv add --dev ipykernel 
  • --dev: installs it as a development dependency.

  • ipykernel: the package that allows creating Jupyter kernels.


2. Create a Jupyter Kernel for Your Project's .venv

Run the following command to register a Jupyter kernel that points to the project's virtual environment:

uv run ipython kernel install --user --env VIRTUAL_ENV=$(pwd)/.venv --name=project 

Explanation of the command:

  • uv run ipython kernel install: runs the IPython kernel installation inside the uv environment.

  • --user: installs the kernel only for the current user.

  • --env VIRTUAL_ENV=$(pwd)/.venv: points the kernel to use your project's virtual environment located at .venv.

  • --name=project: the name of the kernel that will appear in Jupyter.


3. Reload VS Code (if using VS Code)

After creating the kernel, reload your VS Code window to make it available:

  1. Press Ctrl+Shift+P (or Cmd+Shift+P on Mac) to open the Command Palette.

  2. Type Reload Window and hit Enter.


4. Select the Kernel in Jupyter Notebook or VS Code

  1. Open your Jupyter notebook.

  2. Click on the kernel name (top right corner) or Select Kernel.

  3. Choose the newly created kernel project.


Notes

  • Make sure your .venv is already created and has Python installed.

  • Using this method ensures that the notebook runs in the same environment as your uv project.

  • This is especially useful when working with isolated dependencies for different projects.

Generator Expression vs List Comprehension in Python

 When handling large datasets in Python, both performance and memory usage are key concerns. Python offers two powerful tools for creating s...