0. Brieflow <> Brieflow Analysis

Overview

Brieflow and brieflow-analysis are closely related repositories built together and, usually, both used for a screen analysis. We distinguish these repositories like so:

brieflow: code to process OPS data on a large scale
brieflow-analysis: notebooks, files, and scripts that are used during a brieflow run

Both of these work together to run modules for steps like preprocessing, SBS, phenotype, etc. Let’s take a closer look:

Brieflow

Components

Brieflow has the following components:

lib: Brieflow library code used for performing Brieflow processing. Organized into module-specific, shared, and external code.
rules: Snakemake rule files for each module. Used to organize processses within each module with inputs, outputs, parameters, and script file location.
scripts: Python script files for processes called by rules. Organized into module-specific and shared code.
targets: Snakemake files used to define inputs and their mappings for each module
Snakefile: Main Snakefile used to call modules.

One of the simplest examples for this is read calling during the SBS step. We can approach this from a top -> down perspective to understand what is going on.

In the main Snakefile we tell Snakemake to include the rules and targets for the entire SBS module:

if "sbs" in config and len(sbs_wildcard_combos) > 0:

    # Include target and rule files
    include: "targets/sbs.smk"
    include: "rules/sbs.smk"

Snakemake first looks at the targets to see what we want produced. The read calling output file is specified here:

"call_reads": [
    SBS_FP
    / "tsvs"
    / get_filename(
        {"plate": "{plate}", "well": "{well}", "tile": "{tile}"}, "reads", "tsv"
    ),
],

Snakemake then looks through the rules to see what needs to be run to produce this file. We need to run the following rule to get the call_reads output:

rule call_reads:
    input:
        SBS_OUTPUTS["extract_bases"],
        SBS_OUTPUTS["find_peaks"],
    output:
        SBS_OUTPUTS_MAPPED["call_reads"],
    params:
        call_reads_method=config["sbs"]["call_reads_method"]
    script:
        "../scripts/sbs/call_reads.py"

This process takes 2 inputs, produces 1 output, and passes one param to the script to do so. 4) Snakemake loads the script for this rule:

from lib.sbs.call_reads import call_reads

# load bases data
bases_data = pd.read_csv(snakemake.input[0], sep="\t")

# load peaks data
peaks_data = imread(snakemake.input[1])

# call reads
reads_data = call_reads(
    bases_data=bases_data,
    peaks_data=peaks_data,
    method=snakemake.params.call_reads_method,
)

# save reads data
reads_data.to_csv(snakemake.output[0], index=False, sep="\t")

This is a very simple script that loads data, calls a function, and saves data. 5) Finally, snakemake accesses the library code that we use here:

def call_reads(
    bases_data,
    peaks_data=None,
    correction_only_in_cells=True,
    normalize_bases_first=True,
    method="median",
):
    """Call reads for in situ sequencing data.

    Call reads by compensating for channel cross-talk and calling the base
    with the highest corrected intensity for each cycle.

    Args:
    bases_data : pandas DataFrame
        Table of base intensity for all candidate reads, output of extract_bases.
...

Project Structure

Brieflow is built on top of Snakemake. We follow the Snakemake structure guidelines with some exceptions. The Brieflow project structure is as follows:

workflow/
├── lib/ - Brieflow library code used for performing Brieflow processing. Organized into module-specific, shared, and external code.
├── rules/ - Snakemake rule files for each module. Used to organize processses within each module with inputs, outputs, parameters, and script file location.
├── scripts/ - Python script files for processes called by modules. Organized into module-specific and shared code.
├── targets/ - Snakemake files used to define inputs and their mappings for each module. 
└── Snakefile - Main Snakefile used to call modules.

Brieflow runs as follows:

A user configure parameters in Jupyter notebooks to use the Brieflow library code correctly for their data.
A user runs the main Snakefile with bash scripts (locally or on an HPC).
The main Snakefile calls module-specific snakemake files with rules for each process.
Each process rule calls a script.
Scripts use the Brieflow library code to transform the input files defined in targets into the output files defined in targets.

Brieflow Analysis

The analysis repo holds the files neccessary for configuring and running brieflow. In the case of the read calling function above we:

Run the 2.configure_sbs_params.ipynb notebook.
Set the CALL_READS_METHOD parameter in this notebook.

# Define parameters for extracting bases
CALL_READS_METHOD = "median"

Save this parameter to the config file at the end of the notebook:

config["sbs"] = {
    ...
    "call_reads_method": CALL_READS_METHOD,
    ...
}

# Write the updated configuration back with markdown-style comments
with open(CONFIG_FILE_PATH, "w") as config_file:
    # Write the introductory markdown-style comments
    config_file.write(CONFIG_FILE_HEADER)

    # Dump the updated YAML structure, keeping markdown comments for sections
    yaml.dump(config, config_file, default_flow_style=False, sort_keys=False)

This parameter gets passed to the snakemake rule during a run (see above)

rule call_reads:
...
    params:
        call_reads_method=config["sbs"]["call_reads_method"]
...

Reproducibility and Modularity

Brieflow and brieflow-analysis are built for reproducibility and modularity. This setup enables researchers to

work on a specific process within brieflow (ex, make siloed changes to call_reads)
develop versions of brieflow that can be used across multiple brieflow-analysis repositories (ex, a custom_screen branch for brieflow can be used in multiple brieflow-analysis repos)
track exact differences between the main branch of brieflow and a custom branch
host an entire screen analysis on GitHub for reproducibility