Just Enough Python

for R users to write advanced Snakemake workflows

Author
Published

2022-11-14

Modified

2022-11-21

Snakemake is a workflow management tool implemented as extra magic on top of Python. If you want to take your Snakemake workflows from beginner to advanced, learning a little Python goes a long way.

This is just enough Python to understand and apply some cool tricks in your Snakemake workflows.

The Python, R, and Snakemake logos linked together.1

prereqs
  • Basic Snakemake knowledge. You’ve gone through the official tutorial. Maybe you’ve even built your own snakemake workflow.
  • Minimal or no python knowledge, aside from what you may have gleaned from the snakemake docs.
  • Basic R knowledge. Tidyverse experience is helpful but not required.

variables

Here are the basic variable types in Python and the equivalent type in R:

basic types

Python R
string character
integer integer
float numeric
boolean logical
None NULL
Python
# string
name = "Kelly"
favorite_color = 'green'
# float
height_inches = 64.0
# int
num_plants = 14
# bool
likes_cats = True
# none
invalid = None
R
# character
name <- "Kelly"
favorite_color <- 'green'
# numeric
height_inches <- 64
# integer
num_plants <- 14L
# logical
likes_cats <- TRUE
# null
invalid = NULL

A key difference is that in R, all of the above objects are vectors of length 1.

type casting

You can convert variables to different types:

Python
int(32.0)
32
float(32)
32.0
str(32)
'32'
bool(0)
False
bool(1)
True
bool(None)
False

The equivalent R code:

R
as.integer(32)
[1] 32
as.numeric(32L)
[1] 32
as.character(32)
[1] "32"
as.logical(0)
[1] FALSE
as.logical(1)
[1] TRUE
as.logical(NULL)
logical(0)

What similarities and differences do you notice when type-casting in Python vs. R?

Naming variables

The biggest difference in variable names between R and Python: no dots are allowed in Python variable names. All other rules from R apply. Python variable names must start with an alphabetic character and they can contain alphanumeric characters and underscores.

Just like in R, it is a good idea not to name a variable the same thing as a reserved keyword or global function.

Python keywords
help("keywords")

Here is a list of the Python keywords.  Enter any keyword to get more help.

False               break               for                 not
None                class               from                or
True                continue            global              pass
__peg_parser__      def                 if                  raise
and                 del                 import              return
as                  elif                in                  try
assert              else                is                  while
async               except              lambda              with
await               finally             nonlocal            yield

formatting strings

You can add strings together to build new ones like this:

Python
firstname = "Marie"
lastname = "Curie"
middle_init = 'S'
prefix = "Madame"

fullname = prefix + " " + firstname + " " + middle_init + ". " + lastname
print(fullname)
Madame Marie S. Curie

The print function takes any number of arguments and will insert a space between items by default:

Python
print(prefix, firstname, middle_init + ".", lastname)
Madame Marie S. Curie

But there’s a better way!

f-Strings in Python work just like glue() in R, where curly braces in the string are interpreted as code. They’re much easier to read & write:

Python
f"{prefix} {firstname} {middle_init}. {lastname}"
'Madame Marie S. Curie'

This is equivalent to:

R
firstname <- "Marie"
lastname <- "Curie"
middle_init <- 'S'
prefix <- "Madame"
glue::glue("{prefix} {firstname} {middle_init}. {lastname}")
Madame Marie S. Curie

container types

Here are a few of the most useful container types and their equivalents in R:

Python R
list vector (all same type) or list (multiple types)
dictionary named list
DataFrame (pandas) data.frame (base) or tibble (tidyverse)

lists

Python lists are most similar to lists in R. Python lists can also be compared to vectors in R, but unlike vectors, lists can contain objects of different types.

Python
fruit = ['apple','banana','grape']
my_list = ['pizza', 'fries', 3, -2.5, 'donuts']
fruit + my_list
['apple', 'banana', 'grape', 'pizza', 'fries', 3, -2.5, 'donuts']
fruit.append('mango')
fruit
['apple', 'banana', 'grape', 'mango']
Indexing

Indexing is 0-based in Python. You can index lists and strings.

Python
numbers = [1,2,3,4]
len(numbers) # length
4
numbers[0] # first item
1
numbers[3] # last item
4
numbers[-1] # last item
4

You might find 0-based indexing unintuitive if you’re used to working in languages like R that use 1-based indexing, but this is “not mad, just differently sane!”2

Here’s an illustration that shows how 0-based & 1-based indexing systems compare in the context of DNA coordinates3:

DNA indexing comparison

Python
dna = "TACGTCA"
print(dna[0])
T
print(dna[6])
A

We can try to write equivalent code in R…

R
dna <- "TACGTCA"
print(dna[1])
[1] "TACGTCA"
print(dna[7])
[1] NA

But this doesn’t work! In the above R code, dna is a vector of length 1 with “TACGTCA” as the item at index 1. You can’t grab individual characters out of it. We have to break up the string into a vector with every character as a separate item in R.

R
dna <- c('T', 'A', 'C', 'G', 'T', 'C', 'A')
print(dna[[1]])
[1] "T"
print(dna[[7]])
[1] "A"

dictionaries

Dictionaries consist of key-value pairs where you use the key to access the value. Keys can be strings, integers, floats, or booleans. Values can be anything!

Python
foods = {'vegetable': ['carrot', 'eggplant'], 
         'fruit': 'cherry'}
foods['fruit']
'cherry'
foods['vegetable']
['carrot', 'eggplant']
foods['vegetable'][0]
'carrot'

Dictionaries can be nested:

Python
hyperparams = {'glmnet': {'alpha': [0, 0.5, 1],
                          'lambda': [0.01, 0.1, 0, 1]
                          },
                'rf': {'mtry': [128, 256, 512]}
               }
hyperparams['glmnet']['alpha'][1]
0.5

Get a list of keys in the dictionary with:

Python
hyperparams.keys()
dict_keys(['glmnet', 'rf'])
Snakemake config

Say you have a Snakemake workflow for training machine learning models. You want to be able to specify different datasets, ML methods, and random seeds, so you use f-Strings to fill in these variables:

Snakefile
dataset = "OTU"
method = "rf"
seed = 1001

rule train_ml_model:
    input:
        R="workflow/scripts/train_ml.R",
        csv=f"data/{dataset}.csv"
    output:
        model=f"results/{dataset}/runs/{method}_{seed}_model.Rds"
    script:
        "scripts/train_ml.R"

You can improve this by creating a configuration file in YAML format:

config/otu.yml
dataset: 'OTU'
method: 'rf'
seed: 1001

Then specify your config file at the top of your Snakefile. Snakemake parses your YAML config files to a dictionary called config.

Snakefile
configfile: "config/otu.yml"

dataset = config["dataset"]
method = config['rf']
seed = config['seed']

rule train_ml_model:
    input:
        R="workflow/scripts/train_ml.R",
        csv=f"data/{dataset}.csv"
    output:
        model=f"results/{dataset}/runs/{method}_{seed}_model.Rds"
    script:
        "scripts/train_ml.R"

When you run this workflow, it uses config/otu.yml by default:

sh
snakemake -n
rule train_ml_model:
    input: workflow/scripts/train_ml.R, data/OTU.csv
    output: results/OTU/runs/rf_1001_model.Rds
    jobid: 0
    reason: Missing output files: results/OTU/runs/rf_1001_model.Rds
    resources: tmpdir=/var/folders/9n/glrhxtfx453gl68sc1gzq0mc0000gr/T

Job stats:
job               count    min threads    max threads
--------------  -------  -------------  -------------
train_ml_model        1              1              1
total                 1              1              1

Now you can change the values of these variables by editing the config file, or by creating multiple config files and specifying them on the command line:

config/genus.yml
dataset: 'genus'
method: 'glmnet'
seed: 1001
sh
snakemake -n --configfile config/genus.yml
rule train_ml_model:
    input: workflow/scripts/train_ml.R, data/genus.csv
    output: results/genus/runs/glmnet_1001_model.Rds
    jobid: 0
    reason: Missing output files: results/genus/runs/glmnet_1001_model.Rds
    resources: tmpdir=/var/folders/9n/glrhxtfx453gl68sc1gzq0mc0000gr/T

Job stats:
job               count    min threads    max threads
--------------  -------  -------------  -------------
train_ml_model        1              1              1
total                 1              1              1

See this example in context here.

conditionals

Think of these as functions that return a boolean.

operator meaning same in R
== equal
!= not equal
< less than
> greater than
<= less than or equal to
>= greater than or equal to
or or (inclusive) |
and and &
in in %in%
not not !
3 > 2
True
True or False
True

if-else statements

No curly braces here! But indentation does matter.

Python
if len(my_list) < 10:
    list_description = "This list is short"
elif len(my_list) < 15:
    list_description = "This list is medium length"
else:
    list_description = "This list is long!"
print(list_description)
This list is short

inline if-else statement

When you create a variable with an if-else statement, you can do it all on one line:

list_description = "short" if len(my_list) < 10 else "not short"

This is equivalent to:

if len(my_list) < 10:
    list_description = "short"
else:
    list_description = "not short"

You can’t make it a one-liner if you need to use elif.

nested if-else blocks

Python
num = 1
if num >= 0:
    if num == 0:
        print("zero")
    else:
        print("positive")
else:
    print("negative")
positive

for loops

If a variable has a length (i.e. you can run len(myvar)), then you can iterate over it in a for loop.

Python
postdocs = ['Allison Mason', 'Courtney Armour', 'Sarah Lucas']
for person in postdocs:
    print(f"{person}, PhD")
Allison Mason, PhD
Courtney Armour, PhD
Sarah Lucas, PhD

range

Use range() to loop over a range of numbers. If you give range() two arguments, range starts counting from the first argument and stops just before the second argument, stepping by 1. In other words, the stop argument is exclusive.

Python
num_bottles = 3
for i in range(1, num_bottles+1):
    print(i, " bottles of pop on the wall")
1  bottles of pop on the wall
2  bottles of pop on the wall
3  bottles of pop on the wall

If you give range just one argument, it starts counting from zero and uses the argument as the stop value:

Python
num_bottles = 3
for i in range(num_bottles+1):
    print(i, " bottles of pop on the wall")
0  bottles of pop on the wall
1  bottles of pop on the wall
2  bottles of pop on the wall
3  bottles of pop on the wall

You can also use range() to create lists:

Python
mylist = list(range(3))
mylist
[0, 1, 2]

You can use range with three arguments: start, stop, and step:

Python
list(range(2, 11, 2))
[2, 4, 6, 8, 10]

Give the step argument a negative value and make start greater than stop to move backwards:

Python
list(range(4, -7, -2))
[4, 2, 0, -2, -4, -6]

You might think it’s weird that range stops just before the stop value. But it works nicely when you give it the length of the list – now you have the list indices!

Python
numbers = [1,2,3,4]
for i in range(len(numbers)):
    print(f"index {i} is {numbers[i]}")
index 0 is 1
index 1 is 2
index 2 is 3
index 3 is 4

iterate over dictionaries

Python
problems = {1: "naming things", 2: "understanding binary", 3: "off-by-one errors"}
print('There are only 10 hard problems in computer science:')
There are only 10 hard problems in computer science:
for num in problems:
    print(f"\t {num} {problems[num]}")
     1 naming things
     2 understanding binary
     3 off-by-one errors

list comprehensions

When you want to make a for loop that creates a new list, dictionary, or set, use a comprehension. I’m only going to cover list comprehensions below, but you can learn about dictionary and set comprehensions here.

To build a list with a for loop, you can do this:

Python
squares = []
for i in range(4):
    squares.append(i**2)
squares
[0, 1, 4, 9]

but a list comprehension is much sleeker and performs faster:

Python
squares = [i**2 for i in range(4)]

This is roughly equivalent to using *apply functions in R:

R
sapply(0:3, function(x) {x**2})
[1] 0 1 4 9

You can continue nesting for x in y statements to build complex lists like this:

Python
[f"{x}{y}{z}" for x in range(1,3) for y in range(2,4) for z in range(3,5)]
['123', '124', '133', '134', '223', '224', '233', '234']

Snakemake expand

Let’s say you have a Snakemake workflow with a rule that should run multiple times with different parameters. For example, you may want to train machine learning models with different ML methods, datasets, and random seeds. You define these as wildcards in a rule to train each ML model:

Snakefile
rule train_ml_model:
    input:
        R="workflow/scripts/train_ml.R",
        csv="data/{dataset}.csv"
    output:
        model="results/{dataset}/runs/{method}_{seed}_model.Rds"
    script:
        "scripts/train_ml.R"

Notice that we’re not using f-Strings within the rule because we want Snakemake to recognize these as wildcards and fill in different values for them with each run of the train_ml_model rule.

To tell Snakemake what values to assign for these wildcards, you can create lists with the values you want and write a target rule at the top of your Snakefile.

Here, you could use an f-String and a list comprehension to create the list of target files.

Python
datasets = ['OTU', 'genus']
methods = ['rf', 'glmnet', 'svmRadial']
seeds = range(1000, 1005)
[f"results/{dataset}/runs/{method}_{seed}_model.Rds" for dataset in datasets for method in methods for seed in seeds]
['results/OTU/runs/rf_1000_model.Rds', 'results/OTU/runs/rf_1001_model.Rds', 'results/OTU/runs/rf_1002_model.Rds', 'results/OTU/runs/rf_1003_model.Rds', 'results/OTU/runs/rf_1004_model.Rds', 'results/OTU/runs/glmnet_1000_model.Rds', 'results/OTU/runs/glmnet_1001_model.Rds', 'results/OTU/runs/glmnet_1002_model.Rds', 'results/OTU/runs/glmnet_1003_model.Rds', 'results/OTU/runs/glmnet_1004_model.Rds', 'results/OTU/runs/svmRadial_1000_model.Rds', 'results/OTU/runs/svmRadial_1001_model.Rds', 'results/OTU/runs/svmRadial_1002_model.Rds', 'results/OTU/runs/svmRadial_1003_model.Rds', 'results/OTU/runs/svmRadial_1004_model.Rds', 'results/genus/runs/rf_1000_model.Rds', 'results/genus/runs/rf_1001_model.Rds', 'results/genus/runs/rf_1002_model.Rds', 'results/genus/runs/rf_1003_model.Rds', 'results/genus/runs/rf_1004_model.Rds', 'results/genus/runs/glmnet_1000_model.Rds', 'results/genus/runs/glmnet_1001_model.Rds', 'results/genus/runs/glmnet_1002_model.Rds', 'results/genus/runs/glmnet_1003_model.Rds', 'results/genus/runs/glmnet_1004_model.Rds', 'results/genus/runs/svmRadial_1000_model.Rds', 'results/genus/runs/svmRadial_1001_model.Rds', 'results/genus/runs/svmRadial_1002_model.Rds', 'results/genus/runs/svmRadial_1003_model.Rds', 'results/genus/runs/svmRadial_1004_model.Rds']

Here’s what your Snakefile looks like now.

Snakefile
datasets = ['OTU', 'genus']
methods = ['rf', 'glmnet', 'svmRadial']
seeds = range(1000, 1005)

rule targets:
    input:
        [f"results/{dataset}/runs/{method}_{seed}_model.Rds" for dataset in datasets for method in methods for seed in seeds]

rule train_ml_model:
    input:
        R="workflow/scripts/train_ml.R",
        csv=f"data/{dataset}.csv"
    output:
        model="results/{dataset}/runs/{method}_{seed}_model.Rds"
    script:
        "scripts/train_ml.R"

The list comprehension gets harder to read the more for statements you add to it. So Snakemake provides a function expand() to clean this up:

Snakefile
datasets = ['OTU', 'genus']
methods = ['rf', 'glmnet', 'svmRadial']
seeds = range(1000, 1005)

rule targets:
    input:
        expand("results/{dataset}/runs/{method}_{seed}_model.Rds", 
                dataset = datasets, method = methods, seed = seeds)

rule train_ml_model:
    input:
        R="workflow/scripts/train_ml.R"
    output:
        model="results/{dataset}/runs/{method}_{seed}_model.Rds"
    script:
        "scripts/train_ml.R"

Using expand() creates the exact same list as the list comprehension, but it’s much easier to read and add variables to.

functions

You can define your own functions in R:

R
square <- function(x) {
    return(x^2)
}

square(2)
[1] 4

and in Python:

Python
def square(x):
    return x**2

square(2)
4

anonymous functions

Sometimes you want to write a simple function that you only use once. They’re so inconsequential you don’t even want to give them a name.

We’ve already used one in R4 inside sapply() above:

R
function(x) { x**2 }
function(x) { x**2 }

In Python you use the lambda keyword:

Python
lambda x: x**2
<function <lambda> at 0x10ce49940>

Before the colon, list the arguments of the function. After the colon, compute the value that the function should return.

lambda in Snakemake

When writing Snakemake workflows, lambda functions are useful for defining input files based on the wildcards in output files. Consider a workflow where you have several rules that plot figures for a manuscript. When you initially conduct the analysis, you don’t know how the figures will be ordered in the manuscript. Once you begin drafting the manuscript, you decide that your diversity plot will be figure 1 and your error rates plot will be figure 2. You also decide to convert the figures to a different format, so the conversion step seems like a good opportunity to rename the figures. Initially you write this workflow:

Snakefile
rule targets:
    input:
        "paper/paper.pdf"

rule convert_figure_1:
    input:
        tiff='figures/diversity.tiff'
    output:
        png="paper/figure_1.png"
    shell:
        """
        convert {input.tiff} {output.png}
        """

rule convert_figure_2:
    input:
        tiff='figures/error_rates.tiff'
    output:
        png="paper/figure_2.png"
    shell:
        """
        convert {input.tiff} {output.png}
        """

rule render_paper:
    input:
        Rmd="paper/paper.Rmd",
        R="workflow/scripts/render_rmd.R",
        figures=['paper/figure_1.png', 'paper/figure_2.png']
    output:
        pdf="paper/paper.pdf"
    script:
        "scripts/render_rmd.R"

The rules convert_figure_1 and convert_figure_2 are a bit repetitive; they only differ by the input filenames and the figure numbers in the output filenames. Maybe this isn’t so bad with only two figures, but you might actually have 5-10 figures for a full scientific paper. We can reduce the repetitive code with a few tricks:

  1. Create a dictionary figures_dict that maps the figure numbers to the descriptive figure file names.
  2. Use a single rule to convert figures called convert_tiff_to_png, using a lambda function to get the input figure filenames based on the final figure numbers.
Snakefile
figures_dict = {'1': 'diversity', '2': 'error_rates'}

rule targets:
    input:
        "paper/paper.pdf"

rule convert_tiff_to_png:
    input:
        tiff=lambda wildcards: f"figures/{figures_dict[wildcards.fig_num]}.tiff"
    output:
        png="paper/figure_{fig_num}.png"
    shell:
        """
        convert {input.tiff} {output.png}
        """

rule render_paper:
    input:
        Rmd="paper/paper.Rmd",
        R="workflow/scripts/render_rmd.R",
        figures=['paper/figure_1.png', 'paper/figure_2.png']
    output:
        pdf="paper/paper.pdf"
    script:
        "scripts/render_rmd.R"

This lambda function is equivalent to:

def get_fig_name_from_num(wildcards):
    return figures_dict[wildcards.fig_num]

This works because Snakemake allows you to define a function that takes the rule’s wildcards and returns a list of input filenames, rather than literally listing the input filenames as before. This greatly reduces the repetitiveness of the code and makes it easier to maintain.

We can improve this Snakefile even further by replacing the list of figures in render_paper with a call to expand():

Snakefile
figures_dict = {'1': 'diversity', '2': 'error_rates'}

rule targets:
    input:
        "paper/paper.pdf"

rule convert_tiff_to_png:
    input:
        tiff=lambda wildcards: f"figures/{figures_dict[wildcards.fig_num]}.tiff"
    output:
        png="paper/figure_{fig_num}.png"
    shell:
        """
        convert {input.tiff} {output.png}
        """

rule render_paper:
    input:
        Rmd="paper/paper.Rmd",
        R="workflow/scripts/render_rmd.R",
        figures=expand(rules.convert_tiff_to_png.output.png, 
                       fig_num = figures_dict.keys())
    output:
        pdf="paper/paper.pdf"
    script:
        "scripts/render_rmd.R"

You can take a look at this full example in context here. Also see the Snakemake docs for more about input functions and params functions.

recap

Snakemake concepts covered

  • Use f-Strings for human-readable string formatting.
  • Config files are imported as dictionaries and allow you to change workflow parameters without modifying code.
  • Snakemake’s expand() function is a readable way to build lists just like list comprehensions.
  • Lambda functions help define Snakemake input files based on wildcards. You can also use them to define params based on wildcards and/or output files.

resources

Footnotes

  1. The Python logo by the Python Software Foundation is licensed under GNU GPL v2. The Snakemake logo by Johannes Köster is licensed under CC BY-SA 4.0. The R logo by Hadley Wickham and others at RStudio is licensed under CC BY-SA 4.0.↩︎

  2. Greg Wilson quoting Terry Pratchett in The Tidynomicon: https://tidynomicon.github.io/tidynomicon/↩︎

  3. from biostars: https://www.biostars.org/p/84686/↩︎

  4. Read more about anonymous functions in Hadley Wickham’s book Advanced R: http://adv-r.had.co.nz/Functional-programming.html#anonymous-functions↩︎