Automated PCR Primer Design
Last month I ran my very first PCR, attempting (and eventually failing) to amplify a region flanking a specific SNP. This meant designing a set of PCR primers, a process that turned out to include a surprising amount of gene-sequence copy-and-paste and waiting for NCBI’s servers to run my job. Being a compulsive tool builder, I had to streamline this process, and see if it could all be done locally on my own computer.
Designing PCR primers
When designing PCR primer pairs, several properties should be satisfied to increase the likelihood of successfully amplifying the region of interest. Primers cannot be too short or too long. GC content should not be too low, but definitely not too high. The melting temperature of both primers should be nearly identical. The Primer3 program considers all of theese factors. You feed it a sequence including your region of interest, and it will spit out multiple primer pairs for you too choose from. To ensure that your chosen primer pair does not inadvertently align with another region, you can run a BLAST search against whatever genome you are working with.
There are multiple web-based tools that integrate Primer3 and BLAST, e.g. NCBI’s Primer-BLAST, but I found them slightly frustrating to work with. First, and this was specific to my case, where I was interested in amplifying a specific SNP, there was no way to say “design primers that include this SNP”. Instead, you had to manually find the sequence (or accession number), and figure out the position of the SNP. Second, the primer design and BLAST search runs as a job on NCBI’s server. Sometimes it is moderately fast, sometimes it is slow. This makes tweaking parameters cumbersome, since the feedback loop is prohibitively slow.
Automating primer design
The question then is: can we run Primer3 and BLAST locally, and will it be fast enough? As it turns out, the answers are yes and yes. The process involves:
- Ask the user for an SNP RS number.
- Download the SNP metadata.
- Download the sequence flanking the SNP.
- Run Primer3 against the sequence.
- Run BLAST against the primer pairs.
The Biopython package can download SNP metadata and sequences from NCBI’s databases:
# SNP RS number
= "429353"
snp_id
# SNP metadata
= Entrez.esummary(db="snp", id=snp_id, retmode="xml")
handle = Entrez.read(handle)
record
# Accession number
= record["DocumentSummarySet"]["DocumentSummary"][0]["ACC"]
acc
# Chromosome position
= int(
chrpos "DocumentSummarySet"]["DocumentSummary"][0]["CHRPOS_SORT"]
record[
)
# Sequence flanking (+/- 1000 bp) the SNP
= Entrez.efetch(
handle ="nucleotide",
dbid=acc,
=chrpos - 1000,
seq_start=chrpos + 1000,
seq_stop="fasta",
rettype="text",
retmode
)= SeqIO.read(handle, "fasta") seq_record
Next we run Primer3 from Python using primer3-py:
= primer3.bindings.design_primers(
primers ={
seq_args"SEQUENCE_ID": "PRIMER",
"SEQUENCE_TEMPLATE": str(seq_record.seq),
"SEQUENCE_TARGET": [1000, 1],
},={
global_args"PRIMER_OPT_SIZE": 20,
"PRIMER_MIN_SIZE": 18,
"PRIMER_MAX_SIZE": 25,
"PRIMER_OPT_TM": 60.0,
"PRIMER_MIN_TM": 57.0,
"PRIMER_MAX_TM": 63.0,
"PRIMER_MIN_GC": 20.0,
"PRIMER_MAX_GC": 80.0,
"PRIMER_MAX_POLY_X": 100,
"PRIMER_PRODUCT_SIZE_RANGE": [250, 500]
}, )
To run BLAST, we need to install the BLAST command-line tools, then download the human reference genome:
update_blastdb.pl --decompress human_genome
This takes a while, and will take up ~ 1.5G of disk space. The blastn
command-line tool reads sequences from a FASTA file, and writes results to an XML file. From Python we can write the primer pairs to a FASTA file:
= []
primer_seqs
for i in range(5):
= SeqRecord(
left_seq "PRIMER_LEFT"][i]["SEQUENCE"]),
Seq(primers[id=f"Primer Pair {i + 1} Left",
)= SeqRecord(
right_seq "PRIMER_RIGHT"][i]["SEQUENCE"]),
Seq(primers[id=f"Primer Pair {i + 1} Right",
)
primer_seqs.append(left_seq)
primer_seqs.append(right_seq)
"primer_pairs.fasta", "fasta") SeqIO.write(primer_seqs,
Then, run blastn
:
blastn \
-query primer_seqs.fasta \
-db GCF_000001405.39_top_level \
-outfmt 5 \
-out blast_results.xml \
-task blastn-short \
-perc_identity 95 \
-strand both \
-max_target_seqs 10 \
-num_threads 4
On my MacBook Air M2 this takes approximately 5 seconds, much faster than the several minutes it may take for NCBI to run a BLAST search. The XML file written by blastn
can be parsed with Biopython:
= Blast.parse(blast_results_file) records
Each individual primer corresponds to a single query. We can inspect the result of the first query:
>>> records[0]
=20)
Query: Query_1 (length1 Left
Primer Pair ---- ----- ----------------------------------------------------------
Hits: # # HSP ID + description
---- ----- ----------------------------------------------------------
0 56 gi|568815579|ref|NC_000019.10| Homo sapiens chromosome...
1 77 gi|568815581|ref|NC_000017.11| Homo sapiens chromosome...
2 52 gi|568815578|ref|NC_000020.11| Homo sapiens chromosome...
3 139 gi|568815587|ref|NC_000011.10| Homo sapiens chromosome...
4 190 gi|568815593|ref|NC_000005.10| Homo sapiens chromosome...
5 99 gi|568815584|ref|NC_000014.9| Homo sapiens chromosome ...
6 79 gi|568815580|ref|NC_000018.10| Homo sapiens chromosome...
7 38 gi|568815576|ref|NC_000022.11| Homo sapiens chromosome...
8 3 gi|568815531|ref|NT_187651.1| Homo sapiens chromosome ...
9 254 gi|568815596|ref|NC_000002.12| Homo sapiens chromosome...
We can take a look at the first hit:
>>> records[0][0]
Query: Query_11 Left
Primer Pair |568815579|ref|NC_000019.10| (length=58617616)
Hit: gi19, GRCh38.p13 Primary Assembly
Homo sapiens chromosome ---- -------- --------- ------ --------------- ---------------------
HSPs: # E-value Bit score Span Query range Hit range
---- -------- --------- ------ --------------- ---------------------
0 0.0081 40.14 20 [0:20] [44908581:44908601]
1 7.8 30.23 15 [3:18] [51029314:51029299]
2 31 28.25 14 [2:16] [9296443:9296457]
3 31 28.25 14 [3:17] [31178323:31178309]
4 1.2e+02 26.26 13 [0:13] [29087873:29087886]
5 1.2e+02 26.26 13 [2:15] [34134193:34134206]
6 1.2e+02 26.26 13 [0:13] [35972210:35972223]
7 1.2e+02 26.26 13 [0:13] [51178062:51178075]
8 1.2e+02 26.26 13 [3:16] [56957621:56957634]
9 1.2e+02 26.26 13 [4:17] [57327775:57327788]
10 1.2e+02 26.26 13 [5:18] [21716673:21716660]
The hit is on chromosome 19, as expected, since the SNP is in the APOE gene, which is located on chromosome 19.
Putting it all together
This is all very well, but you might object, that running all of this in a Python console is not that much better than using the online web interface, and you would be absolutely right. Luckily, we can put it all in a Marimo notebook. Marimo notebooks can run in an app mode, making them perfect for small, purpose-built apps such as this:
The notebook is available on GitHub.
Addendum
This was my first foray into biology-related programming. None of this was terribly difficult to do, but I was surprised at the state of the ecosystem. I had expected an array of well-integrated, coherent packages, but instead I ran into a number of somewhat abandoned libraries, and a lot of duct tape. The scverse packages for single-cell omics look (at least on the surface) like a move in the right direction. Is there any work like this for the broader ecosystem, or am I simply missing something obvious as a newcomer? If you have any insight into this, I would love to hear from you.