Sara does Bioinformatics: k-Mer Composition

In this problem we are given a DNA-string for which we are to return a matrix consisting of the frequencies of all possible 4-mers in the string, ordered alphabetically.

Sample Dataset
>Rosalind_6431
CTTCGAAAGTTTGGGCCGAGTCTTACAGTCGGTCTTGAAGCAAAGTAACGAACTCCACGGCCCTGACTACCGAACCAGTTGTGAGTACTCAACTGGGTGAGAGTGCAGTCCCTATTGAGTTTCCGAGACTCACCGGGATTTTCGATCCAGCCTCAGTCCAGTCTTGTGGCCAACTCACCAAATGACGTTGGAATATCCCTGTCTAGCTCACGCAGTACTTAGTAAGAGGTCGCTGCAGCGGGGCAAGGAGATCGGAAAATGTGCTCTATATGCGACTAAAGCTCCTAACTTACACGTAGACTTGCCCGTGTTAAAAACTCGGCTCACATGCTGTCTGCGGCTGGCTGTATACAGTATCTA
CCTAATACCCTTCAGTTCGCCGCACAAAAGCTGGGAGTTACCGCGGAAATCACAG

Expected Output
4 1 4 3 0 1 1 5 1 3 1 2 2 1 2 0 1 1 3 1 2 1 3 1 1 1 1 2 2 5 1 3 0 2 2 1 1 1 1 3 1 0 0 1 5 5 1 5 0 2 0 2 1 2 1 1 1 2 0 1 0 0 1 1 3 2 1 0 3 2 3 0 0 2 0 8 0 0 1 0 2 1 3 0 0 0 1 4 3 2 1 1 3 1 2 1 3 1 2 1 2 1 1 1 2 3 2 1 1 0 1 1 3 2 1 2 6 2 1 1 1 2 3 3 3 2 3 0 3 2 1 1 0 0 1 4 3 0 1 5 0 2 0 1 2 1 3 0 1 2 2 1 1 0 3 0 0 4 5 0 3 0 2 1 1 3 0 3 2 2 1 1 0 2 1 0 2 2 1 2 0 2 2 5 2 2 1 1 2 1 2 2 2 2 1 1 3 4 0 2 1 1 0 1 2 2 1 1 1 5 2 0 3 2 1 1 2 2 3 0 3 0 1 3 1 2 3 0 2 1 2 2 1 2 3 0 1 2 3 1 1 3 1 0 1 1 3 0 2 1 2 2 0 2 1 1

Regardless of the given sequence, the possible 4-mers are always the same and we can generate them with the following code:

import itertools
nt = 'ACGT' #Use this order of nt to get correct order later without sorting
permutations = itertools.product(nt, repeat=4)

kmers = []
for i, j in enumerate(list(permutations)):
kmer = ''
for item in j:
kmer += str(item)
kmers.append(kmer)

This gives us a list of all the possible 4-mers in alphabetical order (note that the funktion sorts the permutations in the order the letters are listed in. If nt is not entered alphabetically you will need to sort kmers).

Now we can extrakt the sequence from the FASTA file and use regex to find all the occurrences of the k-mers. Remember to use ?= in the pattern to include overlapping k-mers:

import re
from Bio import SeqIO

record = SeqIO.read('sampledata.fasta', 'fasta')
sequence = record.seq

A = []
for k in kmers:
occurence = 0
pattern = re.compile(r'(?=(' + k + '))')
for l in re.findall(pattern, str(sequence)):
occurence += 1
A.append(occurence)
print(*A, sep=' ')

Sara does Bioinformatics

Thursday, 4 August 2016

k-Mer Composition

No comments:

Post a Comment