Another Bioinformatics website

List of helpful Linux commands to process FASTQ files from NGS experiments

ngs_analysisHere I’ll summarize some Linux commands that can help us to work with millions of DNA sequences from New Generation Sequencing (NGS).

A file storing biological sequences with extension ‘.fastq’ or ‘.fq’ is a file in FASTQ format, if it is also compressed with GZIP  the suffix will be ‘.fastq.gz’ or ‘.fq.gz’. A FASTQ file usually contain millions of sequences and takes up dozens of Gigabytes in a disk. When these files are compressed with GZIP their sizes are reduced in more than 10 times (ZIP format is less efficient).

In the next lines I’ll show you some commands to deal with compressed FASTQ files, with minor changes they also can be used with uncompressed ones and FASTA format files.

To start, let’s compress a FASTQ file in GZIP format:

> gzip reads.fq

The resulting file will be named ‘reads.fq.gz’ by default.

If we want to check the contents of the file we can use the command ‘less’ or ‘zless’:

> less reads.fq.gz
> zless reads.fq.gz

And to count the number of sequences stored into the file we can count the number of lines and divide by 4:

> zcat reads.fq.gz | echo $((`wc -l`/4))
  256678360

If the file is in FASTA format, we will count the number of sequences like this:

> grep -c "^>" reads.fa

Also we can count how many times appear an specific subsequence:

> zgrep -c 'ATGATGATG' reads.fq.gz
    398065    
> zcat reads.fq.gz | awk '/ATGATGATG/ {nlines = nlines + 1} END {print nlines}'
    398065

Or extract examples of sequences that contain the subsequence:

> zcat reads.fq.gz | head -n 20000 | grep  --no-group-separator -B1 -A2 ATGATGATG

For FASTA files:

> zcat reads.fa.gz | head -n 10000 | grep  --no-group-separator -B1 ATGATGATG

Or extract  sequences that contain some pattern in the header:

> zcat reads.fa.gz | grep  --no-group-separator -B1 -A2 'OS=Homo sapiens'

Sometimes we will be interested in extracting a small part of the big file to use it for testing our processing methods, ex. the first 1000 sequences (4000 lines):

> zcat reads.fq.gz | head -4000 > test_reads.fq
> zcat reads.fq.gz | head -4000 | gzip > test_reads.fq.gz

Or extract a range of lines (1000001-1000004):

   > zcat reads.fq.gz | sed -n '1000001,1000004p;1000005q' > lines.fq

If we want to extract random sequences (1000):

> cat reads.fq | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("n");} else { printf("X#&X");} }' | shuf | head -1000 | sed 's/X#&X/n/g' > reads.1000.fq

Or from FASTA file:

> cat reads.fa | awk '{if ((NR%2)==0)print prev"X#&X"$0;prev=$0;}' | shuf | head -1000 | sed 's/X#&X/n/g' > reads.1000.fa

A more complex but sometimes useful Perl one-liner to have statistics about the length of the sequences:

> perl -ne '{ $c++; if(($c-2)%4==0){$lens{length($_)}++} }; END { print "len\tseqs\n"; while(my($key,$value)=each %lens){print "$key\t$value\n"} }' reads.fastq

Or with AWK:

> awk '{ c++; if((c-2)%4==0){ ++a[length()] } } END{ print "len\tseqs"; for (i in a) print i"\t"a[i]}' reads.fastq

We can be interested in knowing how much disk space is using the compressed file:

> gzip --list reads.fq.gz 
   compressed  uncompressed  ratio  uncompressed_name
  18827926034   1431825024  -1215.0%    reads.fq

It seems that GZIP expands instead of compressing, but it is not true, the reason is that the ‘–list’ option doesn’t work correctly with files bigger than 2GB. The solution is to use a slower but more reliable method to know the real size of the compressed file:

 > zcat reads.fq.gz | wc --bytes
      61561367168

Also can be interesting to split the file in several, smaller ones, with for e.x. 1 million of sequences each (4 millions of lines in a FASTQ file):

> zcat reads.fq.gz | split -d -l 4000000 - reads.split    > gzip reads.split*    > rename 's/.split(d+)/.$1.fq/' reads.split*

And later we can join the files again:

> zcat reads.*.fq.gz | gzip > reads.fq.g

My original article in Spanish can be read here:

http://bioinfoperl.blogspot.com/2015/01/algunos-comandos-utiles-linux-ficheros-fastq-ngs.html

Some of these commands and extra ideas can be found here:

http://darrenjw.wordpress.com/2010/11/28/introduction-to-the-processing-of-short-read-next-generation-sequencing-data/

3 Comments

  1. Robert Mukiibi

    great and helpful commands.
    Thank you

  2. circulosmeos

    Maybe `gztool` can be useful (https://github.com/circulosmeos/gztool): it can extract parts of a gzip file without uncompressing it from the start as it is usually the case with any other commands.It does so at the expense of creating an index file (it is created at the same time compression or decompression occurs, so there’s no wasted time) which is ~0.3%/gzip or much less if gztool itself compresses the data.

  3. edoardomagherini

    Hi,
    thank you for the information.
    But what is the difference between .fastq and .fq file?

Leave a Reply to Robert Mukiibi Cancel reply

© 2024 Sixth researcher

Theme by Anders NorenUp ↑