C++: FASTA Format

Bjarne-stroustrup
 

In bioinformatics, long character strings are often encoded in a format called FASTA. A FASTA file can contain several strings, each identified by a name marked by a “>” character at the beginning of the line.

Write a program that reads a FASTA file such as:

>Rosetta_Example_1
THERECANBENOSPACE
>Rosetta_Example_2
THERECANBESEVERAL
LINESBUTTHEYALLMUST
BECONCATENATED

And prints the following output:

Rosetta_Example_1: THERECANBENOSPACE
Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED

Note that a high-quality implementation will not hold the entire file in memory at once; real FASTA files can be multiple gigabytes in size.

#include <iostream>
#include <fstream>

int main( int argc, char **argv ){
	if( argc <= 1 ){
		std::cerr << "Usage: "<<argv[0]<<" [infile]" << std::endl;
		return -1;
	}

	std::ifstream input(argv[1]);
	if(!input.good()){
		std::cerr << "Error opening '"<<argv[1]<<"'. Bailing out." << std::endl;
		return -1;
	}

	std::string line, name, content;
	while( std::getline( input, line ).good() ){
		if( line.empty() || line[0] == '>' ){ // Identifier marker
			if( !name.empty() ){ // Print out what we read from the last entry
				std::cout << name << " : " << content << std::endl;
				name.clear();
			}
			if( !line.empty() ){
				name = line.substr(1);
			}
			content.clear();
		} else if( !name.empty() ){
			if( line.find(' ') != std::string::npos ){ // Invalid sequence--no spaces allowed
				name.clear();
				content.clear();
			} else {
				content += line;
			}
		}
	}
	if( !name.empty() ){ // Print out what we read from the last entry
		std::cout << name << " : " << content << std::endl;
	}

	return 0;
}
Output:
Rosetta_Example_1 : THERECANBENOSPACE
Rosetta_Example_2 : THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED

SOURCE

Content is available under GNU Free Documentation License 1.2.