In bioinformatics, long character strings are often encoded in a format called FASTA. A FASTA file can contain several strings, each identified by a name marked by a “>” character at the beginning of the line.
Write a program that reads a FASTA file such as:
>Rosetta_Example_1 THERECANBENOSPACE >Rosetta_Example_2 THERECANBESEVERAL LINESBUTTHEYALLMUST BECONCATENATED
And prints the following output:
Rosetta_Example_1: THERECANBENOSPACE Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED
Note that a high-quality implementation will not hold the entire file in memory at once; real FASTA files can be multiple gigabytes in size.
#include <iostream> #include <fstream> int main( int argc, char **argv ){ if( argc <= 1 ){ std::cerr << "Usage: "<<argv[0]<<" [infile]" << std::endl; return -1; } std::ifstream input(argv[1]); if(!input.good()){ std::cerr << "Error opening '"<<argv[1]<<"'. Bailing out." << std::endl; return -1; } std::string line, name, content; while( std::getline( input, line ).good() ){ if( line.empty() || line[0] == '>' ){ // Identifier marker if( !name.empty() ){ // Print out what we read from the last entry std::cout << name << " : " << content << std::endl; name.clear(); } if( !line.empty() ){ name = line.substr(1); } content.clear(); } else if( !name.empty() ){ if( line.find(' ') != std::string::npos ){ // Invalid sequence--no spaces allowed name.clear(); content.clear(); } else { content += line; } } } if( !name.empty() ){ // Print out what we read from the last entry std::cout << name << " : " << content << std::endl; } return 0; }
- Output:
Rosetta_Example_1 : THERECANBENOSPACE Rosetta_Example_2 : THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED
Content is available under GNU Free Documentation License 1.2.