regex - finding, and then efficiently replacing with the reverse in java -


i'm working in java large .txt file database of proteins. proteins have general structure, not 1 uniform enough hard-code "take startindex endindex, reverse, , replace". true uniformity delimited >, e.g.:

...werinweti>gi|230498 [bovine albumin]adfijwoenaonfoaidnfklsadnfathisdatfdaifj>sp|234235 (human) agp1 qwiqwonoqwnroiwqrnoqwirnswelle>gi|... , on.

as can see, although actual protein sequence (the long chains of capitals) uniform in chains of capitals, besides that, preceding description can pretty (there lot of times not space between description , sequence). program needs copy original text on new file, go through, add r- after each > (e.g. ...eerfds>r-gi|23423...) reverse chain of capitals. after process complete, need append end of original text.

i have completed r- function, , have completed reversal , appending well, not efficient enough. databases receiving treatment massive, , program takes long. in fact have no idea how long takes, because never let finish. waited 1 hour , ended it. here algorithm reversal using regex (the built-in pattern class) (the part computationally intensive):

pattern regexsplit = pattern.compile(">"); string[] splits = regexsplit.split(rdash.tostring()); stringbuilder rdashedited = new stringbuilder(); pattern regexprotein = pattern.compile("[a-z]{5,}");  (int splitindex = 1; splitindex < splits.length; splitindex++) {     matcher rdashmatcher = regexprotein.matcher(splits[splitindex]);     rdashmatcher.find();     stringbuffer reverser = new stringbuffer(rdashmatcher.group());     rdashedited.append(rdashmatcher.replaceall(reverser.reverse().tostring()) + ">"); } system.out.println(">" + rdashedited); 

so, split rdash (which stringbuilder contains original proteins >r- put in, hasn't gone through reversal yet) each individual protein , add them string array. go through each string in array , chains of capital letters longer 5 letters, add match stringbuffer, reverse it, , replace forward version reverse. note algorithm works intended smaller text files.

is there more powerful regex eliminate need of splitting/traversing array? when tried, replaceall() call replaced downstream proteins reverse of first protein in set. checked, fun, system.out.println(rdashmatcher.groupcount()) , printed 0 each of proteins in set. can me more efficient/powerful regex? new concept me, reminds me of vectorizing in matlab (only letters).

i threw 10,000,000 records (came ~379mb text files) @ , took 1:06 minutes.(4core athlon, few years old)

the big if tree handles ends half because delimiter in middle of element.

public void readproteins(bufferedreader br, bufferedwriter bw) throws ioexception {        pattern regexsplit = pattern.compile(">");   pattern proteinpattern = pattern.compile("(.*?)([a-z]{5,})");   matcher m;   scanner s = new scanner(br);   s.usedelimiter(regexsplit);            while (s.hasnext())   {       stringbuffer sb = new stringbuffer();       string protein = s.next();       m = proteinpattern.matcher(protein);                   if (m.find())           sb.append(m.group(2)).reverse().append(">r-").insert(0, m.group(1));       else           sb.append(protein);       );             }   bw.flush();   bw.close(); } 

Comments

Popular posts from this blog

c# - SVN Error : "svnadmin: E205000: Too many arguments" -

c++ - Using OpenSSL in a multi-threaded application -

All overlapping substrings matching a java regex -