regex - finding, and then efficiently replacing with the reverse in java -
i'm working in java large .txt file database of proteins. proteins have general structure, not 1 uniform enough hard-code "take startindex endindex, reverse, , replace". true uniformity delimited >, e.g.:
...werinweti>gi|230498 [bovine albumin]adfijwoenaonfoaidnfklsadnfathisdatfdaifj>sp|234235 (human) agp1 qwiqwonoqwnroiwqrnoqwirnswelle>gi|... , on.
as can see, although actual protein sequence (the long chains of capitals) uniform in chains of capitals, besides that, preceding description can pretty (there lot of times not space between description , sequence). program needs copy original text on new file, go through, add r- after each > (e.g. ...eerfds>r-gi|23423...) reverse chain of capitals. after process complete, need append end of original text.
i have completed r- function, , have completed reversal , appending well, not efficient enough. databases receiving treatment massive, , program takes long. in fact have no idea how long takes, because never let finish. waited 1 hour , ended it. here algorithm reversal using regex (the built-in pattern class) (the part computationally intensive):
pattern regexsplit = pattern.compile(">"); string[] splits = regexsplit.split(rdash.tostring()); stringbuilder rdashedited = new stringbuilder(); pattern regexprotein = pattern.compile("[a-z]{5,}"); (int splitindex = 1; splitindex < splits.length; splitindex++) { matcher rdashmatcher = regexprotein.matcher(splits[splitindex]); rdashmatcher.find(); stringbuffer reverser = new stringbuffer(rdashmatcher.group()); rdashedited.append(rdashmatcher.replaceall(reverser.reverse().tostring()) + ">"); } system.out.println(">" + rdashedited); so, split rdash (which stringbuilder contains original proteins >r- put in, hasn't gone through reversal yet) each individual protein , add them string array. go through each string in array , chains of capital letters longer 5 letters, add match stringbuffer, reverse it, , replace forward version reverse. note algorithm works intended smaller text files.
is there more powerful regex eliminate need of splitting/traversing array? when tried, replaceall() call replaced downstream proteins reverse of first protein in set. checked, fun, system.out.println(rdashmatcher.groupcount()) , printed 0 each of proteins in set. can me more efficient/powerful regex? new concept me, reminds me of vectorizing in matlab (only letters).
i threw 10,000,000 records (came ~379mb text files) @ , took 1:06 minutes.(4core athlon, few years old)
the big if tree handles ends half because delimiter in middle of element.
public void readproteins(bufferedreader br, bufferedwriter bw) throws ioexception { pattern regexsplit = pattern.compile(">"); pattern proteinpattern = pattern.compile("(.*?)([a-z]{5,})"); matcher m; scanner s = new scanner(br); s.usedelimiter(regexsplit); while (s.hasnext()) { stringbuffer sb = new stringbuffer(); string protein = s.next(); m = proteinpattern.matcher(protein); if (m.find()) sb.append(m.group(2)).reverse().append(">r-").insert(0, m.group(1)); else sb.append(protein); ); } bw.flush(); bw.close(); }
Comments
Post a Comment