c# - Getting Matched HTML Value with Regex -
ok start know should not using regex parse html it's not reliable, not 100% safe, etc. however, learning excercise regex as else.
so example uses bbc website http://www.bbc.co.uk/sport/football/premier-league/table.
the project parsing tbody of first table. trying search elements matching search value returned. example, given search "manc" want tr tag manchester city , manchester united (matched url).
what have far <tr\b[^>]*>(.*?)manc(.*?)</tr>
matches first tr closing tr after man city , returns expected result man utd. point out i've gone wrong regex.
edit: source (trimmed)
<tbody id="trc-20-118996114-3"> <tr id="team-138824012" class="team first"> <td class="statistics"></td> <td class='position'> <span class='moving-up'>moving up</span> <span class='position-number'>1</span> </td> <td class="team-name"> <a href='http://www.bbc.co.uk/sport/football/teams/arsenal'>arsenal</a> </td> <td class="played">0</td> <td class="home-won"> <span>0</span> </td> <td class="home-drawn">0</td> <td class="home-lost">0</td> <td class="home-for">0</td> <td class="home-against">0</td> <td class="away-won"> <span>0</span> </td> <td class="away-drawn">0</td> <td class="away-lost">0</td> <td class="away-for">0</td> <td class="away-against">0</td> <td class="goal-difference">0</td> <td class="points">0</td> <td class="last-10-games"> <ol> <li class="win" title="win"> <span>win</span> </li> <li class="draw" title="draw"> <span>draw</span> </li> <li class="draw" title="draw"> <span>draw</span> </li> <li class="draw" title="draw"> <span>draw</span> </li> <li class="loss" title="loss"> <span>loss</span> </li> <li class="win" title="win"> <span>win</span> </li> <li class="win" title="win"> <span>win</span> </li> <li class="loss" title="loss"> <span>loss</span> </li> <li class="win" title="win"> <span>win</span> </li> <li class="win last" title="win"> <span>win</span> </li> </ol> </td> <td class="status"> <a class="report" href="http://www.bbc.co.uk/sport/0/football/17973141">report</a> </td> </tr> <tr id="team-137316633" class="team"> <td class="statistics"></td> <td class='position'> <span class='moving-up'>moving up</span> <span class='position-number'>2</span> </td> <td class="team-name"> <a href='http://www.bbc.co.uk/sport/football/teams/aston-villa'>aston villa</a> </td> <td class="played">0</td> <td class="home-won"> <span>0</span> </td> <td class="home-drawn">0</td> <td class="home-lost">0</td> <td class="home-for">0</td> <td class="home-against">0</td> <td class="away-won"> <span>0</span> </td> <td class="away-drawn">0</td> <td class="away-lost">0</td> <td class="away-for">0</td> <td class="away-against">0</td> <td class="goal-difference">0</td> <td class="points">0</td> <td class="last-10-games"> <ol> <li class="loss" title="loss"> <span>loss</span> </li> <li class="draw" title="draw"> <span>draw</span> </li> <li class="draw" title="draw"> <span>draw</span> </li> <li class="loss" title="loss"> <span>loss</span> </li> <li class="draw" title="draw"> <span>draw</span> </li> <li class="loss" title="loss"> <span>loss</span> </li> <li class="draw" title="draw"> <span>draw</span> </li> <li class="draw" title="draw"> <span>draw</span> </li> <li class="loss" title="loss"> <span>loss</span> </li> <li class="loss last" title="loss"> <span>loss</span> </li> </ol> </td> <td class="status"> <a class="report" href="http://www.bbc.co.uk/sport/0/football/17973120">report</a> </td> </tr> <tr id="team-137318151" class="team"> <td class="statistics"></td> <td class='position'> <span class='moving-down'>moving down</span> <span class='position-number'>7</span> </td> <td class="team-name"> <a href='http://www.bbc.co.uk/sport/football/teams/manchester-city'>man city</a> </td> <td class="played">0</td> <td class="home-won"> <span>0</span> </td> <td class="home-drawn">0</td> <td class="home-lost">0</td> <td class="home-for">0</td> <td class="home-against">0</td> <td class="away-won"> <span>0</span> </td> <td class="away-drawn">0</td> <td class="away-lost">0</td> <td class="away-for">0</td> <td class="away-against">0</td> <td class="goal-difference">0</td> <td class="points">0</td> <td class="last-10-games"> <ol> <li class="win" title="win"> <span>win</span> </li> <li class="win" title="win"> <span>win</span> </li> <li class="win" title="win"> <span>win</span> </li> <li class="win" title="win"> <span>win</span> </li> <li class="win" title="win"> <span>win</span> </li> <li class="win" title="win"> <span>win</span> </li> <li class="loss" title="loss"> <span>loss</span> </li> <li class="draw" title="draw"> <span>draw</span> </li> <li class="draw" title="draw"> <span>draw</span> </li> <li class="win last" title="win"> <span>win</span> </li> </ol> </td> <td class="status"> <a class="report" href="http://www.bbc.co.uk/sport/0/football/17973148">report</a> </td> </tr> <tr id="team-137318152" class="team"> <td class="statistics"></td> <td class='position'> <span class='moving-down'>moving down</span> <span class='position-number'>8</span> </td> <td class="team-name"> <a href='http://www.bbc.co.uk/sport/football/teams/manchester-united'>man utd</a> </td> <td class="played">0</td> <td class="home-won"> <span>0</span> </td> <td class="home-drawn">0</td> <td class="home-lost">0</td> <td class="home-for">0</td> <td class="home-against">0</td> <td class="away-won"> <span>0</span> </td> <td class="away-drawn">0</td> <td class="away-lost">0</td> <td class="away-for">0</td> <td class="away-against">0</td> <td class="goal-difference">0</td> <td class="points">0</td> <td class="last-10-games"> <ol> <li class="win" title="win"> <span>win</span> </li> <li class="win" title="win"> <span>win</span> </li> <li class="loss" title="loss"> <span>loss</span> </li> <li class="draw" title="draw"> <span>draw</span> </li> <li class="win" title="win"> <span>win</span> </li> <li class="loss" title="loss"> <span>loss</span> </li> <li class="win" title="win"> <span>win</span> </li> <li class="win" title="win"> <span>win</span> </li> <li class="win" title="win"> <span>win</span> </li> <li class="win last" title="win"> <span>win</span> </li> </ol> </td> <td class="status"> <a class="report" href="http://www.bbc.co.uk/sport/0/football/17973162">report</a> </td> </tr> </tbody>
the problem is, regular expression broad. you're asking for:
<tr\b[^>]*>(.*?)manc(.*?)</tr>
lets simplify little bit.
<tr>.*?manc.*?</tr>
so you're saying, ok. need match tr followed anything , manc , , closing tr. so. of course happens regex starts @ first tr , goes ok. i've got tr let me keep matching until find manc. in meantime, passed bunch of other tr. regex doesn't care.
try this:
<tr>(?:(?!</tr>).)*manc.+?</tr>
or, guess in example:
<tr\b[^>]*>(?:(?!</tr>).)*manc.+?</tr>
Comments
Post a Comment