php - get all imdb id from page url by preg_match_all -


how can imdb ids page? example, want ids here. in page, urls of format:

http://www.imdb.com/title/tt0948470/ 

i need ids page using preg_match_all() - can me?

okay, i'm giving cooked-up code, explain it:

  1. obtain html source
  2. parse <a> href attributes
  3. test regular expression if value matches.
  4. if matches, extract id link , store in way don't duplicates.
  5. done.

example/demo

// initialize $ids   = array();  $url   = 'http://www.imdb.com/movies-coming-soon/'; # url $expr  = '//a/@href';                               # these attributes $regex = '(/title/(tt\d{5,7})/)u';                  # matching regex $match = 1;                                         # take group 1  // process foreach((new domxpath(@domdocument::loadhtmlfile($url)))->query($expr) $obj)     preg_match($regex, $obj->value, $matches)       && $ids[$matches[$match]] = 0;     ; $ids = array_keys($ids);  // output print_r($ids); 

(notes: tagged question php5, current stable php5 5.4, example; if configure php5 version curl wrappers, code curl.)

edit: lower php versions:

... // process $xp = new domxpath(@domdocument::loadhtmlfile($url)); foreach($xp->query($expr) $obj) ... 

edit2: seeing imdb tags it's markup, it's possible retrieve actual movie entries of list rather title links on page.

this require little improvement in xpath expression used. because parsing more intelligent, duplicates not exist , there no need remove them:

// initialize $ids   = array(); $url   = 'http://www.imdb.com/movies-coming-soon/'; # url $expr  = '//*[@itemtype="http://schema.org/movie"]                 //a[@itemprop="url"]/@href';        # these attributes $regex = '(/title/(tt\d{5,7})/)u';                  # matching regex $match = 1;                                         # take group 1  // process $xp = new domxpath(@domdocument::loadhtmlfile($url)); foreach($xp->query($expr) $obj)     preg_match($regex, $obj->value, $matches)         && $ids[] = $matches[$match]; ; 

Comments

Popular posts from this blog

c# - SVN Error : "svnadmin: E205000: Too many arguments" -

c# - Copy ObservableCollection to another ObservableCollection -

All overlapping substrings matching a java regex -