algorithm - What is the fastest (realistic) storage implementation to retrieve similar entries? -
i read bk-trees (burkhard-keller-trees) months ago , said method saving stuff want read out again distance-metrics. in each case want retrieve similarity.
however, these bk-trees do not seem fast me. when tried implementation , did output, had travel around through tree lot allowed longer distances (i trued levenshtein , allowed 6 edits).
the fastest implementation (if it’s speed) of course store distances from each each entry in table , them directly, overhead.
thus added realistic in title. okay require more memory, implementation should still realistic , usable (i not know enough such techniques realistic is, guess there border).
is there faster bk-trees available or bk top of mountain (yet)?
scenario
i not have real use-case, scenario follows: have 1 mio entries of , have distance each other (defined distance function). 1 entry , want know either:
- which 5 entries best match given entry
- which other entries (independant of number) lower or equal same given threshold
database not matter.
i guess in end best algorithm match both?
another tree-based nearest neighbour metric http://en.wikipedia.org/wiki/cover_tree. claims practical, , http://www.cs.waikato.ac.nz/ml/weka/ has picked up, indeed so. however, nearest neighbour seems hard exactly, trees, or else, because there number of proposals floating around approximate nearest neighbour, presume pretty silly if wasn't hard. can see 1 edit distance @ http://people.csail.mit.edu/indyk/edit.ps.
another way approximate nearest neighbour search hope nearest neighbour have contiguous section of characters occurs in query string. strings in database, chop them contiguous k-long substrings , build table can use exact match. query string consider k-long contiguous substrings, exact match these, , compute edit distance strings database have found exact search k-long substrings.
Comments
Post a Comment