elasticsearch - Why do two identical documents score differently? -


i'm figuring out tire gem (i'm new elasticsearch , lucene) , trying things out. need (probably non-trivial) scoring try grip on that. read find on web scoring formula , trying match found explained query.

if read figures correctly, documents title "foo foo foo foo" have different score, not intended. guess missing step during or after indexing, not figure out.

below code. i'm not going way tire dsl intended because want figure things out -- things may more tire-ish @ time later.

require 'tire' require 'pp'  class model   index = 'myindex'   type = 'company'    class << self     def delete_index       tire.index(index) { delete }     end      def create_mapping       tire.index index         create mappings: {           type => {             properties: {               title: { type: 'string' }             }           }         }       end     end      def refresh_index       tire.index index         refresh       end     end   end    def initialize(attributes = {})     @attributes = attributes.merge(:_id => object_id) #use oid id, testing   end    def _type     type   end    def id     object_id.to_s #convert string because tire compares object_id!   end    def index     item = self     tire.index index       store item     end   end    def to_indexed_json     @attributes.to_json   end    entities = [     new(title: "foo foo foo foo"),     new(title: "foo"),     new(title: "bar"),     new(title: "foo bar"),     new(title: "xxx"),     new(title: "foo foo foo foo"),     new(title: "foo foo"),     new(title: "foo bar baz")   ]    queries = {     :foo => { query_string: { query: "foo" } },     :all => { match_all: {} }   }    def self.custom_explained_search(q)     tire.search(model::index, :wrapper => model, :explain => true) |search|       search.query |query|         query.send :instance_variable_set, :@value, q       end     end   end end  class tire::results::collection   def explained     @response["hits"]["hits"].map |hit|       {         "_id" => hit["_id"],         "_explanation" => hit["_explanation"],         "title" => hit["_source"]["title"]       }     end   end end  model.delete_index model.create_mapping model::entities.each &:index model.refresh_index s = model.custom_explained_search(model::queries[:foo]) pp s.results.explained 

the printed result this:

[{"_id"=>"2169251840",   "_explanation"=>    {"value"=>0.54932046,     "description"=>"fieldweight(_all:foo in 0), product of:",     "details"=>      [{"value"=>1.4142135,        "description"=>"btq, product of:",        "details"=>         [{"value"=>1.4142135, "description"=>"tf(phrasefreq=2.0)"},          {"value"=>1.0, "description"=>"allpayload(...)"}]},       {"value"=>0.7768564, "description"=>"idf(_all:  foo=4)"},       {"value"=>0.5, "description"=>"fieldnorm(field=_all, doc=0)"}]},   "title"=>"foo foo foo foo"},  {"_id"=>"2169251720",   "_explanation"=>    {"value"=>0.54932046,     "description"=>"fieldweight(_all:foo in 1), product of:",     "details"=>      [{"value"=>0.70710677,        "description"=>"btq, product of:",        "details"=>         [{"value"=>0.70710677, "description"=>"tf(phrasefreq=0.5)"},          {"value"=>1.0, "description"=>"allpayload(...)"}]},       {"value"=>0.7768564, "description"=>"idf(_all:  foo=4)"},       {"value"=>1.0, "description"=>"fieldnorm(field=_all, doc=1)"}]},   "title"=>"foo"},  {"_id"=>"2169250520",   "_explanation"=>    {"value"=>0.48553526,     "description"=>"fieldweight(_all:foo in 2), product of:",     "details"=>      [{"value"=>1.0,        "description"=>"btq, product of:",        "details"=>         [{"value"=>1.0, "description"=>"tf(phrasefreq=1.0)"},          {"value"=>1.0, "description"=>"allpayload(...)"}]},       {"value"=>0.7768564, "description"=>"idf(_all:  foo=4)"},       {"value"=>0.625, "description"=>"fieldnorm(field=_all, doc=2)"}]},   "title"=>"foo foo"},  {"_id"=>"2169251320",   "_explanation"=>    {"value"=>0.44194174,     "description"=>"fieldweight(_all:foo in 1), product of:",     "details"=>      [{"value"=>0.70710677,        "description"=>"btq, product of:",        "details"=>         [{"value"=>0.70710677, "description"=>"tf(phrasefreq=0.5)"},          {"value"=>1.0, "description"=>"allpayload(...)"}]},       {"value"=>1.0, "description"=>"idf(_all:  foo=1)"},       {"value"=>0.625, "description"=>"fieldnorm(field=_all, doc=1)"}]},   "title"=>"foo bar"},  {"_id"=>"2169250380",   "_explanation"=>    {"value"=>0.27466023,     "description"=>"fieldweight(_all:foo in 3), product of:",     "details"=>      [{"value"=>0.70710677,        "description"=>"btq, product of:",        "details"=>         [{"value"=>0.70710677, "description"=>"tf(phrasefreq=0.5)"},          {"value"=>1.0, "description"=>"allpayload(...)"}]},       {"value"=>0.7768564, "description"=>"idf(_all:  foo=4)"},       {"value"=>0.5, "description"=>"fieldnorm(field=_all, doc=3)"}]},   "title"=>"foo bar baz"},  {"_id"=>"2169250660",   "_explanation"=>    {"value"=>0.2169777,     "description"=>"fieldweight(_all:foo in 0), product of:",     "details"=>      [{"value"=>1.4142135,        "description"=>"btq, product of:",        "details"=>         [{"value"=>1.4142135, "description"=>"tf(phrasefreq=2.0)"},          {"value"=>1.0, "description"=>"allpayload(...)"}]},       {"value"=>0.30685282, "description"=>"idf(_all:  foo=1)"},       {"value"=>0.5, "description"=>"fieldnorm(field=_all, doc=0)"}]},   "title"=>"foo foo foo foo"}] 

am reading figures wrong? or misusing tire? maybe missing "reindex whole collection" step?

afaik if no explicit sorting field defined, sorting defaults (a variant of ) tf * idf (http://en.wikipedia.org/wiki/tf*idf) .

literally: term frequency* inverse document frequency.

from wikipedia:

term frequency (term count): term count in given document number of times given term appears in document

inverse document frequency measure of whether term common or rare across documents. obtained dividing total number of documents number of documents containing term, , taking logarithm of quotient

in case "term frequency" component of sorting result in "foo foo foo foo" score higher other docs when searching 'foo'

moreover, effect see when changing id's: i'm not sure, i'm guessing has es stores docs ordered id's internally (i'm not sure that)...

if that's case, 2 documents having same sort score sorted based on id tiebreaker. can of course define multiple sorts change behavior (e.g: sort=sorta+desc, sortb+desc. in case sortb used tiebreaker docs score same on scorea)


Comments

Popular posts from this blog

c# - SVN Error : "svnadmin: E205000: Too many arguments" -

c# - Copy ObservableCollection to another ObservableCollection -

All overlapping substrings matching a java regex -