whoosh doesn't search for short words like "C#" -
i using whoosh index on 200,000 books. have encountered problems it. whoosh query parser returns nullquery words "c#", "c++" meta-characters in them , other short words. words used in title , body of documents not using keyword type them. guess problem in analysis or query-parsing phase of searching or indexing can't touch data blindly. can me correct issue. tnx.
i fixed problem creating standardanalyzer regex pattern meets requirements,here regex pattern:
'\w+[#+.\w]*'
this make tokenizing of fields done successfully, , searching goes well. when use queries "some query++*" or "some##*" parsed query single every query, '*'. found not related analyzer , whoosh's default behavior. here new question: behavior correct or bug??
note: removing wildcardplugin query-parser solves problem need wildcardplugin.
now using following code:
from whoosh.util import rcompile #for matching words like: '.net', 'c++' , 'c#' word_pattern = rcompile('(\.|[\w]+)(\.?\w+|#|\+\+)*') #i don't need words shorter 2 characters don't change minsize default analyzer = analysis.standardanalyzer(expression=word_pattern)
... in schema:
... title = fields.text(analyzer=analyzer), ...
this solve first problem, yes. main problem in searching. don't want let users search using every
query or *
. when parse queries c++*
end every(*)
query. know there problem can't figure out is.
i had same issue , found out standardanalyzer()
uses minsize=2
default. in schema, have tell otherwise.
schema = whoosh.fields.schema( name = whoosh.fields.text(stored=true, analyzer=whoosh.analysis.standardanalyzer(minsize=1)), # ... )
Comments
Post a Comment