whoosh doesn't search for short words like "C#" -


i using whoosh index on 200,000 books. have encountered problems it. whoosh query parser returns nullquery words "c#", "c++" meta-characters in them , other short words. words used in title , body of documents not using keyword type them. guess problem in analysis or query-parsing phase of searching or indexing can't touch data blindly. can me correct issue. tnx.

i fixed problem creating standardanalyzer regex pattern meets requirements,here regex pattern:

'\w+[#+.\w]*'

this make tokenizing of fields done successfully, , searching goes well. when use queries "some query++*" or "some##*" parsed query single every query, '*'. found not related analyzer , whoosh's default behavior. here new question: behavior correct or bug??

note: removing wildcardplugin query-parser solves problem need wildcardplugin.


now using following code:

from whoosh.util import rcompile #for matching words like: '.net', 'c++' , 'c#' word_pattern = rcompile('(\.|[\w]+)(\.?\w+|#|\+\+)*') #i don't need words shorter 2 characters don't change minsize default analyzer = analysis.standardanalyzer(expression=word_pattern) 

... in schema:

... title = fields.text(analyzer=analyzer), ... 

this solve first problem, yes. main problem in searching. don't want let users search using every query or *. when parse queries c++* end every(*) query. know there problem can't figure out is.

i had same issue , found out standardanalyzer() uses minsize=2 default. in schema, have tell otherwise.

schema = whoosh.fields.schema(   name = whoosh.fields.text(stored=true, analyzer=whoosh.analysis.standardanalyzer(minsize=1)),   # ... ) 

Comments

Popular posts from this blog

sql - VB.NET Operand type clash: date is incompatible with int error -

SVG stroke-linecap doesn't work for circles in Firefox? -

python - TypeError: Scalar value for argument 'color' is not numeric in openCV -