java - Efficient Way to find most similar value -


i have value such color, , list of string : {colour,color,main color, main colour, theme, brand, subject ..... etc}

i similar string , except searched string itself. in example expect colour. (not color)

i sorting list using following rules , ranked rules :

  1. filter same value
  2. check upper lower cases
  3. delete whitespaces. trim
  4. using levenshtein distance
  5. string order : main color = color main
  6. check acronym : hp - hewlett packard

it takes lot of time go on list of 1000 relevant candidates. have lots of candidates check.

any other efficient way?

original code:

public static list findsimilarity(string word, list candidates) {     list recommendations = new arraylist();     if (!word.equals("")) {         (string candidate : candidates) {             if (!word.equals(candidate)) { //1. same token , lower/upper cases , ignore white spaces                 if (stringutils.deletewhitespace(word).equalsignorecase(stringutils.deletewhitespace(candidate))) {                     recommendations.add(candidate);                 }                 //2. same tokens diff order                 else if (candidate.split(" ").length == word.split("     ").length) {                     string[] candidatearr = candidate.split(" ");                     string[] wordarr = word.split(" ");                     boolean status = true;                     sortignorecase icc = new sortignorecase();                     arrays.sort(candidatearr, icc);                     arrays.sort(wordarr, icc);                     (int = 0; < candidatearr.length; i++) {                         if (!(candidatearr[i] == null ? wordarr[i] == null : wordarr[i].equalsignorecase(candidatearr[i])))                             status = false;                     }                      if (status) {                         recommendations.add(candidate);                     }                 }             }         }         //3. distance between words         if (recommendations.size() == 0) {             (string candidate : candidates) {                 if (!word.equals(candidate)) {                     string[] candidatearr = candidate.split(" ");                     string[] wordarr = word.split(" ");                     //check acronym                     if ((wordarr.length == 1 && candidatearr.length > 1) || (wordarr.length > 1 && candidatearr.length == 1)) {                         string acronym = "";                         if (wordarr.length > candidatearr.length) {                             (string tmp : wordarr) {                                 if (!tmp.equals("")) {                                     acronym = acronym + tmp.substring(0, 1);                                 }                             }                              if (acronym.equalsignorecase(candidatearr[0])) {                                 recommendations.add(candidate);                             }                         } else {                             (string tmp : candidatearr) {                                 if (!tmp.equals("")) {                                     acronym = acronym + tmp.substring(0, 1);                                 }                             }                              if (acronym.equalsignorecase(wordarr[0])) {                                 recommendations.add(candidate);                             }                         }                     }                 }             }         }          if (recommendations.size() == 0) {             (string candidate : candidates) {                 if (!word.equals(candidate)) {                     int dist = 0;                     string check = "";                     if (word.length() > candidate.length()) {                         check = candidate;                     } else {                         check = word;                     }                     if (check.length() <= 3) {                         dist = 0;                     } else if (check.length() > 3 && check.length() <= 5) {                         dist = 1;                     } else if (check.length() > 5) {                         dist = 2;                     }                      if (stringutils.getlevenshteindistance(word, candidate) <= dist) {                         //if(levenshtein.distance(word,candidate) <= dist){                         recommendations.add(candidate);                     }                 }             }         }          if (recommendations.size() == 0) {             (string candidate : candidates) {                 if (!word.equals(candidate)) {                     string[] candidatearr = candidate.split(" ");                     string[] wordarr = word.split(" ");                      (string cand : candidatearr) {                         (string wor : wordarr) {                             if (cand.equals(wor) && cand.length() > 4) {                                 recommendations.add(candidate);                              }                         }                     }                 }             }//for             if (recommendations.size() > 4) {                 recommendations.clear();             }         }          //4. low priority - starts         if (recommendations.size() == 0) {             (string candidate : candidates) {                 if (!word.equals(candidate)) {                     if (candidate.startswith(word) || word.startswith(candidate)) {                         recommendations.add(candidate);                     }                 }             }             if (recommendations.size() > 4) {                 recommendations.clear();             }         }          //5. low priority - contain word         if (recommendations.size() == 0) {             (string candidate : candidates) {                 if (!word.equals(candidate)) {                     if (candidate.contains(word) || word.contains(candidate)) {                         recommendations.add(candidate);                     }                 }             }             if (recommendations.size() > 4) {                 recommendations.clear();             }         }     }     return recommendations; } 

thanks, m.

edited

i wrapped answer given bohemian context of original code better understanding.

the line .map(term -> arrays.stream(term.split(" ")).sorted().collect(collectors.joining(" "))) splits multi-word terms, sorts, , joins again eliminate permutations of same words. answer permutational equality challenge of terms "main color" , "color main".

however, not make sense catch business requirements of task in context of question. answer have got outline of solution. problem of efficiency addressed. might need more stages in pipeline, that's different story. strength of approach stages detached, can ask questions , seek each stage independently.

public static string findsimilarity(string word, list<string> candidateslist) {      // populating set distinct values of input terms     set<string> candidates = candidateslist.stream()             .map(string::tolowercase)             .map(term -> arrays.stream(term.split(" ")).sorted().collect(collectors.joining(" "))) // eliminates permutations             .collect(collectors.toset());      map<string, integer> cache = new concurrenthashmap<>();      return candidates.parallelstream()             .map(string::trim)                     // add more mappers if needed             .filter(s -> !s.equalsignorecase(word))                     // add more filters if needed             .min((a, b) -> integer.compare(                     cache.computeifabsent(a, k -> getlevenshteindistance(word, k)),                     cache.computeifabsent(b, k -> getlevenshteindistance(word, k))))             .get(); // closest match } 

Comments

Popular posts from this blog

sql - VB.NET Operand type clash: date is incompatible with int error -

SVG stroke-linecap doesn't work for circles in Firefox? -

python - TypeError: Scalar value for argument 'color' is not numeric in openCV -