java - Efficient Way to find most similar value -
i have value such color, , list of string : {colour,color,main color, main colour, theme, brand, subject ..... etc}
i similar string , except searched string itself. in example expect colour. (not color)
i sorting list using following rules , ranked rules :
- filter same value
- check upper lower cases
- delete whitespaces. trim
- using levenshtein distance
- string order : main color = color main
- check acronym : hp - hewlett packard
it takes lot of time go on list of 1000 relevant candidates. have lots of candidates check.
any other efficient way?
original code:
public static list findsimilarity(string word, list candidates) { list recommendations = new arraylist(); if (!word.equals("")) { (string candidate : candidates) { if (!word.equals(candidate)) { //1. same token , lower/upper cases , ignore white spaces if (stringutils.deletewhitespace(word).equalsignorecase(stringutils.deletewhitespace(candidate))) { recommendations.add(candidate); } //2. same tokens diff order else if (candidate.split(" ").length == word.split(" ").length) { string[] candidatearr = candidate.split(" "); string[] wordarr = word.split(" "); boolean status = true; sortignorecase icc = new sortignorecase(); arrays.sort(candidatearr, icc); arrays.sort(wordarr, icc); (int = 0; < candidatearr.length; i++) { if (!(candidatearr[i] == null ? wordarr[i] == null : wordarr[i].equalsignorecase(candidatearr[i]))) status = false; } if (status) { recommendations.add(candidate); } } } } //3. distance between words if (recommendations.size() == 0) { (string candidate : candidates) { if (!word.equals(candidate)) { string[] candidatearr = candidate.split(" "); string[] wordarr = word.split(" "); //check acronym if ((wordarr.length == 1 && candidatearr.length > 1) || (wordarr.length > 1 && candidatearr.length == 1)) { string acronym = ""; if (wordarr.length > candidatearr.length) { (string tmp : wordarr) { if (!tmp.equals("")) { acronym = acronym + tmp.substring(0, 1); } } if (acronym.equalsignorecase(candidatearr[0])) { recommendations.add(candidate); } } else { (string tmp : candidatearr) { if (!tmp.equals("")) { acronym = acronym + tmp.substring(0, 1); } } if (acronym.equalsignorecase(wordarr[0])) { recommendations.add(candidate); } } } } } } if (recommendations.size() == 0) { (string candidate : candidates) { if (!word.equals(candidate)) { int dist = 0; string check = ""; if (word.length() > candidate.length()) { check = candidate; } else { check = word; } if (check.length() <= 3) { dist = 0; } else if (check.length() > 3 && check.length() <= 5) { dist = 1; } else if (check.length() > 5) { dist = 2; } if (stringutils.getlevenshteindistance(word, candidate) <= dist) { //if(levenshtein.distance(word,candidate) <= dist){ recommendations.add(candidate); } } } } if (recommendations.size() == 0) { (string candidate : candidates) { if (!word.equals(candidate)) { string[] candidatearr = candidate.split(" "); string[] wordarr = word.split(" "); (string cand : candidatearr) { (string wor : wordarr) { if (cand.equals(wor) && cand.length() > 4) { recommendations.add(candidate); } } } } }//for if (recommendations.size() > 4) { recommendations.clear(); } } //4. low priority - starts if (recommendations.size() == 0) { (string candidate : candidates) { if (!word.equals(candidate)) { if (candidate.startswith(word) || word.startswith(candidate)) { recommendations.add(candidate); } } } if (recommendations.size() > 4) { recommendations.clear(); } } //5. low priority - contain word if (recommendations.size() == 0) { (string candidate : candidates) { if (!word.equals(candidate)) { if (candidate.contains(word) || word.contains(candidate)) { recommendations.add(candidate); } } } if (recommendations.size() > 4) { recommendations.clear(); } } } return recommendations; }
thanks, m.
edited
i wrapped answer given bohemian context of original code better understanding.
the line .map(term -> arrays.stream(term.split(" ")).sorted().collect(collectors.joining(" ")))
splits multi-word terms, sorts, , joins again eliminate permutations of same words. answer permutational equality challenge of terms "main color" , "color main".
however, not make sense catch business requirements of task in context of question. answer have got outline of solution. problem of efficiency addressed. might need more stages in pipeline, that's different story. strength of approach stages detached, can ask questions , seek each stage independently.
public static string findsimilarity(string word, list<string> candidateslist) { // populating set distinct values of input terms set<string> candidates = candidateslist.stream() .map(string::tolowercase) .map(term -> arrays.stream(term.split(" ")).sorted().collect(collectors.joining(" "))) // eliminates permutations .collect(collectors.toset()); map<string, integer> cache = new concurrenthashmap<>(); return candidates.parallelstream() .map(string::trim) // add more mappers if needed .filter(s -> !s.equalsignorecase(word)) // add more filters if needed .min((a, b) -> integer.compare( cache.computeifabsent(a, k -> getlevenshteindistance(word, k)), cache.computeifabsent(b, k -> getlevenshteindistance(word, k)))) .get(); // closest match }
Comments
Post a Comment