WikiGalaxy

Personalize

Introduction to String Algorithms

String Matching Algorithms

String matching algorithms are used to find a substring within a string. They are fundamental in text processing and search engines.

Knuth-Morris-Pratt (KMP) Algorithm

The KMP algorithm is efficient for string matching as it preprocesses the pattern to determine where mismatches can occur, reducing the number of comparisons.

Rabin-Karp Algorithm

Rabin-Karp uses hashing to find any one of a set of pattern strings in a text. It is particularly useful when dealing with multiple pattern searches.

Boyer-Moore Algorithm

Boyer-Moore is efficient for large alphabets as it skips sections of the text, thus reducing the number of comparisons needed.


public class KMPAlgorithm {
  void KMPSearch(String pat, String txt) {
    int M = pat.length();
    int N = txt.length();
    int lps[] = new int[M];
    int j = 0;
    computeLPSArray(pat, M, lps);
    int i = 0;
    while (i < N) {
      if (pat.charAt(j) == txt.charAt(i)) {
        j++;
        i++;
      }
      if (j == M) {
        System.out.println("Found pattern at index " + (i - j));
        j = lps[j - 1];
      } else if (i < N && pat.charAt(j) != txt.charAt(i)) {
        if (j != 0)
          j = lps[j - 1];
        else
          i = i + 1;
      }
    }
  }
  void computeLPSArray(String pat, int M, int lps[]) {
    int len = 0;
    int i = 1;
    lps[0] = 0;
    while (i < M) {
      if (pat.charAt(i) == pat.charAt(len)) {
        len++;
        lps[i] = len;
        i++;
      } else {
        if (len != 0) {
          len = lps[len - 1];
        } else {
          lps[i] = len;
          i++;
        }
      }
    }
  }
}
    

String Searching with KMP

KMP algorithm finds occurrences of a pattern within a text efficiently by precomputing the longest prefix which is also a suffix (LPS) array.

Console Output:

Found pattern at index 10

Suffix Array Construction

Basic Concept

A suffix array is an array of integers providing the starting indices of suffixes of a string sorted in lexicographical order. It is useful in many string processing applications.

Applications

Suffix arrays are used in pattern matching, data compression, and bioinformatics for efficient searching and analysis.


import java.util.Arrays;
public class SuffixArray {
  private String text;
  private Integer[] suffixArr;
  public SuffixArray(String text) {
    this.text = text;
    this.suffixArr = new Integer[text.length()];
    for (int i = 0; i < text.length(); i++) {
      suffixArr[i] = i;
    }
    Arrays.sort(suffixArr, (a, b) -> text.substring(a).compareTo(text.substring(b)));
  }
  public Integer[] getSuffixArray() {
    return suffixArr;
  }
  public static void main(String[] args) {
    SuffixArray sa = new SuffixArray("banana");
    System.out.println(Arrays.toString(sa.getSuffixArray()));
  }
}
    

Efficient Searching

With a suffix array, searching for a substring becomes a binary search problem, making it efficient for large texts.

Console Output:

[5, 3, 1, 0, 4, 2]

Trie Data Structure

Overview

A trie is a tree-like data structure that stores a dynamic set of strings, where the keys are usually strings. It is used for efficient retrieval of a key in a dataset of strings.

Operations

Tries support operations like insert, search, and delete efficiently. They are widely used in autocomplete features and spell checkers.


class TrieNode {
  TrieNode[] children = new TrieNode[26];
  boolean isEndOfWord;
  TrieNode() {
    isEndOfWord = false;
    for (int i = 0; i < 26; i++)
      children[i] = null;
  }
}
public class Trie {
  private TrieNode root;
  public Trie() {
    root = new TrieNode();
  }
  public void insert(String key) {
    int level;
    int length = key.length();
    int index;
    TrieNode pCrawl = root;
    for (level = 0; level < length; level++) {
      index = key.charAt(level) - 'a';
      if (pCrawl.children[index] == null)
        pCrawl.children[index] = new TrieNode();
      pCrawl = pCrawl.children[index];
    }
    pCrawl.isEndOfWord = true;
  }
  public boolean search(String key) {
    int level;
    int length = key.length();
    int index;
    TrieNode pCrawl = root;
    for (level = 0; level < length; level++) {
      index = key.charAt(level) - 'a';
      if (pCrawl.children[index] == null)
        return false;
      pCrawl = pCrawl.children[index];
    }
    return (pCrawl != null && pCrawl.isEndOfWord);
  }
  public static void main(String args[]) {
    String keys[] = {"the", "a", "there", "answer", "any", "by", "bye", "their"};
    Trie trie = new Trie();
    for (int i = 0; i < keys.length; i++)
      trie.insert(keys[i]);
    System.out.println(trie.search("the")); // true
    System.out.println(trie.search("these")); // false
  }
}
    

Efficient Storage and Retrieval

Tries provide efficient storage and retrieval by minimizing the number of character comparisons needed to locate a string.

Console Output:

true false

Longest Common Subsequence (LCS)

Definition

The longest common subsequence is the longest sequence that appears in both strings in the same order but not necessarily consecutively. It is a classic computer science problem with applications in diff tools and bioinformatics.

Dynamic Programming Approach

The LCS problem can be solved using dynamic programming by building a matrix to store solutions of subproblems.


public class LCS {
  int lcs(char[] X, char[] Y, int m, int n) {
    int L[][] = new int[m + 1][n + 1];
    for (int i = 0; i <= m; i++) {
      for (int j = 0; j <= n; j++) {
        if (i == 0 || j == 0)
          L[i][j] = 0;
        else if (X[i - 1] == Y[j - 1])
          L[i][j] = L[i - 1][j - 1] + 1;
        else
          L[i][j] = Math.max(L[i - 1][j], L[i][j - 1]);
      }
    }
    return L[m][n];
  }
  public static void main(String[] args) {
    LCS lcs = new LCS();
    String s1 = "AGGTAB";
    String s2 = "GXTXAYB";
    System.out.println("Length of LCS is " + lcs.lcs(s1.toCharArray(), s2.toCharArray(), s1.length(), s2.length()));
  }
}
    

Optimal Substructure

The LCS problem exhibits optimal substructure, allowing it to be broken down into smaller, manageable subproblems.

Console Output:

Length of LCS is 4

Edit Distance (Levenshtein Distance)

Concept

Edit distance measures how dissimilar two strings are by counting the minimum number of operations required to transform one string into the other. Operations include insertion, deletion, or substitution of a single character.

Applications

Edit distance is used in spell checking, DNA sequencing, and natural language processing to determine similarity between strings.


public class EditDistance {
  int min(int x, int y, int z) {
    if (x <= y && x <= z) return x;
    if (y <= x && y <= z) return y;
    return z;
  }
  int editDist(String str1, String str2, int m, int n) {
    int dp[][] = new int[m + 1][n + 1];
    for (int i = 0; i <= m; i++) {
      for (int j = 0; j <= n; j++) {
        if (i == 0)
          dp[i][j] = j;
        else if (j == 0)
          dp[i][j] = i;
        else if (str1.charAt(i - 1) == str2.charAt(j - 1))
          dp[i][j] = dp[i - 1][j - 1];
        else
          dp[i][j] = 1 + min(dp[i][j - 1], dp[i - 1][j], dp[i - 1][j - 1]);
      }
    }
    return dp[m][n];
  }
  public static void main(String args[]) {
    EditDistance ed = new EditDistance();
    String str1 = "sunday";
    String str2 = "saturday";
    System.out.println("Edit Distance is " + ed.editDist(str1, str2, str1.length(), str2.length()));
  }
}
    

Dynamic Programming Solution

The edit distance problem can be solved using dynamic programming by filling a matrix with solutions to subproblems, ensuring optimal results.

Console Output:

Edit Distance is 3

logo of wikigalaxy

Newsletter

Subscribe to our newsletter for weekly updates and promotions.

Privacy Policy

 • 

Terms of Service

Copyright © WikiGalaxy 2025