
Trie樹,即字典樹,又稱單詞查找樹或鍵樹,是一種樹形結構,典型應用是用于統計,排序和保存大量的字符串(但不僅限于字符串),所以經常被搜索引擎系統用于文本詞頻統計。它的優點是:利用字符串的公共前綴來減少查詢時間,最大限度地減少無謂的字符串比較,查詢效率比哈希樹高。
(資料圖)
Trie, also called digital tree and sometimes radix tree or prefix tree (as they can be searched by prefixes), is a kind of search tree—an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings. It is one of those data-structures that can be easily implemented.
最大限度地減少無謂的字符串比較,查詢效率比較高。核心思想是空間換時間,利用字符串的公共前綴來降低查詢時間的開銷以達到提高效率的目的。
插入、查找的時間復雜度均為O(N),其中N為字符串長度。空間復雜度是26^n級別的,非常龐大(可采用雙數組實現改善)。以字符串”hi”和”經海路”的數據為例:
Java的數據結構定義:
@Datapublic class TrieTreeNode { private Character data; private Mapchildren; private boolean isEnd; // 前綴,冗余信息,可選 private String prefix; // 后綴,冗余信息,可選 private String suffix;}
如果只是處理26個英文字符,data可以通過children數組是否為空來判斷。如果處理程序,默認children為空來判斷是否為最后一個節點,則isEnd字段可以省略。前綴prefix和suffix可以在數據處理的時候,方便拿到當前節點前綴和后綴信息,如果不需要可以去除。
public class KeyWord implements Serializable { /** * 關鍵詞內容 */ private String word;//其他省略}
public class KWSeeker { /** * 所有的關鍵詞 */ private SetsensitiveWords; /** * 關鍵詞樹 */ private Map wordsTree = new ConcurrentHashMap (); /** * 最短的關鍵詞長度。用于對短于這個長度的文本不處理的判斷,以節省一定的效率 */ private int wordLeastLen = 0;//其他處理方法,省略}
/** * 將指定的詞構造到一棵樹中。 * * @param tree 構造出來的樹 * @param word 指定的詞 * @param KeyWord 對應的詞 * @return */public static MapmakeTreeByWord(Map tree, String word, KeyWord KeyWord) { if (StringUtils.isEmpty(word)) { tree.putAll(EndTagUtil.buind(KeyWord)); return tree; } String next = word.substring(0, 1); Map nextTree = tree.get(next); if (nextTree == null) { nextTree = new HashMap (); } // 遞歸構造樹結構 tree.put(next, makeTreeByWord(nextTree, word.substring(1), KeyWord)); return tree;}
對關鍵詞字符串,逐個字符進行構建。
/** * 構造、生成詞庫樹。并返回所有敏感詞中最短的詞的長度。 * * @param sensitiveWords 詞庫 * @param wordsTree 聚合詞庫的樹 * @return 返回所有敏感詞中最短的詞的長度。 */public int generalTree(SetsensitiveWords, Map wordsTree) { if (sensitiveWords == null || sensitiveWords.isEmpty() || wordsTree == null) { return 0; } wordsTreeTmp.clear(); int len = 0; for (KeyWord kw : sensitiveWords) { if (len == 0) { len = kw.getWordLength(); } else if (kw.getWordLength() < len) { len = kw.getWordLength(); } AnalysisUtil .makeTreeByWord(wordsTreeTmp, StringUtils.trimToEmpty(kw.getWord()), kw); } wordsTree.clear(); wordsTree.putAll(wordsTreeTmp); return len;}
/** * 將文本中的關鍵詞提取出來。 */public Listprocess(Map wordsTree, String text, AbstractFragment fragment, int minLen) { // 詞的前面一個字 String pre = null; // 詞匹配的開始位置 int startPosition = 0; // 返回結果 List rs = new ArrayList (); while (true) { try { if (wordsTree == null || wordsTree.isEmpty() || StringUtils.isEmpty(text)) { return rs; } if (text.length() < minLen) { return rs; } String chr = text.substring(0, 1); text = text.substring(1); Map nextWord = wordsTree.get(chr); // 沒有對應的下一個字,表示這不是關鍵詞的開頭,進行下一個循環 if (nextWord == null) { pre = chr; continue; } List keywords = Lists.newArrayList(); KeyWord kw = AnalysisUtil.getSensitiveWord(chr, pre, nextWord, text, keywords); if (keywords == null || keywords.size() == 0) { // 沒有匹配到完整關鍵字,下一個循環 pre = chr; continue; } for (KeyWord tmp : keywords) { // 同一個word多次出現記錄在一起 SensitiveWordResult result = new SensitiveWordResult(startPosition, tmp.getWord()); int index = rs.indexOf(result); if (index > -1) { rs.get(index).addPosition(startPosition, tmp.getWord()); } else { rs.add(result); } } // 從text中去除當前已經匹配的內容,進行下一個循環匹配 // 這行注釋了,避免"中國人",導致"國人"。搜索不出來,逐個字符遍歷 // text = text.substring(kw.getWordLength() - 1); pre = kw.getWord().substring(kw.getWordLength() - 1, kw.getWordLength()); continue; } finally { if (pre != null) { startPosition = startPosition + pre.length(); } } }}/** * 查詢文本開頭的詞是否在詞庫樹中,如果在,則返回對應的詞,如果不在,則返回null。return 返回找到的最長關鍵詞 * * @param append 追加的詞 * @param pre 詞的前一個字,如果為空,則表示前面沒有內容 * @param nextWordsTree 下一層樹 * @param text 剩余的文本內容 * @param keywords 返回的keywords,可能多個 * @return 返回找到的最長關鍵詞 */public static KeyWord getSensitiveWord(String append, String pre, Map nextWordsTree, String text, List keywords) { if (nextWordsTree == null || nextWordsTree.isEmpty()) { return null; } Map endTag = nextWordsTree.get(EndTagUtil.TREE_END_TAG); // 原始文本已到末尾 if (StringUtils.isEmpty(text)) { // 如果有結束符,則表示匹配成功,沒有,則返回null if (endTag != null) { keywords.add(checkPattern(getKeyWord(append, endTag), pre, null)); return checkPattern(getKeyWord(append, endTag), pre, null); } else { return null; } } String next = text.substring(0, 1); String suffix = text.substring(0, 1); Map nextTree = nextWordsTree.get(next); // 沒有遇到endTag,繼續匹配 if (endTag == null) { if (nextTree != null && nextTree.size() > 0) { // 沒有結束標志,則表示關鍵詞沒有結束,繼續往下走。 return getSensitiveWord(append + next, pre, nextTree, text.substring(1), keywords); } // 如果沒有下一個匹配的字,表示匹配結束! return null; } else { // endTag , 添加關鍵字 KeyWord tmp = getKeyWord(append, endTag); keywords.add(checkPattern(tmp, pre, suffix)); } // 有下一個匹配的詞則繼續匹配,一直取到最大的匹配關鍵字 KeyWord tmp = null; if (nextTree != null && nextTree.size() > 0) { // 如果大于0,則表示還有更長的詞,繼續往下找 tmp = getSensitiveWord(append + next, pre, nextTree, text.substring(1), keywords); if (tmp == null) { // 沒有更長的詞,則就返回這個詞。在返回之前,先判斷它是模糊的,還是精確的 tmp = getKeyWord(append, endTag); } return checkPattern(tmp, pre, suffix); } // 沒有往下的詞了,返回該關鍵詞。 return checkPattern(getKeyWord(append, endTag), pre, suffix);}
思路是對某個字符串text,逐個字符ch,獲取ch對應的詞庫樹的children,然后獲取匹配到的單個或多個結果,將匹配到的關鍵詞在text中的開始和結束下標進行記錄,如后續需要html標記,或者字符替換可直接使用。如果未能在詞庫樹中找到對應的ch的children,或者詞庫樹的children未能匹配到去除ch的子字符串,則繼續循環。具體可再詳細讀一下代碼。
Redis實現了不定長壓縮前綴的radix tree,用在集群模式下存儲slot對應的的所有key信息。
/* Representation of a radix tree as implemented in this file, that contains * the strings "foo", "foobar" and "footer" after the insertion of each * word. When the node represents a key inside the radix tree, we write it * between [], otherwise it is written between (). * * This is the vanilla representation: * * (f) "" * \ * (o) "f" * \ * (o) "fo" * \ * [t b] "foo" * / \ * "foot" (e) (a) "foob" * / \ * "foote" (r) (r) "fooba" * / \ * "footer" [] [] "foobar" * * However, this implementation implements a very common optimization where * successive nodes having a single child are "compressed" into the node * itself as a string of characters, each representing a next-level child, * and only the link to the node representing the last character node is * provided inside the representation. So the above representation is turned * into: * * ["foo"] "" * | * [t b] "foo" * / \ * "foot" ("er") ("ar") "foob" * / \ * "footer" [] [] "foobar" * * However this optimization makes the implementation a bit more complex. * For instance if a key "first" is added in the above radix tree, a * "node splitting" operation is needed, since the "foo" prefix is no longer * composed of nodes having a single child one after the other. This is the * above tree and the resulting node splitting after this event happens: * * * (f) "" * / * (i o) "f" * / \ * "firs" ("rst") (o) "fo" * / \ * "first" [] [t b] "foo" * / \ * "foot" ("er") ("ar") "foob" * / \ * "footer" [] [] "foobar" * * Similarly after deletion, if a new chain of nodes having a single child * is created (the chain must also not include nodes that represent keys), * it must be compressed back into a single node. * */#define RAX_NODE_MAX_SIZE ((1<<29)-1)typedef struct raxNode { uint32_t iskey:1; /* Does this node contain a key? */ uint32_t isnull:1; /* Associated value is NULL (don"t store it). */ uint32_t iscompr:1; /* Node is compressed. */ uint32_t size:29; /* Number of children, or compressed string len. */ unsigned char data[];} raxNode;typedef struct rax { raxNode *head; uint64_t numele; uint64_t numnodes;} rax;typedef struct raxStack { void **stack; /* Points to static_items or an heap allocated array. */ size_t items, maxitems; /* Number of items contained and total space. */ void *static_items[RAX_STACK_STATIC_ITEMS]; int oom; /* True if pushing into this stack failed for OOM at some point. */} raxStack;
如Redis源碼中的注釋所寫,RAX進行了一些優化,并不會將一個字符串直接按照每個字符進行樹的構建,而是在Insert有沖突時節點分割處理,在Delete時如果子節點和父節點都只有一個,則需要進行合并操作。對于RAX有興趣的同學,可以看一下rax.h、rax.c的相關代碼。
Linux radix樹最廣泛的用途是用于內存管理,結構address_space通過radix樹跟蹤綁定到地址映射上的核心頁,該radix樹允許內存管理代碼快速查找標識為dirty或writeback的頁。Linux radix樹的API函數在lib/radix-tree.c中實現。Linux基數樹(radix tree)是將指針與long整數鍵值相關聯的機制,它存儲有效率,并且可快速查詢,用于指針與整數值的映射(如:IDR機制)、內存管理等。
struct radix_tree_node { unsigned int path; unsigned int count; union { struct { struct radix_tree_node *parent; void *private_data; }; struct rcu_head rcu_head; }; /* For tree user */ struct list_head private_list; void __rcu *slots[RADIX_TREE_MAP_SIZE]; unsigned long tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];};
關于Linux內核使用Radix Tree的具體代碼,有興趣的同學可以繼續深入。
Trie樹在單詞搜索、統計、排序等領域有大量的應用。文章從基礎概念到具體的臟話過濾的應用、Redis的RAX和Linux內核的Radix Tree對Trie樹做了介紹。數據結構和算法是程序高性能的基礎,本文拋磚引玉,希望大家對Trie樹有所了解,并在未來開發過程實踐和應用Trie樹解決中類似情景的問題。