How Devo indexes data
One of the biggest challenges faced by big data solutions is not how to collect and store large amounts of data but being able to quickly find the "needles in the haystack". To accelerate searches for very specific information, Devo uses a unique two-level system for indexing data.
This is the primary index used to locate the requested data across multiple data nodes. This index is used for every query and every time you call upon a table in the Finder.
The primary function of the tag index is to identify the files that contain the events you are requesting.
Every query calls a data table and every table is associated internally to (usually) one or (sometimes) more tags. Because each event is saved in a file stored in a file path that identifies the complete event tag, the domain, and the date, this index can quickly isolate the files containing the data requested.
This index sometimes works in combination with the token index.
This second index contains all tokens identified in the data saved across all data nodes. An internal indexing task runs regularly to scan the recently ingested events to identify all tokens and add them to this index. This is an inverted index meaning that every token is mapped to the individual events in which the token has been found.
The primary function of the token index is to identify the events that contain the data you seek within the files already identified by the tag index.
So, what's a token?
A token is simply a string of alphanumeric characters separated by ASCII symbols (non-alphanumeric characters like symbols and spaces) in the raw event as it was delivered to Devo. Devo also recognizes as tokens any value that matches the IPv4, IPv6, or MAC address-like data formats. Therefore, not only will Devo identify 10.0.1.2 and aa:bb:cc:dd as tokens, but also their component parts, 10, 0, 1, 2, aa, bb, cc, and dd because these component parts are delimited by ASCII symbols (the periods and colons).
Here's an excerpt from a firewall event. In green are all of the substrings identified as tokens. In blue is highlighted a complete standard IP address, also recognized and indexed as a token.
Since almost all raw data sent to Devo uses spaces or other ASCII symbols as separators between field values, the first segment of an event (up to the first ASCII symbol) is also identified as a token. For example, the token access in the example above.
When is the token index used?
Once the tag index has located the relevant files in Devo's data nodes, this index may be used to accelerate the location of the events that you're looking for. Whenever a query launches a search for a string, the query engine determines if the token index should be consulted. It does this based upon the LINQ operation used and how the string to search for is formatted. However, only those operations designed to identify string values can trigger the use of this index.
In addition, there are three ways that matches can be found in the token index.
- By searching for an entire token.
- By searching for tokens that begin with a specified prefix.
- By searching for tokens that end with a specified suffix.
These are the LINQ operations that always use the token index, regardless of how the search string is formatted:
|Operation Name||Case sensitivity||Description|
This operation assumes that the string to search for is a token and therefore always uses the token index. It is a case-sensitive operation, however, so searching for Banana is not the same as searching for banana.
However, if the optional left-extended and right-extended Boolean arguments are used,
will return events where a token ends with
|Starts with (||Case-sensitive|
This operation assumes that the string to search for is the beginning of a token and therefore always uses the token index. Like
This returns events that contain tokens that start with the specified string.
|Ends with (||Case-sensitive|
This operation assumes that the string to search for is the end of a token and therefore always uses the token index. Like
This returns events that contain tokens that end with the specified string.
Since these operations look for an exact match of the string to search for, they always use the token index. While
These return events that contain tokens exactly match the specified string (either regarding or disregarding case).
|Equal - case insensitive (||Case-insensitive|
These are the LINQ operations that sometimes use the token index, depending upon how the search string is formatted:
|Operation Name||Case sensitivity||Description|
These operations will use the token index if the search string contains all or part of a token. That is to say, if the search string contains an alphanumeric string bounded on the left and/or right by a non-alphanumeric ASCII symbol.
To illustrate this, let's look at how the following query filter would be handled by the query engine.
The token index will be used to accelerate the search for:
The results will include subset of events identified by the token index search and will be only those events that contain the full search string.
|Contains - case insensitive (||Case-insensitive|
Is in (
|Is in - case insensitive (||Case-insensitive|
Learn more about the LINQ filter operations mentioned in the article.