Going from natural language to structured ElasticSearch queries

OakNorth stores a lot of structured information on companies in an ES index, providing users with an elegant way of searching through a rich database of companies. However, our credit analysts frequently want to filter their searches based on certain field properties (e.g. operating country of a company). 

While ES exposes neat programatic ways of achieving this, it is lost on our non-technical users that just want to get some quick info for a credit analysis. Asking them to compose a REST request to ES for each such query would be overkill. In order to help them with this filtering, we developed a very simple but useful tool for translating their free text queries (e.g. “fast food restaurants in Germany”) into a proper REST request to ES that filters search results by appropriately identified fields (e.g. country = “Germany”) from our index.

Preprocessing natural language

A first step towards a workable solution is attempting to normalise many of the ambiguities that come with natural language. This is the purpose of our preprocessing step. We use https://github.com/zalandoresearch/flair – Connect your account to preview links for this purpose, since it is a highly flexible state-of-the art NLP library. It is conveniently pre-trained on well-curated corpora and can be used out-of-the-box for basic (yet non-trivial) NLP tasks such as POS tagging and NER.

With that in mind, our pre-processing aims to replace certain linguistic constructs with a pre-defined tags. This removes a lot of the variability that can be inherent to natural language and makes downstream tasks more manageable. As a first step, we replace all identified cardinals with a special <cd> placeholder tag. This is easily achieved using the provided pos-tagger from flair. Secondly, we replace all mentions of locations with <gpe> and <loc> placeholder to indicate countries and regions respectively. This is done with the provided ner-ontonotes-tagger from flair, and the tags match their respective characterisations for this model.

We now have a much tamer input query that we can (fairly) easily convert to an ES request. We do this using a popular computer science construct called a Context Free Grammer (CFG). Let’s dive in in the next section.

From Context Free Grammars to ES requests

A CFG is a fundamental computer science concept which theoretically allows a computer to understand natural sentences by providing a set of rules (productions) that explain how a speaker can construct a sentence.

The nltk library in python provides a convenient interface for defining such constructs in terms of strings depicting construction rules. Let’s show how it works by means of an example. The below grammar might be used to construct phrases from a famous World of Warcraft youtube clip.

from nltk.grammar import CFG

raid_grammger = CFG.fromstring(”’
S -> MOVE S | SPELL S | ALERT S | MOVE | SPELL| ALERT
MOVE -> CHAR LOC
SPELL -> ‘More Dots’ | ‘DPS slowly’
LOC -> ‘Centre’ | ‘into the Whelps’
CHAR -> ‘Lee’ | ‘Mogrus’ | ‘Foresight’
ALERT -> ‘Whelps!’ | CHAR ’50 dkp MINUS’ | ‘There is no aggro reset’ | ‘Do not stand next to other people’
”’
)

A raid leader leading the raid might start saying nothing (S symbol). He can then choose to issue a move, spell or alert instruction, either finishing with that instruction (e.g S → ALERT) or issuing an instruction and leaving the option open for a further instruction (e.g. S → ALERT S). If we observe the following chain of instructions ‘Lee Centre Foresight 50 dkp MINUS’, we can ascertain that the leader might have produced these instructions using the following rules:

S → MOVE S
MOVE → CHAR LOC
CHAR → ‘Lee’ LOC -> ‘Centre’
S → ALERT  (N.B. this is the S following the MOVE rule in the first row)
ALERT -> CHAR ’50 dkp MINUS’
CHAR → ‘Foresight’

Fortunately, nltk already provides a function that given a CFG and a string, can return a parse tree of that sentence that can be generated for a certain grammar.

from nltk import ChartParser

parser = ChartParser(company_grammar)
example_sentence =
‘Lee Centre Foresight 50 dkp MINUS’

for tree in parser.parse(example_sentence.split()):
   
# There can be more than one way to parse a given sentence, so iterate over all such ways
   <name
and shame people that got 50dkp minus, using the returned parse tree>

Extract Field information

We use such a CFG to parse our preprocessed input query and extract relevant field information from it, mapping it to proper fields in our ES index. As an example, we could perhaps use a rule of the following form to parse a location indicator.

LOC → COUNTRY | REGION | L COUNTRY | L REGION
COUNTRY → <gpe>
REGION → <loc>
L → <prepositions or other language constructs that can indicate locations, e.g. ‘in’>

These rules can then conveniently parse the fact that we only care about companies in Germany from our example “fast food restaurants in Germany” query. These production rules are possible due to our preprocessing steps replacing potential countries or regions with a tame tag that we can use.

Searching for keywords

A full grammar that can parse a preprocessed query for field information about companies can thus be generated without much headache. However, that is not the complete story, since in addition to fields we also have a free text component to our grammar (“fastfood restaurants” in our case). It would be quite complicated to generate a set of production rules that can adequately capture this part of the query, since that would essentially reduce to a CFG of the English language. Fortunately, for our given task, we don’t actually need such a strong model, but rather a simple set of production rules that can simply identify words as being the free text of our query. These words can then be thrown into ES as a free text search. Such a set of production rules that can parse our example query could potentially look as follows:

CT -> ‘fast’ | ‘food’ | ‘restaurants’ | CT ‘fast’ | CT ‘food’ |CT ‘restaurants’

CT is a production rule that simply generates text, being able to produce either a word in our vocabulary and more text or just a simple word from our vocabulary. Of course, in order to properly generate all required words from all companies, we’d need a much larger set of production rules (one for each word). At first sight this looks infeasible, however these rules can be programatically populated in python by formatting our CFG defining string. So we first collect all words present in our company descriptions, and then run a for loop over them to generate the above production rules. This might look as follows:

f’CT -> {“|”.join([“{kw} CT | {kw}”.format(kw=repr(kw)) for kw in vocab])}’ 

Since we expect each query to only cover a small subset of our vocab, parsing is still practical despite the apparent size of our grammar.

Conclusion

We provide a simple yet effective algorithm that can translate free text queries into structured ES requests. It works with the help of CFGs that can conveniently map words in a free text query to relevant fields or correctly identify them as free text. It is thus possible to construct an ES query that can correctly differentiate between filtering conditions and free text search and construct a query that can hopefully return result that are more relevant to what the credit analyst intended to find.

Back to Blog