A Generative Model for Finding Industry-Specific Keywords

Nov 19, 2019

Company descriptions provide a wealth of information regarding a company’s business model and can be a very useful source of information for finding company peers. However, due to the nature of natural language, certain equivalent concepts can get expressed in many different ways. This can confound peer suggestions and lead to potentially missed peers.

To alleviate this problem, it seems useful to attempt to obtain a list of industry-specific keywords, which can be used to enhance a company’s description and hopefully overcome differences of expression and meaning when suggesting peers.

Below, we describe a generative statistical model that can be used to find industry-specific keywords, while simultaneously accounting for common/uninformative background words that naturally occur in their descriptions.

Β 

Annotated Latent Beta Allocation (ALBA)- a simpler form of LDA (Latent Dirichlet Allocation)

Β 

It assumes that words are generated from unigram language models, either a generic background model πœ™π΅πΊ that each document is allowed to sample from, or an industry-specific model πœ™πΌ that only documents pertaining to that industry can sample from. The probability of sampling from either the background model or the industry model is given by a document-specific parameter πœƒπ‘‘. It is a single number indicating, for a document 𝑑, the probability that we sample a word from the background model or the industry specific model.

Therefore, to generate a document 𝑑, we first sample πœƒπ‘‘ from some Beta prior parametrised by 𝛼. Then, for each word 𝑀, we flip a biased coin whose chance of success is πœƒπ‘‘, obtaining an indicator 𝑧𝑀. Then if 𝑧𝑀 is zero, we sample a word from πœ™π΅πΊ, else we sample a word from πœ™π‘š[𝑑], where π‘š[𝑑] indicates which industry 𝑑 belongs to. We proceed in this fashion until we’ve sampled all words. Note that this model is very similar to the popular LDA algorithm.

In plate notation, the algorithm would look as follows:

Inference

The above model is fairly simple and we can immediately do inference on it using e.g. MCMC. However, given the volume of data we have and the fact that MCMC loses all guarantees of stationarity if trained in a mini-batch fashion, we elected to use variational inference for it. As will be shown below, careful treatment of the E and M steps can allow us to train this model in an embarrassingly parallel fashion without losing any mathematical guarantees of convergence. We also show an implementation of variational inference and show that it converges to apparently sensible topics on real company data.

We want to maximise the objective 𝐿=𝑃(𝐗|πœ™,𝛼) where 𝑋 is our documents in bag-of-words format and 𝛼 is the prior over πœƒπ‘‘.

We can do this by using the EM algorithm, where at the E-step we aim to find 𝑄

such that

is maximised. Where [𝑧𝑑𝑖=𝑑] indicates whether word 𝑖 of document 𝑑 comes from 𝑑 (which can be either the background model or the industry-specific model), 𝑉 is the size of the vocab, and π‘π‘œπ‘’π‘›π‘‘π‘‘(𝑖) is the count of word 𝑖 in document 𝑑.

We can assume a factorised form for 𝑄(𝐳,πœƒ)=π‘ž(𝐳)π‘ž(πœƒ) and then use the meanfield approximation to compute π‘ž(𝐳) and π‘ž(πœƒ) iteratively.

Thus, we get

, is a beta distribution . We use 𝛾𝑑𝑖(𝑑) to denote 𝔼[𝑧𝑑𝑖=𝑑], and can be interpreted as our belief that word 𝑖 of document 𝑑 came from topic 𝑑=0,1, either the background or the industry-specific topic respectively.

Similarly,

Β 

Thus, the E-step can be implemented as follows:

Β 

Randomly initialise π‘ž(πœƒ)

While the change in π‘ž(πœƒ) is >= 1e-5 do:

Β 

Update π‘ž(𝑧) for each document and word

Update π‘ž(πœƒ) for each document

Β 

Unsurprisingly, these equations are very similar to vanilla LDA updates.

In the M-step, we use the π‘ž(𝐳) and π‘ž(πœƒ) computed in the E-step to maximise

π”Όπ‘ž(𝑧)π‘ž(πœƒ)𝑃(𝐗,𝐙,πœƒ)

w.r.t. πœ™, under the restriction that βˆ‘π‘–=0,Vπœ™π‘‘(𝑖)=1 for all 𝑑 and πœ™π‘‘(𝑖)β‰₯0 for all 𝑖 and 𝑑.

Thus, we obtain the following update:

which is in fact identical to vanilla LDA with the exception that sums are over all documents which have a certain industry (e.g. only companies whose industry is automotive will be considered in the above sum when we look at the automotive industry topic). A thing to note is that all documents are considered when we’re estimating the background topic, so it will naturally converge to containing words that exist in all documents.

Back to Blog