Company descriptions provide a wealth of information regarding a company’s business model and can be a very useful source of information for finding company peers. However, due to the nature of natural language, certain equivalent concepts can get expressed in many different ways. This can confound peer suggestions and lead to potentially missed peers.
To alleviate this problem, it seems useful to attempt to obtain a list of industry-specific keywords, which can be used to enhance a company’s description and hopefully overcome differences of expression and meaning when suggesting peers.
Below, we describe a generative statistical model that can be used to find industry-specific keywords, while simultaneously accounting for common/uninformative background words that naturally occur in their descriptions.
Annotated Latent Beta Allocation (ALBA)- a simpler form of LDA (Latent Dirichlet Allocation)
It assumes that words are generated from unigram language models, either a generic background model 𝜙𝐵𝐺 that each document is allowed to sample from, or an industry-specific model 𝜙𝐼 that only documents pertaining to that industry can sample from. The probability of sampling from either the background model or the industry model is given by a document-specific parameter 𝜃𝑑. It is a single number indicating, for a document 𝑑, the probability that we sample a word from the background model or the industry specific model.
Therefore, to generate a document 𝑑, we first sample 𝜃𝑑 from some Beta prior parametrised by 𝛼. Then, for each word 𝑤, we flip a biased coin whose chance of success is 𝜃𝑑, obtaining an indicator 𝑧𝑤. Then if 𝑧𝑤 is zero, we sample a word from 𝜙𝐵𝐺, else we sample a word from 𝜙𝑚[𝑑], where 𝑚[𝑑] indicates which industry 𝑑 belongs to. We proceed in this fashion until we’ve sampled all words. Note that this model is very similar to the popular LDA algorithm.
In plate notation, the algorithm would look as follows:
The above model is fairly simple and we can immediately do inference on it using e.g. MCMC. However, given the volume of data we have and the fact that MCMC loses all guarantees of stationarity if trained in a mini-batch fashion, we elected to use variational inference for it. As will be shown below, careful treatment of the E and M steps can allow us to train this model in an embarrassingly parallel fashion without losing any mathematical guarantees of convergence. We also show an implementation of variational inference and show that it converges to apparently sensible topics on real company data.
We want to maximise the objective 𝐿=𝑃(𝐗|𝜙,𝛼) where 𝑋 is our documents in bag-of-words format and 𝛼 is the prior over 𝜃𝑑.
We can do this by using the EM algorithm, where at the E-step we aim to find 𝑄
is maximised. Where [𝑧𝑑𝑖=𝑡] indicates whether word 𝑖 of document 𝑑 comes from 𝑡 (which can be either the background model or the industry-specific model), 𝑉 is the size of the vocab, and 𝑐𝑜𝑢𝑛𝑡𝑑(𝑖) is the count of word 𝑖 in document 𝑑.
We can assume a factorised form for 𝑄(𝐳,𝜃)=𝑞(𝐳)𝑞(𝜃) and then use the meanfield approximation to compute 𝑞(𝐳) and 𝑞(𝜃) iteratively.
Thus, we get
, is a beta distribution . We use 𝛾𝑑𝑖(𝑡) to denote 𝔼[𝑧𝑑𝑖=𝑡], and can be interpreted as our belief that word 𝑖 of document 𝑑 came from topic 𝑡=0,1, either the background or the industry-specific topic respectively.
Where 𝑎𝑑,𝑏𝑑 represent the shape parameters of the beta distribution for 𝑞(𝜃𝑑).
Thus, the E-step can be implemented as follows:
Randomly initialise 𝑞(𝜃)
While the change in 𝑞(𝜃) is >= 1e-5 do:
Update 𝑞(𝑧) for each document and word
Update 𝑞(𝜃) for each document
Unsurprisingly, these equations are very similar to vanilla LDA updates.
In the M-step, we use the 𝑞(𝐳) and 𝑞(𝜃) computed in the E-step to maximise
w.r.t. 𝜙, under the restriction that ∑𝑖=0,V𝜙𝑡(𝑖)=1 for all 𝑡 and 𝜙𝑡(𝑖)≥0 for all 𝑖 and 𝑡.
Thus, we obtain the following update:
which is in fact identical to vanilla LDA with the exception that sums are over all documents which have a certain industry (e.g. only companies whose industry is automotive will be considered in the above sum when we look at the automotive industry topic). A thing to note is that all documents are considered when we’re estimating the background topic, so it will naturally converge to containing words that exist in all documents.