Company descriptions provide a wealth of information regarding a company’s business model and can be a very useful source of information for finding company peers. However, due to the nature of natural language, certain equivalent concepts can get expressed in many different ways. This can confound peer suggestions and lead to potentially missed peers.
To alleviate this problem, it seems useful to attempt to obtain a list of industry-specific keywords, which can be used to enhance a company’s description and hopefully overcome differences of expression and meaning when suggesting peers.
Below, we describe a generative statistical model that can be used to find industry-specific keywords, while simultaneously accounting for common/uninformative background words that naturally occur in their descriptions.
Annotated Latent Beta Allocation (ALBA)- a simpler form of LDA (Latent Dirichlet Allocation)
It assumes that words are generated from unigram language models, either a generic background model ??? that each document is allowed to sample from, or an industry-specific model ?? that only documents pertaining to that industry can sample from. The probability of sampling from either the background model or the industry model is given by a document-specific parameter ??. It is a single number indicating, for a document ?, the probability that we sample a word from the background model or the industry specific model.
Therefore, to generate a document ?, we first sample ?? from some Beta prior parametrised by ?. Then, for each word ?, we flip a biased coin whose chance of success is ??, obtaining an indicator ??. Then if ?? is zero, we sample a word from ???, else we sample a word from ??[?], where ?[?] indicates which industry ? belongs to. We proceed in this fashion until we’ve sampled all words. Note that this model is very similar to the popular LDA algorithm.
In plate notation, the algorithm would look as follows:
The above model is fairly simple and we can immediately do inference on it using e.g. MCMC. However, given the volume of data we have and the fact that MCMC loses all guarantees of stationarity if trained in a mini-batch fashion, we elected to use variational inference for it. As will be shown below, careful treatment of the E and M steps can allow us to train this model in an embarrassingly parallel fashion without losing any mathematical guarantees of convergence. We also show an implementation of variational inference and show that it converges to apparently sensible topics on real company data.
We want to maximise the objective ?=?(?|?,?) where ? is our documents in bag-of-words format and ? is the prior over ??.
We can do this by using the EM algorithm, where at the E-step we aim to find ?
is maximised. Where [???=?] indicates whether word ? of document ? comes from ? (which can be either the background model or the industry-specific model), ? is the size of the vocab, and ??????(?) is the count of word ? in document ?.
We can assume a factorised form for ?(?,?)=?(?)?(?) and then use the meanfield approximation to compute ?(?) and ?(?) iteratively.
Thus, we get
, is a beta distribution . We use ???(?) to denote ?[???=?], and can be interpreted as our belief that word ? of document ? came from topic ?=0,1, either the background or the industry-specific topic respectively.
Where ??,?? represent the shape parameters of the beta distribution for ?(??).
Thus, the E-step can be implemented as follows:
Randomly initialise ?(?)
While the change in ?(?) is >= 1e-5 do:
Update ?(?) for each document and word
Update ?(?) for each document
Unsurprisingly, these equations are very similar to vanilla LDA updates.
In the M-step, we use the ?(?) and ?(?) computed in the E-step to maximise
w.r.t. ?, under the restriction that ∑?=0,V??(?)=1 for all ? and ??(?)≥0 for all ? and ?.
Thus, we obtain the following update:
which is in fact identical to vanilla LDA with the exception that sums are over all documents which have a certain industry (e.g. only companies whose industry is automotive will be considered in the above sum when we look at the automotive industry topic). A thing to note is that all documents are considered when we’re estimating the background topic, so it will naturally converge to containing words that exist in all documents.