19-11-2019

Company descriptions provide a wealth of information regarding a company’s business model and can be a very useful source of information for finding company peers. However, due to the nature of natural language, certain equivalent concepts can get expressed in many different ways. This can confound peer suggestions and lead to potentially missed peers.

To alleviate this problem, it seems useful to attempt to obtain a list of industry-specific keywords, which can be used to enhance a company’s description and hopefully overcome differences of expression and meaning when suggesting peers.

Below, we describe a generative statistical model that can be used to find industry-specific keywords, while simultaneously accounting for common/uninformative background words that naturally occur in their descriptions.

**Annotated Latent Beta Allocation (ALBA)- a simpler form of LDA (Latent Dirichlet Allocation)**

** **

It assumes that words are generated from unigram language models, either a generic background model *𝜙 _{𝐵𝐺}* that each document is allowed to sample from, or an industry-specific model

Therefore, to generate a document *𝑑*, we first sample *𝜃 _{𝑑}* from some Beta prior parametrised by

In plate notation, the algorithm would look as follows:

Inference

The above model is fairly simple and we can immediately do inference on it using e.g. MCMC. However, given the volume of data we have and the fact that MCMC loses all guarantees of stationarity if trained in a mini-batch fashion, we elected to use variational inference for it. As will be shown below, careful treatment of the E and M steps can allow us to train this model in an embarrassingly parallel fashion without losing any mathematical guarantees of convergence. We also show an implementation of variational inference and show that it converges to apparently sensible topics on real company data.

We want to maximise the objective *𝐿*=*𝑃*(**𝐗**|*𝜙*,*𝛼*) where *𝑋* is our documents in bag-of-words format and *𝛼* is the prior over *𝜃 _{𝑑}*.

We can do this by using the EM algorithm, where at the E-step we aim to find *𝑄*

such that

is maximised. Where [*𝑧 _{𝑑𝑖}*=

We can assume a factorised form for *𝑄*(**𝐳**,*𝜃*)=*𝑞*(**𝐳**)*𝑞*(*𝜃*) and then use the meanfield approximation to compute *𝑞*(**𝐳**) and *𝑞*(*𝜃*) iteratively.

Thus, we get

, is a beta distribution . We use *𝛾 _{𝑑𝑖}*(

Similarly,

Where *𝑎 _{𝑑}*,

Thus, the E-step can be implemented as follows:

Randomly initialise 𝑞(𝜃)

While the change in 𝑞(𝜃) is >= 1e-5 do:

Update 𝑞(𝑧) for each document and word

Update 𝑞(𝜃) for each document

Unsurprisingly, these equations are very similar to vanilla LDA updates.

In the M-step, we use the *𝑞*(**𝐳**) and *𝑞*(*𝜃*) computed in the E-step to maximise

𝔼_{𝑞}_{(𝑧)𝑞(𝜃)}*𝑃*(**𝐗**,**𝐙**,*𝜃*)

w.r.t. *𝜙*, under the restriction that ∑_{𝑖}_{=0,V}*𝜙 _{𝑡}*(

Thus, we obtain the following update:

which is in fact identical to vanilla LDA with the exception that sums are over all documents which have a certain industry (e.g. only companies whose industry is automotive will be considered in the above sum when we look at the automotive industry topic). A thing to note is that all documents are considered when we’re estimating the background topic, so it will naturally converge to containing words that exist in all documents.