Fun with Language Models

jrai jrai · Posted 07/16/2020


Post Quality

ML NLP CrowdCent

Fun with Language Models

jrai jrai · Posted 07/16/2020


Post Quality

ML NLP CrowdCent

The field of Natural Language Processing has had a Language Modeling Renaissance as of late and if you haven’t yet heard about it you should pay attention, because it’s about to change the world. At its core, a language model has a very simple objective: learn how to predict the next, most appropriate word in a sentence or document (or other slight variations of that task). A language model (LM) typically learns to do this by ingesting an enormous amount of text data, often in the form of Wikipedia entries, blogs and generally popular webpages. This post will not be a full deep dive into the architectures or training processes of language models, but rather an overview of recent advancements in the field and some fun text generation experiments using those advancements; you’ll have to bear with some high-level explanations before getting to the fun parts, but they’re coming. For a more technical read on the topic and LM implementations, we suggest a post by Jay Alammar to start.

Language model applications in the wild

Given that a language model’s goal is to predict the next best word in a sentence or document, it’s not a surprise to see that applications like word and sentence completion on smartphones have been in production systems for some time now. When it comes to the basic systems of autocomplete, the language models employed are very light and only learn the statistical properties of text through remembering the last word or two that they have seen. One of the major recent leaps in language modeling has been around model architectures. More complex neural network architectures utilizing concepts of recurrence and attention allow language models to develop significantly improved long-range dependency. Long-range dependency is the ability for a model to remember words and concepts from very early on in a document just as much as it remembers words and concepts later in the document.


The other major leap that has led us to this point in NLP is the process through which we teach these language models with steps of pre-training and fine-tuning along with the idea of transfer learning.


Now that we have complex model architectures that allow us to learn long-range dependencies and richer contextual relationships, the thinking goes that we should feed a model based on this architecture as much data and compute as possible. This step pre-trains the model to be able to perform out-of-the-box, rudimentary reading comprehension and text generation with samples of near human quality, while maintaining coherence over more than a page of text.


Once we have a pre-trained model that understands some universal properties of language, we can use the pre-trained model to initialize a model to fine-tune. This ingests a new corpus that is more related to whatever downstream task you’d like to solve.

Transfer Learning

These language models can then transform themselves into rich classifiers or regressors with only a few small modifications. This is the step that we call transfer learning. We can use all of the information learned in the pre-training/fine-tuning training of predicting the next word and transfer it to initialize another model that can assist in all types of applications like language translation, question answering, sentiment classification, summarization, and anything else for which you might have labeled data. In a sense, you can also think of the fine-tuning step as a type of transfer learning because the weights learned in the pre-training step are then used to initialize the subsequently fine-tuned model.

These developments have moved the applications of language models from the field of Natural Language Processing (NLP) to one that looks more like Natural Language Understanding (NLU)—and it’s still accelerating. Machines possessing human-level understanding of language are still not quite here, but recent results (see some papers below regarding ELMo, ULMFiT, GPT one and two, BERT, and XLNet) are allowing the creation of production systems that are capable of entire-document level understandings that weren’t possible even just one year ago.


Although not all language models are inherently generative, many are (or can easily be tweaked to become generative). In our case, we will show a few generative examples using models called ULMFiT and GPT-2 Medium—both of these models are generative right out of the box, although various incremental generation methods can be used to get closer to “human level” writing. The generative process is still very much an active area of research in the NLP community. Interestingly, we can use the model’s generation capabilities to illustrate the ideas of pre-training and fine-tuning.

Pre-trained generation examples

For example, a study by Roberts et al. has shown that runners as a group improved their running time by 8% over six weeks compared with time-matched controls. Other studies have used screen time for a sample of female marathoners and found that multiple sessions of quick recovery (24-48 minutes, with no resting) resulted in significant improvements in fatigue and mental performance during some trials.

This type of muscular training is known as tai chi-type training, or true “tai chi-type training.” That term refers to those who have completed what is known as Masterclass Tai Chi or teaching Tai Chi-style. Some residents report extremely good physical results. That said, these types of efforts need to be pursued consistently and appropriately to achieve the greatest possible effect.

(Generated with default OpenAI’s GPT-2 Medium model and nucleus sampling)

The pre-trained language model has an uncanny ability to generate coherent text excerpts. The model has excellent grammar and contextual awareness of a wide variety of fields because it was trained on such a general-purpose dataset.

Fine-tuned generation examples

Remember that we can now fine-tune this general model by subsequently feeding it a large corpus of new, domain-specific text. In our case, we want to understand the financial domain better, so we will use a dataset of investment write-ups posted to online communities/forums. If we wanted to understand the field of medicine better, we may have fed the model medical journals instead. After training for a few days on investment write-ups and asking our language model to give us some completely fabricated text, here is a handpicked sample:

I recommend a long position in Grizzly Oil & Gas Ltd. (TSX: Grizz) for up to 2.0x upside and worst-case downside of 10% downside.

Please note that the following recommendation contains certain AUVs that may not be suitable for all investors. We have chosen to focus the write-up on the TSX Venture Exchange as the primary vehicle for dissemination of the message of this investment is an exchange of ideas among investors. All investors are encouraged to conduct their own due-diligence and research the company’s shares on the websites of its primary subsidiaries.

Summary Thesis

Grizzly is a Canadian E&P company with interests in the US,Australia and Papua New Guinea. The company’s operating segments include offshore oil production, gas exploration and production, and natural gas exploration and production. Unlike many publicly traded equities, Grizzly’s financials and reserves are not closely scrutinized by any Bloomberg or Bloomberg Chemical type of coverage.

Grizzly has traded down from $36 per share in Oct 2012 as fears about the outlook for the Canadian oil sands have weighed down the stock. The stock is now back to over $30 per share as fears of over-leveraged and impaired balance sheets have dissipated.

In our opinion, a better, more focused balance sheet is currently obscured by legacy investments. At current valuation levels, investors are receiving Grizzly’s $0.89 per share cash balance in the form of a promissory note and getting a portfolio of producing properties almost for free. In addition, ~$2 per share legacy investments are payable to Grizzly from third parties who will use the proceeds to repay legacy debt.

With a clear, albeit delayed, catalyst, we believe Grizzly shares represent a great investment and are likely to benefit from a number of our criteria.


The market has rewarded Canadian E&P exporters with a premium valuation due to the previously mentioned over-capitalized balance sheet. Investors have paid almost 16x P/E for similar Canadian player Syncrude Energy (TSX: SNE), and we believe Grizzly shares offer an attractive mean to measure a Canadian E&P multiple. On a trailing basis, the stock trades at a 0.75x Price / Book Value versus Canadian peers at 1.09x. We also looked at peers on EV/EBITDA and P/Book and both showed substantial discounts to peers.

(Generated with fine-tuned GPT-2 Medium model and nucleus sampling)

If you pay close attention, you’ll find that this simulated investment thesis is nonsense—but we’d say it’s remarkable nonsense, nonetheless. Grizzly Oil & Gas Ltd does not exist and neither do any of the details of this write-up, including the fabricated company’s ticker (i.e. TSX: Grizz) and financial datapoints (i.e. “Grizzly has traded down from $36 per share in Oct…”). In fact, the model has generated an entire thesis about a fabricated company! It’s not rooted in reality, but it’s a pretty good indication that we can learn and understand the key drivers behind the investment valuation process.

In the above example, we asked the model to give us any text. If we prompt the language model with “seed text,” we can nudge the context of the model towards what we want it to talk about and the model will fill out the rest for us, all on its own. To start, let’s see if the model inherently knows what types of qualities or traits are common of a good management team. We’ll seed the model with: “The management team is great”

“The management team is great”

They have a long record of execution in a difficult and fast-changing business. They are strong operators who make decisions based on profitability and returns. They are disciplined on capital allocation, and always look to reinvest in attractive ways. They do have a few mistakes, namely on fixed-price contracts, and have made many changes to the way they manage their businesses over the last decade that have adversely affected their business over time, but they are now leaders in the industry.

(so good, in fact, that they bought their own shares on the open market a few weeks ago). They’ve done a great job consolidating the industry and raising ARPU while limiting capital to invest in the business. We believe that the most important decision MPS has been able to make has been to raise prices consistently while delivering significant savings to their customers.

(Generated with custom, fine-tuned GPT-2 Medium model and nucleus sampling)

Ok, now let’s flip the script and prompt the model with: “The management team is terrible.”

“The management team is terrible”

And arguably, they should be focusing on improving operations instead of mucking it up and losing key people. Direct sales to retail group, which is the group that manages Vistaprint, should also be sold because it is doing less good than management thinks.

Richard sets the tone. He is brash, brassy, and loud. I found him to be one of the most arrogant CEOs I have ever seen. He has a way about him that says he is not so nice. More than once I have had the impression that he is very aggressive and direct, and that he could easily fire anyone that came in to oversee the company or the company’s business. I don’t think Richard was the only one who came in with a dull understanding of the operating fundamentals of a growing company, and simultaneously had no ability to direct the operations of the business profitably.

They have misled shareholders and have misled analysts and investors as well. There is an ongoing lawsuit against them and a shareholder lawsuit from another financial institution that went to trial in 2015, and there has been a period where the truth has come out. In general, the organization is crumbling.

(Generated with custom, fine-tuned GPT-2 Medium model and nucleus sampling)

We find that not only does the model know what a good or bad management team is, but it can fabricate examples for why that management team might be good or bad. Finally, let’s say we want to see what the model thinks potential catalysts for stocks are, we can feed the model the seed text “Catalysts” and see how it completes the passage. Here are a few examples:


– Investor awareness with regard to the risk of dividend pay out; management has been very patient to share repurchases; this seems to have been working well

– Dividend payout and distribution in 1H; I expect share distributions to continue this year at roughly 7%, and if this happens they will likely increase their dividend pay out

– Consistent high return capital allocation and an efficient use of the balance sheet

– Rising yield and save for buyback

– An increase in growth capital returns to shareholders

– Successful execution of the JV with APU

– Spin-off of electronic monitoring company RYN

– Realization of cost synergies, significant gross margin expansion, and FCF generation

– Takeout

1) Earnings Release on 2/10/10 (4Q 2010) – While very difficult to predict earnings from KSU, we believe that investors will be much more willing to pay up for an earnings release that ends the noise that has obscured just how great the financial performance has been (versus previous earnings); in particular, in particular, we believe analysts will focus just as much ranging from a low of $3.25 to a high of over $5.00.

2) Sale of Elpida to Chiquita – In fact, we think the company may feel that it may have little choice but to sell Elpida and face a life of attrition in a market that does not support the tonnage needed to fulfill the plant expansion plans.

3) Bidding War – Under these conditions, we believe that KSU may well be able to find buyers that are more amenable than in the past to the idea of a zero-price sales process, especially given the presence of activist investors in the shareholder base.

4) Sale to a Major E&P – Even if we assume that the non-core assets are worth zero, we do not believe KSU would have much of a problem attracting a major E&P to develop the assets (we would note that, as mentioned above, we would be somewhat skeptical of a sale of Elpida to a major E&P given the culture of competing with them that exists in that industry and the potential for anti-trust issues).

(Generated with fine-tuned GPT-2 Medium model and nucleus sampling)

It looks like the model has already learned that dividend payouts, share repurchases, and many other very plausible scenarios are possible catalysts for an investment thesis. As a bonus, the model also knows how and when it’s appropriate to create bullet points and/or numbered lists.


Language models can become multimodal models when combined with non-language data (aka multiple modalities of data). In our financial domain, we can incorporate raw financials and structured data (cash flow statements, employee ratings, media/news sentiment, etc.) to be combined or concatenated with the encoded representations of online investment write-ups when learning an otherwise standard neural language model.

That said, compared to online investment write-ups and for the purposes of this explanatory post, generating the lyrics for songs may serve as a more concrete—and fun—example of the power of multimodal language models. In the following examples for generating meta-data dependent song lyrics, our meta-data will be two discrete pieces of information: the song’s title and the song’s genre. This model was trained on a corpus of roughly one million songs.

When we supply the language model with only a genre, our model can also generate a unique title of its own, and if we don’t supply the language model with a genre or a song title, the model will automatically generate both! In this case, we’ll provide no information to the model and see how it does:

Genre: Southern Rock

Title: First Day (Album Version)


Miles and miles is so long I can’t turn it off

And it ain’t over yet, lord this is not

There isn’t a thing I can’t prove

If i don’t get the time

I don’t wanna fall asleep at night

There must be something more inna the time of my life

On the day that I die

Someone should have done me better

But I got something on my mind you know

(Generated with fine-tuned AWD-LSTM/ULMFiT with beam-search generation)

Not too bad for an already outdated robot. The model wrote a “southern rock” song all on its own. Notice how this song is also deemed to be the “(Album Version)”, which the model just so happened to decide for this song. In fact, everything you see here is learned behavior including any notion of song structure (i.e. intro tags), rhyme, meter, theme, etc. despite never having heard a word.

With the use of certain model architectures, we can even create lyrics for genres that have never existed, but that we can imagine might be quite cool…like “reggae disco” or “electronic rap.” Here’s an example with a language model trying to write song lyrics with the user supplied genre “oldies metal.”

Genre: Oldies Metal

Title: Pure Damnation

Feel the sparkle in your eyes

Savage in tragedy

Feel it right within you

Feel it in your blood

Cause I know what you’re dreamin’ of

Now I’ve got something I can tell you

Now I know what you need

You’ve got just what you need

Generated with fine-tuned AWD-LSTM/ULMFiT with beam-search generation

Cool. Fun. You can tell that the language model tries to blend the two genres together with words that would typically only belong to one or the other. Belonging to the Oldies genre would be words like: pure, sparkle, and dreamin’. In the Metal camp we see words such as: damnation, savage, and blood.

Remember, we’re also able to do this with investment write-ups. Like songs, investment write-ups have associated meta-data. However, unlike genre and title of songs, for an investment write-up we might include stock ticker, industry, author, position type (long vs. short), and/or other more complex financial data points.


At CrowdCent, we use language modeling as one of many tools to enable our investment decision-making pipelines. We can turn unstructured language data into machine readable features and use a document’s encoded representation along with other meta-data in order to make optimized decisions with information sourced directly from the wisdom of the crowd. Although generating text is cool, using these language models to learn classification tasks (like whether to buy or sell a stock based on an investment write-up) can be particularly productive as well! In several years, with the continued convergence of compute capacity, improved model architectures, and cleaner/larger datasets, we will be astonished by the capabilities of “simple” language models.

Even currently—with language models alone—we can learn whether an investment write-up discusses company management in a positive or negative way, whether there is any mention of share repurchases, or whether a write-up’s thesis is a result of a special situation. Theoretically, with enough accurate data we can learn to classify any number of ideas in a document. When combined with other data, these insights can be used to systematically determine whether to invest in a proposed security. Further down the road, we believe these tools can extend to extracting value from any sort of engaged online community (not just investment communities) all while returning that extracted value back to the userbase.

Further Reading:

Original Papers:


Special thanks to Carlos Castro, Justin Plumley, and Steve Yang.

Be the first to comment!