Table of Contents
- And what kind of opportunities does that difficulty bring to start-up companies?
- Summaries are more than text-to-text
- Extraction is naive, abstraction is groundless
- Natural language processing still struggles with long documents
- How companies approach summaries
- Man-made book summaries
- Automatic summarization of medium-sized content
- Startup Case Study: Summari
- Beyond Text: Audio and Video Summaries
And what kind of opportunities does that difficulty bring to start-up companies?
Text Summarization is arguably one of the least headlined tasks in natural language processing (NLP). Shrinking article content is far less novel than GPT-3 automatically generating startup ideas . However, despite its low profile, text summarization is far from the solution, especially in industry.
The rudimentary APIs offered by big companies like Microsoft leave a lot of room for smaller companies to tackle different angles of summarization, and we haven’t seen a clear winner for summarization yet. . This article explores why text summarization remains a challenge.
Summaries are more than text-to-text
Intuitively, summarization can be viewed as a text-to-text conversion with lossy compression. Realistically, summarization is a (text + context)-to-text problem.
Summarizing is shortening a long sentence, discarding irrelevant information in the process. But what is the important information ? It is very difficult to make a universal judgment, as the answer depends greatly on the field the text is discussing, the target audience, and the goals of the abstract itself.
For example, consider the scientific papers on COVID-19. Should the abstract include biological jargon, or should it be understandable to laymen? Or should it be a dry enumeration of the main findings, or should it be concise and suspenseful to encourage the reader to read the entire article?
In other words, what constitutes a good summary depends on the context. Summarization is not just a text-to-text transformation, but rather a (text + context) to text matter. General-purpose abstraction APIs like Microsoft’s Azure Cognitive Service adhere to the naive definition of text-to-text.
The current generic summary API does not allow any specification of the desired output other than the length of the desired summary. For these reasons, current summarization APIs are not practical applications where nuances of context can determine a good or bad product.
Extraction is naive, abstraction is groundless
In the field of summarization, it has been established that summaries are either abstract (ie, excerpts from the original text) or abstract (ie, newly generated text). Extractive summaries seek out important sentences from the original text, so they tend to accurately reflect the gist (unless malicious extraction is done by deceiving the sentences). However, extractive summaries have limitations in grasping the content completely, and have the disadvantage that sentences missing from the summary are lost forever.
Abstract summarization , on the other hand, mimics the way humans summarize. When we humans summarize, we combine (original sentences) into new sentences with the aim of telling the overall story from a higher level perspective. However, abstract type summarization is technically difficult because it requires a generative model like the GPT family . Such models are currently plagued with delusions , and their outputs may differ from the facts or be unsupported by the text.
Since fidelity to the input text is indisputable, non-reliance on the original is a major obstacle for summarization. Preventing delusions is an active area of research, but there is no irrefutable solution that guarantees a certain level of quality. This is why the abstract abstraction API provided by Google is still experimental (i.e. not officially supported like the product offered through Google Cloud Platform).
Natural language processing still struggles with long documents
By its definition, a summary assumes a long input text. If the text is not long, there is no need for a summary at all. However, in the field of natural language processing, the current situation is that it is difficult to process large-sized documents.
The dominant model architecture, Transformer , limits the maximum number of input tokens to thousands. Documents that exceed this number of tokens need to be split into fragments to be summarized separately. The final result is simply a stitching together of independent subsummaries.
For document types that are divided into more or less independent sections, such as news articles, this approach may work. However, when applied to documents with complex dependencies between chapters, such as novels, this technique becomes useless. For example, the steps of a character (in a novel) often span the entire book. Since the subsummaries are simply connected (and unaware of each other), the trajectory of the person in the summary is also fragmentary. A single sentence cannot concisely capture their journey.
How companies approach summaries
Since there is no universal solution for summarization, companies approach it in very different ways, ranging from manual to fully automated, depending on the nature of the document in question.
Man-made book summaries
Companies like Blinkist and 12min hire human experts to create high-quality summaries of non-fiction books that can be read in 15 minutes or less. While this approach ensures high-quality content, it won’t help readers with very specific tastes, as it won’t scale beyond human-curated bestseller lists.
Automatic summarization of medium-sized content
Summarizing medium-sized content such as blog posts, news articles, research papers, internal documents, etc., while likely to benefit from automation, remains a labor intensive task. Each use case, defined by an input domain (news, legal, medical, etc.) and an output format (bullets, highlights, passage summaries, etc.), has a separate training data set and, potentially, a separate training model. Is required.
In a recent Google AI blog post announcing a new feature for auto-generated summaries in Google Docs, we collected a clean training set focused on a specific input domain and added It is explained that a consistent style was forced.
[…] Early versions of the corpus (of automatic summarization) contained various types of documents and summaries written in various ways. Academic summaries, for example, were generally long and detailed, while executive ones were concise and punchy.
Therefore, such corpora were inconsistent and highly variable. As a result, it learned too many different types of documents and summaries, and the model was easily confused and struggled to learn the relationships between them.
We therefore carefully cleaned and filtered the fine-tuning data to include training examples that represented a more consistent and cohesive definition of summarization. As a result, despite the reduced amount of training data, it led to the development of higher quality models. ( Excerpt from Google AI blog post )
The text summarization function developed by Google utilizes the summarization model PEGASUS . The innovation of this model is that pre-training named GSP (Gap Sentence Prediction) was implemented. This training is given as input a masked portion of an unlabeled news article or web document, and then predicts the remaining unmasked sentences.
When implementing PEGASUS in Google Docs, we prepared fine-tuning data as cited in this article to improve the quality of summaries.
Google has set various goals for its experimental API on Abstract Type Summarization . They are consistent with the company’s advice above and focus on narrower applications as follows.
The image above lists the tasks performed by the abstract summary model Google is developing. There are seven such tasks:
List of tasks performed by an experimental abstract type summary model under development at Google
Startup Case Study: Summari
(This article is not sponsored by Summari or any other featured product)
The lack of a robust model for running summarization in any context presents a huge opportunity for startups. Startups can laser focus on niche areas that big tech companies don’t find very attractive.
Summari , for example , is on a mission to allow Internet consumers to sift through articles before reading them to the end. In a previous interview , the company’s founders said they were disappointed with the state-of-the-art of summarization and initially chose to provide human-generated summaries.
Unfortunately, we did not get the quality we expected from AI technology. We believe that a good summary has an artistry. A good summary requires more than just picking and copying phrases from a sentence, but a deeper understanding, which at least for now requires humans (Ed Shrager, founder of Summari, Ness Labs interview ).
According to Shrager, all content suffers from the “content consumption paradox” that we only know its value after we consume it . This paradox is very inconvenient for consumers living in an age of content overload. To this problem, the movie industry, for example, has adopted a solution of producing trailers . He founded Summari because he wanted to create a movie trailer-like summary of any content so that consumers know its value before consuming it.
Shrager says summaries also help improve . For example, if you ask 10 team members to read a document that takes 10 minutes to read, the entire team will spend a total of 100 minutes comprehending the document. On the other hand, if you tell them to read a one-minute summary, the entire team spends less than ten minutes understanding the document.
About a year after our interview, Summari now offers a Chrome extension that generates highlights for almost any website with text content . The human-generated summaries arguably resulted in a clean training set, allowing us to build models that would automate and scale their original mission. You can read a summary of this article, generated by Summari, at this link .
Beyond Text: Audio and Video Summaries
Compared to audio and video, text is arguably an easier modality to summarize. It’s no surprise that state-of-the-art models and industry practices for audio and video lag behind text.
For example, little automation is done to shorten podcasts. One common practice is for podcasters to publish short YouTube clips of episodes ( watch Joe Rogan’s channel , for example). This case corresponds to manual extractive summarization. In response, Blinkist works directly with podcast creators to produce additional shortened versions of episodes called “shortcasts.” This corresponds to manual abstract type abstraction.
However, automation is on the horizon. Startups like Snackable , for now, aim to automatically extract and stitch together key pieces from audio and video files in a purely abstract manner. It may be only a matter of time before abstract summarization becomes possible in these modalities due to advances in techniques for manipulating and generating moving images.
Text summarization is a difficult task because it is highly context dependent. As such, it is nearly impossible to converge on a single universal solution or rely on the omnipotent GPT model to produce correct summaries for all situations. This fragmentation of solutions gives startups an opportunity to invest in clean learning sets that focus on very specific use cases and complement what big tech companies offer.