Manus Invitation Code Application Guide
Character.AI launches AvatarFX: AI video generation model allows static images to "open to speak"
Manychat completes US$140 million Series B financing, using AI to accelerate global social e-commerce layout
Google AI Overview Severely Impacts SEO Click-through Rate: Ahrefs Research shows traffic drop by more than 34%
Gemini 2.0 Flash and Gemini 1.5 Flash are equipped with a 1 million word context window, while Gemini 1.5 Pro is equipped with a 2 million word context window. In the past, large language models (LLMs) were greatly limited by the amount of text (or lexical elements) that could be passed to the model at one time. Gemini 1.5's long context window has nearly perfect retrieval capabilities (>99%), discovering many new application scenarios and developer models.
The code you have used for scenarios such as text generation or multimodal input can be used directly in long contexts.
The basic method of using a Gemini model is to pass information (context) to the model, which then generates an answer. A context window can be compared to short-term memory. People's short-term memory can store a limited amount of information, and so does the generative model.
Most generative models created in the past few years can only process 8,000 word elements at a time. The newer model further improves this number, accepting 32,000 word elements or 128,000 word elements. Gemini 1.5 is the first model to accept 1 million word elements, and now Gemini 1.5 Pro can accept 2 million word elements.
In practice, 1 million word elements are equivalent to:
1. 50,000 lines of code (standard 80 characters per line)
2. All text messages you sent in the past 5 years
3. 8 English novels with average length
4. Translations of more than 200 Average Duration Podcast Series
While there are growing contexts that are acceptable to models, most traditional views on using large language models assume that there is such inherent limitations for models, and from 2024 this is no longer the case.
Some common strategies to deal with the limitation of smaller context windows include:
1. When new text enters, delete old messages/text arbitrarily from the context window
2. When the context window is close to full, summarize the previous content and replace it with a summary
3. Use RAG with semantic search to move data out of context window and into vector database
4. Use deterministic filter conditions or generative filter conditions to remove certain text/characters from the prompt to save word elements
While many of this content are still relevant in some scenarios, the default start operation now simply puts all the lexical elements into the context window. Because Gemini models are specially built using long context windows, they have stronger context learning capabilities. For example, Gemini 1.5 Pro and Gemini 1.5 Flash can learn to translate from English to Kalamang with just all the teaching materials provided (500 pages of reference grammar book, a dictionary and about 400 additional parallel sentences), with the translation quality similar to those who learn using the same materials. Kalamang is a Papua New Guinea language with less than 200 people, so there is little presence online.
This example highlights how to start thinking about what functions can be achieved with the long context and context learning capabilities of the Gemini model .
Although the standard application scenarios for most generative models are still text input, the Gemini 1.5 model family can implement new modes for multimodal application scenarios. These models can understand text, video, audio, and images in a native way. They come with Gemini API that accepts multimodal file types for ease of use.
As it turns out, text is the intelligent layer that supports most of the development momentum of LLM. As mentioned earlier, many of the practical limitations of LLM are because there is not a large enough context window to perform certain tasks. This has led to the rapid adoption of retrieval enhancement generation (RAG) and other technologies that can provide relevant contextual information for model dynamics. Now, with the context window getting bigger (currently up to 2 million on Gemini 1.5 Pro), new technologies have emerged to discover new application scenarios.
Some emerging and standard application scenarios for long text-based contexts include:
1. Summary of large text corpus
Previous summary methods using smaller context models require the use of sliding windows or other techniques to preserve the state of the previous part when new words are passed to the model
2. Q&A
In the past, when the number of contexts is limited and the true recall of the model is low, this was achieved only with RAG.
3. Agent workflow
Text is the basis for how an agent preserves the state of the completed operations and the operations to be performed; without sufficient information about the real world and the agent's goals, it limits the agent's reliability.
Multi-sample context learning is one of the most unique features of long context model discovery. Research shows that taking the common "single sample" or "multi-sample" example mode where one or several task examples are provided to the model and then scaling to up to hundreds, thousands, or even hundreds of thousands of examples can form entirely new model capabilities. It turns out that the performance of this multi-sample approach is similar to that of a model that is fine-tuned for a specific task. For Gemini models that are not yet capable of meeting production release applications, you can try a variety of methods. As you will later learn in the Long Context Optimization section, context caching makes this high input word workload type economically more feasible, even reducing latency in some scenarios.
The inability to access the media itself has long limited the practicality of video content. Browsing content is not easy, transliteration often fails to capture the nuances of videos, and most tools cannot process images, text, and audio simultaneously. In Gemini 1.5, the long context text feature can be converted to the ability to reason and answer questions about multimodal input with continuous performance. When tested in the "Find a Needle in a Haystack" video problem with 1 million wordmones, the Gemini 1.5 Flash had a video recall of over 99.8% in the context window, while the 1.5 Pro achieved leading performance on the Video-MME benchmark .
Some emerging and standard application scenarios for video long context include:
1. Video Q&A
2. Video memory, as shown in Google's Project Astra
3. Video subtitles
4. Video recommendation system, enriching existing metadata through new multimodal understanding
5. Video customization, you can view the data and the corpus of associated video metadata, and then remove the video part that is not related to the viewer
6. Video content review
7. Real-time video processing
8. When processing videos, it is important to consider how to process videos into word elements, which will affect settlement and usage limits. For more information on how to use video files for prompts, see the prompts guide.
The Gemini 1.5 model is the first native multimodal large language model that can understand audio. Traditionally, a typical developer workflow involves concatenating multiple domain-specific models, such as speech to text model and text to text model, in order to process audio. This results in increased latency and reduced performance required to perform multiple round trip requests, which is often attributed to a separate architecture with multiple model setups.
In standard audio massive evaluation, the Gemini 1.5 Pro can find hidden audio in 100% of the tests, while the Gemini 1.5 Flash can find hidden audio in 98.7% of the tests. Gemini 1.5 Flash accepts up to 9.5 hours of audio in a single request, while Gemini 1.5 Pro can accept up to 19 hours of audio using a 2 million word context window. Additionally, for a test set consisting of 15-minute audio clips, the word error rate (WER) of the Gemini 1.5 Pro archive is approximately 5.5%, which is much lower than the dedicated voice-to-text model, and does not increase complexity due to additional input segmentation and preprocessing.
Some emerging and standard application scenarios for audio context include:
1. Real-time Translation and Translation
2. Podcast/Video Q&A
3. Conference Translation and Summary
4. Voice Assistant
For more information on how to use audio files for prompts, see the prompts guide.
When using long context and Gemini 1.5 models, the main optimization method is to use context caching. Apart from the fact that a large number of word elements were not previously handled in a single request, another major limitation is the expense. If you have a Chat with Data app where users upload 10 PDF files, a video, and some working documents, you used to have to use more complex Retrieval Enhanced Generation (RAG) tools/framework to process these requests and pay a lot of fees for word elements that move into the context window. You can now cache files uploaded by users and pay for storing them by hour. For example, when using Gemini 1.5 Flash, the input/output charge per request is about 1/4 of the standard input/output charge, so if the user chats enough with their data, you can save a lot of money as a developer.
In various sections of this guide, we discuss how the Gemini 1.5 model can achieve high performance in various “Find a Needle in a Haystack” search evaluations. These tests consider the most basic setup, where you only need to look for a "needle". If you are looking for multiple "needles" or specific information, the accuracy of the model execution varies. Performance may vary greatly depending on the context. It is important to consider this because there is an inherent trade-off between retrieving the correct information versus the cost. You get about 99% accuracy in a single query, but you have to pay the input word element fee every time you send it. Therefore, to retrieve 100 pieces of information, if you need 99% performance, you may need to send 100 requests. This is a good example of context cache where the expense associated with using the Gemini model is significantly reduced while maintaining high performance.
Generally, if you don't need to pass the token to the model, it's better to avoid passing the token. However, if you have a large number of tokens containing certain information and want to ask for it, the model is very good at extracting this information (in many cases, with an accuracy of up to 99%.
Gemini 1.5 Pro has a 100% recall of 530,000 tokens and over 99.7% recall of up to 1 million tokens.
If you have a similar set of tokens/contexts and want to reuse multiple times, context caching can help you reduce the costs associated with asking for such information.
All developers now have access to context windows of 2 million tokens using Gemini 1.5 Pro.
There will be a certain delay time for any given request (regardless of size), but generally, the longer the query, the longer the delay time (the time to get the token for the first time).
Yes, some data is mentioned in different parts of this guide, but overall, the Gemini 1.5 Pro performs better in most long context applications.