9 Challenges to Unlock Global Knowledge

Starting with Arabic Documents

In today's digital age, information drives progress and innovation. The vast expanse of hard-to-find knowledge—buried in documents—is invaluable to researchers, scholars, and professionals across various fields. From historical manuscripts to contemporary reports, these documents hold the potential to unlock new insights, foster cultural understanding, inform more effective decision-making, and lead to scientific breakthroughs.

Unfortunately, extracting and organizing such knowledge is fraught with challenges. Take Arabic documents for example: The unique complexities of the Arabic language, including its rich linguistic structure and script variations, present significant obstacles. Additionally, the scarcity of advanced tools and technologies tailored to the Arabic script compounds these difficulties, leaving much of this critical knowledge inaccessible. We’ve faced such challenges while working on AI-powered solutions that serve many languages and hundreds of millions of customers across the globe.

Moreover, finding information is challenging in any language: Modern search systems expand search terms to look for synonyms (e.g., “vehicle” when searching for “car”), which is prohibitive in a language like Arabic that has 348 words for “lion”. Expanding terms may get counterproductive as their meanings can change in different contexts (e.g., “river bank”). Searches may also fail or require many attempts because one couldn’t specify key terms correctly (e.g., looking for the figure “1,931,512” but only recalling it was close to two million or “books by a Chicago professor who visited Cairo in the middle of last century”).

For English and similar languages, modern search systems have evolved to use generative AI effectively to improve the search experience and go beyond finding needles in haystacks. Generative AI models, on their own, are remarkably able to converse and reason in natural language; however, they are confined to knowledge in the source material used to develop them. In addition, said material helps teach these AI models patterns to recognize and follow (rather than precise recall of knowledge read). To make effective use of language models in search, we connect them to external knowledge sources, similar to asking students questions in open-book exams. Techniques such as Retrieval-Augmented Generation (RAG) blend the best of retrieval-based and generative AI techniques, enabling search systems to match knowledge to queries in more intuitive ways.

Retrieval Augmented Generation

RAG finds indexed documents relevant to a search query then uses them as context for a generative large language model, such as Google's Gemini, to synthesize responses. The results are grounded by the indexed corpus, richly informative, and contextually relevant. Thanks to the reasoning and language understanding capabilities of such AI models, they can perform tasks on-the-fly (e.g., extracting information from multiple documents and tabulating a comparison) and can carry a conversation — no more pages of blue links. Adding RAG to conversational AI assistants greatly improves search experiences, offering more intuitive responses and addressing users' needs more efficiently.

A high-level diagram of how our multilingual RAG-powered AI assistant works.

The diagram above shows a corpus of PDF files because they are the epitome of unstructured documents (as opposed to ones that strictly adhere to a known schema) and text content in PDFs—especially in Arabic—is notoriously hard for software tools to read. Parsing Arabic text from PDF files presents several technical challenges, some of which are unique to the Arabic language and script, while others are common issues faced in parsing text from PDFs in general. Here are the main difficulties:

1. Complex Script Rendering

Arabic is a complex script with characters that change shape depending on their position in a word (isolated, initial, medial, or final). This complexity can pose a challenge for text extraction tools, which may incorrectly interpret these variations. For example, here are two errors highlighted in red:

وســجلت اقتصــادات دول أمريــكا الالتينيــة والكاريبــي تراجع ً ـا نسـبته 7.0 فـي المئـة فـي عـام 2020م

2. Right-to-Left Writing Direction

Arabic is written from right to left, which is opposite to many languages that are written from left to right. This can cause issues with text extraction tools that are primarily designed for left-to-right languages, leading to problems with text ordering and alignment.

3. Ligatures and Diacritics

Arabic uses ligatures, where two or more characters are combined into a single glyph, and diacritics, which are small markings that change the pronunciation and meaning of words. For example: “جِبِلًّا” is a combination of the following characters and markings “ج ِب ِل ّ ً ا”. Extracting these accurately can be challenging, as PDFs may not always store them as separate entities from their base characters; extracting such text from images in PDFs is even more challenging. Arabic also features "word ligatures" for frequently used words and phrases as a single character; searching for a word in such ligatures has to consider their expanded forms . For example: ﷽ is a single character (U+FDFD) in Unicode, the commonly used text encoding standard.

4. Encoding Issues

The way Arabic text is encoded in a PDF can vary; such variance can lead to Mojibake, the garbled text most of us have experienced when working with documents written in Arabic. For example:

Ø§Ù„Ø¥Ø¹Ù„Ø§Ù† Ø§Ù„Ø¹Ø§Ù„Ù…Ù‰ Ù„ØÙ‚ÙˆÙ‚ Ø§Ù„Ø¥Ù†Ø³Ø§Ù†

Nonetheless, the text cleaning step in our ingestion pipeline takes care of messy data and files saved using mismatched encoding standards; the text above is ingested correctly as:

الإعلان العالمي لحقوق الإنسان

5. Font Issues

It’s very common for PDFs to use custom, decorative fonts. If the extraction tool doesn't adapt to these, it can lead to incorrect character representations, missing text, or even crashes when loading the text. Certain features specific to Arabic, such as Tatwil or Kashida for text justification, can complicate text extraction. These features might confuse standard text extraction tools and downstream tasks as well (e.g., text understanding). Here is an example that combines missing text (spaces, in this case) and Tatwil issues, highlighted in red:

شـــهدت مؤشـــرات الأداء للأنشـــطة الاقتصاديـــةنموًاملحوظًاحيثتشـــيرالبيانات إلـــى أن النمـــو الأكبر كان في نشـــاط الصناعات التحويليـــة

Moreover, decorative characters (such as Tatwil) increase the costs and latencies of storing, retrieving, and processing text — bloating search systems and hiking up bills. Recurring costs add up every time a document with decorative characters is indexed, passed to an embedding model (less frequent), retrieved as a match for a query, passed to a large language model to process (more frequent), and so on. To reduce latency and cost, our RAG system cleans up text at ingestion, effectively compressing it without any loss. We also use AI models (for ingestion, processing, and post-processing) that are more efficient at understanding Arabic: They break down a sentence into fewer processing units (tokens) than most models do with Arabic text (again, without any loss). Since many commercial providers of Generative AI models charge by the token, and measure model speed using tokens/second, reducing the number of tokens is highly desirable at scale. Below we show an example of savings in characters and tokens for a single PDF file (a financial report) of 182 pages:

6. Non-Standard Text Layers

PDFs can contain text in different layers, such as in images, annotations, or forms. Extracting text from these layers requires additional processing; Optical Character Recognition (OCR) might be needed for text within images. Monta AI utilizes OCR when ingesting such elements; for example:

ChatGPT Plus, despite its astounding capabilities, failed to answer the same question (due to the need for OCR):

7. Formatting and Layout Issues

PDFs often use complex layouts and formatting, which can include tables, columns, forms, footnotes, and embedded images. Understanding the intended structure of the document when parsing it can be a difficult task, especially when the layout is non-linear or interrupted. For example, reading extracting data from complex tables (highlights in screenshots below were added to help readers of this blogpost find relevant information in the referenced PDFs):

Layout-aware ingestion also enables Monta AI to answer questions when the text is flowing from one column to another and across multiple pages while interrupted by a table — all at the same time:

Layout understanding matters, as shown by the lacking answer ChatGPT Plus gave for the same query:

Monta AI’s assistant correctly answers questions about information (e.g., a radio-button selection) in a form thanks to understanding form controls and the custom reformatting steps in ingestion:

When asked the same question, ChatGPT Plus gave a verbose—yet dissatisfying—explanation as to why it couldn’t determine the answer:

8. Chunking

To ingest a knowledge base of PDFs in preparation for enabling search, they need to be broken down to smaller pieces in a process known as chunking. Determining chunk boundaries effectively is a challenging task: make them too small and context gets lost as atomic units break into multiple chunks; make them too big and context gets diluted with irrelevant, confusing information. Simplistics methods do not work well. For example, breaking text to sentences to process is challenging: Telltales such as punctuation are insufficient, periods in abbreviations get confused as end-of-sentence demarcations, and so on. To understand the flow and boundaries of atomic units conducive to effective search, our RAG system takes cues from the meanings of various textual and non-textual elements in a PDF (language-aware and layout-aware).

9. Generative-AI Issues

To generate answers given relevant evidence and citations, a large language model (LLM) is used to understand and generate Arabic text. Capable LLMs take more time and cost more to produce high-quality responses. As mentioned earlier, our RAG system uses AI models that are better at understanding Arabic and do so more efficiently, speeding up processing time and saving costs. Moreover, LLMs tend to be too verbose and produce walls of text to answer questions that are better answered as a table, a list, or a short commentary on an inline PDF page that has the answer, all of which are formats our RAG system automatically understands and produces based on the query and given context.

Monta AI

To effectively parse PDFs, Monta AI combines an ensemble of specialized techniques developed to handle the complexities of the Arabic script and language, as well as the general challenges of extracting text from PDFs. Our systems go beyond extracting text to understand the content of PDFs using cues such as layout and metadata, aiding the entire solution to provide highly accurate answers in an intuitive format. The AI assistant decides which relevant context to use to respond to queries, while citing and displaying embedded pages used to formulate the response.

‍

Conclusion

The journey to unlock the vast reservoirs of knowledge contained within Arabic documents is fraught with unique challenges, ranging from the linguistic intricacies of the Arabic language to the technical hurdles of parsing text from PDFs. Yet, the advent of technologies like multilingual RAG-powered AI assistants heralds a new era of possibilities. By combining multilingual embedding models and LLMs with specialized techniques for dealing with complex scripts and document formats, we are on the cusp of making this wealth of information easily accessible to a global audience. This endeavor is not just about overcoming technical obstacles; it's a catalyst for bridging cultures, expanding knowledge, and fostering a deeper understanding across the diverse tapestry of humanity. As we continue to refine these technologies and methods, the promise of seamlessly extracting and leveraging the rich insights hidden in Arabic documents becomes much closer to a tangible reality, lighting the way for future innovations and discoveries for all languages around the globe.

Looking to unearth insights from your Arabic documents? Partner with Monta AI today to get tailored AI solutions that will elevate your business to new heights!