Engineering August 13, 2025 Monta AI

9 Challenges to Unlock Global Knowledge

Hard-to-find knowledge in documents is invaluable but difficult to extract, especially for complex languages like Arabic. Traditional search and generative AI both have significant limitations in accessing this information. RAG connects AI models to external knowledge sources, enabling more effective search and knowledge retrieval.

Unlocking Knowledge from Arabic Documents: Building Advanced RAG Systems

Starting with Arabic Documents

In today’s digital age, information drives progress and innovation. The vast expanse of hard-to-find knowledge—buried in documents—is invaluable to researchers, scholars, and professionals across various fields. From historical manuscripts to contemporary reports, these documents hold the potential to unlock new insights, foster cultural understanding, inform more effective decision-making, and lead to scientific breakthroughs.

Unfortunately, extracting and organizing such knowledge is fraught with challenges. Take Arabic documents for example: The unique complexities of the Arabic language, including its rich linguistic structure and script variations, present significant obstacles. Additionally, the scarcity of advanced tools and technologies tailored to the Arabic script compounds these difficulties, leaving much of this critical knowledge inaccessible. We’ve faced such challenges while working on AI-powered solutions that serve many languages and hundreds of millions of customers across the globe.

The Search Challenge in Arabic

Language-Specific Search Complexity

Search Challenge	Example	Impact
Synonym Expansion	Arabic has 348 words for “lion”	Term expansion becomes prohibitive
Context Sensitivity	”River bank” vs. financial “bank”	Meanings change in different contexts
Imprecise Queries	Looking for “1,931,512” but remembering “close to two million”	Exact term specification required
Fuzzy Memory	”Books by a Chicago professor who visited Cairo in the middle of last century”	Complex, multi-faceted search criteria

Moreover, finding information is challenging in any language: Modern search systems expand search terms to look for synonyms (e.g., “vehicle” when searching for “car”), which is prohibitive in a language like Arabic that has 348 words for “lion”. Expanding terms may get counterproductive as their meanings can change in different contexts (e.g., “river bank”). Searches may also fail or require many attempts because one couldn’t specify key terms correctly.

The Evolution of Search with Generative AI

For English and similar languages, modern search systems have evolved to use generative AI effectively to improve the search experience and go beyond finding needles in haystacks.

Generative AI Capabilities and Limitations

Capability	Strength	Limitation
Natural Conversation	Understands and responds in natural language	Confined to training data knowledge
Reasoning	Can analyze and synthesize information	Pattern recognition, not precise recall
Context Understanding	Grasps complex queries	Needs external knowledge for accuracy

Generative AI models, on their own, are remarkably able to converse and reason in natural language; however, they are confined to knowledge in the source material used to develop them. In addition, said material helps teach these AI models patterns to recognize and follow (rather than precise recall of knowledge read).

Solution: To make effective use of language models in search, we connect them to external knowledge sources, similar to asking students questions in open-book exams. Techniques such as Retrieval-Augmented Generation (RAG) blend the best of retrieval-based and generative AI techniques, enabling search systems to match knowledge to queries in more intuitive ways.

Retrieval Augmented Generation (RAG)

RAG finds indexed documents relevant to a search query then uses them as context for a generative large language model, such as Google’s Gemini, to synthesize responses. The results are grounded by the indexed corpus, richly informative, and contextually relevant.

RAG Benefits

Feature	Benefit
Grounded Responses	Answers backed by actual documents
Rich Information	Comprehensive, synthesized insights
Contextual Relevance	Answers tailored to specific queries
On-the-fly Tasks	Extract, compare, tabulate information dynamically
Conversational	Natural dialogue instead of blue links

Thanks to the reasoning and language understanding capabilities of such AI models, they can perform tasks on-the-fly (e.g., extracting information from multiple documents and tabulating a comparison) and can carry a conversation — no more pages of blue links. Adding RAG to conversational AI assistants greatly improves search experiences, offering more intuitive responses and addressing users’ needs more efficiently.

Blog Diagram

Retrieval Augmented Generation with a knowledge corpus of documents

The PDF Challenge: Why Arabic Text Extraction Is Hard

The diagram above shows a corpus of PDF files because they are the epitome of unstructured documents (as opposed to ones that strictly adhere to a known schema) and text content in PDFs—especially in Arabic—is notoriously hard for software tools to read.

Parsing Arabic text from PDF files presents several technical challenges, some of which are unique to the Arabic language and script, while others are common issues faced in parsing text from PDFs in general.

1. Complex Script Rendering

Character Shape Variations

Position	Shape Example	Challenge
Isolated	Different form	Extraction tools may misinterpret
Initial	Different form	Character shape changes
Medial	Different form	Contextual variations
Final	Different form	Position-dependent rendering

Arabic is a complex script with characters that change shape depending on their position in a word (isolated, initial, medial, or final). This complexity can pose a challenge for text extraction tools, which may incorrectly interpret these variations.

Example of Extraction Errors:

وســجلت اقتصــادات دول أمريــكا الالتينيــة والكاريبــي تراجع ً ـا نسـبته 7.0 فـي المئـة فـي عـام 2020م

(Errors highlighted in red in original text)

2. Right-to-Left Writing Direction

Directional Challenges

Aspect	Arabic	Most Languages	Impact
Writing Direction	Right-to-Left (RTL)	Left-to-Right (LTR)	Text ordering issues
Tool Design	Rarely optimized	Primary focus	Alignment problems
Mixed Content	RTL + LTR numbers/English	Consistent direction	Complex parsing

Arabic is written from right to left, which is opposite to many languages that are written from left to right. This can cause issues with text extraction tools that are primarily designed for left-to-right languages, leading to problems with text ordering and alignment.

3. Ligatures and Diacritics

Character Composition Complexity

Element	Example	Storage Challenge
Base Characters	ج ب ل ا	Individual glyphs
Diacritics	ِ ِ ّ ً	Separate markings
Combined Form	جِبِلًّا	Single visual unit
Ligatures	Multiple chars → one glyph	Not stored separately

Arabic uses ligatures, where two or more characters are combined into a single glyph, and diacritics, which are small markings that change the pronunciation and meaning of words.

Example: “جِبِلًّا” is a combination of the following characters and markings: ج ِب ِل ّ ً ا

Challenge: Extracting these accurately can be challenging, as PDFs may not always store them as separate entities from their base characters; extracting such text from images in PDFs is even harder.

4. Encoding Issues

The way Arabic text is encoded in a PDF can vary; such variance can lead to Mojibake, the garbled text most of us have experienced when working with documents written in Arabic.

Encoding Problem Example

State	Text Display
Corrupted (Mojibake)	`Ø§Ù„Ø¥Ø¹Ù„Ø§Ù† Ø§Ù„Ø¹Ø§Ù„Ù…Ù‰ Ù„ØÙ‚ÙˆÙ‚ Ø§Ù„Ø¥Ù†Ø³Ø§Ù†`
Correctly Processed	الإعلان العالمي لحقوق الإنسان

Our Solution: The text cleaning step in our ingestion pipeline takes care of messy data and files saved using mismatched encoding standards; the text above is ingested correctly.

5. Font Issues

Custom Font Challenges

Issue	Description	Impact
Custom Fonts	PDFs use decorative, non-standard fonts	Incorrect character representation
Missing Fonts	Fonts not available to extraction tool	Missing text or crashes
Tatwil/Kashida	Text justification feature in Arabic	Confuses extraction tools
Missing Spaces	Combined with decorative elements	Lost word boundaries

It’s very common for PDFs to use custom, decorative fonts. If the extraction tool doesn’t adapt to these, it can lead to incorrect character representations, missing text, or even crashes when loading the text.

Example combining missing spaces and Tatwil:

شـــهدت مؤشـــرات الأداء للأنشـــطة الاقتصاديـــةنموًاملحوظًاحيثتشـــيرالبيانات إلـــى أن النمـــو الأكبر كان في نشـــاط الصناعات التحويليـــة

Cost Impact of Decorative Characters

Process Stage	Impact of Decorative Characters	Frequency
Indexing	Increased storage costs	One-time per document
Embedding	Higher processing costs	Less frequent
Retrieval	Slower matching	Per query
LLM Processing	Increased latency and cost	More frequent

Moreover, decorative characters (such as Tatwil) increase the costs and latencies of storing, retrieving, and processing text — bloating search systems and hiking up bills. Recurring costs add up every time a document with decorative characters is indexed, passed to an embedding model, retrieved as a match for a query, passed to a large language model to process, and so on.

Our Solution: To reduce latency and cost, our RAG system cleans up text at ingestion, efficiently removing decorative characters while preserving meaning.

Saving with Clean

6. Non-Standard Text Layers

PDF Layer Complexity

Layer Type	Content	Extraction Method
Standard Text	Searchable text layer	Direct extraction
Images	Text embedded in images	OCR required
Annotations	Comments, highlights	Specialized parsing
Forms	Form fields and data	Form-aware extraction

PDFs can contain text in different layers, such as in images, annotations, or forms. Extracting text from these layers requires additional processing; Optical Character Recognition (OCR) might be needed for text within images.

Monta AI vs. ChatGPT Plus: OCR Capabilities

System	OCR Support	Result
Monta AI	✅ Advanced OCR	Successfully extracts text from images
ChatGPT Plus	❌ Limited	Fails on image-embedded text

Monta AI utilizes OCR when ingesting such elements:

Different layers text

ChatGPT Plus, despite its astounding capabilities, failed to answer the same question (due to the need for OCR):

GPT text

7. Formatting and Layout Issues

PDFs often use complex layouts and formatting, which can include tables, columns, forms, footnotes, and embedded images. Understanding the intended structure of the document when parsing it can be a difficult task, especially when the layout is non-linear or interrupted.

Complex Layout Challenges

Layout Element	Challenge	Example Use Case
Tables	Multi-cell data extraction	Financial reports, statistics
Multi-column	Reading order determination	Newspapers, academic papers
Forms	Field identification and extraction	Applications, surveys
Footnotes	Reference linking	Research documents
Embedded Images	Context understanding	Mixed-media reports

Example: Complex Table Extraction

(Highlights in screenshots below were added to help readers of this blogpost find relevant information in the referenced PDFs)

complex layouts and formatting

complex layouts and formatting example 2

Layout-Aware Processing Capabilities

Multi-page, Multi-column Flow:

Layout-aware ingestion enables Monta AI to answer questions when the text is flowing from one column to another and across multiple pages while interrupted by a table — all at the same time:

Layout-aware ingestion

Comparison: Monta AI vs. ChatGPT Plus

Feature	Monta AI	ChatGPT Plus
Layout Understanding	✅ Advanced	❌ Limited
Multi-column Flow	✅ Accurate	❌ Fails
Form Extraction	✅ Precise	❌ Cannot determine

Layout understanding matters, as shown by the lacking answer ChatGPT Plus gave for the same query:

Layout understanding matters

Form Control Understanding

Monta AI’s assistant correctly answers questions about information (e.g., a radio-button selection) in a form thanks to understanding form controls and the custom reformatting steps in ingestion:

Monta AI’s assistant correctly answers questions

When asked the same question, ChatGPT Plus gave a verbose—yet dissatisfying—explanation as to why it couldn’t determine the answer:

When asked the same question, ChatGPT Plus gave a verbose—yet dissatisfying

8. Chunking

To ingest a knowledge base of PDFs in preparation for enabling search, they need to be broken down to smaller pieces in a process known as chunking.

Chunking Strategy Challenges

Approach	Chunk Size	Problem
Too Small	Individual sentences	Context gets lost, atomic units break
Too Large	Multiple paragraphs	Context diluted with irrelevant info
Simplistic (period-based)	Variable	Abbreviations confused as sentence ends
Optimal (Monta AI)	Semantic boundaries	Preserves meaning and context

The Challenge: Determining chunk boundaries effectively is a challenging task: make them too small and context gets lost as atomic units break into multiple chunks; make them too big and context gets diluted with irrelevant, confusing information.

Why Simple Methods Fail:

Method	Problem
Sentence splitting	Periods in abbreviations confused as end-of-sentence
Fixed character count	Breaks mid-thought or mid-word
Paragraph-based	Paragraphs may be too long or too short

Our Solution: To understand the flow and boundaries of atomic units conducive to effective search, our RAG system takes cues from the meanings of various textual and non-textual elements in a PDF (language-aware and layout-aware).

9. Generative-AI Issues

To generate answers given relevant evidence and citations, a large language model (LLM) is used to understand and generate Arabic text.

LLM Performance Trade-offs

Aspect	Challenge	Our Solution
Arabic Understanding	Most LLMs weaker in Arabic	Specialized Arabic-optimized models
Processing Cost	Capable LLMs are expensive	Efficient model selection
Response Time	Quality models take longer	Performance optimization
Output Format	Often too verbose	Context-aware formatting

Capable LLMs take more time and cost more to produce high-quality responses. As mentioned earlier, our RAG system uses AI models that are better at understanding Arabic and do so more efficiently, speeding up processing time and saving costs.

Intelligent Output Formatting

Query Type	Optimal Format	Traditional LLM Output
Comparison	Table	Wall of text
Multiple items	Bullet list	Long paragraphs
Simple fact	Inline citation + PDF page	Verbose explanation

Moreover, LLMs tend to be too verbose and produce walls of text to answer questions that are better answered as a table, a list, or a short commentary on an inline PDF page that has the answer, all of which are formats our RAG system automatically understands and produces based on the query and given context.

Monta AI’s Comprehensive Solution

Technical Capabilities Summary

Challenge Area	Monta AI Solution	Competitive Advantage
Script Rendering	Shape-aware character recognition	Handles all positional variants
RTL Processing	Bidirectional text handling	Correct ordering and alignment
Ligatures & Diacritics	Advanced character composition	Accurate extraction
Encoding	Robust normalization pipeline	Handles Mojibake and mismatches
Font Issues	Adaptive font handling + Tatwil removal	Cost-efficient processing
OCR	Advanced image text extraction	Surpasses ChatGPT Plus
Layout	Multi-column, table, form-aware	Complex document understanding
Chunking	Semantic boundary detection	Optimal context preservation
Arabic LLMs	Specialized Arabic models	Faster, cheaper, more accurate

To effectively parse PDFs, Monta AI combines an ensemble of specialized techniques developed to handle the complexities of the Arabic script and language, as well as the general challenges of extracting text from PDFs.

Our Differentiators:

🔍 Beyond Text Extraction - Understand content using layout and metadata cues
🎯 Intelligent Context Selection - Choose relevant information for responses
📄 Source Citation - Display embedded pages used to formulate answers
💡 Format Optimization - Automatic selection of tables, lists, or prose based on query
⚡ Cost Efficiency - Reduced latency and processing costs
🌐 Arabic Expertise - Purpose-built for Arabic script complexities

Our systems go beyond extracting text to understand the content of PDFs using cues such as layout and metadata, aiding the entire solution to provide highly accurate answers in an intuitive format. The AI assistant decides which relevant context to use to respond to queries, while citing and displaying embedded pages used to formulate the response.

Conclusion

The journey to unlock the vast reservoirs of knowledge contained within Arabic documents is fraught with unique challenges, ranging from the linguistic intricacies of the Arabic language to the technical hurdles of parsing text from PDFs. Yet, the advent of technologies like multilingual RAG-powered AI assistants heralds a new era of possibilities.

The Path Forward

From	To
Inaccessible Knowledge	Unlocked insights from Arabic documents
Technical Obstacles	Seamless extraction and processing
Limited Understanding	Deep cultural and linguistic comprehension
Isolated Information	Connected, searchable knowledge bases

By combining multilingual embedding models and LLMs with specialized techniques for dealing with complex scripts and document formats, we are on the cusp of making this wealth of information easily accessible to a global audience.

As we continue to refine these technologies and methods, the promise of seamlessly extracting and leveraging the rich insights hidden in Arabic documents becomes much closer to a tangible reality, lighting the way for future innovations and discoveries for all languages around the globe.

Ready to Unlock Your Arabic Documents?

Looking to unearth insights from your Arabic documents? Partner with Monta AI today to get tailored AI solutions that will elevate your business to new heights!