Post by Yuvraj Singh Bhadoria
ML Engineer @ BofA | LLM & RAG Systems | 52%→88% Retrieval Accuracy | LangChain · LangGraph · PyTorch | Ex-PolicyBazaar
Just read "Multimodal OCR: Parse Anything from Documents" and can't stop thinking about it. Traditional OCR tools extract text from PDFs and call it a day. This paper goes way further — it parses entire documents into structured representations, including charts, diagrams, UI elements, and icons. The model (dots.mocr, just 3B parameters) outputs SVG code instead of pixels, which means you can actually edit and re-render what it extracts. A few things that stood out: - It ranks second only to Gemini 3 Pro on the OCR Arena leaderboard - Beats Gemini on image-to-SVG reconstruction tasks - Sets a new state-of-the-art on olmOCR-Bench (83.9) The implications are pretty significant. If every document can be converted into editable, structured data, that's a massive unlock for document understanding, data extraction, and multimodal training pipelines. Would love to hear from people working in document AI — is this the direction the field is heading?