Let AI Instantly Tame, Vectorize, and Search Your Documents

Realize the full potential of your unstructured data with the world's most advanced document database. "Gensim on Steroids."



Unstructured data, tamed

  • Looking to make sense of large collections of contracts, customer feedback, CVs, emails or patents?
  • Struggling to understand document structure and automatically categorize sections?
  • Want a powerful semantic vector search and clustering?
  • Need a flexible, open on-prem solution your data science team can build on?
  • Had it with AI hype, shoddy engineering and lacklustre support?

A typical business keeps much of its data in unstructured form: contracts, financial reports, customer support cases, resumes, patents…

ScaleText is a solution that taps into that potential with automated content analysis, document section indexing and semantic queries. No more low-recall keywords and costly manual labelling.

Using advanced machine learning algorithms, ScaleText implements state-of-the-art workflows for format conversions, content segmentation, categorization, entity detection and semantic vector modeling. The resulting metadata is indexed as a database service, allowing fast document inserts, similarity queries and clustering requests that scale to hundreds of millions of documents.

And the best part? ScaleText is not a black box. It doesn’t lock power users in, prescribing pre-packaged algorithms. Powerful REST APIs lets you plug in your own analysis pipelines into its highly scalable architecture. ScaleText works with your data science team, not against it.

From the makers of Gensim. Why reinvent the wheel and develop costly in-house solutions when you can join forces with the world’s foremost team of experts?


Extract metadata and semantic vectors from documents.


Index, query and cluster unstructured documents at scale.


Build new applications grounded in cutting edge NLP and IR.


Who is ScaleText for?

Organizations & businesses

Utilize knowledge implicit in your data

  • Search and group similar content, using adaptive semantic models
  • Route documents based on automatically discovered metadata and themes
  • Avoid manual tagging errors and reduce annotation costs
  • Base your analytics on content-driven insights
  • Use domain-tuned analysis pipelines to automate your workflows

Data scientists & ML consultants

Gain competitive edge

  • Employ a robust and scalable platform to analyze, index and search documents
  • Assemble plug-and-play pipelines to build cutting edge NLP solutions
  • Integrate efficiently using a local installation with clean REST and Python APIs
  • Enhance pipelines with own custom machine learning models and incremental training
  • Avoid the R&D costs of building and maintaining a scalable semantic engine

Enjoy seeing all your data available for analysis, no matter the format or volume

Request free DEMO

ScaleText Technology


Adaptive Vector Representations

Modern machine learning techniques represents content as multi-dimensional vectors. ScaleText comes with flexible domain-customized vector models out of the box, and then gradually adapts to your data.

Nuggets of Content

Documents may come in various formats and sizes, from short tweets to a 100-page scanned PDF report. ScaleText implements powerful domain-specific format convertors and segmentation algorithms, to split unstructured text into “nuggets” as meaningful semantic units to index and retrieve.

Industry Focused Applications

ScaleText’s flexible architecture supports domain-focused applications: from organizing company contracts, automated support routing, filtering job candidates, prior art in patent search, to enabling legal e-discovery and financial report analysis.

Robust Index Management

Each ScaleText installation is multi-tenant, supporting multiple users and indexes out of the box. Its database capabilities include distributed indexing, index and pipeline versioning, continuous model updates and reindexing with zero downtime.

On-Premises With Stellar Support

Clients happy with our prepackaged pipelines will enjoy a hands-off managed version of ScaleText, configured and run on our servers via a REST API. For power users, ScaleText offers a self-hosted Docker deployment allowing pipeline customizations and additional command-line capabilities. Each installation comes with a detailed API documentation, including examples and expert tips.

Providing industry excellence since 2011

RARE Technologies logo

As artificial intelligence leaders, the mission of RARE Technologies is to bridge the gap between research excellence and robust engineering.

Our R&D consulting, unique corporate training, and Incubator programmes strive to democratize machine learning and bring innovation from the classroom to the boardroom.

You might know us as the makers of Gensim and other open source Data Science tools.

RNDr. Radim Řehůřek, PhD


"More than a decade of intensive R&D in Artificial Intelligence at RARE TECHNOLOGIES has shaped what is the most advanced tool for text discovery to date. We are excited to bring our technology to the market and help organizations tap into the value of their unstructured data."

Trust the companies that trust RARE

Tim Budden

Director of Data Science

"RARE is great at sharpening up a problem definition, planning a realistic approach to solving it and then delivering an effective solution on a timeline. We were impressed by their experience and deep machine learning knowledge."

James Bradley

Program Manager Advanced Analytics

"RARE Technologies created a fantastic tool for us at Autodesk for helping to extract quality insights from hard to analyse unstructured data."