Hands on with: MeiliSearch - A next generation search engine for modern web


MeiliSearch

In this post, I’m gonna review the Meilisearch repository which describes itself as a “Lightning Fast, Ultra Relevant and Typo-Tolerant Search Engine”. There were couple of things that caught my eye with this project...

Quick intro: What is the “Hands on with X” series?

This is a series of blog posts in which we focus on open source technologies encountered in the course of our research for new development projects or while browsing latest news from a development environment which is close to our minds in our day to day work. One day I realized that my Github account is full of starred repositories containing interesting tools, databases, libraries and other cutting-edge technologies that seem to be very interesting for me. I was alway staring at them on Github for later review but never got a chance to really use them. That’s when the idea for this blog post series was born. I was having a feeling that a lot of users are in the same situation, they starred a repository because they might have used it in the future and was curious how it works, but never got a chance to do that. So for the next couple of posts I’m gonna focus on reviewing and using interesting and popular repositories I’ve found on Github for the past couple of years.

Table of contents


Motivations for reviewing Meilisearch

Its search engine. I’m not a heavy user of search engines in my day-to-day projects. I know Elasticsearch and Spinx are among the most popular ones. I also have some experience with TSvector in Postgresql which allowed to create simple “search engine features” within Postgresql database. But that's it, so I’m curious how that new Meilisearch project accommodates itself in this long-inhabited environment.

Its written in Rust which is interesting so that Rust is very close to bare metal. It’s also a new technology and most of the search engines we have are written in C++ or Java. I feel that a new perspective from a totally new language might be worth a try.

Since it’s Rust, it compiles to a single binary and it's portable, so that means no more long hours of building a project which would make deploying along with any other project a breeze.

It claims to have “search as-you-type experience” which means that it is able to return results so fast that it returns results for EVERY keystroke. Interesting given the fact that the Readme file doesn’t contain information about reference dataset authors used to support that claim. Of course it is not possible for a dataset of any size and for smaller data sets it's trivial, so I wanted to learn how long Meilisearch is able to support that claim, given that I’ll be using a bigger than usual dataset.

Github repository and Readme

The url address of Github repository is https://github.com/meilisearch/MeiliSearch. As of the beggining of the September 2020, the repository has over 9k stars and over 30 contributors. The commits are pushed more or less weekly. The tool seem to have small community around it so its definitely not a dead project. In addition Meili raised 1,5M euros in their first funding round (https://blog.meilisearch.com/meili-fundraise/) which shows that they are determined to develop the product and compete with big players, which is what they also claim on their website, they want to be an alternative to search engine APIs like Algolia.

What is more important for us - developers is the documentation and its clarity. The readme page is short and compact, with only relevant information which is a plus, because they also have a full blown documentation hosted as a separate website at https://docs.meilisearch.com/.

The readme contains all of the basic informations need to start using the engine which means it has recipes for building from source as well as downloading a compiled binary or using a docker image. In addition there are examples how to index a first collection and make queries. All that is compact enough so that following it step by step should take no more than 10-15 minutes to complete.

The documentation contains description of the main concepts that were used when building the engine as well as advanced guides starting from installation and configuration up to the most advanced features of the engine. It is well written but after reading it I was left with a feeling that it is still immature project with a long road ahead of it.

The data set for testing purposes

Of course testing the search engine with a small data set doesn’t make much sense since today’s hardware capabilities allows to easily keep in memory datasets weighing a couple hundreds of megabytes. So I took a different approach, I decided to find a data set that consists mostly of text and would be useful in real world examples. My choice went to Kaggle dataset: Cornell University arXiv index (https://www.kaggle.com/Cornell-University/). As per Wikipedia, arXiv is an open-access repository of electronic preprints (known as e-prints) approved for posting after moderation, but not full peer review. It consists of scientific papers in the fields of mathematics, physics, astronomy, electrical engineering, computer science, quantitative biology, statistics, mathematical finance and economics, which can be accessed online. The dataset hosted on Kaggle is just a friction of the whole arXiv and it contains an index of publications with information like: author, title, category and short excerpt. The dataset format is JSON and weights about 2.7gb all you need to do to download it and unzip.

Kicking it off - Download binary and start the server

Now that I have a data set we can start testing MeiliSearch. The installation is pretty straightforward, I used a ready to use bash script (of course I reviewed the script in the first place as I know these bash installations are basically an attack vector).

$ curl -L https://install.meilisearch.com | sh
$ ./meilisearch
Server is listening on: http://127.0.0.1:7700

That was very smooth, it went well without any problems, point for Meilisearch.

Prepare the data to be ingested

In order to search the data we have to create an index first (think of it as a database table) and ingest the data to it in a Meilisearch instance. For index creation I used:

curl -i -X POST 'http://127.0.0.1:7700/indexes' --data '{ "name": "arxiv", "uid": "arxiv" }'

After the command completed I tried to feed the index with raw data, but there are couple of constraints to the format. First thing is that the file must be less than 10 megabytes and second is that the JSON should actually be a JSON array where each element is a separate document identified by unique id field. I tried to use good old split, awk, sed unix tools in the first place, but after some time I gave up and switched to Node.js.

You can check the whole script below. Its not very sophisticated but it does the job.

Indexing first collection

The script above does basically two things. It first reads the file contents line by line and builds a payload containing about 1000 documents. It then sends the payload to MeiliSearch, in return we get an “updateId” which is an identifier that we can later use to ask our MeiliSearch instance whether the indexation operation for that batch finished. If the indexation batch is finished, then we can resume the file consumption, assemble another data batch and send it to MeiliSearch. And here comes the first surprise, as the documentation doesn’t clearly lay that out and I had to figure it (how it works). It seems that MeiliSearch has an internal queue that is able to accept any number of data payloads for ingesting and it processes this queue at its own pace. Makes sense to me, but as it turned out later, it’s not so lovely.

Indexing speed

Buffering incoming data has a lot of advantages, having that buffer you are able to not overwhelm your processing units with work, while still maintaining the work pace. In addition you can scale your work capacity by adding more workers, that’s when the buffers shine, because you can quickly consume additional data by throwing more resources at it. Unfortunately MeiliSearch indexing queue is processed only by one worker even on a multi-core CPU. That means, you can throw a lot of data for indexing at it but it will basically take the same amount of time, if you would do it one payload at a time.

It actually gets worse as the indexation time increases as your data set size increases. That is not a bad thing as long as the growth is linear and it's not too significant. Unfortunately again with MeiliSearch it's exactly the opposite. The indexation time grows exponentially and even with small data sets it gets to a point where indexing 3gb of data would take unknown time (from my calculations). I was having high hopes downloading a 2.7gb data set and trying to index it with MeiliSearch, unfortunately I was only able to index 115mb (yes, megabytes) which is about 80 000 lines from the file. Just to give you the context, the first 20 000 items were indexed in less than 1 minute, another 20 000 items took about 4 minutes. Reaching 88 000 items took 10 minutes and arXiv data set has almost 1 800 000 items.

That being said, the indexation time is too long and I didn’t want to spend the next few days waiting for it to finish. I did spend some more time trying to find how to speed up the indexing process, I looked through the documentation and Issues on Github, unfortunately I was not the only one that had that problem. This issue: https://github.com/meilisearch/MeiliSearch/issues/876 highlights it. Until the indexing process can be paralleled at least in small degree (through using additional CPU cores), then using MeiliSearch with bigger datasets seems cumbersome.

Making search queries

So far, I've indexed about 88 000 documents, not much but given that this is a number of items in a medium sized e-commerce shop, with just about the same amount of details (title, category, excerpt), I was pleased. For now MeiliSearch exposes a REST API for interacting with it. It also comes with predefined Web UI which has a single text input, the results are shown as you type the query. I decided to go with the Web UI, since REST API is nothing new and the Web UI actually uses it. As a side note, what would be cool is a set of different interfaces, I’m thinking here particularly about XML (soap) and grpc (protobuf). They both are still extensively used. That might not be required for now but for sure it is a nice feature. Getting back to the querying. It’s incredibly fast actually, like really fast. To better give that feeling take a look at the movie below.

As you can see, the Web UI doesn’t use any debouncing, which means that the moment you press the key the query is sent to MeiliSearch and the results are returned immediately. Typing a 5 letter word sends 5 queries, which all of them return a result in less than 10 ms (milliseconds). Which is quite a good result. Of course there always will be a network round trip but I did not account for it in my tests as I wanted it to be very synthetic. The only longer response I received (about 100 ms) was when I typed the first 1-3 letters. That is because a lot of documents are found with given short phrases, but that’s not a problem. The only issue I’ve found is that typing long phrases (about 20-30 characters long) makes MeiliSearch return response in a very long time or even freeze. I suppose this is because the internal search algorithms are not yet optimized and authors have to work that specific scenario out.

Frontend integration

To integrate MeiliSearch in our application we can use the Instant MeiliSearch repository (https://github.com/meilisearch/instant-meilisearch/) which enables developers to integrate without reinventing the wheel. It's based on Algolia open source library which comes with a lot of predefined behaviour which we can customize to our needs. It’s very cool stuff and can save a lot of time, especially when building custom search experience. On the other hand we have the full power to build fully customized integration. Exposing MeiliSearch to the public creates unnecessary risk for malicious users so I advise everybody to at least use a proxy when exposing MeiliSearch to the client directly or communicate with it only on the backend side.

Would I recommend MeiliSearch?

Of course, I personally would use it for smaller projects where the data set size is not big. For sure it’s a great fit for building a product catalogue or other backend that powers UI with a lot of filters. Thanks to blazingly fast query execution, it offers great user experience.

Conclusions

MeiliSearch looks very promising and I’m sure it has a bright future ahead. It's still a young project and a lot of features are missing (paralleled indexing, HA, sharding). I believe that if the development team will keep the pace, we can expect MeiliSearch to blossom in the next couple of months. MeiliSearch developers focused on blazing fast performance, now they should work on the features that will attract bigger users.

A thing to note before. While creating a MeiliSearch Index you have to set a primary key which is a unique value identifying a document across the whole index. The primary key can be an integer or a string value composed of alphanumeric characters, hyphens (-) and underscores (_).

Another thing to note is that the indexing process is single threaded right now. 88 000 documents, it's about 800 megabytes of RAM, the index stored on disk takes about 2,2 Gb. So we are dealing here with about an 8x increase in data size for keeping it in memory and about a 20x increase to keep it on disk.

However the startup time is very low and the process is responsive immediately after start, making it possible to start querying immediately (probably thanks to OS buffer cache).

Thanks for reading this article, I hope you've liked it and you will come back soon!


Similar searches: meilisearch / search engine / alternative to elasticsearch / alternative to sphinx / open source search engine for web app

Author: Peter

I'm a backend programmer for over 10 years now, have hands on experience with Golang and Node.js as well as other technologies, DevOps and Architecture. I share my thoughts and knowledge on this blog.