Keywords: ElasticSearch | reindex API | data copying | index management | query filtering
Abstract: This article explores the use of ElasticSearch's built-in _reindex API to copy data that meets specific criteria to a new index. It covers basic reindexing operations, filtering with queries, and provides rewritten code examples for clarity.
Introduction
ElasticSearch is a powerful search engine that stores data in indices. A common requirement in data management is to copy specific subsets of data to new indices for purposes such as archiving, testing, or reorganization. For instance, consider a scenario where you have an index named "movies" containing movie documents with fields like title, director, and year. You might want to copy all movies from a particular year, say 1972, to a new index called "70sMovies". This article demonstrates how to achieve this using ElasticSearch's built-in _reindex API.
Using the _reindex API
Introduced in ElasticSearch version 2.3, the _reindex API provides a native and efficient way to copy data between indices. It supports basic reindexing operations as well as filtering data based on queries, allowing for precise control over what data is transferred.
Basic Reindexing Operation
To copy all documents from one index to another, you can use the following command. This example assumes you have an index named "movies" and want to create a new index "allMoviesCopy".
POST /_reindex
{
"source": {
"index": "movies"
},
"dest": {
"index": "allMoviesCopy"
}
}
This command will copy all documents from the "movies" index to the "allMoviesCopy" index. It is important to ensure that the destination index exists or is created automatically based on settings.
Filtering Data with Queries
For more selective copying, you can add a query to the source configuration. To copy only movies from the year 1972 to a new index "70sMovies", use a term query on the "year" field.
POST /_reindex
{
"source": {
"index": "movies",
"query": {
"term": {
"year": 1972
}
}
},
"dest": {
"index": "70sMovies"
}
}
In this example, the query filters the source data to include only documents where the "year" field equals 1972. The _reindex API then copies these filtered documents to the specified destination index. This method is highly efficient and reduces the amount of data transferred.
Alternative Method: elasticsearch-dump
While the _reindex API is the recommended approach for most use cases, alternative tools like elasticsearch-dump can also be used for data migration. This tool allows for dumping and loading ElasticSearch data, including mappings and data types. For example, to copy data from one index to another using elasticsearch-dump, you might run commands similar to the following, though it is generally less integrated than the native API.
elasticdump \
--input=http://localhost:9200/movies \
--output=http://localhost:9200/70sMovies \
--type=data
However, for most scenarios involving selective copying based on queries, the _reindex API is preferred due to its simplicity and performance.
Conclusion
The _reindex API in ElasticSearch offers a robust solution for copying data between indices, with the added flexibility of query-based filtering. By leveraging this API, users can efficiently manage their data, such as archiving specific subsets like movies from a particular year. Always refer to the official ElasticSearch documentation for the latest features and best practices.