Combine documents with other data in Azure Search
by Frans Lytzen | 30/01/2017
Azure Search has built-in support for indexing documents in blob storage, which works great. But what if you want to combine the documents with other data, for example if you are building a recruitment system and want to search on, say, location and CV content at the same time?
TL;DR; Indexers in Azure Search uses a "create or update" approach; as long as your different indexers use the same id, they can all write data to the same document in the index.
Terminology in Azure Search
A data source is a definition in Azure Search of somewhere that an indexer can read data from. It's sort of like a connection string.
Scenario
- I could extract the content of the files myself and write code to combine it with candidate data in code. But extracting content out of files is not a trivial problem.
- I could let Azure Search index the files, then write code to extract the content back out of Azure Search and write it back to another index. But that seemed like a very long-winded solution.
- I asked Microsoft if I could somehow access the content-extraction feature directly. But you can't.
The solution
Leverage Blob Storage with Data Store (such as Azure SQL or DocumentDB): If you store your files in Azure Blob storage, and your structured data in something like Azure SQL or DocumentDB, you can use both the Azure Search Blob Indexer as well as the Azure Search SQL or DocumentDB Indexer together. The only trick is to make sure that the unique key that is defined for each document matches between Blob and the structured data store. For example, if you choose to use the Base64 Encoded filename as the key, then you would need to make sure that the matching content in your structured datastore also contains this value. That way the Indexer which does something called MergeOrUpdate, will first take a row from say the Blob Storage and insert it as a new row in Azure Search and then it will bring in the data from the structured data store and update the existing row with the new fields.
Implementation
- Content-Type : application/json
- api-key : [an admin key for your Azure Search instance]
Create the Index
Create the data sources
Time to start posting JSON (see above).
{This tells Azure Search how to access your data.
"name" : "blobcvs",
"type" : "azureblob",
"credentials" : { "connectionString" : "XXX" },
"container" : { "name" : "cvs" }
}
{
"name" : "candidates",
"type" : "azuresql",
"credentials" : { "connectionString" : "XXX" },
"container" : { "name" : "Candidates" }
}
Create the indexers
{This tells Azure Search to take the data in the SQL database specified in the SQL data source and create a document in Azure Search for each row. Azure Search will automatically match fields with the same names; I've got an Id field as well as Name, Type and Thingiemajics columns in SQL. The only one that is a bit special is Thingiemajics; I'm storing an array of tag values in that field in SQL in the format ["red", "white", "blue"] and putting that mapping function in there tells Azure Search to make them individual tags that can be filtered individually. See the docs for more details.
"name" : "candidateindexer",
"dataSourceName" : "candidates",
"targetIndexName" : "candidates",
"fieldMappings" : [ { "sourceFieldName" : "Thingiemajics",
"mappingFunction" : { "name" : "jsonArrayToStringCollection" }
} ]
}
Before I create the indexer for the files, let me just take a little detour. If you remember, the original statement was that for this to work, the two different indexers need to use the same id for data about the same candidate. The SQL indexer in my example uses the database ID of the candidate and we need to ensure that when Azure Search indexes the CV for a candidate it returns the same index. By default Azure Search will use the filename, which is obviously no good in this situation. The way I solved this was to add a custom meta data property to the blob when I uploaded it to Azure Blob Storage, something like this:
using (var fileStream = System.IO.File.OpenRead(file))Here I have called it "mykey", but it could be called anything.
{
await blob.UploadFromStreamAsync(fileStream);
}
blob.Metadata.Add("mykey", identifier);
await blob.SetMetadataAsync();
{The most important thing here is that I tell Azure Search to take the metadata property "mykey" and map it to the "id" field in my index. That's all that is required to ensure the contents of the CV ends up in the same search document as the other Candidate information.
"name" : "cvindexer",
"dataSourceName" : "blobcvs",
"targetIndexName" : "candidates",
"fieldMappings" : [ { "sourceFieldName" : "mykey", "targetFieldName" : "id" } ],
"parameters" : { "configuration" : { "failOnUnsupportedContentType" : false } }
}
Notes
You should also note that this works because the different datasources do not have any of the same field names; if the two indexers both return the same field, one will win and delete the data from the other one.