Discover how Inpute partnered with Ireland's Central Statistics Office (CSO) to intelligently capture and validate a variety of printed store till receipts. The unstructured nature of data being captured made this project particularly challenging.
The CSO carries out extensive research on economic and social activities in Ireland. It found that capturing and collating data on larger projects was becoming increasingly difficult. Data accuracy was a major concern as they needed to capture and index 100,000 pieces of unformatted till receipt data that varied hugely from store to store.
Leveraging expertise in intelligent data capture and forms recognition, the Inpute team built a solution to automate and intelligently extract data from a variety of document types in variable locations based on key words and phrases.
“The solution has helped to transform the survey processing operation... The system doesn't require a template for receipt type. Instead, it can intelligently search the content, extract the relevant data and ignore the irrelevant items.”
Central Statistics Office
The household budget survey (HBS) is among the largest and most important of the data collection programmes which the Central Statistics Office in Ireland carries out. Every five years a random sample of 10,000 households are polled about their expenditure patterns. The aim is to determine in detail the pattern of household expenditure in order to update the consumer price index.
Inpute was tasked with devising a system which would automate and streamline the processing of 100,000 pieces of highly varied, unformatted data and deliver substantial cost and resource savings.
John O’Reilly of IT corporate systems at the CSO explains that each household member over the age of 16 is asked to maintain a detailed diary of their expenditure over a two-week period.
Participants were encouraged to return till receipts to CSO in-lieu of entering handwritten detail in the diary booklet. This reduces the burden on respondents, while also enhancing the accuracy of the information collected. “It does however add another layer of complexity to the data processing operation,” says O’Reilly.
Information from the expenditure diaries was already being captured by a form recognition software solution which Inpute had implemented some years earlier. In the context of structured templates like the diaries, it worked perfectly. As the system stood however, it was not a viable option to automatically capture data from till receipts. Prior to this, 2015 receipts data was manually keyed in by a team of data entry operators over a period of months.
Variety is the issue here. Till receipts vary hugely from store to store. Totals, discounts, dates and numbering sequences appear in different places on different parts of the receipt. Identical products are described in different ways, while many receipts also carry marketing messages unrelated to the underlying price data. In short, no two till receipts are the same, while the paper itself is invariably of poor quality which fades quickly and creases easily.
“A solution which could deal with the unstructured nature of the till receipts seemed highly unlikely,” says John O’Reilly.
Following a competitive tendering process, Inpute was selected as the preferred solution provider. “They proposed the introduction of software with advanced character recognition capabilities, designed to transform information from unstructured documents, such as receipts, into machine readable data.”
Integrating with existing systems was a key deliverable for the CSO. It was vital that the intelligent capture solution which Inpute provided worked with the existing form recognition software which would continue to capture the diary data. These twin data sets – receipts and diaries – needed to be linked and directly traceable to the source documents.
Inpute began working closely with the CSO to agree the specification, and then set about tailoring their intelligent capture software so that it merged seamlessly with the CSO’s existing systems. Implementation went very smoothly; minor issues which arose during the UAT phase were quickly and effectively dealt with.
“The solution has helped to transform the survey processing operation,” says John O’Reilly. “The system doesn’t require a template for each receipt type. Instead, it can intelligently search the content, extract the relevant data and ignore the irrelevant items. Now, instead of keying every line item, you only have to correct those items flagged by the software. The recognition capabilities of the solution meant that the resources needed to process the survey were significantly reduced. It’s definitely a much more efficient process.”
Feedback from survey staff has been very positive, with ease of use being the key benefit. An image of the receipt is presented side-by-side with a table of what the software has read, making the interface easy to navigate and correct.
“The project was delivered on time and within budget,” says John O’Reilly. "The capabilities of the software coupled with the edits and checks which Inpute built into the system have contributed to the overall data quality. The system is now more flexible and robust.”
He concludes: “Inpute’s solution has transformed our data processing operation for the better. We were impressed with the quality of the product and the support provided by the Inpute team. ”
This was a fascinating project. Extracting and validating data based on keywords or phrases from documents which are in variable sizes from A4 down to till receipts. The data is also in variable locations on each document. So many user cases could benefit from this type of solution.
CEO, Inpute
Support
support@inpute.comIreland: +353 1 517 5111
UK: +44 203 026 9024
Poland: +48 717 166 900
Customer login