Associative Database Search
 
   
 
   
 
   
 
   
 
   
 
   
 

Databases are efficient systems for storing and retrieving data: they control in meantime in a subtle - or less subtle way our everydaylife. Why and where can things go wrong? How to help it?

In many business applications it is imperative to identify client data from letters, forms, invoices, or electronic messages by matching the input data to the data stored in a central data base. Identifying a client’s data base entry allows one to access further information about the client (like its bank account) and perform accordingly the necessary operations on the data (pay the total sum to that bank account). This is called associative access, because it relates some partial information of a record to its full content.
Consider for example the following three records, part of a large client database. Note that the fields are separated by the “;” symbol.

4;ABUS KRANSYSTEME GMBH GUMMERSBACH;ABUS KRANSYSTEME GMBH;; SONNENWEG 1;51647 GUMMERSBACH;GUMMERSBACH;;51601 GUMMERSBACH;Postfach 100162;;;061236080;315568220;N;320017;DEUTSCHE;;4023800;N;BLZ 51070021;UA1-01

5;ANCRA JUNGFALK GMBH ENGEN;ANCRA JUNGFALK GMBH;;RICHARD-STOCKER-STR. 19;78234 ENGEN;ENGEN;;78230 ENGEN;Postfach 1309;;DE811575164;02261370; 341920403;N;;DEUTSCHE;;100800;N;BLZ 38470091;UA1-01
6;ACTECH GMBH FREIBERG;ACTECH GMBH;;AM ST.NICLAS-SCHACHT 13;09599 FREIBERG;FREIBERG;; ;;;;;316131929;N;50023;DEUTSCHE;;949768;N;BLZ 69270038;UA1-017

7;AUTOHS.MOLTKE-GARAGE GMBH STUTTGT.;AUTOHAUS;MOLTKE-GARAGE GMBH;ESPERANTOSTR. 8A;70197 STUTTGART;STUTTGART;; ;;;DE171187036; 03731781179;313798340;N;;DEUTSCHE;;210047729;N;BLZ 87030670;UA1-01

While the record structure is in general well defined, each record might have empty fields and some addresses or places might be written in short form (like Stuttgt. instead of Stuttgart). In the praxis it is often the case that such records are duplicated, reflecting a change in address, bank account, etc.
The input document might also contain spelling and typing errors, as well as OCR-generated errors if it is received in paper format. Following the above example, the input document might contain an address block, which after scanning, OCR, and further pre-processing, looks like following:

ESPARANT0 STR. 80 70137 STUTTGT

Errors are marked in red. The goal is to find the record number 7, correct the errors, identify the fields, and pass this information to the next station of the workflow.
In most cases the input document contains some of the record data but certainly not in the field-organized structure present in the database, while in other cases (forms, web-forms), a field structure is enforceable. Therefore, the task is to find the best match between an input consisting either of unstructured text or semi-structured text (forms) and a database record. Both can have either orthographic errors and/or missing parts.

C;A;R;E, the content addressable record extraction module is LCI’s solution to this problem. C;A;R;E takes as input a raw text in ASCII or UNICODE format, or CSV-format file as shown above. After creating its own index, C;A;R;E can answer to queries, which consist of structured or unstructured fields. Quite similarly to a search engine, C;A;R;E provides then a ranked list of the best matching records in the data base. If the query contained no field information, C;A;R;E might provide on request also a map between the input strings and the matching record fields.

One of the main goals of the AMASS project is to enable associative databases based on content analysis, like C;A;R;E, to run much faster by using effectively possibilities provided by modern hardware.