| |
Databases
are efficient systems for storing and retrieving
data: they control in meantime in a subtle - or less subtle
way our everydaylife. Why and where can things go
wrong? How to help it?
In
many business applications it is imperative to identify
client data from letters, forms, invoices, or
electronic messages by matching the input data to the data
stored in a central data base. Identifying a client’s
data base entry allows one to access further information
about the client (like its bank account) and perform accordingly
the necessary operations on the data (pay the total sum to
that bank account). This is called associative access, because
it relates some partial information of a record to its full
content.
Consider for example the following three records, part of
a large client database. Note that the fields are separated
by the “;” symbol.
4;ABUS
KRANSYSTEME GMBH GUMMERSBACH;ABUS KRANSYSTEME GMBH;;
SONNENWEG 1;51647 GUMMERSBACH;GUMMERSBACH;;51601
GUMMERSBACH;Postfach 100162;;;061236080;315568220;N;320017;DEUTSCHE;;4023800;N;BLZ
51070021;UA1-01
|
5;ANCRA
JUNGFALK GMBH ENGEN;ANCRA JUNGFALK GMBH;;RICHARD-STOCKER-STR.
19;78234 ENGEN;ENGEN;;78230 ENGEN;Postfach 1309;;DE811575164;02261370;
341920403;N;;DEUTSCHE;;100800;N;BLZ 38470091;UA1-01 |
6;ACTECH
GMBH FREIBERG;ACTECH GMBH;;AM ST.NICLAS-SCHACHT 13;09599
FREIBERG;FREIBERG;; ;;;;;316131929;N;50023;DEUTSCHE;;949768;N;BLZ
69270038;UA1-017 |
7;AUTOHS.MOLTKE-GARAGE
GMBH STUTTGT.;AUTOHAUS;MOLTKE-GARAGE GMBH;ESPERANTOSTR.
8A;70197 STUTTGART;STUTTGART;; ;;;DE171187036; 03731781179;313798340;N;;DEUTSCHE;;210047729;N;BLZ
87030670;UA1-01
|
While the record structure is in general well defined, each
record might have empty fields and some addresses or places
might be written in short form (like Stuttgt. instead of
Stuttgart). In the praxis it is often the case that such
records are duplicated, reflecting a change in address, bank
account, etc.
The input document might also contain spelling and typing
errors, as well as OCR-generated errors if it is received
in paper format. Following the above example, the input document
might contain an address block, which after scanning, OCR,
and further pre-processing, looks like following:
ESPARANT0 STR. 80 70137 STUTTGT |
Errors
are marked in red. The goal is to find the record number
7, correct the errors, identify the fields, and
pass this information to the next station of the workflow.
In most cases the input document contains some of the record
data but certainly not in the field-organized structure
present in the database, while in other
cases (forms, web-forms), a field structure is enforceable. Therefore, the
task is to find the best match between an input consisting either of unstructured
text or semi-structured text (forms) and a database record. Both can have
either orthographic errors and/or missing parts.
C;A;R;E,
the content addressable record extraction module is LCI’s
solution to this problem. C;A;R;E takes as input a raw
text in ASCII or UNICODE format, or CSV-format file as
shown above.
After creating its own index, C;A;R;E can answer to queries,
which consist of structured or unstructured fields. Quite
similarly to a search engine, C;A;R;E provides then a ranked
list of the best matching records in the data base. If
the query contained no field information, C;A;R;E might
provide
on request also a map between the input strings and the
matching record fields.
One
of the main goals of the
AMASS project is to enable associative databases based
on content analysis, like C;A;R;E, to run much faster by
using effectively possibilities provided by modern
hardware.
|
|