Search and Destroy: 5 Pro Tips to Extracting what you want In Ephesoft Transact

During World War II the German Submarines where causing the allies a multitude of problems by lurking around shipping routes and attacking allied ships.  It was the job of the USA Destroyers to protect these ships and seek out and destroy the German Submarines. In order to do this the US Navy had to be clever and outsmart its opponent! Building extraction logic is very similar. You have to study the enemy (in this case documents that have unstructured data). Below are 5 proUNITAS 2002 tips that I use when building extraction logic within Ephesoft Transact.

Ephesoft Transact has a number of extraction tools available, but for the focus of this article, I will only talk about the most popular tool; Key Value Extraction. Key Value Extraction works by locating anchor words or keys on the page and then runs regex parsing on an area in relation to the key area.

1. The bigger the zone the more specific the regex:

When building key value extraction a good rule of thumb is; the larger the zone the more focused your regular expression will have to be. For example, look at the below example of your extraction zone. I have a fairly large red zone and the value pattern of .+ which match anything. This will pull back the ticket number of 41068, but it will also bring back a bunch of other data that I don’t want. The answer to this is to make a much smaller zone or focus the value regular expression. Changing the regex from .+ to [0-9]{4,6} will capture the data I am looking for.

2017-04-17_10012017-04-17_0959

2. Working with multiple extraction rules, use the weight feature

I think many people aren’t aware of the importance of using weights when building multiple extraction rules for a particular index field. I see a lot of projects that do not take advantage of the weight feature. Take a look at the below example that showcases a number of multiple invoice extraction rules.

2017-04-17_1007

I have 4 different extraction rules built out. For the most part, they all have very similar regular expressions. The main difference between these rules are the weights and how I have the actual extraction rule zones laid out.

For example the first rules zone looks like this:2017-04-17_1010

Its main function is to look for invoice numbers that reside to the right of the key word. It has a weight of 1

My Second rule looks like this:

2017-04-17_1013.png

It is searching for the invoice number below the key word. It has a weight of .9

So you can begin to see that this is forming a bit of a decision tree in how Ephesoft picks the actual extraction rule. The weight number effects the values confidence score. Transact will run all the extraction rules and then evaluate which rule has the highest confidence score and pick that rule.

3. Negative/positive look ahead/behinds

This one may require you to brush up on your regex skills, but it becomes very valuable to building out complex extraction rules. I use Regex Buddy for building out some more complex regular expressions.  Lets say you are working with the below example. In this example you can see that we have two dates on the document; invoice date and due date.  I am looking to capture the invoice date and not the date it’s due. If I use my key word “DATE” it will match, but it will match in two places.

2017-04-17_1019.png

However, If I change my key to match the word “DATE”, but not if the word “DUE” comes before “DATE”, then it will capture the invoice date only. If you were to write that in regular expression it would look like this:

(?<!DUE )DATE

Or you could use the regex builder to build this out

2017-04-17_1028.png

4. You don’t have to always look forward.

Many People assume that you have to use the anchor that is adjacent to the data value, but you don’t have to. In some cases it may be better to use another keyword as the key. Take the below example. I am trying to extract out “interest rate”, but I know that the word “interest rate” is small text and difficult to OCR some times. The word “property” below is bold and large text. This is a much better anchor word to use and look up from that key word. OCR will be much more reliable. 2017-04-17_1035.png

5. Find a particular pattern anywhere on the page without a key

This is another technique I use all the time, and it works great. You don’t have to have a key word as your key, a pattern will work as well. Let’s pretend we are trying to find a unique pattern on the document, like a social security number.

As you can see below I have the same pattern for both the key and the value. I also have my key zone on top of my value zone. This essentially performs a pattern match across the entire document to find the social security number.

2017-04-17_1050_001.png2017-04-17_1050.png

I hope these 5 pro extraction tips help you find some golden nuggets.

Until next time!

Unconventional Capture – Image Capture with Ephesoft?

One of the things I have enjoyed most in my time with Ephesoft is solving a problem that presents itself as a difficult one, because it forces me to think creatively. It may sound odd to think creatively when working with the 1s and 0s within software, but it’s really important to think out of the box when designing a solution.

I recently worked with a customer that wanted to capture data using Ephesoft, which of course is an intelligent document capture technology that is designed to OCR and capture data off documents. This customer was in the Aerospace industry and was spending a lot of time looking through huge 100+ page CMMs (Component Maintenance Manual). These CMMs have many smaller individual parts and components on each page. This customer has an application in which they would like to store the manufacturer’s diagrams within that system without going through the pain of opening and searching through a 100+ page PDF file.  In working with this most recent project it became clear that he did not want to capture OCR text but wanted to capture actual images within a document. How does one handle that?  Below is an example of the type of image they wanted to capture. You can see this is a diagram of a particular part that goes to a larger component.

2017-03-01_1147.png

Below is a short video that showcases how Ephesoft can solve this problem of embedded images within a CMM PDF.

Since Ephesoft is the system of choice when it comes to OCR and Advanced Data Capture I decided that we would need a few customizations in order to achieve the image capture that they were looking for. Ephesoft has a very nice scripting interface that allows one to write small custom scripts and place them within a capture process.

Below is an example of the script that I created to enable zone image capture. This script will restructure the zone in the “Image” Document level Field so that the zone stretches up from the the keyword field up to the top of the page.

import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.OutputStream;
import java.util.ArrayList;
import java.util.List;
import java.util.zip.ZipEntry;
import java.util.zip.ZipOutputStream;

import org.jdom.Document;
import org.jdom.Element;
import org.jdom.input.SAXBuilder;
import org.jdom.output.XMLOutputter;

import com.ephesoft.dcma.script.IJDomScript;

/**
 * @author Chris MacWilliams
 * @version 1.0
 */
public class ScriptExtraction implements IJDomScript {

 private static final String BATCH_LOCAL_PATH = "BatchLocalPath";
 private static final String BATCH_INSTANCE_ID = "BatchInstanceIdentifier";
 private static final String EXT_BATCH_XML_FILE = "_batch.xml";
 private static String ZIP_FILE_EXT = ".zip";

 public Object execute(Document documentFile, String methodName, String docIdentifier) {
 Exception exception = null;
 try {
 System.out.println("************* Inside ScriptTemplate scripts.");
 System.out.println("************* Start execution of the ScriptTemplate scripts.");

 if (null == documentFile) {
 System.out.println("Input document is null.");
 }

 methodTemplate(documentFile);
 System.out.println("************* End execution of the ScriptExtraction scripts.");
 //Set isWrite to true to have changes to document written out.
 boolean isWrite = false;
 // write the document object to the XML file.
 if (isWrite)
 {
 writeToXML(documentFile);
 System.out.println("************* Successfully write the xml file for the ScriptExtraction scripts.");
 System.out.println("************* End execution of the ScriptExtraction scripts.");
 }
 } catch (Exception e) {
 System.out.println("************* Error occurred in scripts." + e.getMessage());
 exception = e;
 e.printStackTrace();
 }
 return exception;

 }

 /**
 * Basic scripting template that will traverse through a document list and its document level fields.
 * Includes instantiated string variables containing the most relevant information from batch xml tags.
 * Users of this script should change the class name for the appropriate module/plugin, as well as the
 * this method's name.
 */

 private void methodTemplate(Document document)
 {
 //Gets the document root element.
 Element docRoot = document.getRootElement();
 int pagecount =0;
 //Get and traverse through documents list.
 List&amp;lt;Element&amp;gt; docList = docRoot.getChild("Documents").getChildren("Document");

 for (Element doc : docList)
 {
 //String variables containing information regarding the individual documents in the documents list.
 String docId = doc.getChildText("Identifier");
 String docType = doc.getChildText("Type");
 //Flag to check if we were able to find the extracted data.
 boolean isKeywordPresent = false;
 System.out.println(docId);
 //this will store the y coordinate
 String KeyWordy1Cord ="";
 //Get and traverse through document level fields list.
 List&amp;lt;Element&amp;gt; dlfList = doc.getChild("DocumentLevelFields").getChildren("DocumentLevelField");

 for (Element dlf: dlfList)
 {
 //String variables containing the name of the document level field, and the value in that field.
 //Note: If there is an empty tag in the batch xml, such as &amp;lt;Value/&amp;gt;, the text for that tag will be an empty string "", not null.

 String dlfName = dlf.getChildText("Name");
 String dlfValue = dlf.getChildText("Value");

 System.out.println(dlfName);
 //check if the document level field is keyWord
 if (dlfName.equalsIgnoreCase("keyWord")) {
 if(dlfValue != null) {
 System.out.println("FOUND KEY WORD");
 //Element Cords =new Element("CoordinatesList");
 Element Cord = dlf.getChild("CoordinatesList").getChild("Coordinates");
 //Element x0 = Cord.getChild("x0");
 //Element y0 = Cord.getChild("y0");
 //Element x1 = Cord.getChild("x1");
 Element y1 = Cord.getChild("y1");
 KeyWordy1Cord= y1.getText();
 System.out.println(" Cordy1 " +KeyWordy1Cord);
 isKeywordPresent=true;
 }
 }

 if (dlfName.equalsIgnoreCase("image")) {
 if(isKeywordPresent){
 if(dlfValue == null) {
 Element ValueTag = new Element("Value");
 ValueTag.setText("Test");
 dlf.addContent(ValueTag);

 Element Cords =new Element("CoordinatesList");
 //Element Cord = new Element("CoCoordinates");
 Element page = new Element("Page");
 Element ForceReview = new Element("ForceReview");
 dlf.addContent(new Element("Confidence").setText("100.0"));
 dlf.addContent(Cords.addContent(new Element("Coordinates")));

 Element Cord = dlf.getChild("CoordinatesList").getChild("Coordinates");

 Element x0 = new Element("x0");
 Element y0 = new Element("y0");
 Element x1 = new Element("x1");
 Element y1 = new Element("y1");

 x0.setText("100");
 y0.setText("100");
 x1.setText("3200");
 y1.setText(KeyWordy1Cord);
 System.out.println(" Set Keyword1 "+ KeyWordy1Cord);
 page.setText("PG" + pagecount);
 ForceReview.setText("false");

 Cord.addContent(x0);
 Cord.addContent(y0);
 Cord.addContent(x1);
 Cord.addContent(y1);

 dlf.addContent(page);
 dlf.addContent(ForceReview);

 }

 }
 }
 } pagecount++;
 }
 }
}

I also created a second script that is responsible for taking the zones in the image document level field and cropping the image down into smaller image segments. This will also move the files to an export directory as see in the video above.

import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.OutputStream;
import java.util.List;
import java.util.zip.ZipEntry;
import java.util.zip.ZipOutputStream;

import org.jdom.Document;
import org.jdom.Element;
import org.jdom.input.SAXBuilder;
import org.jdom.output.XMLOutputter;

import com.ephesoft.dcma.script.IJDomScript;
import java.io.BufferedReader;
import java.io.InputStreamReader;

/**
* @author Chris MacWilliams
* @version 1.0
*/
public class ScriptExport implements IJDomScript {

private static final String BATCH_LOCAL_PATH = "BatchLocalPath";
private static final String BATCH_INSTANCE_ID = "BatchInstanceIdentifier";
private static final String EXT_BATCH_XML_FILE = "_batch.xml";
private static String ZIP_FILE_EXT = ".zip";

public Object execute(Document documentFile, String methodName, String docIdentifier) {
Exception exception = null;
try {
System.out.println("************* Inside ScriptTemplate scripts.");
System.out.println("************* Start execution of the ScriptTemplate scripts.");

if (null == documentFile) {
System.out.println("Input document is null.");
}

methodTemplate(documentFile);

//Set isWrite to true to have changes to document written out.
boolean isWrite = false;
// write the document object to the XML file.
if (isWrite)
{
writeToXML(documentFile);
System.out.println("************* Successfully write the xml file for the ScriptExtraction scripts.");
System.out.println("************* End execution of the ScriptExtraction scripts.");
}
} catch (Exception e) {
System.out.println("************* Error occurred in scripts." + e.getMessage());
exception = e;
}
return null;
}

/**
* Basic scripting template that will traverse through a document list and its document level fields.
* Includes instantiated string variables containing the most relevant information from batch xml tags.
* Users of this script should change the class name for the appropriate module/plugin, as well as the
* this method's name.
*/

private void methodTemplate(Document document)
{
//Gets the document root element.
Element docRoot = document.getRootElement();
//String variables containing information regarding the current batch class excuting this script.
String batchInstanceId = docRoot.getChildText("BatchInstanceIdentifier");
//Export Path
String ExportPath = "C:\\Ephesoft\\SharedFolders\\final-drop-folder\\ProductCatalog\\"+ batchInstanceId;
new File(ExportPath).mkdirs();

//Get and traverse through documents list.
List<Element> docList = docRoot.getChild("Documents").getChildren("Document");
for (Element doc : docList)
{
//String variables containing information regarding the individual documents in the documents list.
String docId = doc.getChildText("Identifier");
String docType = doc.getChildText("Type");

Element batchInstanceID = document.getRootElement().getChild(BATCH_INSTANCE_ID);
String batchInstanceIdentifier = batchInstanceID.getText();

//Get and traverse through document level fields list.
List<Element> dlfList = doc.getChild("DocumentLevelFields").getChildren("DocumentLevelField");
for (Element dlf: dlfList)
{
//String variables containing the name of the document level field, and the value in that field.
//Note: If there is an empty tag in the batch xml, such as <Value/>, the text for that tag will be an empty string "", not null.
String dlfName = dlf.getChildText("Name");
String dlfValue = dlf.getChildText("Value");

if(dlfName.equalsIgnoreCase("image")) {
if(dlfValue !=null) {
Element cordlist = dlf.getChild("CoordinatesList");

if (cordlist !=null) {
//get page value
String pageID = dlf.getChild("Page").getText();
String identifier = pageID.substring(2);

Element cords = cordlist.getChild("Coordinates");
Element cordsx0 = cords.getChild("x0");
Element cordsy0 = cords.getChild("y0");
Element cordsx1 = cords.getChild("x1");
Element cordsy1 = cords.getChild("y1");

System.out.println("document level field name "+ dlfName);
System.out.println(" x0 "+ cordsx0.getText());
System.out.println(" y0 "+ cordsy0.getText());
System.out.println(" x1 "+ cordsx1.getText());
System.out.println(" y1 "+ cordsy1.getText());

int boxHight = (Integer.parseInt(cordsx1.getText()) - Integer.parseInt(cordsx0.getText())) + 100;
int boxwitdth = (Integer.parseInt(cordsy1.getText()) - Integer.parseInt(cordsy0.getText())) + 100;
int offsetx0 = Integer.parseInt(cordsx0.getText()) - 50;
int offsety0 = Integer.parseInt(cordsy0.getText()) - 50;

Element pages = doc.getChild("Pages");
Element pageNode = pages.getChild("Page");
String DisplayFileName = pageNode.getChildText("DisplayFileName");
//Path to Image Magick Executable
String IMpath ="C:\\Ephesoft\\Dependencies\\ImageMagick\\";
String Imexe = "convert.exe";
String IMOpperator = "-crop";

String BatchLocalPath = document.getRootElement().getChildText("BatchLocalPath");
String PathSeperator = "\\";
String whiteSpace = " ";

//Command Line Command to pass in image path and cordentates to crop
String command = IMpath + Imexe + whiteSpace+ BatchLocalPath + PathSeperator + batchInstanceIdentifier
+ PathSeperator + DisplayFileName + whiteSpace + IMOpperator + " " + boxHight + "x" + boxwitdth + "+"
+ offsetx0 + "+" + offsety0 + " " + ExportPath + PathSeperator +"PAGE"+ identifier +".png";

System.out.println(command);

try {
Process p = Runtime.getRuntime().exec(command);
p.waitFor();
System.out.println("Exit code: "+p.exitValue());

}
catch(IOException e1) {
e1.printStackTrace();
}
catch(InterruptedException e2) {
System.out.println(e2.getMessage());
}

System.out.println("Done ");
}
}
}
}
}
}
}

As you can see with a few customizations and script you can really extend the functionality to help solve a unique problem. Feel free to leave comments below.

Cmac @ Ephesoft Innovate 2016

File_000.jpeg

If you were unable to attend Ephesoft Innovate 2016 you would have missed out on having the pleasure to see my talk on Advanced Extraction with Ephesoft. Well its not to late. See the full talk below. Learn about all the POWERFUL EXTRACTION tools that Ephesoft Transact 4.1 has to offer.  Read more about Ephesoft at ephesoft.com

See what these extraction tools mean to your business

  • Field Extraction Tools
    • Paragraph Extraction
    • Wrapped Extraction
    • Data Conversion
    • Advanced Barcode Extraction
  • Table Extraction Improvements
    • Cross-Section Extraction
    • NEW UI Layout and functionality
  • Batch Class Extraction Rule Management
    • Global Batch Class Extraction Management
    • NEW Testing interface UI
  • Linux Fixed From Extraction
  • Machine Learning Extraction
    • Automatic User Based Input Extraction

Ephesoft and Alfresco

These days it is not enough to just have document management system. You need the power of a capture system, the organization of the document management system and a workflow engine to drive process. When these three items work together you really start to see some business automation improvements.

See my video on how to do this with Ephesoft and Alfresco.

Ephesoft is integrated with Alfresco via the CMIS protocol. This protocol is not just limited to Alfresco but many other content management systems and capture systems have begun to adopt its standard.