Evans Training & Consulting

 

Home

Our Services

News

Recommended Products

What our Clients Say

Tips and Tricks

Cheryl's Blog

Polls

Resources

Client Services

About Us

Contact Us

Software to create solutions. Training to build confidence. Consulting to help you succeed.

Admin Login
The In's and Out's of OCR Searching in Summation
By Cheryl Evans - Evans Training & Consulting | June 25, 2009 at 10:01 AM EDT | No Comments

Recently, on the Summation Users List, we got into a discussion about the pro's and con's of searching OCR from the Case Explorer tree vs the OCRBase tab. A user said that the most effective way to search a case is from the Case Explorer tree; I disagreed. Another user asked me to explain, because when he searches from the OCRBase tab, he is only searching the currently displayed document. As explaiend here, his process isn't correct, as explained below. I'm an advocate for searching from the OCRBase tab unless you understand the trade-offs.

I thought reprinting my post to the group here might be helpful. I've edited it to remove some of the personal references to the person asking the initial question.

This is how I explain it in my training classes. My training manual has an 8-page appendix entitled “Everything you Thought You Knew about OCR but Didn’t” or something along those lines. (No, it’s not for sale.)

 

(The person asking the initial question complained that when he searches in the OCRBase tab, he's only searching the currently displayed document. That's not correct, and here's why.) When the OCRBase tab is active, there are options: Search, Fuzzy Search, All Fuzzy and Search All. When you hit the Enter key after you enter your search term, you are only searching the current document, which I bet is what most users do who've not been trained on this particular view. Searching just the current document is  handy when you want to find exactly where in that document your search term appears, but it doesn’t retrieve the entire set of documents that respond to your search.

 

Instead, you want to hit the Search All button to run a search for your exact search term. If the Search All button’s not appearing on your toolbar (which can happen in versions after 2.8), try right-clicking on a gray area of the toolbar (to the right of the Help button works) and select Reset Toolbars. If that doesn’t force the Search All button to appear, you might try monkeying with the resolution on your screen (not my favorite solution since I often present with projectors that force a lower resolution and hide the darn button on this toolbar). Alternatively, go to the Search menu and select “Quick Search all OCRBase. “

 

But one of the reasons why I prefer searching from the OCRBase tab is the ready availability of the Fuzzy Search All (or you can Fuzzy Search just the current document). (By the way, I’m not at all a fan of how Summation renamed these toolbar buttons. It seemed much clearer in the 2.7.x version.) OCR, as you know, is by its nature inherently imperfect, no matter what claims a vendor makes to you (in my opinion). Although you can use asterisks to surround your search term when you conduct a Quick Search, you can’t use asterisks in the middle of a term to account for misinterpretation of characters (i.e., you’re searching for CAT but the OCR interpreted that A sometimes to be an O or a U or….). That’s where Fuzzy Search All comes in. You’re presented with a list of the possible terms in the OCRBase that are spelled close to the way your search term is spelled.

 

Here’s an example in the Franc v Morris sample case. With the OCRBase tab active, enter the term INTERROGATORY (doesn’t matter if it’s capped or not). For purposes of this illustration, don’t use any asterisks. Then hit Search All (or select Quick Search All OCRBase from the Search menu). You get back no hits, right? Now, hit the All Fuzzy tool. You get a list of possible hits. You can change the “fuzziness” percentage, going as low as 65% or as high as 99%. The lower the fuzziness percentage, the more terms you’ll see in that list. Select all of the terms, unless you've happened to lower the fuzziness to 65% and got “interceptor” as one of your choices. If so, select all the terms and then de-select “interceptor.” One document is located in that search vs the “0” documents you found without using fuzzy.

 

Now here’s where just being able to search the document currently displayed on your screen comes in. If you were to hit “Search”, you wouldn’t find any hits in the document because it’s searching for the original search term (“interrogatory”). But if you hit the “Fuzzy Search” tool, you’ll be able to jump from hit to hit within that one document. Why? Because nowhere in that document is the term “interrogatory” spelled correctly so you have to rely on the list of Fuzzy Search terms that was used to conduct the correct search.

 

You’ll also notice that when you run an OCRBase search from the OCRBase tab, by clicking on the Column tab, the retrieved records are displayed in the Column view. (In fairness, any time you run a search, even from the Case Explorer, the Column view will reflect the results of your search.) Now, if you’ve done any coding in the database, you can use Subset searching to narrow the results. Once you’ve done that, flip back to the OCRBase tab and you will see only those documents that (1) came back as a result of the original fuzzy search and then (2) were narrowed down when you used subset searching in the Column view. Cool, yes? While you can run a search in the Case Explorer and flip to the Column view to do subset searching, you can’t flip back to the Case Explorer and narrow the search results report as you can by flipping to the OCRBase tab.

 

So here’s why I believe the Case Explorer view searching for OCRBase can lead to problems. The person asking me to explain my position said he can search the "entire database” from the Case Explorer. I’d bet that what he's doing is selecting in the Case Explorer “Core Database”, “OCRBase” and possibly other items like transcripts, transcript notes, etc. So humor me here. Go to the Case Explorer and de-select everything but OCRBase. You have both a “Search” tool and a “Fuzzy Search” tool on the toolbar, right? Now, if you’re using versions earlier than 2.9.x, if you select Core Database as well as OCRBase, the “fuzzy search” tool grays out. That’s because the Core Database can’t be fuzzy-searched. But that also means that you’re not using fuzzy search on the OCRBase either, which I just demonstrated would be a problem. If you’re using 2.9.x, the darn Fuzzy Search tool doesn’t gray out but if you click it, nothing happens. The developers just messed up the graying out of that button. If it were working, you’d be presented with a list of alternate terms.

 

Now I know what’s going to happen is that you (or someone) will run a Quick Search with both the Core Database and OCRBase selected and get a single hit. What’s important to understand is that you didn’t get the hit because you were searching INTERROGATORY in the OCR. You got it because the term INTERROGATORY appears in the coding of the record. (See the DOCTYPE field for the record you retrieved.)That won’t always happen, especially if you’re relying primarily on OCR to get your search results.

 

The inclination from the Case Explorer view is to “search all of the database” meaning not just the OCR (the contents of the documents) but the Core Database (the coding) as well. When you do that, you lose the fuzzy search option, which I think can be dangerous when relying on searching the OCR.

 

Now, if you have perfect OCR with no misspelled words (either by the OCR software or by the author of the document), then you don’t need fuzzy searching. But if you have that, I’d sure like to get the name of the OCR vendor you use because I’ve never, ever seen a perfect set of OCR’d documents in all the years I’ve been doing this (more than I care to count).

 

Happy searching!

 

Knitted Brows
By Cheryl Evans - Evans Training & Consulting | June 02, 2009 at 07:00 PM EDT | No Comments

I've been training software for over 16 years and using it for much longer. Both the training and using the software are now second nature to me. One of the reasons I don't like to conduct web-based training is because I can't see when a student isn't "getting it" or feels like a "deer in the headlights." When I'm in front of a group of students, I can usually spot when those students get into trouble or become overwhelmed and then step in to walk them through it.

Recently, I took up knitting. I'm not a complete klutz with my hands, having played the piano since I was a little girl and being pretty proficient with a keyboard. But I'm apparently a klutz with knitting needles. Believe me, I never expected to feel so awkward with a pair of knitting needles in my hands! I drop stitches without even realizing it. The whole process of knitting seems puzzling to me. It's hard for me to put 2 and 2 (or knit and purl) together in order to understand how to get myself out of a bind. In fact, the folks at the yarn shop have become accustomed to me walking in with a mess in my hands for them to help me fix.

I realize, as I slowly move towards basic proficiency in this art, that this feeling of inadequacy, of klutziness is how many software students feel. This is even more of an issue with paralegals who have shifted into litigation from another specialty. Learning litigation support software when you're still learning litigation jargon. Well, it's like someone is trying to teach you in another language. What seems obvious to those of us who have been using the program for a long time is just the opposite to those who are new to the program. Even more puzzling for a new user of software is, yes, how to get out of a jam you've gotten into. Just like my knitting "messes."

This experience is a much welcomed wake-up call to me to be even more sensitive to how students feel being in a classroom filled with others who know the jargon, may know more about the program, or feel more comfortable with computers.

This is one of the reasons I'm glad I can offer on-line, as-needed support to my clients via GoToMeeting. If you get stuck, please don't spin your wheels (or, like I frequently do in knitting) give up and start over. Let's schedule a session to help you get over that hump.
 
I'd like to hear your feelings about the effectiveness of on-line training vs. classroom training. Please send me an email or write a comment here.
 
And now it's time to get back to a little knitting and purling...or was that purling and knitting?

Best wishes,
Cheryl

Welcome to ETC's Blog
By Cheryl Evans - Evans Training & Consulting | May 31, 2009 at 10:54 PM EDT | 1 comment

May 31, 2009


Thanks for dropping by to read (and hopefully contribute to) my blog. I'm hoping to use this space to convey information about new products, events, problem resolutions -- and just plain old random thoughts.

SUMMATION ISSUES -- SEE COMMENTS FOR AN UPDATE ON THESE PROBLEMS
TALLY PROBLEM: It's recently come to my attention that there may be a problem with Summation's useful Tally feature. If you use Tally (which is one of my favorite features) to convey information to your attorneys or to opposing counsel about your keyword search results, please drop me an email so that I can walk you through some of the issues. Summation is working hard to resolve this in an upcoming release (a hotfix, if we're lucky), and I'll report here on the results. This problem goes back at least as far as 2.7.2. Thanks to my client Connie for uncovering the issue.

LONG "TO" LISTS IN EMAILS: Another issue that has come to light is one you may have run into but didn't realize it, and it may not just be limited to Summation. Summation has a field limit size of 32k (about 30,000 characters). When the body of an email exceeds this size, Summation puts the text of the email body into a file called EMBODY.TXT, which can then be searched via the eDocs view. Of course, when this is done, if you're used to searching emails via the Column view, you won't be picking up those emails that have exceeded the field limit. (TIP: Once emails are loaded into your system, conduct a search in Column view for DOCLINK CON *EMBODY.TXT. If your search returns records, you'll know those exceeded the body field limitation and will not be returned with hits when you search using the Column view. You'll have to use the eDocs view to locate hits within those emails.)
The issue that we discovered recently was that, particularly when dealing with email populations from very large corporations, the "TO" (or BCC or CC) fields might exceed that 32K limit if everyone in the corporation is an addressee. In those instances, the addressee list could be thousands of employees which could well exceed 32K. Summation is also working to resolve this issue. I'll report here on the progress.

Best wishes,
Cheryl


 Evans Training & Consulting

 Phone: 480.899.5588    Email: Info@evanstraining.com
© 2006-2009 by ETC Enterprises, Inc. dba Evans Training & Consulting.
All rights reserved.