Like a Black Cat Fading into the Night: A Guide to Textual Search Engine Tricks

Files and emails generally offer clear, readable text. But some tricks can turn what should be readable words into something more like a black cat melting into the night, explains Elizabeth Thede, director of sales at dtSearch Corp.

Most files and emails – Word, Excel, Access, PowerPoint, OneNote, PDF, Outlook/Exchange, etc. – come in easily readable text when you view them in their native apps. But some tricks can make the text less like bright words on the page and more like a black cat disappearing into the dark. This article details several of these text tricks and shows how an enterprise search engine like dtSearch® (as opposed to an Internet search engine like Google) can bring them to light.

Before we enlighten our chat noir with text tips, it’s essential to understand how everything fits together under normal circumstances. Let’s say you’re sifting through thousands or even millions of files and emails to see if they contain one or more of the following terms: costume, candy, or cauldron. If you have unlimited time, you can individually retrieve each file in its native application and analyze the text. We will call this the eyeball method. You can also deploy an enterprise search engine to instantly search terabytes.

An enterprise search engine instantly searches terabytes after indexing data. The index itself is just an internal tool that contains each unique word, number, and location in the data. Indexing is not difficult for the end user; just tell the search engine which folders to cover, and the search engine’s indexer takes care of the rest.

When building its index, the search engine approaches each file in its binary format, avoiding the need to retrieve each in the native application. Looking directly at a file in binary format, it can be difficult to make out a single word in the sea of ​​binary codes. To process the text inside, the search engine must first determine the applicable parsing specification. Scan specifications can span hundreds of pages and vary widely between file formats. It is therefore essential to match the right specification to the right file type.

In addition to its much faster speed, indexed search can go far beyond the eyeball method in terms of the types of searches it allows. In fact, indexed search makes available more than 25 different search functions, including complex phrases, Boolean expressions and proximity: (candy corn or black cat) and (cauldron with 12 witch costumes) and not Christmas. For multilingual texts, indexed search supports one of the hundreds of international languages ​​that unicode supports. Beyond word searches, a search engine can also search for specific numbers and numeric ranges, as well as dates or dates, ranges covering different date formats like 10/31/22 vs. October 31 2022. A search engine can even flag any credit card numbers in the data.

Finally, unlike the eyeball method, indexed search allows simultaneous multi-user queries across a network or online. Run from an “on-premises” web server or from the cloud, such as on Azure or AWS, an online search can run stateless, with no built-in limit on the number of concurrent search threads. Concurrent searching can continue even when an index is updated to reflect files that have been added, deleted, or modified since the last index build, so there is no “downtime”.

Now on to our black cat tricks.

Tip #1: Black text on black background, white text on white background

Talk about a black cat melting into the night. This type of text is very difficult to spot when viewing a file in its native application. However, since a search engine crawls files in their binary formats, black vs. black text and the like are on par with any other text for a search engine.

Tip #2: Deeply Buried Metadata

A native app view of files can hide some metadata, so it can take a lot of clicks to even know it’s there. But all metadata is fully apparent in the binary format view of a file that a search engine sees in indexing, and therefore fully searchable.

Tip #3: Multi-Level Nested File Structure

Sometimes the files don’t show up as standalone items. You can have a Word document with an Excel spreadsheet buried inside, and the same duo can be part of a larger nested structure, like an email with a ZIP or RAR attachment. When viewing an embedded file in the native application of the external file, sometimes only a fraction of the embedded file is visible by default. But when a search engine browses files in binary format, it sees everything. Additionally, a search engine may even allow you to individually copy a file from a larger ZIP or RAR archive or an individual email from a larger email archive.

Tip #4: Files with incompatible extensions

It’s too easy to save an Access database with an Excel .XLSX extension or a Word file with a .PDF extension. However, a search engine can directly access the binary format to determine the correct file type, bypassing the file extension entirely.

Tip #5: PDF containing only images

Sometimes a PDF can have what looks like regular text. But then when you try to copy and paste it, you get no text. (This example is the opposite of the others, in that the text might be clearly visible in a PDF viewer like Adobe Reader, but not otherwise accessible.) But a search engine might tag image-only PDFs when builds its index, you know you have to run them through an OCR program like Adobe Acrobat to turn them into a “searchable image” PDF.

Tip #6: Typos

Diving into that treat bag can lead to sticky fingers, which leads to even more typos than usual. The fuzzy search adjustable from 1 to 10 makes it possible to find a word even if it is misspelled. So if Halloween is mistakenly typed as Hallomeen in an email, a search engine can still find that in a Halloween search with a low level of fuzziness.

Keep an eye out for this black cat and happy Halloween!

