Bret W. Lester

Collections and "initial-letter Hacks"

I recently released v2.6 of WOL. The major feature in that release was Collections (like folders that you can put your saved content into).

Collections are a first step toward a more comprehensive feature set, the details of which a don’t feel comfortable disclosing at the moment for a number of reasons, principal among them is how much plans are likely to change.

Anyway, along with collections, the javascript text-extraction engine behind WOL has undergone some improvements and bug fixes.

The main improvement to the text extractor is that I’ve improved support for articles that begin with a big upper-case initial letter.

There is a CSS property for this purpose called “initial-letter” but browser support is currently lacking so publishers must resort to clever approaches toward achieving the desired aesthetic; many going as far as using an image in place of the first letter.

The problem with clever approaches are that they are confusing to AI. Specifically this means that WebOutLoud‘s text extraction was having a hard time identifying paragraphs containing initial-letter hacks as actual paragraphs.

Fortunately, as is often the case with such things, there are patterns to how hacks are implemented across a multitude of publishers—providing a hook with which to overcome them with a general purpose algorithm.

To make a long story short, WOL v2.6 will do a better job at extracting paragraphs containing an initial-letter hack which in most cases, should prevent the horrible user experience that is listening to an article with its first paragraph missing.

§

Listen to documents and web articles like this one using lifelike text-to-speech. Try WebOutLoud free.

More Posts

RSS