Category: Uncategorized

OCR (Optical Character Recognition) – a technical overview

Optical Character Recognition (OCR) is software that assists in reading text, translating, and converting an image into a text file. OCR system comprises of the optical scanner for reading the text and disenchanted software for converting an image to a text. This software ease reading of complicated letters, books, and journals. OCR has the capability of reading text in large various fonts. Unfortunately, OCR does not provide a great support to handwritten text files. This article will provide you with the technology behind OCR, an essential element of OCR, principles based on OCR, and how one can convert scanned image to the editable file using OCR.

Technology Behind Optical Character Recognition

The most advanced type of OCR currently is ABBY FineReader OCR. Usually, OCR works with three basic principles- integrity, purposefulness, and adaptability. OCR is easy to use and consist of three steps which are scanning a document, recognizing it, and store it in the right format (RFT, XLS, PDF, and TXT). How does OCR recognize text? It isolates document pages to an element like the block of text, image or even tables. The line is partitioned to the word and then to the character for them to be recognized. After character have been recommended the software do a comparison with a set of the image pattern. It then improves diverse hypothesis about character recommended. Regarding this hypothesis, the program inquires different effect of subdividing lines into word and word into characters. After processing colossal numbers of such probabilistic hypothesis the software proceeds with its decision of presenting you with the recognized text. Modern OCR such as ABBY FineReader can support 45+ languages from the dictionary. This facilitates auxiliary inquiry of text element on word unit. It ensures more accurate inquiry and recognition of document and assists in getting text information from the complex document. These are the three basic principles of equipped on the OCR with maximum reliability and brilliance that make it possible for human recognition.

Essential Element for OCR

The essential elements are scanning and recognition. Both two elements involve various procedures.

Recognition: images captured through scanning digital camera can be as well recognized by OCR for them to be converted to text form. A digital camera needs to have bright light for their images be recognized by OCR. Modern OCR such as ABBY FineReader has dependable recognition technology that targets processing camera images. They have been well built to counter image bias at the edge, perfect recognition.

Scanning: this software can scan to two types; scanning to pdf and scanning to a word. Scanning to pdf ensure layout accuracy. Scanned document retain original outlook on screen resembling virtual photocopy. One can click on a single word or listen to overall document. Scan to a word is done for flexibility thus providing power to edit and change text layout

Converting Scanned Images to Editable File

It involves various steps when converting scanned images from source to editable file using OCR.

Step 1:

Involves detecting the direction of the text. The scanned image is never perfectly aligned hence you slightly need to rotate scanned images so that text line become 100 % horizontal.

Step 2:

Involves discovering whether the text is unit column or double column

Step 3:

In every column, you need to locate ‘baseline’ position of the consecutive text line. Double column text needs to be changed to a single column “long ribbon of printed character”. The format used is black and white.

Step 4:

Express this ribbon into unity character by recognizing vertical stripes of the white pixel. Each “token” is of rectangular mini-image of black/white pixel. In case two tokens are sandwiched by more than average white space you can add “space” token

Step 5:

Go through the token, comparing one by one with a pixel of known characters (letters, number, etc.). Find the length of a token and each of selected templates. You require selecting short length character being the right one. But in case this step doesn’t return token for you take unit character but instead probabilities.

Step 6:

Sum variety of probabilities called language model which is always specific to a language, e.g. English. Example, likely previous letter “who” is accompanied by the letter “I” or digit “1” which has same probability in the next token. However, language model will lean on the letter”I” instead of “1”. OCR lacking this step typically produce many non-sensical word –language free but the one with language model produce ideal transcript no matter the blurred image( image captured with an out of focus camera or the one printed both side of paper where text of back can penetrate through)

To convert your scanned images into an editable format, you can use our online OCR Conversion tool.

 

XML: what is it? History and Uses

There are varieties of file extensions, and the new one with variation pop up each day. For an IT expert, the XML file extension does not seem like anything new. But for a new user of IT XML might be sound like an extension for malware. But it is not a malware; it is one of the widely used file extensions. XML actually means Extensible Markup Language. It means that it is kind of a markup language to interpret and add data to be understandable by humans as well as machines. So XML files are used to interpret, transport, structure and store data. It was designed with the aim of the generality of usage across all over the Internet. Dr. Charles Goldfarb who was involved in the development of XML file system, says that XML is a kind of holy grail for computing because it overcomes the problem of data exchange universally between non-identical systems.

Background

The true history of XML actually goes back to 1970’s when three employees of IBM, Charles Goldfarb, Ed Mosher and Ray Lorie introduced a new technique of forming technical documents, called GML, in which tags were used to structure the data. The name actually consisted of initials of their names Goldfarb, Mosher, and Lorie. Later Goldfarb introduced the term general markup language to reflect the use of the GML better. In 1986 ISO adopted it as SGML (Standard Generalized Markup Language). SGML is not an actual markup language rather it provides a specification for other languages, for example, HTML is used to specify tags for web pages, is an application of SGML.

Although HTML was so much popular from the start but there were many problems which were creating headaches for the web developers especially. There were some loopholes in the presentation which were left open on the discrete of end users, on the contrary, page designers always wanted to solidify their design. The completion between Internet Explorer and Netscape was causing differentiation in the standards and was creating more problems for web developers. These problems and lack of standard were causing the original idea about web pages and usage of web services to drift away because the interpretation of web content was no longer uniform.

Because of these problems, it was widely accepted that HTML is too limited and SGML is too complicated, so there was demand for something much better. In 1990’s a large group of people well known in the industry collaborated and came up with the XML. The collaboration was done using emails and teleconferences. The XML stands for Extensible Markup Language. Basically, it is also the specification for other markup languages. It was developed with the aim of general purpose usability and stability, conciseness, formality and minimum optional features. Its internet application is only one form of its usage. With the introduction of XML now it is easy to specify and store any data, which can be imported and processed by any application using any platform.

Uses

There are wide varieties of uses of XML. Some of these uses are as follows.

General application of XML is that it provides a standardized platform to store, access and display data.

While doing web searches and web tasks, XML makes it easier to get desired results because XML defines the kind of data store in the file.

The most popular use of XML is in web development. With XML it is easier to develop an interactive and wide variety with the option to the customer to customize. While data is stored once in XML, which can be used to present to different users with the different style of viewing and processing.

XML use is also getting popular in the storage of data among enterprises. With the use of XML, it is easy to exchange data across different platforms. A business process is getting connected across global scale more than ever. If data hubs are using standardized XML, it will allow business to interact with each other and with customers with ease. Many industries have created systems for storing information in the standardized form using XML for better interaction within the industry. Finance, health care, sciences, and music industry are just to name a few who are using XML for standard data storage. For example in publishing industry XML is used across all document publishers, for example, XML is the basis for Microsoft’s as well as Google’s application as well.

Layout

There are two methods to layout data of any XML document. These layouts are DTD (Data Type Definition) and XML schema. DTD is basically an extension of SGML and XML schema is written using XML syntax. There are some limitations in DTD such as it treats all data in a document as a string and does not have the ability to specify specific rules to specific data. For some application, it is not useful that is why XML schema language was introduced.

You can use our online XML Converter to convert XML files into other formats or you can also convert your PDF files into XML using our PDF to XML Converter.

Portable Document Format (PDF); the Evolution of a Format and its History

Before PDF, it was a lot difficult to interpret data from various formats on a different device. All of the operating systems were based on different formats. To eliminate this problem, John Warnock and his team come together on a project named ‘Camelot’. The goal of this project was to create a uniform format which can be accessed on any operating system. The very idea of a singular format that worked on any system was formed for the advancement of business technology. This kind of format indicated that the offices would go digital and the document would be stored in such format instead of paper.

The first PDF; IPS

The first format was mentioned at Seybold conference in San Jose in 1991. At this time it was named IPS i.e. Interchange PostScript. It was officially announced at Comdex Fall in 1992 and received a best of Comdex award. The tools by Adobe, which were used to create or view PDF files and Acrobat, were launched in 1993. It was, at first, of no use for the publishing market since it already featured bookmarks or internal link with RGB as the only supported color space. The name of the original project was replaced by Carousel, the future Acrobat software. This name remained and was taken as a file format type in Macintosh.

PDF 1.1

Acrobat 2 was introduced to the market in November 1994. It supported the format which added support for external links, article threads, security features, independent color space, and notes. Acrobat 2 was also improved, including a new architecture of Acrobat Exchange to support plug-ins in and PDF file searching feature. After its launch, this new and improved PDF file format was promoted ad popularized by the Adobe itself as well as the US government which distributed their forms and papers digitally in PDF files. Adobe started shipping its product in the year 1995. At the same time, they also introduced PDF file support in many of their products like Framemaker 5.0.

PDF 1.2; time of the Press Market

Acrobat 3 along with PDF 1.2 was launched in 1996. This included many features for the press community and greatly enhanced the Adobe software. It included features such as OPI 1.3 and CMYK color space support as well as spot color maintenance. At this time, the internet was also improving and popular. Adobe saw this as an opportunity to take the market and released a plug-in to view PDF files in Netscape browser. On the other side, Acrobat 3 was improved with a lot of extensions for the prepress community throughout 1997 and 1998 like plug-ins for Pitstop and Checkup from Enfocus software and Crackerjacks from Lanatanarips.

PDF/X-1 and PDF 1.3

Based on PDF 1.2, a reliable and standard format was launched by the prepress community in 1998, known as PDF/X-1. It included extra and improved features such as the ability to blind transfers and fonts as well as embedding high-resolution image.

In 1999, Adobe introduced PDF 1.3 along with Acrobat 4. They were designed according to the modern prepress technology and requirement so as to cope up with the market. It included OPI 2.0 specifications, new color space known as DeviceN as well as annotations. Page size support was improved along with integration with Microsoft Office. This made it easier to work with adobe acrobat and the software was seen as user-friendly rather than confusing.

PDF 1.4; an illustrative partner

In mid-2000, it was the first time that Adobe release PDF version 1.4 with Illustrator 9 and not Acrobat. Illustrator 9 came with the unique feature of supporting PDF 1.4 and transparency, although full specs were not revealed about PDF 1.4 at the time.

PDF 1.4 was properly released with Acrobat 5 in 2001 and its full specifications were shown. Along with transparency support, improved file security and printing quality were also seen in the new version. JavaScript support was also added in the version. A new step towards the advancement of digital papers was taken by the company. ‘Tagged PDF’ was introduced in this version which means detailed information defining the actual document can be a part of the PDF.

PDF of the future

The year 2003, PDF 1.5 is launched which comes with improved features like image compression, layer support and enhanced tagged PDF.

Two years late, in 2005, Adobe gives PDF 1.6 which had some improved as well as new features like PDF container file, file embedding support, new 3D embed support, as well as some enhancements to old features.

PDF 1.7 was much of an improvement over the previous version with few new features. One improvement of PDF 1.7 was becoming ISO-standard, which happened in 2008.

Since PDF had now become part of ISO, Adobe could not release a new version of PDF Thus, it stuck with Acrobat and its extension. This extension acted as an improvement to PDF 1.7.However, ISO is planning to release PDF 2.0 with minor adjustments to PDF 1.7 and some new features of its own.

At FreeFileConvert, we support conversion of PDF files into various formats, so you can always use our online converter if there’s a need to change the format of your PDF files.

How to choose a data compression format

If you work with your computer and you are used to manage big, big, chunks of data, the most difficult choice to make when thinking about how to share them or to send them, is picking the right data compression format. It sounds maybe too much if you are listening with inexperienced ears, but which compression format you use can have a powerful impact on the performance of your work.

So it is time to compress some files, what format should you use? There are many criteria you could take into account and compression ratio is one of them, but not the only one as many tend to believe. The ease with which you can use a specific format rather than another one is very important too. Downloading and learning to use third-party software can be exhausting and tiring and overall irritating, so in this guide we will try to analyze the most common data compression formats: ZIP, RAR and 7Z format. Hopefully, after reading this, you will know which one to pick considered that each one of us has different needs in terms of compressing data.

ZIP
The Zip archiving format is most likely the most common and mainstream one that you can find on a Microsoft Windows system. It has many good functions and in recent years its developers have introduced several interesting improvements to the format, such as big recovery records that are able to rebuild accidentally missing data, a strong and safe encryption and a better compression ratio. However, these features are not what kept Zip this popular. Two other factor did that.

  • Zip compression is undoubtedly quite fast and if you a massive amount of data to compress, you will probably end up choosing Zip over other options because it is faster and the fact does not provide the best amount of compression will probably not affect you at all.
  • Also, Zip support is almost epidemic. No matter the operating system of your computer, whether it is Linux or Windows systems, Zip is the ideal choice when you have to send via email big data. You never know what the other person has as operating system, but Zip will solve this issue.

RAR
RAR is another archive format that is quite common out there. It was introduced by WinRar for the Windows platforms and it can be used on Linux too, even if only as an extractor. It is probably the strongest alternative to the Zip format being that has a very good encryption as well, an even better compression ratio and error recovery capabilities. Because of these reasons, RAR is very popular among those who need a way to distribute files all over the web.

7z
And lastly, but not least, there is 7z. It is modern and open source, sporting the highest compression ratio out there and when put work against its fellow archive formats Zip and Rar, it proves to be better in many cases. Because it has such a nice compression ratio, it is not the best option if you need a quick job, unless the computer you are using has an ultramodern multi-core CPU.

In a nutshell
If you are not an expert of this field, this entire article might have sound slightly difficult. However, going through our arguments, one thing is clear: it depends on what needs you have, what do you hope to achieve with the compression of your file and how much time you have that can be dedicated to waiting for your big files to be compressed. Hopefully this guide helped you go navigate better in this jungle of formats.

Our Archive Converter understands a lot of compression formats, so if you can’t open an archive/compressed file on your computer because of not having the appropriate software then you can always use it to convert your files to a most popular zip format.