PDF and Its Structure
PDF stands for Portable Document Format. It is an open standard for document exchange. A PDF file contains both text and binary data. When a PDF file is viewed using a text editor, one can see only the raw objects which form the contents and structure of the PDF file.
The PDF file is structured in hierarchical manner. This structure defines a flow by which a PDF viewer application reads the contents in a sequence and draws them on the screen. The syntax of a PDF file can be described at three levels — object, file and document.
In order to better understand the structure of a PDF file, we need to consider it in four parts — objects, file structure, document structure, and content stream. In the following paragraphs, we’ll have a look into these individual parts of the PDF file.
A PDF file is composed of small sets of basic types of data objects. These basic data objects collectively form a PDF document’s data structure. These objects include the character set which is used to write these objects and other syntactical elements. The basic types define the properties of the objects and the syntax as well.
The second part of the PDF document is file structure. The way basic objects are stored in the PDF file and later accessed or updated is defined by the file structure. The file structure is independent of the semantics of the objects; this means that the file structure is only responsible for organizing and updating the objects.
The document structure actually describes that how the basic objects are grouped together to form various components of the PDF file. These components can be pages, annotations, form fields etc. So, in fact, this part describes the semantics of the components of the PDF file.
A sequence of instructions which describe the appearance of any graphical entity is represented in the form of content stream. The content stream is also composed of objects, however these objects are distinct from the basic types of data objects.