PDF (Portable Document Format)

PDF (Portable Document Format), formalized as ISO 32000, was developed by Adobe in 1992 to present documents, including textbook formatting and images. Grounded on the PostScript language, each PDF train encapsulates a complete description of a fixed-layout flat document, including the textbook, sources, vector plates, raster images, and other information demanded to display it.

It has its roots in” The Camelot Project”, initiated by Adobeco-founder John Warnock in 1991. It was formalized as ISO 32000 in 2008. The last edition, ISO 32000- 22020, was published in December 2020. PDF lines may contain a variety of content besides flat textbooks and plates, including logical structuring rudiments, interactive rudiments similar to reflections and form- fields, layers, rich media( including videotape content), three-dimensional objects using U3D or PRC, and colorful other data formats.

History

Adobe Systems made PDF available for free in 1993. In the early days, It was popular substantially in desktop publishing workflows and competed with several other formats, including DjVu, Envoy, Common Ground Digital Paper, Farallon Replica, and indeed Adobe’s own PostScript format.

Before its July 1, 2008, release as an open standard and publication as ISO 32000-12008 by the International Organization for Standardization, It was a proprietary format under Adobe’s control. An ISO Committee comprised of highly competent specialists then took control of the specification. Adobe released a Public Patent License to ISO 32000-1 in 2008, offering royalty-free rights to all its patents required to create, utilize, sell, and distribute PDF-biddable executions.

version 1.7, the sixth edition of the PDF specification that came ISO 32000- 1, includes some personal technologies defined only by Adobe, similar to Adobe XML Forms Architecture( XFA) and JavaScript extension for Acrobat, which are substantiated by ISO 32000- 1 as normative and necessary for the full perpetration of the ISO 32000- 1 specification. These personal technologies aren’t formalized, and their specification is published only on Adobe’s website. Numerous of them aren’t supported by famous third-party executions of PDF.

ISO published ISO 32000- 2 in 2017, available for purchase, replacing the free specification handed by Adobe. In December 2020, the alternate edition of version 2.0, ISO 32000- 22020, was published, with interpretations, corrections, and critical updates to normative references( ISO 32000- 2 doesn’t include any personal technologies as normative references). The PDF Association released ISO 32000-2 for free download in April 2023.

Specialized details A PDF train is frequently a combination of vector plates, textbook, and bitmap plates. The introductory types of content in a PDF are Typeset textbooks stored as content aqueducts( i.e., not decoded in plain textbook); Vector plates for illustrations and designs that correspond to shapes and lines; Raster plates for photos and other types of images; and Other multimedia objects.

After Its variations, a PDF document can also support links( inside document or web runner), forms, JavaScript( originally available as a plugin for Acrobat3.0), or any other types of bedded content that can be handled using draw- sways. Three technologies are combined in PDF, a declarative version of the original PostScript runner description computer language, used to create the layout and plates.

A fountain-embedding/ relief system will allow sources to travel with the documents. A structured storehouse system to rush these rudiments and any associated content into a single train, with data contraction where applicable.

PostScript language( edit) PostScript is a runner description language run by a practitioner to induce an image. It can handle plates and has standard features of programming languages similar to branching and looping.

Plate commands are still present in PDF, a subclass of PostScript that has been streamlined to exclude similar control inflow characteristics. PostScript was firstly designed for a drastically different use case transmission of one- way direct print jobs in which the PostScript practitioner would collect a series of commands until it encountered the showpage command, also execute all the commands to render a runner to a printing device.

PostScript wasn’t intended for long-term storehouses and real-time interactive pictures of electronic documents, so there was no need to support scrolling back to former runners. Therefore, any given runner in a PostScript train could be rendered only as the accretive result of executing all antedating commands to draw all former runners — any of which could affect posterior runners — plus the commands to draw that particular runner, and there was no easy way to bypass that process to skip around to different runners.

Traditionally, to go from PostScript to PDF, a source PostScript train( that is, an executable program) is used as the base for generating PostScript, such as PDF law. This is done by applying standard compiler ways like circle unrolling, inlining, and removing unused branches, performing in a law that’s purely declarative and static. The result is also packaged into a vessel format, together with all necessary dependencies for a correct picture( external lines, plates, or sources to which the document refers), and compressed.

As a document format, It has several advantages over PostScript. It contains only stationary declarative PostScript law that can be reused as data and doesn’t bear a full program practitioner or compiler. This avoids the complexity and security pitfalls of an machine with such a advanced complexity position. Like Display PostScript, PDF has supported transparent plates since interpretation 1.4, while standard PostScript does not.

It enforces the rule that the law for any particular runner can not affect any other runners. That rule is explosively recommended for PostScript law, too. Still, it has to be enforced explicitly( see,e.g., the Document Structuring Conventions), as PostScript is a full programming language that allows for similar, lesser flexibilities and isn’t limited to the generalities of runners and documents. All data needed for a picture is included within the train, perfecting portability.

Compared to PostScript, It has several benefits, such as lower complexity and security risks, support for transparent graphics, rendering data encapsulation within the file itself, and improved portability. Though compression allays this worry, it has limitations, including restricted flexibility, being restricted to certain use cases, and sometimes greater file sizes. Versions of PDF 1.6 and later allow the embedding of interactive 3D documents. This makes it possible to integrate 3D drawings in U3D or PRC formats with various other data formats.

File format

These files are organized using ASCII characters, except for some components that include binary content. The file starts with a header containing the format’s version (e.g., %PDF-1.7) and a magic number (in readable text format). This format represents a subset of the COS (“Carousel” Object Structure) format. The main components of a COS tree file are objects, which are divided into nine types:

– Real numbers; – Boolean values, which denote true or false

– Strings: Integers, enclosed in parenthesis ((…)) or hexadecimal, enclosed in single angle brackets (<…>). Eight-bit characters can appear in strings.

Lists of names that start with a forward slash (/)

– Arrays, which are square bracket-enclosed ordered collections of items ([…])

– Dictionaries, which are collections of items with their names indexed by double angle brackets (\\…>>)

– Streams, usually surrounded by endstream keywords and preceded by a dictionary, hold vast amounts of binary data that can optionally be compressed.

– The null object

You can add comments by using 8-bit characters that are preceded by the percent symbol (%).

Objects can be either indirect or direct (included within another object). If they are located in the document root, indirect objects are defined between the obj and endobj keywords and are numbered with an object number and a generation number. Indirect objects (apart from other streams) can also be found in unique streams called object streams (marked /Type /ObjStm) starting with version 1.5. This method works exceptionally well with Tagged PDF and simplifies ordinary stream filters for non-stream items while reducing file size for documents with many small indirect objects. The generation number of an object cannot be specified in object streams (except for 0).

The byte offset of each indirect object from the beginning of the file is provided at the conclusion by an index table (cross-reference table). Efficient random access to objects and incremental modifications without rewriting the entire file are made possible by this architecture. Before version 1.5, the table would come after the main body of indirect objects in a unique ASCII format, indicated by the xref keyword.

Cross-reference streams are optional and look like regular stream objects, maybe with filters applied. Version 1.5 introduced them. Such a stream, which includes offsets and other data in binary format, may replace the ASCII cross-reference table. Because of the format’s flexibility, an integer width can be specified (using the /W array), allowing a document less than 64 KiB in SizeSize to allocate only 2 bytes for object offsets.

A PDF file’s footer includes the following information:

– The startxref keyword, which is followed by an offset to the cross-reference stream object or the cross-reference table’s start (starting with the xref keyword).

– The end-of-file marker (%%EOF).

A dictionary containing information that would otherwise be in the dictionary of the cross-reference stream object, such as a reference to the root object of the tree structure (/Root), the number of indirect objects in the cross-reference table (/SizeSize), and optional information, comes before the trailer keyword in the footer if a cross-reference stream is not in use.

Using a stack-based methodology akin to PostScript, one or more content streams specify the text, vectors, and images produced within each page.

There are two layouts for PDF files: linearized (or “optimized”) and non-linearized (or “not optimized”). Since necessary information is dispersed across the file, non-linearized PDF files could be smaller but take longer to retrieve. Because all of the objects needed for the first-page display are optimally arranged at the beginning of the file, linearized PDF files—also known as “optimized” or “web optimized” files—are structured to allow viewing in a web browser plugin without having to wait for the complete file to download. QPDF or Adobe Acrobat can be used to optimize PDF files.

Imaging model

With the noteworthy exception of transparency, added in version 1.4, the basic architecture of PDF’s graphics representation is similar to that of PostScript.

A device-independent Cartesian coordinate system is used by PDF graphics to represent the surfaces of pages. A matrix can be used in a PDF page description to skew, rotate, or scale graphical objects. One of the most essential ideas in PDF is the graphics state, which comprises a set of graphical properties that can be changed, stored, and retrieved inside a page description. Version 2.0 of PDF has twenty-five graphics state attributes, such as:

– The coordinate system is established by the current transformation matrix (CTM).

– The path of clipping

– The alpha constant, which is essential for transparency – The color space

Control for black point adjustment (added in version 2.0)

Vector Graphics

vector graphics are created using paths, usually made up of lines and cubic Bézier curves. This is similar to PostScript. Text outlines can also be used to develop these routes. In contrast to PostScript, It does not allow text outlines combined with lines and curves on a single road. Paths can be clipped, filled, stroked, and then filled again. Any color set in the graphics state, including patterns, can be adopted by strokes and fills. Patterns of all kinds, such as shading and tiling patterns, are supported by PDF.

Raster Images

Image XObjects, often known as raster images in PDF, are image data streams linked with dictionaries to represent raster images. These photos are frequently filtered to reduce their size. It supports several general-purpose filters, such as JPXDecode, FlateDecode, DCTDecode, and ASCII85Decode.

Text

Characters drawn at precise points are specified by text elements in page content streams, representing text in PDFs. A chosen font resource’s encoding is used to define characters. Digital fonts, which may be embedded in the font file, are described using PDF font objects. The standard 14 fonts are a set of 14 typefaces that are particularly important in It documents. The encoding requirements in PDF were initially created for Type 1 fonts, and they became more complicated for TrueType fonts.

Transparency

The original PostScript-like opaque imaging model for PDF completely replaced any previously marked content in the same spot for every object rendered on the page. The imaging model was extended in version 1.4 to support transparency, making it easier to achieve blending effects when newly marked objects interact with ones already marked. Transparency extensions correspond with features of Adobe Illustrator version 9 and are based on transparency groups, blending modes, shape, and alpha.

The concept of transparency in the PDF specification is distinct from the concepts of “group” and “layer” currently in programs such as Adobe Illustrator. It represents the logical links between objects that are significant during editing but are not essential to the imaging model.

Logical structure and accessibility

Reliable text extraction and accessibility are primarily made possible by the logical structure and accessibility capabilities of PDF documents, especially in tagged PDFs. Page content can be extracted and reused for various uses thanks to the inclusion of document structure and semantics information in tagged PDFs. As of 2021, support for tagged PDFs still needs to be consistent among consuming devices, including assistive technology, even though it is optional for print-oriented PDFs. Nonetheless, ISO 32000-2 provides a more comprehensive explanation of tagged files, which should increase acceptance even more. Furthermore, in 2012, the PDF/UA subset—an ISO-standardized version of PDF intended for accessibility—was first released.

With the release of PDF version 1.5, optional content groups (OCGs) were added, enabling users to view or hide specific content parts inside PDF documents selectively. OCGs, also called layers, are helpful in various contexts, including multilingual papers, CAD drawings, layered artwork, and maps. OCGs are just an array of OCGs with each one’s display status included in an Optional Content Properties Dictionary that is added to the document root.

Signatures and encryption are essential to PDF security. As with version 2.0 files, 256-bit AES encryption can encrypt PDF files. ISO 32000-2 describes digital signatures, which offer secure authentication. Additionally, these files may have inbuilt DRM limitations that prohibit printing, editing, and duplicating. However, the enforcement of these restrictions depends on reader software compliance. Two passwords are required for typical security: an owner password to indicate restricted operations even after the file has been decrypted and a user password for file encryption.

Since its introduction in version 1.5, usage rights (UR) signatures have allowed for more interactive features than those available in the default PDF viewer. These consist of submitting form data, importing/exporting form data, and saving amended forms, among other things. By offering a framework for improved electronic signatures, PAdES (PDF improved Electronic Signatures) improves the appropriateness of PDFs for safe digital signatures.

File attachments are supported by PDF files, which enables processors to access, open, and save attachments to a local filesystem. Furthermore, two types of metadata can be included in PDFs: the Document Information Dictionary, which provides key/value fields like author, title, and creation date, and Metadata Streams, which use the expandable Metadata Platform (XMP) to provide expandable metadata based on XML standards. In these documents, display options such as page arrangement and zoom level can also be configured, affecting how PDF viewers, such as Adobe Reader, display the document.

Multimedia

A file containing interactive content, either embedded in it or accessible through links, is a rich media PDF. This interactive material could include buttons, audio, video, or graphics. Consider an interactive PDF that functions as an online catalog for an e-commerce company. Products might be displayed on the PDF pages in this way, along with pictures, website connections, and buttons that let customers place purchases straight from the document. Rich Media PDFs help display products, training, or deliver multimedia-rich content since they provide an exciting and dynamic way to present information.

Forms

Forms may be added to PDF files thanks to Interactive Forms. As of right now, It supports two distinct data and form integration techniques, both of which are present in the PDF specification:

1. AcroForms, or Acrobat forms: Originally included in the version 1.2 format definition, AcroForms are now part of all subsequent PDF specifications. They make it possible to employ code like JavaScript and various elements like text boxes and radio buttons. Interactive forms (AcroForms) offer actions including submitting, resetting, and importing data in addition to standard PDF action types. For example, the ” submit ” action sends the names and values of specific interactive form fields to a designated unified resource location (URL) by the “submit” action. Interactive form fields and their values can be submitted in different formats based on the ExportFormat, SubmitPDF, and XFDF flags set.

2. XML Forms Architecture (XFA) forms: The version

1.5 format specification introduced XFA forms. However, they are incompatible with AcroForms. With version 2.0, XFA was deprecated from PDF.

Format for HTML Forms

Since version 1.5, It has supported the HTML Form format, and compatibility with the HTML 4.01 Specification was added. Moreover, from PDF version 1.2, HTML 2.0 functionality has been accessible.

Data Format for Forms (FDF)

Although it is far simpler than PDF, the Forms Data Format (FDF), which is based on PDF, has the same syntax and has a comparable file structure. The body of an FDF document has just one necessary item. FDF was first introduced in 1996 as a component of ISO 32000-2:2017, and it has been described in the PDF specification since version 1.2.

FDF can be used to submit form data to a server, receive responses, and integrate form data into interactive forms. It also makes exporting form data to stand-alone files easier, which you can insert into the associated PDF interactive form.

Data Format for XML Forms

The “XML” form submission format specified in version 1.4 has been replaced by the XML Forms Data Format (XFDF), an external specification supported since version 1.5. Forms Data Format (FDF) is implemented in XFDF, the XML version of FDF; however, it only implements a portion of FDF that includes forms and annotations. A few terms in the FDF dictionary—Status, Encoding, JavaScript, Page’s keys, EmbeddedFDFs, Differences, and Target—do not have XFDF equivalents. Furthermore, unlike FDF files, XFDF files do not allow for the spawning or inserting of additional pages based on the data presented.

Although it is covered independently in the XML Forms Data Format Specification, the XFDF specification is mentioned in the version 1.5 specification and later editions. Like FDF, XFDF is compatible with the XML standard and has similar uses. Form data can be transmitted to a server, changed, and then returned, allowing the updated data to be integrated into an interactive form. Additionally, using XFDF, form data exporting to independent files that may be reimported into the associated PDF interactive form is made more accessible.

As of August 2019, ISO 19444-1:2019 – Document management — XML Forms Data Format — Part 1: Use of ISO 32000-2 (XFDF 3.0) is the ISO/IEC standard for XFDF 3.0. This standard acts as ISO 32000-2’s normative reference.

PDF

Individual form fields and the values that go with them could be submitted independently in version 1.4. Subsequent versions, such as version 1.5, allow the submission of the whole document rather than just specific fields and values.

AcoroForms can store form file values in external standalone files that are key-value pairs. Two formats for these external files are Forms Data Format (FDF) and XML Forms Data Format (XFDF).

Use rights (UR) signatures in PDFs specify permissions for importing and exporting form data files in the FDF and XFDF formats and text(CSV/TSV).

Adobe Systems launched Adobe XML Forms Architecture (XFA), a proprietary format with PDF 1.5.

Nevertheless, XFA Forms cannot be used with the IDO 32000-Defined AcroForms Functionality XFA material, which most PDF processors do not support. Although the XFA Specification was deprecated entirely from PDF with ISO 32000-2 (PDF 2.0), it was still referenced from ISO 32000-1/PDF 1.7 as an external private specification.

Modifications to the content Researchers from Hackmanit GmbH and Ruhr University Bochum Published attacks on Digitally signed PDFs in November 1019. They showed how to change the material seen in a signed PDF without making the signature void. Due to implementation problems, this exploit impacted 6 out of 8 online validation services and 21 out of 22 desktop PDF Viewers.

At the same conference, they demonstrated techniques to extract plaintext from PDFs, including encrypted content.

They unveiled fresh methods of attacking it in 2021, dubbed “shadow attacks”. These attacks take advantage of the flexible characteristics of the PDF specification.

During the conference, Jens Muller summarised security vulnerabilities in PDFs, including data manipulation, arbitrary code execution attacks, information exposure, and denial of service.

The vulnerability of malware files has the potential to spread malware, Trojan horses, and viruses. They might have hidden JavaScript code that opens secret objects, launches malware directly, or takes advantage of flaws in PDF readers.

The first known case of a PDF attachment containing a virus was found in 2001 with the OUTLOOK virus. Peachy or PDFWorm. This virus was activated with Adobe Acrobat but not with Acrobat Reader, and it sent itself as an attached Adobe PDF file via Microsoft Outlook.

New vulnerabilities are regularly found in different versions of Adobe Reader, prompting the business to provide security updates. These vulnerabilities also affect other PDF viewers in that an attack vector is possible since PDF readers can be set up to launch automatically whenever a web page has an embedded file. Therefore, even if the browser is secure, a system could still be vulnerable if a malicious website includes an infected file and takes advantage of a flaw in the PDF reader.

The PDF Standard’s ability to allow JavaScript scripting in documents is the source of some vulnerabilities. While turning off JavaScript execution in the PDF reader can help reduce the likelihood of such exploits, it cannot completely guard against security flaws in other areas of the reader software. Despite compatibility concerns, security experts maintain that JavsScript is not necessary for a PDF reader and advise against using it to increase security.

One way to prevent its vulnerabilities is to use a local or web service to convert files to another format before viewing them. Didier Stevens, a security researcher, discovered an exploit for Adobe Reader and Foxit Reader on March 30, 2010, which launches a malicious program if the user accepts to start it when requested.

Printing

Digital production processes, or RIPs, are indispensable devices for printers. We call this procedure “rasterization.” The Adobe PDF Print Engine from Adobe Systems, Jaws, and the Harlequin RIP from Global Graphic are some RIPs that can process PDF files natively.

In 1993, the Jaws raster image processor from Global Graphics became the first shipping prepress RIP to comprehend PDF natively without requiring conversion to another format. In 1997, the business upgraded its Harlequin RIP to be capable of doing the same.

The first PDF-based prepress workflow system was called Apogee, initially released by Agfa-Gevaert in 1997. This was a significant turning point in using it in the printing industry. Press-ready PDF files are now accepted as the principal print source by many commercial offset printers, with a preference for the PDF/X-1a subset and its variants. By using this technique, the requirement for gathering native working files, which was frequently troublesome and time-consuming, is eliminated.

It became the de facto standard print job format at the Open Source Development Labs Printing Summit in 2006. While desktop application projects like GNOME, KDE, Firefox, Thunderbird, LibreOffice, and OpenOffice have switched to issuing print jobs in PDF, the Common Unix Printing System still supports it as a print job format.

Additionally, some desktop printers have direct PDF printing capabilities, which make it possible for them to understand PDF data Without the need for outside help and streamline the printing process for outside help and streamline the printing process for consumers.