Linking and loading

November 1, 2011

Compiled programs are prepared for running by a combination of softwares called the linker and loader. The edge between these two is blurry, and all unixes have a single program doing both, called ld.

Why are we talking about linking-loaders at all? Recalling the discussion of what is an operating system, the linking-loader would be part of the development tools. They have a great dependency on the operating system, and must have knowledge of the operating systems internals, but they are perhaps not core elements of an operating system. Most operating systems texts don’t cover the topic of linking-loaders.

However, Windows NT’s loader is part of the operating system. The loader must make considerable use of the virtual memory system. A too narrow definition of the operating system might allow for better theoretical discussion, but is simply not useful in practical terms. For better or worse, in operating systems I include the utilities of the “greater operating system” as well as the core development tools.

According to Levine’s book, Linkers and Loaders, the combined linker-loader has three functions:

  • Loading: the actual copy of the program and its preinitialized data segments into memory;
  • Relocation: making address translations so that the code or data occupy particular locations in memory;
  • Symbol resolution: the coordination and connection between locations and symbols among separately compiled code segments.

As the name suggests, loading is tightly associated with loaders. Symbol resolution is always associated with linkers. Relocation can variously be considered a linker’s job, or a loader’s job. As will be plain by the end of this lecture, this is all very arbitrary. Reality is very messy when it comes to linker-loaders.

Loading: The copying of a program and its data into memory has a few more subtasks than one might assume. In simplest of operating systems, this can be a straight copy, almost mirror image, for a file into physical memory. However, in any operating system of consequence, virtual memory must be set up to receive the various segments of the program and initialization tasks must be handled. The source for the load might come from multiple files, as loadable libraries, or various sections of a file.

There we have some of the standard terminology of loader-linkers. In the layout of virtual memory, various spans in virtual memory will have distinct purposes. These intervals are called segments. In Intel architectures, there are various segment registers on the CPU to keep track of segments.

Files that contain loadable images will have various sections, that are more various than segments, and are more like various appendices in a book. There are alignments: a loadable image file will have code sections, areas of the file that contain the text that should be associated with or copied into the code segments set aside in the space of virtual memory.

To load code, virtual memory space is set aside and propertied by a data element called the Virtual Address Descriptor, VAD, by Microsoft. In Linux, it is called either a memory region or area, but it is described by an vm_area structure. The VAD might describe that an interval, say from 0x8000 to 0xafff in virtual addresses is read-only (text) and maps to a certain set of bytes in a certain file: the loadable source for the text. The VAD might set aside a range of virtual memory for data, and mark that range as read-write, and initialize data in that range in accordance with a data section in a loadable source. Unlike text, however, the association of the virtual address range with the file section occurs only at start-up. Text sections are associated throughout the run of the program.

Relocation: A compiler works on a piece of the overall program. These parts of the program are compiled into assumed addresses — both the code and data. When pieces of code are brought together it is likely that they must be moved. The assumed starting address for the code in both pieces will be the same, and eventually only one piece of code can occupy that address.

Relocation maps out how the memory will be arranged and translates code segments into non-overlapping intervals. This is done by the linker providing reloc records, for each and every address written into the code, whether the target of a code jump or of a data load or store, there is a record of this address, with additional information of the kind of address written. These are also called fixups, and the act of going through the collection reloc records and modifying the address referred to, so as to account for the translation of the segment, is also called a fixup.

For instance, a program might be compiled so that some while loop ends with a jump to location 0x80800. The jump itself is at location 0x80880. A reloc pointing to location 0x80880 will be placed in the linker output file. If the program must be moved upwards by 0x10000 bytes, the reloc will cause the the jump instructed to be rewritten as a jump to 0x90800; the jump itself will appear in memory at location 0x90880.

Symbol resolution: Separately compiled codes make reference to each other. Certainly, functions written in on file are called by code written in another file. At compile time, the compiler must know the signature of the called function: the return type and the number and type of arguments. Knowing these things, the compiler and emit code for the call, and handle the return values, without knowing the exact code of the called function.

In order that the code run, however, the call must eventually be made to the address of the called function. After the code is assembled and relocated, the call instruction must be fixed-up as well, rewritten so that address of the called function, left at zero by the compiler, now have the proper address of the called function.

Additional reloc records point to the location of the call statement, annotating the reloc record with the symbol name. These are unresolved references, and the linker endeavors to resolve them. Code providing symbols has a section of its linker output file stating all the external symbols the file exports. The record for an exported symbol would five the symbol’s name and the unrelocated address of the symbol in text section (or data section for an exported variable).

posted in CSC521 by admin

Powered by Wordpress and MySQL. Theme by Shlomi Noach,