18.01.2020
Posted by 
Colas En Estructura De Datos Pdf

As is well known, the popularization of parallel computers, like multi-core systems, has posed the challenge to parallelize programs, or data structures supporting them to take advantage of that processing power. It is also known that lock-based data structures could severally limit the amount of parallelism in a program. Lock-free programming techniques (; ) are known to deliver increased parallelism and scalability in scenarios of high contention and sharing of data among multiple threads. It is also agreed that this is a challenging area with a lot of active research going on.Data compression applications are processor-intensive; therefore, they are interesting examples where the benefit from parallelization could be significant. Data compressors and decompressors are important not only by themselves, but also when used as part of bigger processes to do compression or decompression on-the-fly. The program bzip2 is a very popular tool used for data compression and decompression on multiple platforms like Unix, Linux and Windows.The bzip2 program is based on the Burrows-Wheeler Transform algorithm for data compression. The BWT algorithm works over blocks of text instead of working over an entire file.

It does not process the input file sequentially in a character by character basis, but instead, processes each block of text as a single unit. BWT applies a reversible transformation to a block of text to form a new block that contains the same characters, but it is easier to process by simple compression algorithms, like Huffman. Basically, this transformation generates all the rotations of the characters in the original sequence (block) S, and then sort all that generated sequences. An important observation here is that the sequence T (for Tails), formed by taking the last character of each sequence in the sorted collection, clearly forms a permutation of the original sequence. By keeping that T sequence and the position I of the original sequence in the sorted collection, one can eventually reproduce the original sequence.

The indicated transformation is illustrated in. Figure 1.BWTCompressing Transformation Examplethis studyThe described sorting and tailing operations tend to put together (or close to each other) the occurrences of each character in the generated sequence T (like the two h and the two e in the shown example). Thus, T will be easy to compress with a simple locally-adaptive compression algorithm, like Huffman or arithmetic compression.The serial implementation of bzip2 works by splitting the original file into blocks; then, it compresses each block independently and sequentially, and writes the compressed blocks in the destination file. The program bzip2 is based on the libbzip2 library that provides API functions for the required tasks. In particular, there is a function that allows compressing an in-memory data block. It runs the compression code within the calling thread and leaves the compressed block in memory.The bzip2 serial implementation of course does not take advantage of parallel hardware like multi-core and many-core computers. It just runs a thread that continuously reads a block, compresses it, and writes the compressed block.

Since the compression part is the most time consuming stage of the process, there is a clear opportunity to improve the performance by parallelizing this stage.There already exist some parallel implementations of bzip2 (;; ). Basically, they fall in two categories. Some of them do lock-based fine-grained parallelism of certain operations involved into the BWT algorithm, like rotating or sorting.

The data dependencies and the contention due to locks prevent them from scaling well when using hardware with many cores. The other category includes approaches that do coarse-grained parallelization, for example, parallelizing the compression of entire blocks. They use some sort of lock-based queues as buffers between the reading, compressing, and writing stages.

Datos

They suffer from the contention problem inherent to lock-based designs, and consequently do not scale well when the number of processors grows.In this work, we address the problem of building an efficient parallel implementation of the bzip2 data compressor using a lock-free implementation of the queue data structure. The idea is to have a reader thread, many compressor threads running in parallel, and a writer thread. The reader thread reads blocks from the input file and store them into an input buffer.

The compressors use the API provided in-memory-compressing function to compress the blocks and then put them in an output buffer. The writer thread takes compressed blocks and writes them to the output file, in the correct sequential order.

Both, the input and output buffers, are implemented as lock-free queues, to improve the parallel processing, avoiding the contention problem of lock-based data structures. We tag the read blocks with their sequential numbers. Then, it compresses these blocks in parallel, in an out-of-order fashion, and finally reorders them before writing to the output file. This is similar to how pipelined out-of-order execution occurs in modern hardware architectures. We also use a two-buffer-output strategy for reordering the blocks before writing them to the output file so that the compression stage proceeds completely in parallel. Our approach is based on the observation that the reading and writing stages are inherently serial and the compression part is the most time-consuming stage. As the number of compressor threads increases, the throughput of the compression phase grows and so the stress is shifted to the queues (buffers) causing contention on them.

The use of a lock-free queue implementation is worthwhile here as it can alleviate the contention and improve the performance stage. The complete approach is detailed in the “Parallelizing bzip2” section.The rest of this paper is organized as follows: the next section presents the related work, the “Parallelizing bzip2” section describes our approach, the “Experimental design” section describes the setup of our experiments, the obtained results and how our approach compares to previous approaches is presented in the “Experimental results” section, finally some conclusions and future work are presented.Related workMost algorithms for data compression are designed for sequential processing. Both dictionary based approaches, like Zip family (; ) and statistical approaches, like arithmetic and Huffman encodings, are essentially sequential. Even the current implementation of the popular bzip2 algorithm runs sequentially.Isakov implemented a parallel version of bzip2 for symmetric multiprocessors called BZIP2SMP. Basically, this implementation does a fine-grained parallelization of some of the processes involved in the bzip2 algorithm, like RLE (run-length-encoding), sorting and bit-storing.

This approach is said to have good speedup, but that has not been systematically documented. Besides, it does not take advantage of any lock-free data structure as we do in our work.An experiment was conducted and documented by Jannesari et al.

as part of a software engineering class. Student teams were assigned the task to parallelize bzip2 in a team competition. Most of the teams tried out some forms of fine-grained parallelization however; many parts of the sequential code were not amenable for parallelization due to data dependencies, and tricky optimizations for faster sequential execution.Gilchrist proposed and implemented a parallel version of bzip2 named PBZIP. Basically, this implementation parallelizes the compression stage of the process compressing many blocks in parallel. He uses a blocking fifo queue as a buffer between the reading thread and the compressing threads, and between them two and the writing thread. This implementation, and others similar, work fine but use lock-based queues suffering from the contention problem inherent to them.There are several proposals for implementing lock-free queues (;;; ). In particular, the proposal of Michael & Scott was adopted in the lock-free queue implementation included in the BOOST library.In our work, we follow an approach similar to Gilchrist’s, but we use Michael’s lock-free queue implementation from BOOST library, instead of mutual-exclusion queues, as explained in following sections.Parallelizing bzip2The starting point is the serial implementation of bzip2.

Then we use the Gilchrist’s implementation pbzip2 as a reference point for our parallel implementations. Both, the serial bzip2 and the Gilchrist’s implementations were explained before. For our implementations we adopted coarse-grained parallelization where multiple compressor threads compress in parallel independent blocks. We tried out various original approaches looking for better results: lock-based one-buffer-output, lock-based two-buffer-output, and lock-free implementation.Lock-based one-buffer-outputAlthough blocks of data can be processed in parallel, they must be written in the same order they were read, to reconstruct the original data correctly. To guarantee this, we add a sequential number to each block at the moment of reading and we used a modified output queue that only enqueues a block if its sequential number is one more than the last block accepted. Using this mechanism, the writer thread only receives ordered blocks when it dequeues and its work is really simplified: just pick up a block and write it to file.The downside to this approach is that it forces the compressor threads to wait for the moment when they can enqueue a block. For example, if the compressor that handles blocks 7 finishes, it must wait until block 6 is enqueued before enqueuing block 7 because the special queue does not allow block 7 to get in.To measure the impact of using this special queue, we compare its performance against an implementation which does not consider the output ordering of the blocks; let’s call it non-ordered implementation.

The non-ordered implementation will produce corrupt compressed files and therefore, it does not have practical applications, but it works as a baseline to compare the performance of the ordered implementation.We compare the performance of these two approaches and find that the ordered implementation is 1.3X slower than the non-ordered implementation. The cost of having compressors waiting in line to enqueue their blocks is clearly significant.Lock-based two-buffer-outputAnother approach to get correctly ordered blocks in the compressed file is to allow the compressors to enqueue their blocks in the output queue in any order, and let the writer thread worry about the ordering of the blocks. We call this approach two-buffer-output, because the writer uses a secondary buffer to store some blocks.When the writer is ready to write a block to the file, it searches for the next sequential block inside its secondary buffer. If the block is found then, it is taken out of the secondary buffer and written to the file. If the block is not found in the secondary buffer, the writer takes a block from the output queue. If this block is the next sequential block then, it is written to the file. If not, the block is stored in the secondary buffer and another block is taken from the output buffer.

This process is repeated until the next sequential block is found and written to the file.exemplifies the two-output-buffer approaches: (a) the writer (W) needs to write block 7 to file. (b) Since the block is not in the secondary buffer, it takes blocks from the output queue and puts them in the secondary buffer. (c) Block 7 is found in output queue, and is written to file.

(d) Now the writer must write block 8, which is already in the secondary buffer. Figure 2.Two Buffers Approachthis studyThetwo-output-buffer approach has several advantages. First, the secondary bufferis only used by the writer thread so it does not need to be protected fromother threads (no locking or multi-threading worries).

Second, this mechanismkeeps the output queue as empty as possible so compressor threads do not getstalled because of a full queue. Finally, the compressor threads can enqueue inany order, so they do not have to wait. These advantages are reflected in theperformance: the two-output-buffer implementation performs just as good as thenon-ordered implementationLock-free implementationFor our parallel bzip2 implementation we replace the lock-based queue with a lock-free queue and use it for both, the reading buffer (between reader and compressors) and the writing buffer (between the compressors and the writer). We keep our two-output-buffer strategy, but now combined with a lock-free queue.There are several designs of lock-free queues.

Estructura De Datos Lineales

In this work we used the lock-free queue implementation included in the BOOST library which is based on the design of Michael et al. This queue is implemented as a singly-linked list with Head and Tail pointers where Head always points to a dummy node at the beginning of the list, and Tail points to either the last or second to last node in the list. The algorithm uses compare-and-swap with modification counters to avoid the ABA problem. We selected this queue because it is based in a well-documented design and has a well-engineered implementation.As shows, the main program uses the following components: two lock-free queues (readqueue and writequeue), one reader, and one writer threads (fr, fw), and n compressor threads (c). Figure 3Compressing algorithmthis studyThealgorithm proceeds as follows: at the starting phase, it creates the componentsand starts the threads (reader, compressors and writer). During the workingphase all the threads work in parallel, mediated by the queues. The finishingphase begins when the reader finishes.

At this point, all the compressors arenotified to stop as soon as the reading queue is empty. When all they finish,the writer thread is similarly notified to stop when the writing queue isempty. The algorithm finishes when the writer thread stops.Experimental designWe used a 64-core shared memory computer (Opteron 6272 x4) to compare the performance of the various versions of the parallel bzip algorithm.

We designed experiments to compare the performance of the following implementations: the Gilchrist’s lock-based implementation, our lock-based two-buffer-output and our lock-free implementation. Our lock-based one-buffer-output was not considered for the final experiment as it was significantly outperformed by our two-buffer-output implementation.All the programs were compiled with gcc 4.6.3 compiler. For the experiments we compressed a 3.1GB data file using a block size of 900KB. Reducing the block size (from 900KB to 100KB in this case) marginally improves the performance of all the parallel approaches, but it negatively affects the compression ratio. Since the block size reduction has a similar effect on the different approaches and negatively affects the compression ratio, it was discarded as a parameter for comparison.Each one of the compared implementations was run with the following number of processors: 1,2,4,8,16,24,32,40,48,56, and 64. All the experiments were measured as wall clock times in milliseconds.

Then the time data for each implementation were used to obtain the speedup for each number of processors (compared to the corresponding one-processor (serial) configuration). These speedup data is presented in the graphs.Experimental resultsThe followingsubsections summarize the results of the experiments in both, settings usingthe lock-based and the lock-free implementations.Lock-based implementationsOurtwo-buffer-output lock-based implementation gets similar result to Gilchrist’s.This was foreseeable as both implementations are very similar using lock-basedqueues. The clock-time decreases as the number of processor grows, as shown in. Graph 2Lock Based SpeedUpthis studyBoth, ourtwo-buffer-output implementation and Gilchrist’s implementation, show a goodspeedup which is 99% of the ideal speedup using 8 processors and around 88% at16 cores.

After that their performance starts to decayLock-free implementationThelock-free implementation uses lock-free queues. Beyond that, it is similar tothe lock-based implementations.

All phases (reading, compressing, and writing)of the program run in parallel so both, I/O and compressing times areconsidered. Different block sizes were tried out (like 100k and 900k). Thisstrategy scales very well up to around 16 processors.

It performed at 99% ofoptimal speedup for 8 processors andat 90.44% of optimal speedup for 16 processors, as shown in.