Alejandro Gonzalez Lazaro (alejand2) and Majd Almudhry (malmudhr)
We will implement the mesh simplification algorithm by Michael Garland and Paul Heckbert on the GHC machines’ multicore processors and GPUs. The two implementations require different strategies that we will compare in this paper.
Credit: https://github.com/CMU-Graphics/Scotty3D/blob/main/assignments/A2/simplify.md
The project is about using a mesh simplification algorithm developed at CMU by Michael Garland and Paul Heckbert. From the 15462 assignment, the pseudo code for the algorithm looks like the following:
Here we will go over some of the key parts of this algorithm.
Key data structures: The Halfedge Mesh data structure is made up of lists of the following 4 different Elements.
Edge: contains pointers to a halfedge Fig 2: Credit: https://github.com/CMU-Graphics/Scotty3D/blob/main/assignments/A2/halfedge.md
Face: contains a pointer to a halfedge Fig 3: Credit: https://github.com/CMU-Graphics/Scotty3D/blob/main/assignments/A2/halfedge.md
Vertex: contains the position of the vertex and a pointer to a halfedge Fig 4: Credit: https://github.com/CMU-Graphics/Scotty3D/blob/main/assignments/A2/halfedge.md
Key Operations on Data Structures:
The Collapse Diamond structure: When we collapse an edge, the mesh area around the edge (Bolded Blue) needs to be modified. Components of the halfedge data structure marked with a red cross need to be deleted and magenta arrows represent references that need to be switched to other components.
Fig 6: Collapse Diamond
Fig 7: Collapse Diamond after Edge Collapse
Unordered Maps: For each face in the mesh, a quadric matrix needs to be computed and stored once, then for each vertex a quadric matrix is computed by summing up all the quadric matrices of the faces surrounding it. These matrices are used to compute the cost of collapsing each edge and the new optimal position for the replacing vertex. Thus, whenever a new vertex is created a new quadric matrix should also be added to this dictionary. Edge Record Priority Queue: Every time we want to collapse an edge we need to pop off one edge from the top of this queue which stores edges by cost. Then, before we collapse an edge we take out every surrounding edge’s record and place it back in the edges surrounding the new vertex that has replaced the previous edge.
Algorithm’s Inputs and Outputs: The algorithm takes in a halfedge mesh data structure and outputs another halfedge mesh data structure (these are generally stored in js3d files, but file loading/saving these files is outside the scope of our algorithm). What is the part that is computationally expensive and could benefit from parallelization? The computation of quadrics per face, vertex and edge can be parallelized The collapse of edges can be parallelized when there are no conflicts with other parts of the mesh that are being collapsed simultaneously.
Workload Breakdown.
Intrinsically, the halfedge mesh has a large amount of parallelism that can be accomplished; different parallel elements do not modify elements that other elements need to access/modify. For instance, outside of the Collapse Diamond area that we described above, another parallel unit can collapse any edge that does not touch any of the elements in another’s Collapse Diamond area. This can amount to good data parallelism if conflicts are detected and handled appropriately. Moreover, there is potential for temporal locality while computing an edge collapse, but not as much space locality as the halfedge mesh structure is stored as a set of references between items.
Fig 8: Mesh Partitioning for MPI
When we partition our mesh, we traverse over every edge and attempt to save all the data in the Collapse Diamond that we need for collapsing that edge to that partition and set the owner of that element to it. If some of that data is already owned by another partition, we will still add the element to the partition but we will not update the owner. This will function as a “ghost cell”. Any edge that is linked to ghost cells will now become part of the “conflict zone” where collapsing cannot be done without communication as there exists data owned by someone else there. The data in each partition is determined by iterating over every edge and carrying out a breadth-first search of other edges, to maximize the area to the conflict zone perimeter. In our current implementation, we first compute the partitions from a master process and asynchronously send each of the halfedge, edge, vertex, and face arrays to each process. From here, each process (including the master process) takes its respective partition data and begins simplification. From here, we handle edges in the conflict free zone by first checking the Collapse Diamond data structure and if no conflicts are found with data owned by other partitions we commit to collapsing the edge. After each of the processes collapse their respective number of target edges, these partitions are communicated back to the master process asynchronously. This main process is assigned edges last so it has less work to do than the other processes so it is ready to receive each piece of data from the other processes. We send each array from each process individually and asynchronously so that the main process can update the main mesh as data is coming in and not at the end while waiting for everyone to send their data.
Adapting original data structures for parallelization:
As we have already mentioned, the halfedge data structure is built up of a series of references to the different elements around the mesh. The problem with this design is that the two parallelization models that we chose require memory transfers between devices that are not in the same shared memory system. Thus simply transferring the original data structures across devices would render all those pointers useless. As such, we changed all of the base data structures such that instead of using element pointers (ElementRef) they would store uint32_t indexes into each array of halfedges, faces, vertices and edges, so that any previous element dereference would be converted into an array access. With this, transferring each of the arrays over with MPI and between CPU and GPU became possible. For both CUDA and MPI implementations, we begin by translating the loaded halfedge mesh into our array version. This conversion is not included in our initialization time as the mesh is loaded into the original reference based halfedge data structure but in our encapsulation of our testing we assume that it’s already in the right format. The following are some of the additional modifications we made per implementation:
MPI: In our initial MPI implementation we had a simple broadcast of the partition arrays and then a synchronous receive of all the finished partition data. To improve this, we changed the broadcast and gather stages to use asynchronous sends and receives. This way, the master process does not need to wait for other processes to receive their partitions before it starts to work, and then it can do work (rebuilding the final mesh), as it receives data asynchronously from the other processes. These changes helped bring down the communication time somewhat but the biggest bottleneck in our design was the initial partitioning stage. For instance here is our data before some optimization of the partitioning stage with a sphere consisting of 20480 faces:
This speedup happened after we did some restructuring and removing inefficiencies in the code. This partitioning step is inherently serial but some of the later steps after the breadth first search assignment could be parallelized thus we attempted to parallelize that later step but between the extra communication and necessary steps to adapt to these last steps the time delta was not consistently better thus we kept the whole partitioning serially.
However, we again ran out of shared memory as we used bigger meshes. So, a better approach would be to partition the mesh, similar to how it’s done with MPI, and make every block handle a different partition. However, we were not able to fully test this approach yet.
One issue we faced throughout our time with this project was that there seems to be a bug in our sequential version of the algorithm that we could not find. Our theory is that when we handle an edge collapse once, this produces a valid mesh but some reference change is not quite right. Thus if we collapse edges around an area we had collapsed before, this error can compound and eventually produce an invalid mesh. As a result in some cases we were not able to generate valid meshes, and often found our code running into infinite loops when traversing the mesh due to mesh inconsistencies. Hence here we will talk about some of the improvements we began to implement/planned for our parallelization but were hindered by this bug.
To test our implementation we have generated some sphere meshes of different face counts and simplified them with different ratios. We will compare our results both with our original sequential code and the single core performance of our parallel code. We will measure our performance in triangles collapsed per second. As we will see in all the following results, partitioning can be a bottleneck. However, the partitioning has to be done once and its cost could be amortized if we reuse it if we need to simplify a mesh multiple times at multiple resolutions (For example when the mesh is seen on the screen from multiple different distances), hence we will also show the speedup without the partitioning cost.
MPI
Sample renders from 20480 Faces Sphere with 0.2 ratio
CUDA
MPI We can see throughout our results that our main limitation for speedup was the partitioning step since this is inherently sequential. However, this step is very useful when it comes to avoiding conflicts in our edge collapse (our code currently searches for edges outside of the conflict zone thus seeing near linear speedup shows us that we are able to find non-conflict edges more often than not). Thus, next optimization efforts should focus on other partitioning methods as the kd-tree method that we had planned to reduce that overhead. Still, we can see that higher mesh sizes and lower ratios of simplification (smaller mesh at the end) are able to amortize the extra partitioning cost and achieve better speedup. The following is an example analysis of execution time for the 20480 0.05 8 core simplification case, as we can see partitioning takes more than half the execution time:
We can see that choosing a multicore processor was a good idea to speed up mesh simplification, however, using MPI instead of Shared Memory may not be the best for consumer targeted multiprocessors (which is our intended use case) as this required a large amount of reworking of structures to be transferable through memory and made conflict zone collapse more complicated to implement than potentially making use of locks only at conflict zones. Would be interesting to try this algorithm with OpenMP instead of OpenMPI.
CUDA There are two main problems with the CUDA implementation which might explain why the speedup is far from ideal. The first problem is the small problem size. Only smaller meshes with high ratios worked with CUDA because of the cost computation bug that we have. This makes it harder to draw a conclusion on how effective this approach is. The second problem is inherent to the algorithm chosen here. After the kernel is done from its first phase, where it chooses a single edge to collapse, every other thread in the block remains idle as it waits for that thread to finish collapsing the edge. Clearly there’s a better way to better utilize all the GPU resources. We think looking at algorithms such as the one with the priority queue that we talked about in our milestone report would be interesting. The reason we didn’t use that approach is because of our problems with the sequential implementation that we thought a simpler approach would be easier to debug and analyze. The following tables show some analysis of execution time for the two main phases in the CUDA kernel in units of clock cycles.
We can see from these results that we need to find a way to parallelize the second phase more or find things to overlap with during that time.
Week 0 (March 24-31): (DONE)
Get the Scotty 3D build system compiling CUDA and MPI and start sequential implementation.
Week 1 (April 1-7): (DONE)
Finish sequential implementation
Week 2 (April 8-14): (DONE)
Sketch out the algorithms for both CUDA and MPI and start implementing them.
Data structure reworking
MPI Partitioning
Week 3.1 (April 15-17):
Implement edge record pass for MPI (Alejandro) (DONE)
Start implementing the “relaxed priority queues” for CUDA (Majd)
Week 3.2 (April 18-21):
Implement collapsing pass for MPI (Alejandro) (DONE only with non-conflicts)
Finish implementing the “relaxed priority queues” for CUDA (Majd)
Week 4.1 (April 22-24):
Implement gather and rebuild from partitions for MPI (Alejandro) (MOVED)
Implement the locking part for CUDA (Majd)
Week 4.2 (April 25-28):
Implement gather and rebuild from partitions for MPI (Alejandro) (DONE) Add collapsing for MPI with conflicts (Alejandro) (DONE)
Partitioning algorithm with CUDA (Majd)
Week 5.1 (April 28-May 1):
MPI alternate mesh partitioning with KD-trees (Alejandro)
Optimization/Data collection CUDA (Majd)
Optimization/Data collection MPI (Alejandro) (DONE)
Edge collapsing for partitioning algorithm (Majd & Alejandro)
Week 5.2 (May 2-May 5):
Poster and report (Alejandro & Majd)
https://docs.google.com/document/d/13vQRTEH8dcyKOkdxwu2RBkG24wPOiYFiHQDjRZhcpcI/edit?usp=sharing
https://docs.google.com/document/d/1yHcyYbR6AOp56O6Ecy0u2nk4vsgWCoEUkRCMBGLWxfk/edit?usp=sharing
https://docs.google.com/document/d/17UkdbUonFj9XmygbsjkXXH3oLB5dvahuskbuaYc_oZU/edit?usp=sharing