15418_mesh_project

The Application of CUDA and openMPI in Mesh Simplification

Alejandro Gonzalez Lazaro (alejand2) and Majd Almudhry (malmudhr)

SUMMARY:

We will implement the mesh simplification algorithm by Michael Garland and Paul Heckbert on the GHC machines’ multicore processors and GPUs. The two implementations require different strategies that we will compare in this paper.

Meshes

Credit: https://github.com/CMU-Graphics/Scotty3D/blob/main/assignments/A2/simplify.md

BACKGROUND:

The project is about using a mesh simplification algorithm developed at CMU by Michael Garland and Paul Heckbert. From the 15462 assignment, the pseudo code for the algorithm looks like the following:

  1. Compute quadrics (a matrix whose details are in the webpage listed in the resources) for each face.
  2. Compute an initial quadric for each vertex by adding up the quadrics at all the faces touching that vertex.
  3. For each edge, create an Edge_Record (a data structure) and add it to one global queue.
  4. Until a target number of triangles is reached, collapse the best/cheapest edge (as determined by the queue) and set the quadric at the new vertex to the sum of the quadrics at the endpoints of the original edge.

Here we will go over some of the key parts of this algorithm.

Key data structures: The Halfedge Mesh data structure is made up of lists of the following 4 different Elements.

Key Operations on Data Structures:

Algorithm’s Inputs and Outputs: The algorithm takes in a halfedge mesh data structure and outputs another halfedge mesh data structure (these are generally stored in js3d files, but file loading/saving these files is outside the scope of our algorithm). What is the part that is computationally expensive and could benefit from parallelization? The computation of quadrics per face, vertex and edge can be parallelized The collapse of edges can be parallelized when there are no conflicts with other parts of the mesh that are being collapsed simultaneously.

Workload Breakdown.

Intrinsically, the halfedge mesh has a large amount of parallelism that can be accomplished; different parallel elements do not modify elements that other elements need to access/modify. For instance, outside of the Collapse Diamond area that we described above, another parallel unit can collapse any edge that does not touch any of the elements in another’s Collapse Diamond area. This can amount to good data parallelism if conflicts are detected and handled appropriately. Moreover, there is potential for temporal locality while computing an edge collapse, but not as much space locality as the halfedge mesh structure is stored as a set of references between items.

APPROACH

PARTITIONS Fig 8: Mesh Partitioning for MPI

When we partition our mesh, we traverse over every edge and attempt to save all the data in the Collapse Diamond that we need for collapsing that edge to that partition and set the owner of that element to it. If some of that data is already owned by another partition, we will still add the element to the partition but we will not update the owner. This will function as a “ghost cell”. Any edge that is linked to ghost cells will now become part of the “conflict zone” where collapsing cannot be done without communication as there exists data owned by someone else there. The data in each partition is determined by iterating over every edge and carrying out a breadth-first search of other edges, to maximize the area to the conflict zone perimeter. In our current implementation, we first compute the partitions from a master process and asynchronously send each of the halfedge, edge, vertex, and face arrays to each process. From here, each process (including the master process) takes its respective partition data and begins simplification. From here, we handle edges in the conflict free zone by first checking the Collapse Diamond data structure and if no conflicts are found with data owned by other partitions we commit to collapsing the edge. After each of the processes collapse their respective number of target edges, these partitions are communicated back to the master process asynchronously. This main process is assigned edges last so it has less work to do than the other processes so it is ready to receive each piece of data from the other processes. We send each array from each process individually and asynchronously so that the main process can update the main mesh as data is coming in and not at the end while waiting for everyone to send their data.

Adapting original data structures for parallelization:

As we have already mentioned, the halfedge data structure is built up of a series of references to the different elements around the mesh. The problem with this design is that the two parallelization models that we chose require memory transfers between devices that are not in the same shared memory system. Thus simply transferring the original data structures across devices would render all those pointers useless. As such, we changed all of the base data structures such that instead of using element pointers (ElementRef) they would store uint32_t indexes into each array of halfedges, faces, vertices and edges, so that any previous element dereference would be converted into an array access. With this, transferring each of the arrays over with MPI and between CPU and GPU became possible. For both CUDA and MPI implementations, we begin by translating the loaded halfedge mesh into our array version. This conversion is not included in our initialization time as the mesh is loaded into the original reference based halfedge data structure but in our encapsulation of our testing we assume that it’s already in the right format. The following are some of the additional modifications we made per implementation:

optcuda

However, we again ran out of shared memory as we used bigger meshes. So, a better approach would be to partition the mesh, similar to how it’s done with MPI, and make every block handle a different partition. However, we were not able to fully test this approach yet.

One issue we faced throughout our time with this project was that there seems to be a bug in our sequential version of the algorithm that we could not find. Our theory is that when we handle an edge collapse once, this produces a valid mesh but some reference change is not quite right. Thus if we collapse edges around an area we had collapsed before, this error can compound and eventually produce an invalid mesh. As a result in some cases we were not able to generate valid meshes, and often found our code running into infinite loops when traversing the mesh due to mesh inconsistencies. Hence here we will talk about some of the improvements we began to implement/planned for our parallelization but were hindered by this bug.

RESULTS

To test our implementation we have generated some sphere meshes of different face counts and simplified them with different ratios. We will compare our results both with our original sequential code and the single core performance of our parallel code. We will measure our performance in triangles collapsed per second. As we will see in all the following results, partitioning can be a bottleneck. However, the partitioning has to be done once and its cost could be amortized if we reuse it if we need to simplify a mesh multiple times at multiple resolutions (For example when the mesh is seen on the screen from multiple different distances), hence we will also show the speedup without the partitioning cost.

DISCUSSION

RESOURCES

SCHEDULE:

Week 0 (March 24-31): (DONE)

Get the Scotty 3D build system compiling CUDA and MPI and start sequential implementation.

Week 1 (April 1-7): (DONE)

Finish sequential implementation

Week 2 (April 8-14): (DONE)

Sketch out the algorithms for both CUDA and MPI and start implementing them.

Data structure reworking

MPI Partitioning

Week 3.1 (April 15-17):

Implement edge record pass for MPI (Alejandro) (DONE)

Start implementing the “relaxed priority queues” for CUDA (Majd)

Week 3.2 (April 18-21):

Implement collapsing pass for MPI (Alejandro) (DONE only with non-conflicts)

Finish implementing the “relaxed priority queues” for CUDA (Majd)

Week 4.1 (April 22-24):

Implement gather and rebuild from partitions for MPI (Alejandro) (MOVED)

Implement the locking part for CUDA (Majd)

Week 4.2 (April 25-28):

Implement gather and rebuild from partitions for MPI (Alejandro) (DONE) Add collapsing for MPI with conflicts (Alejandro) (DONE)

Partitioning algorithm with CUDA (Majd)

Week 5.1 (April 28-May 1):

MPI alternate mesh partitioning with KD-trees (Alejandro)

Optimization/Data collection CUDA (Majd)

Optimization/Data collection MPI (Alejandro) (DONE)

Edge collapsing for partitioning algorithm (Majd & Alejandro)

Week 5.2 (May 2-May 5):

Poster and report (Alejandro & Majd)

PROPOSAL URL:

https://docs.google.com/document/d/13vQRTEH8dcyKOkdxwu2RBkG24wPOiYFiHQDjRZhcpcI/edit?usp=sharing

PROJECT MILESTONE REPORT URL:

https://docs.google.com/document/d/1yHcyYbR6AOp56O6Ecy0u2nk4vsgWCoEUkRCMBGLWxfk/edit?usp=sharing

FINAL REPORT URL:

https://docs.google.com/document/d/17UkdbUonFj9XmygbsjkXXH3oLB5dvahuskbuaYc_oZU/edit?usp=sharing