How should you build a single Point Cloud dataset from multiple frames of depth data?


I am desperately stuck and would appreciate some help and guidance from the community, and apologise in advance for the lengthy post I hope you persevere in reading through it!

What I am trying to achieve
The ultimate goal of my application is to analyse surfaces, looking for anomalies and then performing calculations on these anomalies in order to make decisions. I first tried to use the Scanner app to simply scan a surface with a defect and then look at the resultant mesh, however the mesh just doesn’t have the resolution I need. To do this analysis I am going to need a dense Point Cloud of the surface and therefore need to roll my own app.

My first attempt…
…was kind of successful, in that I was able to use code posted in these forums to extract the data from a depth frame and save to a Point Cloud file. Using the 640x480 resolution depth image gives a potential maximum number of points 307,200 (if all are not NaN). However, this does not give me a sufficiently detailed Point Cloud dataset.

My second attempt…
…was a complete failure. I am capturing multiple frames of depth data (from different angles) and then processing them into separate Point Cloud files. When I load these Point Cloud files into an application they all appear on top of each other and create a total mess. I am assuming this is because the extract routine is creating the 3D points in the reference of the camera (using the intrinsics posted in these forums) and so will be the same for every depth frame.

So, I think I need to be able to supply some form of translation to the processed depth frame 3D points so they get correctly oriented to the viewpoint of the camera. Trouble is I don’t know where to start:

  1. What role does STDepthFrame.glProjectionMatrix play in applying transformations to the depth frame data, or does it?
  2. What role does STDepthFrame.colorCameraPoseInDepthCoordinateFrame play in applying transformations to the depth data, or does it?
  3. I tried to check these values for each frame I captured, but they never changed - why not?
  4. Is it because I need to use the STTracker in order for the STDepthFrame.colorCameraPoseInDepthCoordinateFrame to be updated for each frame based on how the camera has moved over time?

Apologies for all the questions but I’m really struggling to figure all this out. I am thinking that I need to take the following approach:

  1. Capture a depth frame along with the camera pose matrix
  2. Capture another depth frame along with the camera pose matrix
  3. Calculate the difference in camera poses from 1 and 2 to get a transformation matrix
  4. Apply the transformation matrix calculated to the depth data from the second frame so that it is positioned correctly, relative to the first depth frame

I just don’t know how to get the necessary data from the structure sensor to do the calculations!


Assuming you’ve started with the Scanner sample, the tracker option kSTTrackerTrackAgainstModelKey is set to TRUE. As long as tracking is OK each depth frame received, along with the current camera pose from the tracker, is passed to the mapper in -processDepthFrame for integration into the model.

I think that’s all the information you need to breadboard your algorithm.



I do not want to build a model so do not want to track against a model, therefore I think I need to use the tracker in unbounded mode.

What I need to understand is what the mapper is doing internally to turn the depth frame into a 3D mesh. Presumably it is processing the depth values into points, translating them to the correct orientation, then building the 3D mesh (or updating). I want to do everything except build the mesh - I just want to grab the raw depth values of all frames and have them positioned correctly in 3D space.

The SDK documentation does not provide enough detail to understand what the camera pose is doing and then how to use the values within our application.

How does the glCameraProjection and the camera pose matrices relate to each other? It would be good to see an example block of code that can apply the matrix transform to the depth values. Do you know of any example code?



The best tracking will be achieved when initialized to track against the model, but you don’t have to use the model (mesh). Take the camera poses and depth frames in -processDepthFrame as mentioned above and you should have what you need. The camera poses should be referenced to the upper corner of the scanning volume. It might be helpful to add a label and print out the camera pose to get a feel for what’s happening.



Hi Jim,

I’m still struggling to get this to work, I’m fearing my poor grasp on the necessary math for 3D is hampering my ability to achieve a result.

I have done as you suggest and am utilising the Scanner app (Swift conversion by Christopher Worley) -processDepthFrame to get the depth data and associated camera pose. I use the code posted here which gives me an x, y, and z for each point there is a depth value.

I’m coming unstuck in trying to apply some form of transform to the x, y, z so that when several depth frames are processed together the points are correctly located in 3D space no matter from what angle I view the scene. I have tried, perhaps naively, the following:

let vec = GLKVector3Make(x, y, z)
let vecTrans = GLKMatrix4MultiplyVector3(cameraPose, vec)

But if I bring together my processed depth frames all of the points end up on top of each other and the scene looks a mess. Do I need to use the glProjectionMatrix value from the depth frame in some way?

Would really appreciate some further assistance here, for what it might be worth here is my function to write the depth frame data to a file:

func exportDepthFrame(toFile: String, depthFrame: STDepthFrame, cameraPose: GLKMatrix4) {
    NSLog("Exporting depth frame...")
    // Calculate the normalised intrinsics for buffer size
    let _fx = Float(VGA_F_X)/Float(VGA_COLS)*Float(depthFrame.width)
    let _fy = Float(VGA_F_Y)/Float(VGA_ROWS)*Float(depthFrame.height)
    let _cx = Float(VGA_C_X)/Float(VGA_COLS)*Float(depthFrame.width)
    let _cy = Float(VGA_C_Y)/Float(VGA_ROWS)*Float(depthFrame.height)
    // Get the depth data as an array we can iterate over
    let pointer : UnsafeMutablePointer<Float> = UnsafeMutablePointer(mutating: depthFrame.depthInMillimeters)
    let theCount: Int32 = depthFrame.width * depthFrame.height
    let distanceArray = Array(
            start: pointer,
            count: Int(theCount)
    // Create an array to hold the PointCloud items
    var pointCloudData: Array<PointCloudItem> = Array()
    for r in 0...depthFrame.height - 1 {
        for c in 0...depthFrame.width - 1 {
            let pointIndex = r * depthFrame.width + c
            let depth = distanceArray[Int(pointIndex)]
            if (!depth.isNaN) {
                let x = depth * (Float(r) - _cx) / _fx
                let y = depth * (_cy - Float(c)) / _fy
                let z = depth
                let vec = GLKVector3Make(x, y, z)
                let vecTrans = GLKMatrix4MultiplyVector3(cameraPose, vec)
                //Add item into the PointCloud dataset array
                pointCloudData.append(PointCloudItem(x: vecTrans.x, y: vecTrans.y, z: vecTrans.z))
    // Export the processed depthFrame to the file
    let depthFrameSaveUrl = getDocumentsSaveUrl(forFile: toFile)
    // Header line for output
    var outputData = "x,y,z\n"
    // Loop over the PointCloud array and write each element out to the file
    for item in pointCloudData {
        let data = "\(item.x),\(item.y),\(item.z)\n"
    do {
        try outputData.write(to: depthFrameSaveUrl, atomically: true, encoding: .utf8)
        NSLog("Finished exporting depth frame")
    } catch {
        // Log the error
        NSLog("Error writing point cloud frame: \(error.localizedDescription)")



Hi colin,

You need to make a camera transform to get world space and camera space. The answer to your question

Do I need to use the glProjectionMatrix value from the depth frame in some way?



Check out this neat projection visualization demo to feel like the math is less unapproachable:

See also:

The above is more relevant.



Of course, 10 out of 10 recommend putting a printout of the entire PCL Documentation under your pillow:

Especially the bit at the end titled More about transformations. Big hint: isn’t it strange that camera pose and depth calls always co-occur with GLKMatrix4, and mesh calls with GLVector3?

I would ignore the color information for now even though it is a part of the Scanner app. You are only looking at depth clouds atm, correct?



Hi Charlie,

Thank you very much for the pointers, just seen your post as have not checked for a while and the forum does not seem to be sending me emails for some reason.

Is there a code sample in your post missing? You have an apparent break/line denoting sample code in C++ but then no code fragment?



@colin.henderson It’s in the link!


Thanks for the clarification Charlie.

If I have followed everything from the links you sent me, in particular the visualisation, do I need to create the ModelView matrix and then invert that so I can get the transform from camera space to world space, then use that to multiply the 4 component vector?

If so, am I correct in assuming:

View Matrix = Camera Pose
Model Matrix = glProjectionMatrix
ModelView Matrix = View Matrix x Model Matrix
World Matrix = Inverted ModelView Matrix

Or am I completely wrong?




Were you able to make this work? Wondering if I could fork off your project because im doing something very similar and currently researching it.