Data Preparation: We Voxelized the Lidar Point Cloud and used the Bird’s eye view and cylindrically projected images for model input.
Base Model: We used the Aggregate View Object Detector (AVOD), a two-staged detection network with slight modifications. The first stage of the model finds all the regions that are likely to have the object, and the second stage of the model refines those findings for more accurate predictions.
Model Training: Performed feature extraction with convolution on processed images and lidar data and fused the feature map crops based on anchor boxes for both the data. Performed objectness classification for each crop and used regression to find offsets and orientation of anchor box crops.
Experimentation: We experimented with different ways for cylindrical projections and mapping of 3D points to images and with a different set of anchor boxes for model training.
Infrastructure and Hardware: We preformed data engineering and model training on the local workstation with 20 CPU Cores, 128gb RAM, and TITAN RTX GPU.
%
%