Nowadays, much progress has been made in the research of indoor scene modeling, especially the modeling frameworks based on multiple perspectives and single perspective, which has enhanced the robot’s environment perception. However, the following shortcomings still exist: ①The modeling method based on multiple perspectives requires a long pre-processing time, and the offline optimization process is required after the modeling is completed, which cannot meet the modeling requirements under specific conditions. ②The modeling algorithm based on single perspective is mainly output with voxels, so the modeling quality is low, and the information is missing seriously. The details of the scene cannot be accurately characterised, and it is difficult to meet the requirements of robot interaction. In view of the above deficiencies, this paper puts forward a method of indoor scene modeling based on template replacement. First, the three-dimensional point cloud scene is preprocessed to segment a single object with missing point cloud, and then the virtual scanning technology is used to sample the surface points of the object and calculate the corresponding normal vector and curvature. Next, the octree mesh is used to store the normal vector and the curvature information respectively. Furthermore, the high-dimensional feature vectors are extracted by the convolutional neural network (CNN), and the Euclidean distance is compared with the features of three-dimensional object in the database, so as to obtain the retrieval sequence. Finally, the most similar objects are selected from the sequence, and the iterative closest point (ICP) registration method is used to register with the scanning scene to complete the scene optimization. In this paper, the proposed network model is tested on two benchmark data sets and shows good performance.