Resumen:
This research proposes a novel crowd estimation technology to help authorities to make the
right decisions in times of crisis. Specifically, deep learning models have faced these challenges,
achieving excellent results. In particular, the trend of using single-column Fully Convolutional
Networks (FCNs) has increased in recent years. A typical architecture that meets these charac teristics is the autoencoder. However, this model presents an intrinsic difficulty: the search for
the optimal dimensionality of the latent space. In order to alleviate such difficulty, we propose a
dual architecture consisting of two cascaded autoencoders. The first autoencoder is responsible
for carrying out the masked reconstruction of the original images, whereas the second obtains
crowd maps from the outputs of the first one. Our architecture improves the location of people
and crowds on Focal Inverse Distance Transform (FIDT) maps, resulting in more accurate count
estimates than estimates obtained through a single autoencoder architecture. Specifically, to
evaluate the model in the location task we used two decision thresholds (𝜎 = 4 and 𝜎
= 8),
obtaining, respectively, that our model increased the Precision by 36 (from 27.11% to 63.11%)
and 46.8 (from 37.26% to 84.06%) percentage points, the Recall metric by 3.05 (from 54.56%
to 57.61%) and 1.75 (from 74.98% to 76.73%) percentage points, and F1-Score by 24.02 (from
36.22% to 60.24%) and 30.45 (from 49.78% to 80.23%) percentage points. For the counting
task, the Dual Reconstructive Autoencoder (DRA) model decreased MAE and RMSE by 88.5%
and 75.18%, respectively, compared to the metrics obtained for the Single Autoencoder (SA)
model (SA model MAE: 121.73, DRA model MAE: 13.92, SA model RMSE: 127.61, DRA model
RMSE: 31.67).