Data Analysis in Me-Mind project

Article by UniPi, Department of Computer Science. Author: Fernando De Nitto

Introduction

The purpose of the ME-MIND project is to enable small and medium-sized cultural and creative industries to evaluate their local impact through the collection, manipulation, cleaning and subsequent analysis of data. The two use cases of the project, the Estonian National Museum and the Internet Festival event, are very different examples with respect to their nature, the territorial extension, the way they are disseminated and the time span in which use cases are operational. In this post we describe the main characteristics of the two use cases, useful to understand the different approaches used in the data analysis, but also challenges and opportunities we faced working with the two use cases.

The use cases

The Estonian National Museum, based in Tartu, is the reference point for the entire country of Estonia regarding Estonian history, customs, and traditions. In addition to the permanent exhibition, the museum hosts temporary exhibitions and events every year. The museum also represents an avant-garde model in terms of logistics and technologies proposed to the visitors, who can, for example, interact with electronic ink devices to obtain information in multiple languages. The Internet Festival event is an Italian national event (of 4 days) based in Pisa, organised by Fondazione Sistema Toscana every year in October. The festival is made up of hundreds of events scattered in various locations within the city of Pisa. Every year all the events, while dealing with substantially different themes, are linked to a common keyword that represents the common thread of the entire annual edition.

Data collection: similarities, differences, challenges and opportunities in the two use cases

The profound diversity of the two use cases intrinsically defines the challenges of data analysis to be carried out and more generally represents a challenge for the project, which must find a common pattern to propose to other small and medium-sized cultural and creative organisations. The collection of the data itself represented the first challenge of the data analysis for both the use cases. Initially, the two use cases collected their internal data such as financial data, data on their attendees and anything else that could be extracted from excel files or internal databases. In the case of the Estonian National Museum, we extracted technical data from the museum devices that relates the interactions of the individual tickets with the devices. Furthermore, we collected external data. In the case of the Internet Festival, we obtained external data from various enterprises, bodies and organisations operating in the city of Pisa. While other data came from the questionnaires distributed to the audience during the 2021 festival edition. Finally, for both use cases, we adopted data extrapolation techniques, in particular from the TripAdvisor platform. Once all the data were collected, we reflected on which analyses were the most useful and consequently what deductions could be made. Three main questions guided our data analysis:

Is it possible to get useful information about the visitors using questionnaires?
Is it possible to correlate internal data with external data in order to better observe how the two use cases are impacting their local environment?
Can purely technical data say something about the visitor behavior within a museum?

Although Yes! is the answer that unites the three questions, let’s go and see in detail what are the difficulties, limits, and most importantly the lessons we have learned. For both use cases, the same data analyses were performed, barring substantial differences in the nature of the use cases themselves. A subset of the results will be shown below in the form of examples, also illustrating some of the difficulties encountered.

Internet Festival

For the Internet Festival, the most important data analysis concerned the data collected from external sources, such as

The hotel attendance data was obtained through the extraction of data from the TripAdvisor platform
The data on the flow of car traffic provided by the Tuscan region
The questionnaires filled in by the participants in the event
The attendance data provided by the museums of the city di Pisa

The most important statistical indices (of centrality and variability) such as mean, median, variance and standard deviation were calculated w.r.t. the only quantifiable and interesting attributes over all the data. Very often mean and median were of great help in calculating the central trend of the data for different periods within the years under review. The statistical indices, calculated for the reference attributes, were then related to the quantities of the same attributes during the period of the Internet Festival. Subsequently, the percentage variation between the general trend (provided by the centrality indices) and the values during the festival period was calculated. The results obtained allowed us to evaluate the presence of trends in reference to the Festival period. In almost all cases the resulting trend was positive: from the data it is possible to evaluate how, in correspondence with the event, the quantities taken into consideration grow. For example, as shown in the following figure, in the last 5 years (from 2016 to 2021) we can see an increase in hotel reviews in the city of Pisa during the Internet Festival week.

Weekly Review of Hotel in Pisa from 2016 to 2021

To observe tourism trends, we used the scraping technique from TripAdvisor, rather than using open data provided by public bodies as the level of granularity was not sufficient to perform an analysis of the data. Another interesting example is given by the questionnaires, as shown in the next figure: it was possible to identify the distribution of the visitors’ gender that participated in the Internet Festival during the edition in 2021.

Visitors’ Gender Distribution during Internet Festival 2021

The last example concerns the data obtained from the traffic sensor at the entrance of the city of Pisa. The dataset, provided by the Tuscany Region, gave us the opportunity to evaluate the presence of trends, relating to the flow of traffic, during the week of the Internet Festival. In this case, only private cars and not commercial ones were taken into consideration.

Results of Traffic Sensor Data Analysis

The most significant difficulties for this type of data analysis were the manipulations of the various datasets, all of a different informative and also technological nature. For example, in the case of the data provided by the museums of the city of Pisa, each museum uses its own management system and therefore it was necessary to homogenise all the datasets in order to obtain a single dataset that can be used for analysis (as already described in the article “Acquiring data from external organisations”).

Estonian National Museum

Regarding the use case of the Estonian National Museum, we focused on analysing the technical data obtained from the internal logs of the interactive devices installed inside the museum. In fact, museum visitors can view a lot of information, in different languages, thanks to their ticket. The devices, on the other hand, record any type of interaction for each individual ticket. Each data within the dataset refers to a single interaction of a single ticket with a particular device. One difficulty with this part of the data analysis was the format of the data itself: most of the technical data recorded by the devices was in the form of codes. In addition to the timestamps, most of the codes were unusable for data analysis purposes. Fortunately, thanks to the correspondence tables provided by the Estonian National Museum, it was possible to extrapolate from the purely technical data the following derived data:

Date and time of the interactions of the single ticket with a specific device.
Language used by the single ticket in interacting with the device.
Visited position, with respect to the museum, during the interaction of a ticket with a specific device.

Once these data were derived, for each ticket, it was possible to calculate and derive information about the presumed behaviour of visitors inside the museum. In particular, it was possible to extrapolate the following information:

The time spent by a single ticket inside the museum.
The most requested and visited topics within the permanent exhibition of the museum.
The path inside the museum of every single ticket.

Obviously, the extrapolated information is an estimate of what was previously listed. In fact, every information described is to be understood as limited to the use of interactive devices inside the museum. For example, the time spent inside the museum by a visitor could be greater than the one calculated by our estimate but it certainly cannot be less. A visitor may in fact have continued to visit the museum without interacting with other devices after the last recorded interaction. The same can be said for the route followed by a single ticket. For the analysis of the time spent inside the museum, it was possible not only to calculate general metrics on all visitors but also the different languages used. Languages themselves are an approximate indicator of the origin of visitors. The following figure shows the distribution of time spent inside the museum by French-speaking visitors.

Me-Mind data analysis - Histogram of time spent distribution for French language at ENM

The analysis of the topics that have received the most interactions within the museum are shown (partially) in the following figure. The data were grouped by single device, and for each of these the unique interactions during the whole year examined by the dataset were counted.

Me-Mind data analysis - Attractions and devices at ENM

Finally, by crossing the data of the tickets and devices, we calculated the path that each ticket followed inside the museum. By dividing the tickets according to the language dimension it was possible to show different behaviours within the museum with respect to the different language used to interrogate the devices. An example of these paths, divided by language, is shown in the following figure designed by our partner Domestic Data Streamers (see our page Experience data).

Me-Mind data analysis: data visualisation of visitors' path according to their language

Some difficulties in analysing the data of this use case concerned the presence of some data with values out of the ordinary and certainly due to some (few) incorrect recording by the devices. These data were simply discarded after calculating confidence intervals in line with the context of the use case.

Conclusion

It’s worth saying a few words about the difficulties we encountered during the data collection and analysis process in order to provide some suggestions to speed up and make the entire data analysis process scalable.

1. Try to make your internal datasets homogeneous from the beginning

In the analysis of the internal data provided by the two use cases, one of the major difficulties was the profound difference in the formats of the data. While the nature of the data is impossible to change, it is possible to think from the outset about the type of format should have. Having the same data format, for example in excel sheets or in JSON format, speed up the process of data analysis and further manipulation actions. Also having the same format makes the process scalable in order to perform more complex data analysis. Having different data formats requires a conversion process, which is not always trivial. Therefore, thinking from the beginning to standardise the data format will facilitate the internal data analysis process.

2. Turn everything into a source of data

In the previous paragraphs we have seen how in the case of the Estonian National Museum we were able to extrapolate information on the behaviour of museum visitors from purely technical data recorded by the devices inside the museum. This result suggests an important piece of advice to the entire cultural industries chain: you must use everything you have available as a source of data. For example, within the organisation of an event it is possible to transform the event registration process into a data acquisition process in order to directly evaluate trends of the participants or obtain other information. In this way it is possible to collect data indirectly and obtain the part of the information you would get with questionnaires.

3. Never underestimate networking with other organisations in the area

The collection of external data around the Internet Festival led us to get in touch with multiple organisations of the Tuscan territory. In some cases, getting the data from these organisations was difficult and in some cases impossible. In most cases, however, we obtained a lot of data from organisations willing not only to provide the data but also to view the results of our analysis. Establishing strong relationships with other organisations represents multiple advantages.

4. Don’t confuse public data with open data

Finally, do not confuse public data with open data. Data that is publicly usable (free) is not always open data. A fundamental characteristic of open data is usability, which depends on the specific case: data that can be used for one purpose may not be used for another purpose. Our advice in this case is to immediately evaluate whether the open data you have available are usable for your purposes, otherwise you have to find valid alternatives.

5. If you can’t read it, try to view it!

As regards the analysis of ticket routes, it is necessary to specify how the visualisation of the data (designed by the Domestic Data Streamers) was fundamental in obtaining useful information. In fact, the simple numerical data could not be of any help since the normal statistical measures cannot alone explain the spatial trend of a visitor inside the museum. The suggestion in this case is not to give up if some data are incomprehensible given the large number of entries: sometimes the colours explain the data better than the numbers! The difficulties explained and solved so far, although interesting within the single use case, represent a general lesson applicable to any context: every kind of data, if properly used, is a gold mine of information!

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.