From: Integrating multiple data sources for learning analytics—review of literature
Publication | Issue addressed in publication | Types of data sources | Types of data | Data sources integrated? | Data integration approach (manual or automatic) | Methods used | Records for how many participants analyzed |
---|---|---|---|---|---|---|---|
Lopez Guarin, Guzman, and Gonzalez (2015) | Predict the loss of academic status at a certain time | Student information system (× 2) | Student background information (× 2), performance test data, final grades | Yes | Automatic—first join admissions data sets into one table. Then join with academic information | Decision trees, naive Bayes | 1532 students |
Park, Yu, and Jo (2016) | Classify blended learning courses in a Korean higher education institution | LMS (× 2) | Activity log, course data | Yes | Automatic (most likely—not explicitly stated). Combine course data and log data on course ID (anonymized) | Latent class analysis | N/A (Records regarding 612 courses which were found suitable for analysis) |
Thompson, Kennedy-Clark, Wheeler, and Kelly (2014) | Automatic tagging of text part of speech; for the identification of types of micro-events that learners enact, and the determination of whether learners complete functions that are crucial for task success | Corpora (× 2) (both are mini-corpora of collaborative problem-based learning activities) | Text (× 2) | No (data are analyzed separately) | Data are not integrated | Part of speech tagger—trained on Penn Tree Bank corpora. Visualization of timing and speaker for each utterance in one mini corpora. | Corpora 1: total 6 dyads (12 students + teacher) Corpora 2: four postgraduate students |
Zheng, Bender, and Nadershahi (2017) | Data were extracted from tools to provide data on faculty’s application of digital tools and to assess the impact of the lecture annotation tool on students’ learning behavior | LMS, lecture annotation tool | Activity log data (× 2) | No (data are analyzed separately) | Data are not integrated | N/A | N/A |
Pardos and Kao (2015) | Bayesian network analysis to assess student current and prior knowledge for problems in a MOOC (with visualization); and visualization of course structure (not based on preceding analysis) | MOOC (× 2) | Activity log, student background information (possibly more) | No, platform can currently only integrate EdX MOOC data with other EdX MOOC data. Platform also supports Coursera | Automatic for integrating EdX MOOC data with other EdX data (for Coursera MOOC data this is not addressed). Approach: use HarvardX tool to integrate different types of EdX files into one csv file (loosely based on xAPI). For visualizations: read csv file(s) into memory | Bayesian network analysis, visualization | N/A |
Liu et al. (2017) | Examine use of an adaptive system through analysis of usage patterns | Student information system, adaptive platform, LMS, performance test | Student background information, activity log, performance test data (× 3) | Yes | N/A (publication explicitly mentions combination of data, yet does not specify how) | Spearman correlation, visualizations, regression analyses | 128 first-year students entered into pharmacy program |
Raca, Tormey, and Dillenbourg (2016) | Compare student behaviors (levels of movement) and connect with attention (self-reported) | Video, questionnaire | Video-derived data, questionnaire data | Yes | N/A | Descriptive statistics (e.g., mean, percentage), correlations | 56 bachelor level students |
Di Mitri et al. (2017) | Predict learners performance during self-regulated learning | Physiological signals wristband, software tracking tool, questionnaire, weather information | Physiological arousal data, software category, questionnaire data, location data, weather data | Yes | Automatic. A tool (Learning Pulse Server) imports data from different APIs and stores events in a Learning Record Store (xAPI format) | Linear mixed effects models | 9 PhD students (the multimodal data set originally contained approximately 10,000 records) |
Ochoa et al. (2018) | Provide automatic feedback on oral presentation skills | Video, audio, presentation slide | Video derived data, audio derived data, presentation slide derived data | No (data are analyzed separately) | Data/data sources are not integrated | Various classification algorithms (e.g., random forest) | 83 engineering students |
Hutt et al. (2017) | Detect mind wandering during a lecture using eye tracking | Eye tracker, questionnaire | Eye tracker data, questionnaire data | Yes | N/A | Bayesian network classifier | 32 undergraduate students from a Canadian university |
Jayaprakash, Moody, Lauría, Regan, and Baron (2014) | Detect students who are in academic difficulty | LMS, student information system | Activity log data, partial course grades, course data, student background information (× 2) | Yes | Automatic. Uses Pentaho Business Intelligence Data Integration (ETL approach) | Logistic regression, support vector machines, J48, naive Bayes | 15,150 undergraduate students |
Rodríguez-Triana, Prieto, Martínez-Monés, Asensio-Pérez, and Dimitriadis (2018) | Identify deviations between the desired learning state (based on learning design) and the actual state in blended/CSCL scenarios | LMS, wiki, online writing application, attendance list, human observation, instructional design information, questionnaire | Activity log data (× 2), attendance information, teacher comments, instructional design information, questionnaire data | Yes | Automatic (at least in part). Third-party tools were integrated into virtual learning environment (GLUE) | N/A (three binary classifiers were built to identify deviations between desired learning state and actual state) | 165 students |
Gray, McGuinness, Owende, and Hofmann (2016) | Predict at-risk students | Student information system, questionnaire, exam results | Student background information, questionnaire data, GPA | Yes | N/A | Correlations, t test/ANOVA. Classification (e.g., naive Bayes, decision trees) | 1207 first-year students (records from 2010 to 2012) |
Wang, Paquette, and Baker (2014) | Identify career path for MOOC learners | MOOC, organization member information | Student background information, questionnaire, organization member information | Yes (partly, questionnaire is analyzed separately) | N/A (most likely manual) | Chi-square, descriptive statistics | N/A (536 MOOC participants answered questionnaire) |
Mangaroska, Vesin, and Giannakos (2019) | Predict student performance | E-learning portal (× 2), Integrated Development Environment (IDE) | Performance test data, activity log data (× 3) | Yes | Automatic. System collects and aggregates data from different sources. Data are integrated in a Learning Record Store | Descriptive statistics, Spearman correlation, linear regressions, visualization | 21 (one teacher and 20 computer science students) |
Villano, Harrison, Lynch, and Chen (2018) | Examine the relationship between student retention and an early alert system (controlling for a number of variables) | Student information system, early alert system | Student background information, final grades, workload, school data (e.g., location, fee), early alert system data | Yes | Automatic. University collects and integrates data from different IT systems in a data warehouse | Survival analysis | N/A (16,142 records captured from 2011 to 2013 were analyzed) |
Wong, Kwong, and Pegrum (2018) | Examine if an augmented reality app for integrity and ethics can help change student’s perspectives on these subject matters | AR platform, LMS | Activity log data, text (× 2) | No (data are analyzed separately) | Data/data sources are not integrated | Descriptive statistics, text analysis, visualization | N/A (1259 students participated, but not all participants’ data were included in the subsequent analyses) |
Sandoval, Gonzalez, Alarcon, Pichara, and Montenegro (2018) | Prediction of students who are at risk of failing classes | Student information system, LMS | Student background information, final grades, activity log data | Yes | Automatic. Extract data from data sources and encrypt, then re-codify some of the attributes into similar types before integrating in a relational database | Linear regressions, random forest | 21,314 students (over three semesters) |
Sun, Xie, and Anderman (2018) | Examine the effect of self-regulation on academic achievement in flipped classrooms | Questionnaire, LMS | Questionnaire data, performance test data, partial course grades | Yes | Manual. Combine grades obtained from instructors with survey data | Structural equation modeling, multi-level regression | 151 US undergraduate students |
Giannakos, Sharma, Pappas, Kostakos, and Velloso (2019) | Examine if including physiological sensing data provides advantages for predicting skill acquisition (and more generally for the design of learning technologies) | Eye tracker, physiological signals wristband, EEG cap, video, game | Eye tracker data, physiological arousal data, EEG data, video derived data, activity log data, performance test data | Yes | Automatic. The features for each data source were extracted separately, then data were integrated using R | LASSO regression, random forest, ANOVA | 17 participants from a major European university |