Joint attention behaviour in remote collaborative problem solving: exploring different attentional levels in dyadic interaction

The current article describes an exploratory study that focussed on joint attention behaviour—the basis of interaction predicting productive collaboration—to better understand collaborative problem solving, particularly its social aspects during remote dyadic interaction. The study considered joint attention behaviour as a socio-linguistic phenomenon and relied on detailed qualitative interaction analysis on event-related measures of multiple observational data (i.e. log files, eye-tracking data). The aim was to illustrate and exemplify how the diverse attentional levels of joint attention behaviour (i.e. monitoring, common, mutual and shared attention) delineated by Siposova and Carpenter (Cognition 89:260–274, 2019) were achieved in remote collaborative problem solving in dyads, including the underlying basis of joint attention behaviour (i.e. individual attention experience). The results made visible the complex functioning of the social aspects of remote collaborative problem solving and provided preliminary insights into how the hierarchical and nested levels of ‘jointness’ and common knowledge were achieved in this context. The analysis reproduced all the theorised attentional levels as both isolated and parallel individualistic attention experiences whilst acknowledging the restrictions of the remote interaction environment and the specific task structures.


Introduction
The current article describes an exploratory study focussing on joint attention behaviour (JAB; e.g. Carpenter & Liebal, 2012;Eilan, 2005;Eilan, Hoerl, McCormack, & Roessler, 2005;Mundy, 2013Mundy, , 2018Mundy & Newell, 2007;O'Madagain & Tomasello, 2019;Siposova & Carpenter, 2019;Tomasello, 1995) in dyadic interaction (i.e. interactions between two participants). By focussing on JAB, the aim is to better understand collaborative problem solving (CPS), especially its social aspects during remote CPS. Based on the socio-cognitive approach to learning, CPS is seen to lie in a twodimensional space of social and cognitive domains that intermingle in the processes of attention can be an interesting proxy. That is, for example, for evaluating the quality of social interactions, as well as a basis for further analysing the data, such as by qualitative means.
In addition to gaze following, JAB includes the coordination aspect of joint attention and the sharing of attention (Carpenter & Liebal, 2012;Tomasello, 1995). In JAB's richest definition, individuals must equally recognise that they are attending to the same thing (O'Madagain & Tomasello, 2019;Siposova & Carpenter, 2019;Tomasello, 1995). Thus, following Carpenter and Liebal (2012), it is only communication that 'turns mutually experienced event into interaction, into something joint ' (p. 168). Appropriately, to be successful, CPS not only necessitates the lower attentional levels that can be found by analysing visual joint attention (see, e.g. Liu et al., 2021;Schneider & Pea, 2013, 2014Schneider et al., 2016Schneider et al., , 2018 but also requires considering joint attention to 'internal', mental content (O'Madagain & Tomasello, 2019). This represents 'the ability to focus together in the conversation on the content of our mental states' (O'Madagain & Tomasello, 2019, p. 1). By the contents of the mental states, O'Madagain and Tomasello (2019) meant, for example, the contents of any thoughts, plans, beliefs or reasons. Achieving visual joint attention to external content is considered a perceptual phenomenon. However, in joint attention to mental content, it is the linguistic exchanges that are perceptible, and when attending to those exchanges by monitoring one another's attention and the partner's reaction to these communicative acts they jointly attend to mental contents (O'Madagain, 2016;O'Madagain & Tomasello, 2019).
There are multiple definitions of and ways to use the term 'joint attention', varying from visual joint attention to joint attention to mental contents (O'Madagain & Tomasello, 2019). Siposova and Carpenter (2019) argued that joint attention should not be considered a single state or binary event (i.e. there is or is not jointness). Instead, it should be viewed as a process comprising various, hierarchically nested, and closely connected but distinct phenomena that can be discovered in the related literature, all referred to as joint attention (see Eilan et al., 2005;Mundy, 2018;Seemann, 2012a). At the surface level, definitions may sound similar, but when elaborated on in more detail, significant differences can emerge among them. Accordingly, Siposova and Carpenter (2019) have developed a spectrum of 'jointness', described as 'a typology of social attention and social knowledge' (p. 261) that aims to cover the diversity of the definitions that all include the notion of a triadic relationship between self, other and an object of attention. The typology also defines distinctive levels of knowledge related to the different levels of jointness as individual, common, mutual or shared knowledge. Moreover, according to Siposova and Carpenter (2019), these levels are distinctive in terms of the participant's perspective (i.e. second-and third-person perspectives; see also, e.g. Moore & Barresi, 2017) and the type of knowledge related to a particular attentional level. They also differ in terms of the level of dependency between partners, as well as the level of experience (i.e. individual or jointly created). An essential precondition for each of the four levels of social attention is the individual's ability to engage in individual attention. This refers to the situation in which an individual is attending to something in the environment with a first-person perspective. Joint attention (whether to external entities, situations or involving communicative acts) is closely connected to collaboration and reflective reasoning with others (O'Madagain & Tomasello, 2019), representing the core elements of CPS. Accordingly, this study takes the typology of jointness by Siposova and Carpenter (2019) as a promising conceptual 'lens' to better understand and exemplify CPS process diversity, particularly regarding social aspects of remote CPS.
When focussing on CPS processes, the study takes the unique properties of the remote, game-like CPS assessment environment (Assessment and Teaching of 21st Century Skills [ATC21S 1 ]; e.g. Care, Griffin, & Wilson, 2018;Care, Scoular, & Griffin, 2016;Scoular et al., 2017) as its point of departure. ATC21S was one of the pioneering international projects in exploring CPS competency for assessment and teaching purposes (e.g. Care et al., 2018;Care et al., 2016;Griffin, McGaw, & Care, 2012;Scoular et al., 2017). The CPS tasks of the ATC21S environment have been designed for dyads following a comprehensive CPS framework by Hesse, Care, Buder, Sassenberg, and Griffin (2015; see also Care et al., 2016;Scoular & Care, 2020;Scoular et al., 2017). The framework of CPS covers both social and cognitive elements of the CPS construct (cognitive, social and regulatory aspects), and it amalgamates theoretical knowledge from social psychology and problem solving. In brief, the framework involves three main strands of social elements (i.e. participation, perspective taking, social regulation) and two main strands of cognitive elements (i.e. task regulation, knowledge building), which are all further divided into sub-elements (19 elements in total; Hesse et al., 2015). In CPS, the social elements are related to how participants coordinate and communicate with one another (e.g. Clark & Brennan, 1991;Richardson, Dale, & Kirkham, 2007), which is considered particularly important in synchronous collaboration (Baker, 2015), the context of this study. In addition, coordination is fundamental in establishing mutual knowledge or common ground (e.g. Clark & Brennan, 1991). Yet, according to Barron (2000), this can be challenging for the partners in problem-solving discussions because of the often new and indefinite goals, different ideas and terms, as well as their relations.
The social aspects are also related to how the partners regulate and resolve differences among the collaborating participants (e.g. Hadwin, Järvelä, & Miller, 2018). The cognitive elements, in turn, are related to how effectively and efficiently participants solve the problem (e.g. Mayer, 1992Mayer, , 1998. The designed ATC21S tasks, based on the framework, both enhance and require CPS elements to occur (e.g. Care et al., 2016;Hesse et al., 2015;Scoular et al., 2017. Thus, the tasks aim to encourage the student to collaborate with another student, and the collaborative tasks are designed to stimulate and elicit the social and cognitive elements of the framework. To succeed in CPS task completion, the tasks require varied knowledge, expertise and skills, both in terms of social and cognitive processes 2 . Taken together, the underlying objective of CPS and the task designs are related to bringing about the continued attempts of participants to acquire a shared understanding of a problem or challenge (Roschelle & Teasley, 1995). This can, via engaging in the peer-or group-level process (e.g. Sinha, Kempler Rogat, Adams-Wiggins, & Hmelo-Silver, 2015), produce learning. According to Dillenbourg, Lemaignan, Sangin, Nova, and Molinari (2016), this can be referred to as 'the upper class of collaborative learning' (p. 228), which requires a high level of joint attention, for example, to a task-related object or an aspect of the problem (Baker, 2015).
To better understand JAB in dyadic interactions during CPS, the gaze behaviour of the partners is a significant resource. To examine gaze patterns, in eye-tracking studies, the predominant focus has been on the overall looking times at predefined areas of interest (AOIs; spatial information as 'where' questions; see, e.g. de Leeuw, Segers, & Verhoeven, 2016;Hautala et al., 2019;Liu et al., 2021). Moreover, to study the 'when' question of eye gazing, cross-recurrence plots (e.g. Marwan & Kurths, 2002;Richardson & Dale, 2005) have been commonly used to study joint attention to external contents, such as visual joint attention or gaze alignment. Cross-recurrence is a general measure that quantifies the similarity or the coupling between two dynamical systems (Nüssli, Jermann, Sangin, & Dillenbourg, 2013;Richardson & Dale, 2005). When studying collaborative learning in remote and co-located dual eye-tracking situations, the crossrecurrence plots (Jermann, Mullins, Nüssli, & Dillenbourg, 2011) and augmented crossrecurrence plots (Schneider et al., 2016(Schneider et al., , 2018, for example, have been particularly capable of visualising the temporal evolution of gaze behaviour in achieving visual joint attention. Yet, when analysing joint attention to mental contents, often related to the higher attentional levels of JAB, including the contents of social interaction as the primary source (e.g., Falck-Ytter, Bölte, & Gredebäck, 2013;Holler & Kendrick, 2015), this analytic approach may not be sufficient. Therefore, to explore JAB as a socio-linguistic phenomenon and combine 'where' participants look with 'when' they look at the AOIs (i.e. the timing of gazing) in the interactional sequences (see Korkiakangas, 2018) is more suitable here. Thus, event-related measures focussing on the interactional organisation of gaze are more informative about what makes some instances of gazing 'social' (e.g. Dindar, Korkiakangas, Laitila, & Kärnä, 2017;Korkiakangas, 2018;Tuononen, Korkiakangas, Laitila, & Kärnä, 2016).
In the current study, with the challenging dynamic scene of the remote environment, there are multiple eye-gaze behaviours linked to JAB, such as gazing at the chat window, the actionable artefacts and the instructions. Although gaze is not similarly organised into sequences as verbal interaction is, it is organised according to the actions it performs (e.g. Chepinchikj, 2020). Therefore, in the current study, it is expected that focussing on the gaze patterns in parallel with the interactional sequences of the communicating partners identified from the log data will help us go beyond these sequences and better identify behaviours related to JAB here.
To explore and identify behaviours related to JAB in remote CPS and related meaningful events, qualitative interaction analysis (e.g. Valde, 2017) is applied based on multiple observational data (log files, eye-tracking data). The remote ATC21S environment utilised here includes dynamic stimuli (i.e. actionable artefacts) and a chat property designed for free-flowing written interaction in dyads as the communication affordance. Whilst the automatically generated log files as chat and actions of interacting dyads incorporate multiple pieces of information from joint processes (Graesser et al., 2018), to make visible the typology of jointness as defined by Siposova and Carpenter (2019), the eye-gaze patterns of the individual partners are also identified as significant.
Background: a typology of 'jointness' in joint attention behaviour As a recent viewpoint to better understand the complexity related to JAB, Siposova and Carpenter (2019) proposed a typology of social attention and social knowledge, understood as a process of closely connected yet distinct phenomena (for the diversity of definitions of joint attention, see, e.g. Eilan et al., 2005;Mundy, 2018;Seemann, 2012a). The typology comprises four attentional 'states' or 'levels' (basic components of JAB)monitoring, common, mutual and shared-that all include the notion of a triadic relationship between self, other and an object of attention (for an overview on attentional states, see Fig. 1). The typology also defines distinctive levels of knowledge related to the different levels of jointness as individual, common, mutual or shared knowledge. Moreover, according to Siposova and Carpenter (2019), these levels are distinctive in terms of the participant's perspective (i.e. second-and third-person perspectives; see also Moore & Barresi, 2017) and the type of knowledge related to a particular attentional level. They also differ in terms of the level of dependency between partners and level of experience (i.e. individual or jointly created).
An essential precondition for each of the four levels of social attention is the individual's ability to engage in individual attention. This refers to the situation in which an individual is attending to something in the environment with a first-person perspective; if compared to all the other levels in the scale of jointness (from monitoring to sharing of attention), this type of interaction is not triadic but dyadic (i.e. a relationship between self and an object of attention); thus, the knowledge level is also individual.
The first level in the spectrum of jointness is called monitoring attention (Siposova & Carpenter, 2019). This refers to a situation in which an individual takes an observer's perspective on a second individual involved, and in this way, attends to the same matter as the partner. At this level, the participants have individual knowledge of the situation, and their attention levels are independent. At the same time, an individual has knowledge that the other participant is paying attention to the same object or situation. Nevertheless, although both individual participants simultaneously monitor each other's attention to the object or situation, they still assess the attention and knowledge states of the other participants individually. Often, monitoring behaviour is observable, such as turning one's gaze or bodily orientation, but such behaviour can also be present without easily noticeable actions. At this level, the knowledge type is individual in nature.
At the second level, common attention, two individual participants take an observer's perspective, and nearly simultaneously, attend to what the other is focussed on (Siposova & Carpenter, 2019). Here, individuals not only attend to the same object or situation but also attend to each other's attention to the object or situation. Engaging in common attention requires the object of attention to be pronounced and marked; that is to say, the participants can both assume that they are attending to the same object or situation. In addition, they have a reason to consider other participants' attention; for example, they have a predefined common goal to be achieved, and in this respect, for both participants, the other individual's attention is relevant. As Siposova and Carpenter (2019) pointed out, 'under these conditions individuals could know they are attending to each other's attention without any contact or communication' (p. 262). Thus, the dependency of the other at this attention level is based on the awareness that they are both engaging in the same attention processes. Yet, notably, the evaluation of whether they are in common attention is based on an individual's perspective, and thus, it may not be correct. The knowledge level is defined as common. According to Siposova and Carpenter (2019), at the third and fourth levels of social attention, mutual and shared attention, the observer's attitude towards the other and their attention no longer exists (i.e. a third-person experience), but the experience is based on direct commitment to the other, where the participants are both senders and receivers of the information (i.e. a second-person experience; see also Zahavi, 2015). Through direct social interaction, each participant becomes a 'constituent part' of the experience of the other (Zahavi, 2015), and attention to an object or situation is coloured by mutual awareness of each other's attention (Siposova & Carpenter, 2019). This bidirectional nature makes the experience different if compared with monitoring and common attention levels that are individualistic (Siposova & Carpenter, 2019). Thus, in mutual attention, the participants are more or less simultaneously attending to the same object or situation but not necessarily communicating intentionally (Siposova & Carpenter, 2019). If compared with common attention, at this level, their experience is co-created and the type of knowledge is mutual.
The fourth level of social attention, shared attention, meets the qualifications of mutual attention, but this level also requires the participants to deliberately communicate with each other about the object or situation and/or the fact that they are sharing attention to it (Siposova & Carpenter, 2019). Thus, what makes shared attention different if compared with mutual attention is its intentional nature. Shared attention is characterised by behaviours in which individual participants verify to each other that they are attending to the same object or situation; such behaviours are not necessarily verbal actions. The behaviours can also take the form of 'communicative' and sharing looks (Carpenter & Liebal, 2012) or gestures, such as pointing and showing (Siposova & Carpenter, 2019). Here, the type of knowledge is shared.
To conclude, both the precondition of JAB (individual attention) and the lower attention levels (monitoring and common) include third-person perspectives; that is, the participants are individually attending to something or to the same thing. The two higher attention levels (mutual and shared attention) include a secondperson relation, which means that the participants are jointly attending to the same thing (for an overview of the sliding scale of jointness, as Siposova & Carpenter, 2019, call it, see Fig. 1).

Research questions
In this study, relying on multiple observational data, remote CPS processes in dyads are studied in relation to one of the central elements in social interaction-JAB-and its different attentional levels. The following questions are posed: 1. How are the different attentional levels of 'jointness' and common knowledge in JAB achieved in dyadic interaction in a remote CPS context? 2. Are some attentional levels more evident or valuable if seen regarding productive CPS processes?

Participants and procedure
This study was an explorative pilot pertaining to a 4-year project investigating CPS with process-orientation and multimodal data. The data were collected in a live eye-tracking situation (e.g. Dindar et al., 2017;Korkiakangas, 2018) from two student dyads (one allmale, one all-female dyad) recruited from an initial teacher education programme in a Finnish university. The students knew each other before the recorded CPS session. Participation in the study was voluntary, and in return for their input, participants were rewarded with a cinema ticket.

Eye-tracking setup
During the experiment, the members of the dyads were physically situated in separate cognitive labs. Whilst completing the CPS tasks in dyads, their eye movements were recorded with desktop eye trackers (screen-based; SensoMotoric Instruments [SMI] RED 250 Mobile). The stimuli were presented on an HP Zbook 15 G2 laptop (15.6 inch display) with a 1920×1080 resolution, and a chin rest at a 60-cm viewing distance was used. A (13-point) calibration was conducted prior to the experiment and before each task. The completion of CPS tasks took approximately 40 min.

Context and task
As a game-like (e.g. Squire et al., 2003;Zagal, Rick, & Hsi, 2006), 'dual-space' interaction space (Zemel & Koschmann, 2013), the ATC21S environment encompasses a chat property as a free-form, synchronous interface and a space with actionable artefacts that have either a symmetrical or asymmetrical outlook for the individuals (see Fig. 2). In a symmetrical task, stimulus content and actionable artefacts are equal for the partners, whereas in an asymmetrical task, the dyad is given a unique subset of resources for problem solving. Alternatively, the screen view can be identical, whilst the ability to move certain objects or scroll the bars is divided between the partners. The success of one student depends on the behaviour of the other and the reactions offered (Care et al., 2016). In the experiment, students completed two CPS tasks (i.e. 'Laughing Clowns', 'Plant Growth'). In this paper, the focus is on the 'Laughing Clowns' symmetrical task (e.g. Care, Griffin, Scoular, Awwal, & Zoanetti, 2015). In this task, without advance explanation, each student is presented with a clown machine and 12 balls to be shared between the students (see Fig. 2). The screen views of students A and B are mirror images of each other, where both can view which balls are being used by the partner but cannot see how (i.e. the drop position of the ball in the clown's mouth or the exit point when it comes out). In other words, the visual information that is transmitted in real time is only the number of balls used by the partner and the location of the ball being used. The trajectory of the ball when in use by the partner is not visible to the other student. The students must place the balls into the clown's mouth whilst the mouth is moving to determine the rule governing the direction the balls will take (entry: left, middle, right; and exit: positions 1, 2, 3). The students' goal is to determine whether their clown machines work in the same way. They are to do this via discussion, and then they are expected to individually mark the outcome on their respective play spaces in the CPS environment. To accomplish this, the dyad needs to share information and discuss the rules, negotiating how many balls they should each use. In this regard, communication via the chat interface is central to success in this task. 'Laughing Clowns' is a content-free CPS task, which means that the task is not aligned with any curriculum content and does not require any previous content knowledge. In the task design, the following CPS behaviours are observed regarding the theorised CPS construct-interaction, audience awareness, responsibility initiative, resource management and relationships (see Care et al., 2015). In Table 1, all the taskrelated CPS elements and indicated behaviours are briefly described. Then, as an exemplar, one of the central CPS elements (i.e. interaction) is described in more detail.
Interaction is a fundamental social skill observed in this task, with assessment based on how participants demonstrate their ability to interact with their partners (e.g. presence of chat before allowing the partner to make a move). In the context of this task, this skill is considered crucial because the participants are required to share the 12 balls allocated to them. It is thought that they will benefit by corresponding on how to best utilise them in their dyad. Failure to do this may mean that not everyone has enough resources (i.e. balls) to trial their machine so they can jointly reach a conclusion about the mechanism of how the machines work. Interaction can also be observed at various proficiency levels (i.e. from low to high) within the dyad. It is expected that, from the beginning, proficient collaborators will be aware of the necessity of interaction to both coordinate their activities and promote collaboration for successful resolution of the problem (i.e. in this case, being able to test each machine to reach a conclusion on whether the mechanics are similar or different).

Log files
The dataset incorporated two types of recorded observational data (i.e. log files and gaze data). The automatically generated log files from the online environment served as the primary interaction data, consisting of multiple, time-stamped information of the CPS sessions ; see Table 2). In short, the log file comprised the interaction data from the free-form chat (i.e. 'raw data') and reflected any activities attempted on screen individually or jointly, including some non-activities (e.g. moving the mouse, hovering over a button without clicking it, etc.). All captured information was recorded as a sequence of activities in the order in which they occurred, including the time of the occurrence and the details of the involved participants and the task they were undertaking (includes stage of the task as in task page number).

Gaze data and their prior analyses
The log file data were accompanied with the individual students' gaze data recorded in the CPS sessions in dyads. Here, gaze is defined as 'the act of directing the eyes towards a location in the visual world' (Hessels, 2020, p. 856); gaze is always seen as focussed somewhere or on something. In the study, because of the large amount of data collected via eye trackers that can capture the eye movements 30-60 times per second, a quantitative prior analysis was first conducted to reduce and visualise the data for the qualitative interaction analysis. To do this, behavioural and gaze analysis (SMI BeGaze™) eye-tracking software was used for automatically segmenting the eye-tracking data into gaze fixations and computing scan path visualisations. Gaze fixations are the time periods when the eyes maintain gaze on a single location and allow seeing which part of the screen the participant looks at and for how long. In scan path visualisations, the gaze positions and eye events are plotted on a stimulus video. Computations of gaze fixations are based on their coordinates and duration, and for the computations, SMI BeGaze software uses a dispersion-based spatial algorithm (e.g. Blignaut, 2009;Salvucci & Goldberg, 2000). In the paper, the scan path video exports are used for qualitatively analysing the eye movements of individuals during CPS (answering 'where' and 'when' questions), interpreted in relation to the log file data (for a screen capture of a video export as a scan path view, as well as the AOIs of the symmetrical Laughing Clowns task, see Fig. 3).

Data analysis: focussing on event-related measures
To better understand JAB in terms of the typology of 'jointness' (Siposova & Carpenter, 2019) and search for related behaviours during CPS, a qualitative interaction analysis was applied (e.g. Valde, 2017). To form a meaningful event regarding JAB in the context of CPS, the focus was on the structure of the interaction and how interactional actions were related to each other (Valde, 2017). The analysis was tailored to a remote interaction context and combined multimodal data for analysis as log files and eye-gaze data.
The analysis included the prior analysis phase of the raw gaze data utilising the SMI BeGaze software, explained in the previous section, and two interrelated main phases of manual qualitative coding. In the first phase of the qualitative analysis, to search for meaningful events in terms of JAB, viewed in relation to the CPS construct, the focus was on understanding the structure of interaction in pair-level log data as traces of verbal interaction between the participants (i.e. chat) and manipulating artefacts (i.e. actions). The aim was to systematically review the full log file data corpus in this regard. The log file for a dyad was analysed (including identification, coding and interpretation) by multiple (i.e. at least two) researchers. Observations were compared to eliminate any The analysis was grounded on the following basic principles in terms of the basic structure and organisation of interaction (e.g. Schegloff, 2007). Typically, interactional actions are organised as sequences and have a particular organisation in 'adjacency pairs'. That is, an initiating action (e.g. a question, proposal) makes a responsive action (e.g. an answer, uptake) pertinent, which is expected to occur in the sequentially following position (Tuononen et al., 2016). In triadic interactions related to JAB, a partner may initiate interaction by directing the other partner's attention to something; in this context, the partner refers to an object of attention, for example, an artefact in the collaborative workspace. In the ATC21S environment, the objects of attention can be abstract (e.g. numerical problems); alternatively, they can be manipulated objects on the screen (artefacts) that are explicitly present in the environment (see Andrist, Ruis, & Williamson Shaeffer, 2018).
In the analysis, the initiating and responsive actions included verbal interaction (chat), as well as different combinations of 'chat' and 'action' (e.g. situations in which a student asks the partner to take some action, and subsequently, gives feedback on the result(s) of the action). Therefore, a concept of 'reference-action sequence', as described by Andrist et al. (2018), was viewed as applicable here for more detailed categorising. Reference-action sequences point to 'short interactions between collaborators in which one person indicates an object in the collaborative workspace that another person is supposed to manipulate in some way' (Andrist et al., 2018, p. 339). In the current study, these types of sequences were first coded as 'initiating-responding' utterances, but for clarification, they were further defined as 'reference-action' sequences. For an example of a reference-action sequence, see Table 3.
In the second phase of the analysis, the selected interaction events from the log data were identified from the scan path visualisations and these events were analysed in greater depth (on a frame-by-frame basis) in accordance with the gaze behaviours (i.e. fixations) whilst completing the task. This phase made visible the location and the order of the gaze cursor at specific AOIs during these selected events. (For an example of a coded location and order of gaze viewed in accordance with the specified AOIs, see Fig. 3 and Table 4; the short example here relates to an excerpt from a broader interactional sequence of shared attentional experience). To recap, the overarching aim of the second phase of the analysis was to systematically locate gaze behaviours related to achieving the attentional experiences of the different levels of JAB (i.e. monitoring, common, mutual and shared) during CPS.
It was assumed that, when analysed for consistency, the two data types (log files, gaze data) would allow for a better understanding of CPS respecting the sliding scale of JAB. This was based on the underlying analysis of its interactional structures and organisation, viewed in relation to theorised attentional levels that differ in terms of the following: (a) the participant's perspectives (i.e. second-and third-person perspectives), (b) the type of knowledge involved in a particular attentional level, (c) the level of dependency between partners and (d) experience level (whether individual or jointly created; Siposova & Carpenter, 2019). In addition, the aim was to recognise and separate dyadic interaction 3 related to individual attention levels from triadic interactions during CPS (for an overview of the phases of analysis, see Table 5).
To conclude, the line-by-line analysis of the interactional structure and related organisation of eye gaze in dyads (or in some points, the lack thereof) served as an analytical tool to fully grasp the situation of inquiry to delimit and exemplify the different attentional levels and their underlying basis from these data, considered in relation to the notions linked to the theorised CPS construct. Notably, in the 'Results' section, the attentional levels during CPS are presented as isolated, descriptive behavioural sequences showcasing the different attentional levels, notated with the raw log and gaze data views and abstracted from a certain dyad or individual.

Results
In terms of the identified attentional levels and their underlying basis, having combined the information embedded in both activity logs and gaze data, the analysis resulted in illustrative examples that exemplify the spectrum of jointness as different attentional levels (Siposova & Carpenter, 2019) during remote CPS in dyads. The attentional levels and type of knowledge involved are presented from individual and monitoring attention (Fig. 4) to common attention (Fig. 5) and from mutual attention to shared attention (Fig. 6).
Individual and monitoring attention during collaborative problem solving Figure 4 exemplifies an individual's (student B) monitoring attention in the CPS situation, observed from the onset of the Laughing Clowns task. As typical of a monitoring situation (Siposova & Carpenter, 2019), the participants had individual knowledge of the situation and evidence of the partner (e.g. acquired through the given instructions). In this situation, via the screen, student B attended to what the partner (student A) was attending to. Whilst dragging and dropping a ball, student B took an observer's perspective on the actions of student A. There were no communications yet, but there Note. The example includes selected information of the log data as student ID, task ID, page, role, raw data, code (i.e. initiating) and sub-code (i.e. reference-action), as well as the number of identified AOIs, in the Laughing Clowns task were noticeable changes in the behaviour of student B, such as frequently monitoring the screen and the interaction property; these actions were visible in the gaze data view. In this case, the participants did not have the same attentional level: If student B had the monitoring attention level, student A simultaneously had the individual attention level (Siposova & Carpenter, 2019), where student A was concentrating on reviewing the instructions and testing the machine individually without any monitoring or communication via the chat property.

Common attention during collaborative problem solving
As in monitoring attention, in common attention (see Fig. 5), the experience was primarily individual, but in contrast to previous attentional levels (i.e. individual and  (Siposova & Carpenter, 2019). Although working in parallel during CPS task completion and without systematically communicating over their related goals, students A and B had the following characteristics: (a) they had an established joint objective, acquired via task instructions, and (b) based on the first point; it could be assumed that their attention was relevant to their partner (Siposova & Carpenter, 2019). Here, the dyad was engaging in the same CPS situation. Whilst they depended on the attention of the partner, their evaluation of the situation (common attention or not) was individual. Their dependency was anchored to their (individual-level) awareness of attending the same problem-solving session as their co-student. In the chat, the partners shared their notions of individually manipulating the artefacts as follows: student B wrote, 'The first ball went into L', and continued to test another ball without any further negotiation; immediately afterward, student A wrote, 'Same and the head was left'. The communication was based on reporting parallel efforts that relied on individual partners testing the task-specific properties; they were not yet truly attending to the situation of inquiry together.

From mutual attention to shared attention during collaborative problem solving
In the current example (see Fig. 6), at the onset of the task, the partner's presence was verbally acknowledged-an 'attention contact' was made (Gomez, 2005). This can also be referred to here as a sign of mutual attention experience: In the remote environment, only verbal signs are available and required; the eye contact or even mutual touch typical of mutual attention in a physical environment is not possible. According to Siposova and Carpenter (2019), the mutual attention experience, if compared with previous attentional levels, is co-created: To achieve the experience, both partners must engage in these processes together, and in this sense, their knowledge of the situation is also mutual. When proceeding with the task, both students' communicative exchanges about the task and the task properties were intentional and bidirectional. In addition, over the course of CPS, the partners co-created their experiences by constantly sending and receiving information and negotiating how to solve the problem together (except for the Fig. 6 Illustration of sliding from a mutual to shared attention experience (students A and B perspectives) whilst completing the 'Laughing Clowns' task. The example includes simultaneous moments from (a) the log data, combined with (b) and (c) individual-level screen captures from the eye-tracking video exports first ball thrown by student B without first consulting the partner, but student B came back to the issue later; see Fig. 6): The partners engaged in 'doing together' as a shared attention experience (Siposova & Carpenter, 2019;Zahavi, 2015). If compared with the mutual attention experience from the previous attentional level, the partners co-created their experience (here, by exploring the available artefacts), whereas the communicative exchange also allowed them to align, for example, the goals concerning their object of attention. Accordingly, their knowledge of the current situation was shared (they both acknowledged having two rows of six balls). The gaze data examples represented in Fig.  6 are from an episode that included the moment when the dyad explored the available artefacts and communicated on whether they both had two rows of six balls.

Discussion
This paper described an exploratory study that focussed on JAB, the basis of interaction that predicts productive collaboration, to better understand CPS, particularly its social aspects in remote dyadic interaction. The study aimed to advance our earlier understanding of the theorised CPS in this regard by applying the following approaches: (a) the comprehensive theoretical framework of JAB by Siposova and Carpenter (2019) and (b) thorough qualitative inquiry, relying on a rich set of data. These data allowed for zooming into fine temporal organisation of social interaction, including the eye movements of individuals during CPS processes in dyads. The aim was to arrive at illustrations that would exemplify how the diverse attentional states of monitoring, common, mutual and shared, and individual attention experience, were achieved in dyads in the remote CPS environment (ATC21S), as described by Siposova and Carpenter (2019).
How did the different attentional levels and types of common knowledge in JAB materialise in this study? When focussing on the remote sequential interaction in student dyads, all the attentional levels defined by Siposova and Carpenter (2019) were recognised from the empirical data as third-person, individualistic attention experiences (monitoring and common attention), including the precondition of JAB (individual attention experience), and as second-person relations (mutual and shared attention). The examples, if compared with the theorised typology of jointness, can represent different strength levels of each because the attentional levels of the composite of jointness are, to some extent 'prototypical', as Siposova and Carpenter (2019) pointed out.
When focussing on the examples related to the third-person perspective (i.e. individual, monitoring), the results gave empirical evidence of detached attention experiences, encompassing autonomous actions of individuals, or at the common attention level, parallel processes to solve the given CPS task. Although the student dyads were initially confronted with the social aspects of CPS via the task designs (Scoular et al., 2017), here, the general task-related and interactional organisation did not properly reflect the task-specific CPS elements of the Laughing Clowns task (i.e. interaction, audience awareness, responsibility initiative, resource management and relationships; Care et al., 2015), for example, at the common attention level. The students did not systematically build on each other's contributions but proceeded with trial-error actions and iterations based on these actions (see also Davis et al., 2015).
The examples relating to the second-person perspective (i.e. mutual and shared), for instance, showed the significance of making the 'attention contact' (Gomez, 2005) in mutual attention experience as straightforward acknowledgement of the partner's presence. This can ensure initial sensing of certainty that attention is joint (Siposova & Carpenter, 2019). As students operate in a remote environment, the possibilities of how to gather information about the partner are rather limited (here, to the chat property or observing how the artefacts are being manipulated). In this regard, the verbal acknowledgement of their partner influences the achievement of the attentional state, and thus, it can favour direct processing of a collaborative task (see also Baker, 2015).
At the shared attention level, it was observed that both members of the dyad adopted an engaged approach towards each other to solve the CPS task together and showed interaction that was well coordinated and symmetrical (e.g. Andrist et al., 2018;Pöysä-Tarhonen et al., 2017Miles, Lumsden, Flannigan, Allsop, & Marie, 2017). Accordingly, coordinating interactional sequences and attention can ensure that collaborative activities 'flow easily and intelligibly' (Andrist et al., 2018, p. 339). In productive CPS, participants are expected to explore the social space by acknowledging their partners and asking questions, as well as by sharing information and resources in the remote environment (e.g. Scoular et al., 2017). These principles resonate well with the defined features of mutual and shared attentional levels in which both members of a dyad are considered senders and receivers of information simultaneously (e.g. Siposova & Carpenter, 2019;Zahavi, 2015).
Taken together, as Siposova and Carpenter (2019) argued (see also De Jaegher, Di Paolo, & Gallagher, 2010), there is a substantial difference in the quality of the sociocognitive processes that occur when a participant is adopting a third-person perspective compared with second-person relations. That is, the primary way of understanding what a partner requires is interacting and experiencing together with the partner. As in the third-person perspectives of individual and common attention, two individuals 'meet in the middle' (Siposova & Carpenter, 2019, p. 262; see also Carpenter & Liebal, 2012), in the second-person relations of mutual and shared attention, a 'meeting of minds' occurs, and the partners are truly attending with each other towards the shared goal or object of attention (Gallotti & Frith, 2013;Siposova & Carpenter, 2019).
Even if the different attentional levels are described as separate in the scale of jointness, in real-world situations, the distinct levels can emerge differently in terms of intensity (Siposova & Carpenter, 2019). That is, even if we are engaged, for example, in a rich second-person relation, the intensity of the relation may still vary in different situations. It can be considered that, in the relatively short-term sessions of completing the CPS tasks together, the second-person relations (mutual and shared) lie more on the 'left' side of the sliding scale of common attentional experience (see Siposova & Carpenter, 2019). Although the dual-space remote environment is designed to create bidirectional contacts through instructions and design choices that signal those features for the participating students (e.g. , the interaction is based only on written communication. Even if the chat is free-flowing and informal, its textual nature affects the communication's nature, necessitating that information that otherwise is conveyed non-verbally (i.e. intonation, facial expressions) be presented as text or utilising textual paralinguistic cues as well (see Paolillo & Zelenkauskaite, 2013). Therefore, it is expected that, for some students, this can hamper or restrict the ways in which they communicate and share their understanding whilst exploring in the problem space. However, in prolonged versions of everyday interaction-for example, between friends-the intensity of engagement is typically different (Siposova & Carpenter, 2019).
Yet, it should be noted here that, in longer moments of working together (e.g. on a shared problem), if it is seen to require richer experiences of joint attention, the concept of collaboration is often applicable to cover only certain phases of the groupwork (Baker, 2015). According to Baker (2015), for a given duration, there will normally be periods in which participants are not attending to each other or to the joint task.
Along with the varying intensity of each attentional level, there is a continuum of jointness between and within attentional levels (Siposova & Carpenter, 2019). In the current study, it was witnessed how a lower level attention experience can be a foundation for a higher level (Siposova & Carpenter, 2019). For example, sliding through the short moment of mutual attention experience of verbally acknowledging each other's presence was critical in achieving the shared attention experience between the partners. Although a minimal example of a second-person attentional experience is described here, it includes bidirectional contacts and indicates openness for engagement (Siposova & Carpenter, 2019).
Are some attentional levels more evident or valuable if seen in relation to productive CPS processes? As described by Siposova and Carpenter (2019), attentional levels can come about through bottom-up processes (i.e. automatic, reflective shifts of attention to a salient stimulus) or top-down processes (i.e. in an active, goal-oriented manner; see also Kaplan & Hafner, 2006). In the current (see also Pöysä-Tarhonen et al., 2020) and previous studies (see Pöysä-Tarhonen et al., 2017, equivalent, contrasting 'strategies' in the CPS processes have been observed. Typically, a salient stimulus can generate shifts on the scale of jointness (Siposova & Carpenter, 2019). When focussing on the onset of the task (the Laughing Clowns), it seems that, for some, the salient stimulus (i.e. the moving head of the clown and the balls) may be too salient and eyecatching at the expense of reading the instructions or connecting with the partner, which they are expected to do first in the task design (in the Laughing Clowns task, one or more balls can be used by an individual participant before reading the instructions and realising that the number of balls is limited and they are shared; see Care et al., 2015). This type of behaviour may cause 'sticking' to the individual attention level (Siposova & Carpenter, 2019), being unconcerned with the other participant's presence or connecting with the partner, as also evidenced in this study. However, as soon as both participants fully realise that the goal is shared (here, the task is to solve the problem of whether their machines work similarly), and by design, the partner's contribution is requisite to solve the problem , this awareness of dependency can push the dyad 'right' on the scale of jointness (Siposova & Carpenter, 2019). It can be assumed that, especially at the lower attentional levels when participants are following top-down processes in achieving JAB, the limited perceptual space of the CPS environment and the short time interval typical of sharing the appearance of the object and communication of the objects can have constructive influence in creating shifts in the attentional states during CPS. Although the 'bottom-up' cases here are described in accordance with lower attentional levels, in higher attentional levels when the stimulus is salient, participants can also impulsively shift their attention to it. Yet, what is different from the lower attention levels is that they are verbally sharing their attention (Siposova & Carpenter, 2019).
According to Siposova and Carpenter (2019), when the 'top-down' processes of achieving attentional states are involved, from the onset (e.g. in achieving shared attention), a participant can intentionally direct the other participant's attention to something they can jointly focus on and communicate about, including checking that the partner has noticed it. Through communication, they can also confirm that the attention is shared. Accordingly, despite the possible imbalance between participants' motivations and interests, through top-down processes, individual commitments can trigger social obligations, and through communication, create joint goals and joint commitments (Siposova & Carpenter, 2019;Siposova, Tomasello, & Carpenter, 2018). In terms of achieving sharedness, the quality of the behaviours-especially how explicit or detailed the communication is-can facilitate and sustain these processes (Siposova & Carpenter, 2019). In line with this, in the theorised CPS composite (e.g. Care et al., 2016;Hesse et al., 2015;Scoular et al., 2017), the 'ideal' productive processes of joint problem solving resemble the reciprocal communicative processes and comprise elements that relate to qualities of the top-down process that can enhance achieving higher attentional levels in dyads. In terms of the social elements, to be productive, CPS requires sensitivity to the partner's co-presence and benefits from common ground and shared meanings created during the reciprocal interaction in the dyad (e.g. Baker, 2015;Baker et al., 1999). However, it should be noted that we do not yet know if one of these two approaches (i.e. 'bottom-up' or 'top-down') can rise above the other at a larger scale in terms of achieving higher attentional levels in dyadic interaction during CPS processes, not only in terms of the CPS process qualities but also the higher CPS outcomes of individuals.

Limitations and future prospects
When considering the methods used in this study, to fully understand social connotations as attentional levels of JAB during CPS, gaze data alone did not provide sufficient contextual details of the real interactions between participants. However, especially in the lower attention levels, such as in the monitoring and individual attention conditions, the gaze data view was beneficial. It indicated essential moment(s), composed without writing or moving artefacts. Thus, not only did it reinforce the interpretations of the participant's individual orientation levels, which were only partially visible in the log data view, but it also evidenced the attentional state 'behind' the log data view in terms of the monitoring attention level. However, as remote social interaction has its unique properties, the 'richness' of attention experience or the indicative behaviours at different attentional levels may not be identical (or cannot be fully attained remotely) when compared with the behaviours in face-to-face situations, as defined in Siposova and Carpenter (2019). Therefore, we need a further understanding of the specific 'functioning' definitions of JAB in this context of remote CPS.
Since the current explorative study includes only a few cases, as a next step, the focus of investigation will be on a larger population of students. The aim is to increase our understanding, for example, of the diverse aforementioned 'behavioural strategies', whether they are more unintentional or intentional (i.e. bottom-up and top-down processes), in achieving JAB during CPS, as seen in relation to the CPS process outcomes. In addition, with a larger population of students, it can be considered whether certain attentional levels are linked with higher or lower CPS skill levels acquired from the ATC21S environment. In the process analysis, the aim can be to shed light on the precise timing of the eye movements and interaction sequences during CPS by quantitatively analysing how gaze and verbal interaction intercouple in dyads (e.g. Nüssli et al., 2013). In this study, the focus was on exemplifying isolated behavioural sequences initially identified from the log files, supplemented with the gaze data. Next, the aim is to uncover, for example, longer behavioural sequences of interaction in CPS and to search for more evidence of the principles that can account for a better understanding of JAB in remote CPS, as well as in terms of knowledge levels related to JAB. This can be done by analysing what behaviours preceded and followed identified attentional levels (see Siposova & Carpenter, 2019), as well as by looking at longer and more diverse problem-solving task types (e.g. symmetrical and asymmetrical tasks). Furthermore, to bring together the information embedded in the log files with other types of computed visualisations, based on the raw eye-tracking data, such as 'scarf plots' (e.g. Jarick & Kingstone, 2015;Yang & Wacharamanotham, 2018) or 'sequence charts' that both show gaze transitions among AOIs on timelines to combine 'where' and 'when' questions on eye events will be particularly promising.

Conclusion
The present study has provided preliminary insights into how the hierarchical and nested levels of 'jointness' and common knowledge are achieved in dyadic interaction in remote CPS. Via empirical examples, the study reproduced the basic ideas of the different attentional levels of JAB theorised by Siposova and Carpenter (2019) whilst acknowledging the restrictions of the remote interaction environment and the predefined CPS task structures and interaction features of the CPS environment (see Care et al., 2016). As first insights, these outcomes can stimulate thinking about how to support participants, for example, via task designs in achieving high attentional levels during remote CPS.
To acquire stronger evidence of the multiple attentional levels of JAB during remote CPS processes, the eye-tracking data demonstrated its usefulness in making the 'invisible visible' when the shifts and timing of gaze can also be considered in accordance with the multiple information embedded in the log data (Graesser et al., 2018). However, whilst our study has advanced our understanding of the complex functioning of the social elements during remote CPS, it also points to certain limitations of the approach. Using larger samples and varied methods, further research is required to gain insights into more definite operational definitions and a deeper understanding of behavioural sequences of JAB (Siposova & Carpenter, 2019) related to productive CPS in remote contexts.