---
author: Julian Dehne
title: "Delab Trees"
subtitle: "A python library to analyze conversation trees"
bibliography: references.bib
logo: tutorial/img/icon_delabtress.png
image: tutorial/img/icon_delabtress.png
image-alt: logo
citation: 
    type: "document"
    title: |
        "delab-trees, a python library to analyze conversation trees"
    issued: 2024    
    publisher: GESIS – Leibniz Institute for the Social Sciences 
    URL: https://github.com/juliandehne/delab-trees
execute:
  freeze: auto   
---

## At a glance

By the end of this tutorial, you will be able to

- Analyze the integrity of the social media conversation
- Use network analysis to extract longer reply path that might represent actual deliberation
- Use network analysis to show which author is the most central in the discussion

## Table of Content

[Introduction](#introduction)

[Set-up](#set-up)

[Tool application](#tool-application)

[Conclusion and recommendations](#conclusion-and-recommendations)


## Introduction {#introduction}

### Description
- This notebook introduces the python library delab_trees and showcases on some examples how it can be useful in dealing with social media data.

### Target Audience

- This library is intended for advanced CSS researchers that have a solid background in network computing and python
- Motivated intermediate learners may use some of the toolings as a blackbox to arrive at the conversation pathways later used in their research

### Prerequisites

Before you begin, you need to know the following technologies.

- python
- networkX
- pandas

## Set-up {#set-up}

- In order to run this tutorial, you need at least Python >= 3.9
- the library will install all its dependencies, just run

```python
pip install delab_trees
```

## Social Science Usecases 

This learning resource is useful if you have encountered one of these three use cases:

- deleted posts in your social media data
- interest in author interactions on social media
- huge numbers of conversation trees (scalability) 
- discussion mining (finding actual argumentation sequences in social media)


## Sample Input and Output Data 

Example data for Reddit and Twitter are available here https://github.com/juliandehne/delab-trees/raw/main/delab_trees/data/dataset_[reddit|twitter]_no_text.pkl. 
The data is structure only. Ids, text, links, or other information that would break confidentiality of the academic 
access have been omitted.

The trees are loaded from tables like this:

|    |   tree_id |   post_id |   parent_id | author_id   | text        | created_at          |
|---:|----------:|----------:|------------:|:------------|:------------|:--------------------|
|  0 |         1 |         1 |         nan | james       | I am James  | 2017-01-01 01:00:00 |
|  1 |         1 |         2 |           1 | mark        | I am Mark   | 2017-01-01 02:00:00 |
|  2 |         1 |         3 |           2 | steven      | I am Steven | 2017-01-01 03:00:00 |
|  3 |         1 |         4 |           1 | john        | I am John   | 2017-01-01 04:00:00 |
|  4 |         2 |         1 |         nan | james       | I am James  | 2017-01-01 01:00:00 |
|  5 |         2 |         2 |           1 | mark        | I am Mark   | 2017-01-01 02:00:00 |
|  6 |         2 |         3 |           2 | steven      | I am Steven | 2017-01-01 03:00:00 |
|  7 |         2 |         4 |           3 | john        | I am John   | 2017-01-01 04:00:00 |

This dataset contains two conversational trees with four posts each.

Currently, you need to import conversational tables as a pandas dataframe like this:


In [1]:
import os
import sys
import warnings
import numpy as np  # Example module that might trigger the warning

# assert that you have the correct environment
print(f"Active conda environment: {os.getenv('CONDA_DEFAULT_ENV')}")

# assert that you have the correct python version (3.9)
print(f"Python version: {sys.version}")

# Suppress the specific VisibleDeprecationWarning
warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning)

# the interesting code
from delab_trees import TreeManager
import pandas as pd


d = {'tree_id': [1] * 4,
     'post_id': [1, 2, 3, 4],
     'parent_id': [None, 1, 2, 1],
     'author_id': ["james", "mark", "steven", "john"],
     'text': ["I am James", "I am Mark", " I am Steven", "I am John"],
     "created_at": [pd.Timestamp('2017-01-01T01'),
                    pd.Timestamp('2017-01-01T02'),
                    pd.Timestamp('2017-01-01T03'),
                    pd.Timestamp('2017-01-01T04')]}
df = pd.DataFrame(data=d)
manager = TreeManager(df) 
# creates one tree
test_tree = manager.random()
test_tree

Active conda environment: testtrees3
Python version: 3.9.20 (main, Oct  3 2024, 07:38:01) [MSC v.1929 64 bit (AMD64)]
loading data into manager and converting table into trees...


100%|██████████| 1/1 [00:05<00:00,  5.27s/it]


<delab_trees.delab_tree.DelabTree at 0x1f55bc50250>



Note that the tree structure is based on the parent_id matching another rows post_id. 

You can now analyze the reply trees basic metrics:


In [2]:

from delab_trees.test_data_manager import get_test_tree
from delab_trees.delab_tree import DelabTree
import warnings
import numpy as np

# Suppress only VisibleDeprecationWarning
warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning)

test_tree : DelabTree = get_test_tree()
assert test_tree.average_branching_factor() > 0

print("number of posts in the conversation: ", test_tree.total_number_of_posts())


loading data into manager and converting table into trees...


100%|██████████| 1/1 [00:06<00:00,  6.23s/it]

number of posts in the conversation:  4






## Tool application {#tool-application}

### Use Case 1: Analyze the integrity of the social media conversation

For this we use the provided anonymized sample data (which is real, still):


In [3]:
from delab_trees.test_data_manager import get_test_manager

manager = get_test_manager()
manager.describe()

loading data into manager and converting table into trees...


  0%|          | 0/6 [00:11<?, ?it/s]


KeyboardInterrupt: 

In order to check if all the conversations are valid trees which in social media data, they often are not, simply call:

In [None]:
manager.validate(break_on_invalid=False, verbose=False)


### Use Case 2: Extract Pathways


::: {.columns}
::: {.column width="50%"}
![Photo of marked Pathways](tutorial/img/conversation02.png){#fig-conversationpath width="25%"}
:::
::: {.column width="50%"}
As an analogy with offline-conversations, we are interested in longer reply-chains as depicted in @fig-conversationpath. Here, the nodes are the posts, and the edges read from top to bottom as a post answering another post. The root of the tree is the original post in the online conversation. Every online forum and social media thread can be modeled this way because every post except the root post has a parent, which is the mathematical definition of a recursive tree structure.
:::
:::

The marked path is one of many pathways that can be written down like a transcript from a group discussion. Pathways can be defined as all the paths in a tree that start with the root and end in a leaf (a node without children). This approach serves the function of filtering linear reply-chains in social media (see @Wang2008; @Nishi2016), that can be considered an online equivalent of real-life discussions.

In order to have a larger dataset available we are going to load the provided dataset and run the flow_computation for each tree.
 

In [None]:
# get the sample trees
from delab_trees.test_data_manager import get_social_media_trees

social_media_tree_manager = get_social_media_trees()

# compute the flows
flow_list = [] # initialize an empty list 
tree: DelabTree = None 

for tree_id, tree in social_media_tree_manager.trees.items():
    flows = tree.get_conversation_flows(as_list=True)
    flow_list.append(flows)

print(len(flow_list), " were found")

# now we are only interested in flows of length 5 or more

# Filter to only include lists with length 5 or more
filtered_lists = [lst for lst in flow_list if len(lst) >= 7]

print(len(filtered_lists), " lists with length > 7 were found")



Use Case 3: compute the centrality of authors in the conversation


In [None]:

test_tree : DelabTree = get_test_tree()
metrics =  test_tree.get_author_metrics() # returns a map with author ids as keys
for author_id, metrics in metrics.items():
    print("centrality of author {} is {}".format(author_id, metrics.betweenness_centrality))


The result shows, that only mark is central in the sense that he is answered to and has answered. In bigger trees, this makes more sense.

## Library Documentation

For an overview over the different functions, have a look [here](https://github.com/juliandehne/delab-trees/blob/main/README.md#library-functions)

## Conclusion 
Now you should be able to analyze social media trees effectively. For any questions, write me an email. I am happy to help!

Also I would be happy if someone is interested in doing research and writing a publication with this library!


## Exercises or Challenges (Optional)

Learning exercises are forthcoming! But for now you should click on the binderhub link on the top to get a notebook in Jupyterlab, where you can play around with the code.

## FAQs (Optional)

This will be filled if more people use the library!
