Sequential Data Definition
Source data
We address the problem of learning on discrete event sequences generated by real-world users.
Raw table data
Lifestream data can be presented as table where rows are events and columns are event attributes.
Columns can be of the following data types:
user_id- id for collecting events in sequences. We assume that there are many users in the dataset and associated sequences of events. An event can only be linked to one user.event_time- is timestamp, used for ordering events in sequence. It's possible extract date-time features from timestamp. If the timestamp is not available, you can use any data type that can define the order.feature fields- describe a properties of events. Can be numerical, categorical or any type that can be converted to feature vector.
Credit card transaction history is a example of lifestream data.
| client_id | date_time | mcc_code | amount |
|---|---|---|---|
| A0001 | 2021-03-01 12:00:00 | 6011 | 1000.00 |
| A0001 | 2021-03-01 12:15:00 | 4814 | 12.05 |
| A0001 | 2021-03-04 10:00:00 | 5411 | 2312.99 |
| A0001 | 2021-03-04 10:00:00 | 5411 | 199.99 |
| E0123 | 2021-02-05 13:10:00 | 6536 | 12300.00 |
| E0123 | 2021-03-05 12:04:00 | 6536 | 12300.00 |
| E0123 | 2021-04-05 11:22:00 | 6536 | 12300.00 |
In this example we can find two users (clients) with two sequences. First contains 4 events, second contains 3 events.
We sort events by date_time for each user to assure correct event order.
Each event (transaction) are described by categorical field mcc_code, numerical field amount, and time field date_time.
These fields allow to distinguish events, vectorize them na use as a features.
pytorch-lifeatream supports this format of data and provides the tools to process it throw the pipeline.
Data can be pandas.DataFrame or pyspark.DataFrame.
Data collected in lists
Table data should be converted to format more convenient for neural network feeding. There are steps:
- Feature field transformation: encoding categorical features, amount normalizing, missing values imputing. This works like sklearn fit-transform preprocessors.
- Splitting all events by
user_idand sort events byevent_time. We transfer flat table with events to set of users with event collections. - Split events by feature fields. Features are stored as 1d-arrays. Sequence orders are kept.
Previous example with can be presented as (feature transformation missed for visibility):
[
{
client_id: 'A0001',
date_time: [2021-03-01 12:00:00, 2021-03-01 12:15:00, 2021-03-04 10:00:00, 2021-03-04 10:00:00],
mcc_code: [6011, 4814, 5411, 5411],
amount: [1000.00, 12.05, 2312.99, 199.99],
},
{
client_id: 'E0123',
date_time: [2021-02-05 13:10:00, 2021-03-05 12:04:00, 2021-04-05 11:22:00],
mcc_code: [6536, 6536, 6536],
amount: [12300.00, 12300.00, 12300.00],
},
]
This is a main input data format in pytorch-lifeatream. Supported:
- convert from raw table to collected lists both for
pandas.DataFrameandpyspark.DataFrame - fast end effective storage in parquet format
- compatible
torch.Datasetandtorch.Dataloader - in-memory augmentations and transformations
Dataset
pytorch-lifeatream provide multiple torch.Dataset implementations.
Dataset item present single user information and can be a combination of:
record- is a dictionary where kees are feature names and values are 1d-tensors with feature sequences. Similar as data collected in lists.id- how to identify a sequencetarget- target value for supervised learning
Code example:
dataset = SomeDataset(params)
X = dataset[0]
DataLoader
The main feature of pytorch-lifestream dataloader is customized collate_fn, provided to torch.DataLoader class.
collate_fn collects single records of dictionaries to batch.
Usually collate_fn pad and pack sequences into 2d tensors with shape (B, T), where B - is sample num and T is max sequence length.
Each feature packed separately.
Output is PaddedBatch type which collect together packed sequences and lengths.
PaddedBatch compatible with all pytorch-lifestream modules.
Input and output example:
# input
batch = [
{'cat1': [0, 1, 2, 3], 'amnt': [10, 20, 10, 10]},
{'cat1': [3, 1], 'amnt': [13, 6]},
{'cat1': [1, 2, 3], 'amnt': [10, 4, 10]},
]
batch = PaddedBatch(
payload = {
'cat1': [
[0, 1, 2, 3],
[3, 1, 0, 0],
[1, 2, 3, 0],
],
'amnt': [
[10, 20, 10, 10],
[13, 6, 0, 0],
[10, 4, 10, 0],
]
},
seq_len = [4, 2, 3]
)