Wednesday, 11 September 2013

Efficient matricization of a pandas dataframe

Efficient matricization of a pandas dataframe

My first StackOverflow question.
So I have a Pandas DataFrame that looks sort of like this:
String1 String2 String3 value
word1 word2 word3 5.6
word4 word5 word6 123.4
...
This kind of DataFrame comes from a very long processing chain based on a
huge amount of text. (As a side note, I am getting close to memory limits
and am considering HDFStores now.)
Now I would like to do linear algebra operations based on a conversion of
this table into a (Sparse?)Panel or some other kind of efficient data
structure that fills in the blanks with 0s. That is, I would like to
create a table whose rows are String3s and whose columns are String1 x
String2 pairs, and then do linear algebra operations on the rows. However,
I would also like to be able to do the same thing with any other column --
ie, take String1 as the rows, and make columns out of String2 x String3.
I've been experimenting with Panels and pivot tables, but they don't seem
to be quite right, and they often overflow the memory.
What's the right way to do this with Pandas or in Python (2.7) in general?
Edited to add this example:
The output table is going to look like this:
String1String2 (word1,word2) (word1,word5) (word4,word2) (word4,word5) ...
String3
word3 5.6 0 0 0 ...
word6 0 0 0 123.4 ...
The number of columns is basically going to be |String1| x |String2|.
Alternatively, String3 as columns and String1String2 as rows would be fine
as well, since I can perform the operations on the column series.

No comments:

Post a Comment