[LabPlot] General usage questions

Chrismettal · August 20, 2023, 6:28pm

Background

This is less of an issue and more of a collection of questions that I am unable to solve with the documentation alone. By no means any of these points are meant to be nagging or complaining about missing features. I genuinly want to understand if these features are implemented and I just don’t understand how they work, or if they are not implemented yet. For any feature on this list that is not implemented I’ll happily open up separate feature request tickets, but for the moment I feel most of these come down to user error on my part so I would like to start with a single thread.

My background in data visualization is mostly with SAS JMP in a professional environment, or Plotly at home, though I am mostly rooted in programming and not data viz myself.

Datasets I analyze are mostly related to product quality. My “spreadsheet” might contain a few thousand lines (one per product) with about 50 criteria each. The criteria might be Pass/Fail, Integers, Floats, Bitmasks or full strings.

Points

How do I drop entire rows of data from spreadsheets?
- It makes total sense to me to be able to highlight a column, open up the drop values dialog and for example drop 0 values to NULL. However, I would like to be able to drop the whole row instead of just that value.
- For my background usecase, this would mean I would like to drop all products (rows) from my dataset, that are missing valid values in certain criteria (columns)
Can I create subsets of Data?
- This means I would like to create a new spreadsheet out of data from an existing spreadsheet, only meeting certain criteria.
Sort spreadsheet by column sorts only that one column, disassociating the data value from the rest of data values
- Is this intended? If every row is one data set for one product, sorting a column means that column no longer has any connection to the source data
- I would have expected the whole spreadsheet to be sorted BY that column, instead of only that column
Does the Box Plot support splitting boxes by categorical data?
- My theoretical dataset has one column containing float values, and another containing a Pass/Fail string
- I would like to create a box plot with 2 boxes, where one box represents product that passed, while one represents failed products
Related to the last one, can categorical data be used to colorize data points individually in other plots?
- A pointcloud for example might have points colored red, if a Pass/Fail criteria has failed in another column of that datapoint
In the plot view, Is it possible to see the value of a datapoint underneath the mouse?
- The only way I have figured out yet is to get out the X/Y cursors to approximate datapoint values
During data import from a CSV file, can I specify the data type of each column during import manually?
- In an example dataset, I might have a bitmask column as a 8 char long hexadecimal string. If any dataset does not have any of the letters A-F in that column, it will be interpreted as an integer automatically, making this column useless as categorical data.
- 00A0000 → gets interpreted correctly as a string, usable as categorical data
- 0090000 → gets interpreted incorrectly as a decimal integer

Labplot Version

Release build 
Aug 11 2023, 23:07:12
System: KDE Flatpak runtime
Locale: English,United States (Decimal point '.', Group separator ','
Number settings: Decimal point '.', Group separator ',', Exponential 'e', Zero digit '0', Percent '%', Positive/Negative sign '+'/'-' (Updated on restart)
Architecture: x86_64-little_endian-lp64
Kernel: linux 6.4.10-arch1-1
C++ Compiler: GNU 12.2.0
C++ Compiler Flags: -O2 -g -pipe -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -fno-omit-frame-pointer -fno-operator-names -fno-exceptions -Wall -Wextra -Wcast-align -Wchar-subscripts -Wformat-security -Wno-long-long -Wpointer-arith -Wundef -Wnon-virtual-dtor -Woverloaded-virtual -Werror=return-type -Werror=init-self -Wvla -Wdate-time -Wsuggest-override -Wlogical-op -Wall -Wextra -Wundef -Wpointer-arith -Wunreachable-code -Wunused -Wdeprecated-declarations -fno-omit-frame-pointer -fstack-protector -fexceptions -std=c++11 -O2 -Wcast-align -Wswitch-enum -fvisibility=default -pedantic -Wzero-as-null-pointer-constant

dlaska · August 21, 2023, 3:30pm

Chrismettal,

My general workflow relevant to some of your questions would be something like this:

Duplicate the spreadsheet of interest (select the spreadsheet in the Project Explorer and use the shortcut Ctrl+D or invoke the relevant action from the context menu).
Create a new column ‘Filter’ in the duplicated spreadsheet.
RMB on the Filter column’s header > Generate Data > Function Values and define your criteria, e.g.: if the column X is greater than 0 OR the column y is greater than 0, then set filter to 1, otherwise set it to 0: if(or(greaterThan(x; 0);greaterThan(y; 0));1;0).
Select all columns (Ctrl+A) and RMB > Sort > Selected Columns (e.g. Ascending, Together, Leading Column: Filter).
Select all rows for which ‘Filter’ equals 0.
RMB on the rows headers > Remove (selected) rows.

The above workflow applies to your questions related to dropping entire rows, creating subsets of data and sorting data.

Unfortunately, LabPlot doesn’t support using text expressions in the definitions of functions yet, so it’s not possible to define a function like this one: if(equal(x,“woman”);1;0).

Hope this helps!

asemke · August 21, 2023, 8:17pm

How do I drop entire rows of data from spreadsheets?
MR Draft: Resolve "Handle missing values" (!350) · Merge requests · Education / LabPlot · GitLab is pending. Once merged you can try out this feature in the next nightly build.
Can I create subsets of Data?
not directly but see the reply from dlaska.
Sort spreadsheet by column sorts only that one column, disassociating the data value from the rest of data values
It is possible to sort one single column is selected “Sort” from the context menu or select all columns together when doing “Sort” from the context menu of the spreadsheet (not of a column).
So, just do a RMB-click somewhere in the spreadsheet, select “Sort” and sort all columns together by specifying the leading column that should be used to define the sort order.
Does the Box Plot support splitting boxes by categorical data?
this is not possible yet. The data is organized and consumed in LabPlot column-wise. To obtain what you need you would need to work with two numeric columns “Pass” and “Fail” containing the values for each “criteria”. If you have 50 criteria this won’t be a feasible approach for you, though.

The remaining points are not possible yet, neither, but are already documented in our backlog. To solve the problem with the bitmask data being interpreted as integer and not as text you can export/generate this data initially as quoted text - so, just use “0090000” instead of 0090000 in the file being imported into LabPlot.

Chrismettal · August 23, 2023, 9:43am

Thanks for your answers!

@dlaska this workflow does indeed clear up some of my questions in one go. Though It does make me feel that some of these, especially the “sort whole list” part could be communicated to the user better.
For example, when highlighting a single column to sort, it might be intuitive in the popup to present a radiobutton, choosing between sorting this column only, or the whole dataset.
It does help me continue my work with LabPlot as is however, so thank you!

@asemke that MR looks interesting indeed. A bit wooden-hammer-y to only allow dropping rows with incomplete data (Though certainly neccessary) but if the drop function esiststs, it might be possible to add a radio button to the “drop values” dialog as well, asking if the entire row should be dropped.

For the boxplot I do think I understand how LabPlot loads the data for boxes now. Makes sense for most applications I assume. Splitting by categorical or other data is a thing that is often very helpful to correlate stuff, and might be useful for other plot types as well (Like colorizing points in a cloud from the same column by another categorical value).
I might look into that between other projects when I understand the source a bit more.

asemke · August 23, 2023, 12:34pm

@Chrismettal as to “Drop Values” - we have this functionality already for column values. From the context menu of a column select “Modify Data” and then “Drop” or “Mask”. In the dialog the user can define the condition to drop/mask the values. This functionality only touches the values in the column and doesn’t remove any rows. With the new MR it will be possible to remove complete rows in case one of the values is missing - with this it will be possible to cleanup the spreadsheet quickly and to keep the relevant stuff only.

We can also think about extending this logic and implement something like “drop all rows if the specified condition is met by one of the values in the row” but here I’m not sure this is really usefull since the values in the row across multiple columns usually have different data and even data types which cannot be described by one single condition like for drop/mask’ing of values within the same column.

Spliting by categorical data is very useful, agree. And it’s actually needed or can be used for all visualizations - produce scatter plot or histogram, etc. for every singly group in the data and things like that. We need to implement it, yes.

Chrismettal · August 23, 2023, 12:42pm

we have this functionality already for column values

This is what I was referencing. A selector to choose to drop values in that column vs dropping the entire row (Combining the two mentioned functions) might be very intuitive.

Spliting by categorical data is very useful, agree.

Glad to hear I am on the right track then.
I’ll try to explore some options how this could be implemented.

dlaska · August 24, 2023, 6:54am

@Chrismettal Thank you for your constructive feedback and the eagerness to contribute, if time permits. Yes, the current solution for sorting columns needs to be streamlined in the way you’ve outlined. Sorting multiple columns or a whole spreadsheet (with multiple sorting keys) should be possible, even if you select a single column. This is important and it’s on our TODO list.