Reading Excel files directly from your companies OneDrive

pschlunder · July 2017

Nowadays many corporate Excel files tend to be online. This seems quite natural, when you consider the wide range of benefits: ease of sharing and collaboration, access from any pc with Internet connection, user right management and constant backups, just to name a few. Wouldn't it be great to include data from those files directly into your Data Mining process, even without downloading the whole file? Now you can! Introducing the new Read Excel Online Operator. It is part of the latest update of the Spreadsheet Table Extraction extension (download link). Example Excel Online Sheet Above you can see an Excel Sheet stored in a OneDrive for Business instance. Its filename is 'Customer Data.xlsx', it contains one sheet called 'Sheet1' and note, that the 4th column is not given a name, yet. Now let's access this sheet. Open up RapidMiner and search the Marketplace (found in the Extensions menu item) for Spreadsheet Table Extraction. After the installation is complete, you will find a new Operator, called Read Excel Online in your Operator view. Drag it into your process and fill in the parameters. For the sheet shown above it will look like this:

Example process extracting data from OneDrive

The File Path needed is the relative path in your OneDrive for Business. Unfortunately, Microsoft's API does not support private OneDrive accounts, yet. As you can see, my Excel Sheet is located directly in the upper most folder of my OneDrive for Business. If I would move it into a folder named 'Data', I would need to provide 'Data/Customer Data.xlsx' as the File Path. Note, that no leading slash is used.

The Sheet Name is set per default to Microsoft's current default sheet name. Make sure, that you provide the name of the sheet you intend to access. If you don't provide a Cell Range the whole data containing range will be used. Leading rows and columns that are empty, will be skipped. Providing a Cell Range is done using the A1 notation used in Excel. Selecting all cells starting from the 2nd column (B) and the 3rd row up to the 8th column (H) and the 10th row would require to provide 'B3:H10' as the Cell Range.

Since your OneDrive for Business is an internal resource, you also need to verify, that you have access to the Excel Sheet. Therefore you need a so called authentication token. You can get one by visiting the Microsoft Graph Explorer and logging in with your OneDrive credentials (often equivalent to your Microsoft Account, e.g. Office 365). After having logged in, copy the URL from the address bar into the Auth Token field and the Operator will extract the token information automatically.

Executing the process results in the ExampleSet shown below:Resulting ExampleSet from the extraction process

As you can see, all data was extracted successfully. Empty cells were filled up with missing values and the 4th column, which had no name, was given a generic name (here: Attribute 4). And the best part about it, the file was never completely downloaded to your machine. Only data present in the cells you wanted to access was send to your machine. So now you are able to extract only the data of interest while for example facing a large excel document with many sheets, without creating lots of temporary versions of the file on your machine.

In the rare case, that your Excel Sheet only contains data without any column names, check the No column names given option to generate generic attribute names instead of using the first data row as column names.

Advanced Integration

If you have documents stored in Microsoft SharePoint sites, you can use the SharePoint Connector extension (download link) to obtain a list of existing files and load them directly into your process. Therefore have a look at this articles explaining the usage of the extension, and extract the IDs of the desired files and the site itself as described there. Using these information you can provide them via Macros to the Read Excel Online Operator to directly access the data from your SharePoint even without downloading the files. Checking the Read from SharePoint parameter enables you to enter the SharePoint Site ID as well as the SharePoint File ID.

Read from SharePoint option

An example process using both extensions for direct access to SharePoint data in Excel Sheets is given below:

<process version="7.5.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="sharepoint_connector:list_files_sharepoint" compatibility="0.1.000" expanded="true" height="68" name="List SharePoint Files" width="90" x="246" y="238">
        <parameter key="SharePoint Site" value="SharepointTesting"/>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="7.5.003" expanded="true" height="103" name="Filter Examples" width="90" x="581" y="238">
        <list key="filters_list">
          <parameter key="filters_entry_key" value="filename.contains.\.xlsx"/>
        </list>
      </operator>
      <operator activated="true" class="extract_macro_from_annotation" compatibility="7.5.003" expanded="true" height="68" name="Extract Macro from Annotation" width="90" x="782" y="238">
        <parameter key="macro" value="siteId"/>
        <parameter key="annotation" value="sharepointSiteId"/>
      </operator>
      <operator activated="true" class="concurrency:loop_values" compatibility="7.5.003" expanded="true" height="82" name="Loop Values" width="90" x="983" y="238">
        <parameter key="attribute" value="sharepointId"/>
        <parameter key="iteration_macro" value="fileId"/>
        <parameter key="enable_parallel_execution" value="false"/>
        <process expanded="true">
          <operator activated="true" class="spreadsheet_table_extraction:read_excel_online" compatibility="0.2.001" expanded="true" height="68" name="Read Excel Online" width="90" x="447" y="85">
            <parameter key="Read from SharePoint" value="true"/>
            <parameter key="SharePoint Site ID" value="%{siteId}"/>
            <parameter key="SharePoint File ID" value="%{fileId}"/>
          </operator>
          <connect from_op="Read Excel Online" from_port="example set" to_port="output 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
          <description align="center" color="orange" colored="true" height="201" resized="true" width="223" x="381" y="29">Obtaining the SharePoint site and file IDs from previously set macros&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;Make sure to check the Sheet Name and to provide a valid token.</description>
        </process>
      </operator>
      <connect from_op="List SharePoint Files" from_port="example set" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Filter Examples" from_port="example set output" to_op="Extract Macro from Annotation" to_port="object"/>
      <connect from_op="Extract Macro from Annotation" from_port="object" to_op="Loop Values" to_port="input 1"/>
      <connect from_op="Loop Values" from_port="output 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <description align="left" color="orange" colored="false" height="322" resized="true" width="403" x="100" y="38">#1 Login to https://developer.microsoft.com/en-us/graph/graph-explorer and copy the URL from your browsers address bar into the &amp;quot;Auth Token&amp;quot; field.&lt;br&gt;#2 Go to your SharePoint and choose a site you want to access.&lt;br&gt;#3 Fill in the parameters 'SharePoint URL' with the URL you used to access your SharePoint and fill in the parameter 'SharePoint Site' with the name of the site you want to access.</description>
      <description align="center" color="orange" colored="false" height="322" resized="true" width="219" x="513" y="37">Select only entries containing Excle Sheets (ending with .xlsx).</description>
      <description align="center" color="orange" colored="true" height="321" resized="true" width="185" x="737" y="37">Extract the SharePoint site ID from the annotated ExampleSet and provide it as a macro called 'siteId'</description>
      <description align="center" color="orange" colored="true" height="321" resized="true" width="199" x="928" y="37">Loop over all remaining entries (only the ones being Excel Sheets) to access the SharePoint file IDs from the 'sharepointId' column and providing it via a macro called 'fileId'</description>
    </process>
  </operator>
</process>

Happy Mining,

Philipp for the RapidMiner Research Team

Acknowledgments

The extensions are developed as part of “Data Search for Data Mining (DS4DM)” project (website: http://ds4dm.com), which is sponsored by the German ministry of education and research (BMBF).

Montse · June 2019

Hi @pschlunder ,

I have some problems trying to read an Excel file from my Business OneDrive.
Could you specify with more detail how to obtain the Auth Token?

I have logged in into Microsoft Graph Explorer but the URL I have copied is not correct. When I run the process, RapidMiner tells me "The URL you provided is currently not supported for access token extraction. Please use the URL of the Graph Explorer as explained in the operators documentation".

The steps I have done:
1. I Log in into Microsoft Graph Explorer
2. I have proved the Get API "all the items in my drive" and I can view in JSON format all the items
3. Wich URL I have to copy and put into Auth Token property from Read Excel Online operator? These are the options I have tried
a) the URL that appears in Google Chrome bar: https://developer.microsoft.com/en-us/graph/graph-explorer#
b) the URL that appears into JSON format (when I call all the items in my drive) with the title: @odata.context
c) the URL that appears into JSON format (when I call all the items in my drive) in one of my folders with the title: webUrl
4. With all of them, I can't access the file.

Thanks in advance.
Best regards,
Montse

ey1 · June 2019

Hi Montse,

There has been a change in the way Microsoft exposes the URLs in the browser address bar when accessing the Graph Explorer. This means that you have to fetch the access token in an alternative way. Therefore, in near future, we will update the documentation of "SharePoint Connector" extension and that of "Read Excel Online" operator of the "Spreadsheet Table Extraction" extension. Since the current code works fine - provided you have the access token, I am listing the steps which you need to follow to use the "Read Excel Online" operator:

1) Please login to Microsoft Azure Portal at: https://portal.azure.com/#home

2) Now, you need to create an App, which will be used to automatically access your Microsoft resources (such as Excel Online files from your Business OneDrive) from third party software like RapidMiner Studio. There are several sub-steps:

2.1) Click "Azure Active Directory", then click "App registrations", then "New registration". Give the App a name, provide a redirect URI (https://localhost) for type Web.

2.2) During App creation, you will get a CLIENT_SECRET, which is a hash string. Please write this in your secure file as you cannot access it later.

2.3) During App creation, enable the two check-boxes for "Access tokens" and "ID tokens".

2.4) After App creation, you need to grant it API-level permissions. Click on "Api permissions" link in the left panel. It will show you "Microsoft Graph" API permissions hierarchy. Click it. Now you see a new panel to your right which shows two kind of permissions: "Delegated permissions" and "Application permissions". Select appropriate permissions for the former. For the latter, grant atleast these permissions:

- Application.ReadWrite.All

- Application.ReadWrite.OwnedBy

- Device.ReadWriteAll

- Directory.ReadAll

- Directory.ReadWrite.All

- Domain.ReadWrite.All

- User.Read.All

2.5) After the App has been created, it will be seen in the list under "Owned applications" link. Click on the name of your App. You will see some properties of this App including:

- Application (client) ID: This is a unique identity of your App.

- Directory (tenant) ID: This is a unique identity of your SharePoint / OneDriver user. It will remain the same for all Apps you create.

2.6) If you are not holding Administrator rights for your SharePoint / OneDrive account, you will need to get your App approved from your Administrator. You can ensure that by opening the following URL in your browser and follow the message you receive from Microsoft server:

https://login.microsoftonline.com/{YOUR_DIRECTORY_TENANT_ID}/adminconsent?client_id={YOUR_APPLICATION_CLIENT_ID}

2.7) Now, that your App is approved by your Administrator, you will invoke a POST call on Graph API that will return you the access_token. You can do this using CURL command line utility or Postman - as you like. The method URL is:

URL = "https://login.microsoftonline.com/{YOUR_DIRECTORY_TENANT_ID}/oauth2/token"

Provide following parameters for CURL:

curl -X POST \

https://login.microsoftonline.com/{YOUR_DIRECTORY_TENANT_ID}/oauth2/token \

-H 'cache-control: no-cache' \

-H 'content-type: application/x-www-form-urlencoded' \

-H 'postman-token: ea6acf4b-b46c-d354-6d51-c992ad498d57' \

-d 'client_id={YOUR_APPLICATION_CLIENT_ID}&resource=https%3A%2F%2Fgraph.microsoft.com&grant_type=client_credentials&client_secret={YOUR_CLIENT_SECRET}'

If you prefer to use Postman instead, then add the following content-type to your Request (e.g. as 'application/x-www-form-urlencoded' data) using Postman interface:

client_id : {YOUR_APPLICATION_CLIENT_ID}

resource : https://graph.microsoft.com

grant_type: "client_credentials"

client_secret : {YOUR_CLIENT_SECRET}

3) In the Response of above POST call, you will get your "access_token". Please copy it and enter it to the "Auth Token" property of "List SharePoint Files" operator of "SharePoint Connector" extension. Provide other mandatory parameters like your "SharePoint Site" and "SharePoint URL".

4) The output of "List Sharepoint Files" operator is an ExampleSet. This shows you all folders and files that are available in your SharePoint / OneDrive account. It also gives you a "sharepointId" of each file or folder. Click on the Annotations view and you will see the "sharepointSiteId" as an annotation. Please note down these two properties.

5) Finally, you have all the parameters to configure the "Read Excel Onilne" operator. Copy the "access_token" in "Auth Token" parameter and enter your "Sheet Name". Click the checkbox "Read from SharePoint" to see two dependent parameters namely "SharePoint Site ID" and "SharePoint File ID". You have retrieved these two parameters in Step 4. Copy these as parameter values. Execute the operator and your Excel Online sheet is output as an ExampleSet.

If you further wish to download the Excel Online file, you can connect the output of "List SharePoint Files" operator to the "Download from SharePoint" operator, provide a "Download Path" and execute.

In the next update of "Spreadsheet Table Extraction" extension, we will try to automate Step 2.7, but the App creation steps would remain manual for now.

Hope this helps.

Best Regards,

Edwin Yaqub

Montse · June 2019

Thank you Edwin. I will prove the steps you have told me

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Reading Excel files directly from your companies OneDrive

Advanced Integration

Acknowledgments

Comments