Mining source code files

confusedMonMon · April 2019

Hi there,
I'm new to the mining world and what I'm looking for is mining source code files, i.e files written in programming languages. I thought since source codes are textual data then I can find some text mining tool to mine them, and picked RapidMiner as it is one of the most famous text mining tools. Unfortunately, it couldn't read such files. Am I missing something here? do you have any advice on how to mine such files?
Many thanks

yyhuang · April 2019

Hi @confusedMonMon,

if you have source codes files, saying .sql, .c, .py files, you would need to read document operator from text processing extension.

Image: https://us.v-cdn.net/6030995/uploads/editor/rg/vnqv69nkz9az.png

<?xml version="1.0" encoding="UTF-8"?><process version="9.2.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="open_file" compatibility="9.2.001" expanded="true" height="68" name="Open File" width="90" x="112" y="34">
        <parameter key="resource_type" value="URL"/>
        <parameter key="url" value="https://raw.githubusercontent.com/Marcnuth/AnomalyDetection/master/anomaly_detection/anomaly_detect_vec.py"/>
      </operator>
      <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="Read Document" width="90" x="380" y="34">
        <parameter key="extract_text_only" value="true"/>
        <parameter key="use_file_extension_as_type" value="true"/>
        <parameter key="content_type" value="txt"/>
        <parameter key="encoding" value="SYSTEM"/>
      </operator>
      <connect from_op="Open File" from_port="file" to_op="Read Document" to_port="file"/>
      <connect from_op="Read Document" from_port="output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

SGolbert · April 2019

Hi @confusedMonMon

here is an example of the use of the Text Mining extension with the source code of two Python scripts:

<?xml version="1.0" encoding="UTF-8"?><process version="9.2.001">
<context>
    <input/>
    <output/>
    <macros/>
</context>
<operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document" width="90" x="112" y="34">
        <parameter key="text" value="# Self Organizing Map

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Credit_Card_Applications.csv')
X = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values

# Feature Scaling
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range = (0, 1))
X = sc.fit_transform(X)

# Training the SOM
from minisom import MiniSom
som = MiniSom(x = 10, y = 10, input_len = 14, sigma = 1.0, learning_rate = 0.5)
som.random_weights_init(X)
som.train_random(data = X, num_iteration = 200)

# Visualizing the results
from pylab import bone, pcolor, colorbar, plot, show
bone()
pcolor(som.distance_map().T)
colorbar()
markers = ['o', 's']
colors = ['r', 'g']
for i, x in enumerate(X):
    w = som.winner(x)
    plot(w[0] + 0.5,
         w[1] + 0.5,
         markers[y[i]],
         markeredgecolor = colors[y[i]],
         markerfacecolor = 'None',
         markersize = 10,
         markeredgewidth = 2)
show()

# Finding the frauds
mappings = som.win_map(X)
frauds = np.concatenate((mappings[(8,1)], mappings[(6,8)]), axis = 0)
frauds = sc.inverse_transform(frauds)"/>
        <parameter key="add label" value="false"/>
        <parameter key="label_type" value="nominal"/>
      </operator>
      <operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document (2)" width="90" x="112" y="187">
        <parameter key="text" value="# Recurrent Neural Network



# Part 1 - Data Preprocessing

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the training set
dataset_train = pd.read_csv('Google_Stock_Price_Train.csv')
training_set = dataset_train.iloc[:, 1:2].values

# Feature Scaling
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range = (0, 1))
training_set_scaled = sc.fit_transform(training_set)

# Creating a data structure with 60 timesteps and 1 output
X_train = []
y_train = []
for i in range(60, 1258):
    X_train.append(training_set_scaled[i-60:i, 0])
    y_train.append(training_set_scaled[i, 0])
X_train, y_train = np.array(X_train), np.array(y_train)

# Reshaping
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1)) 
# The last "1" corresponds to the number of dataset, for example to include
# another related stock in the predictions



# Part 2 - Building the RNN

# Importing the Keras libraries and packages
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout

# Initialising the RNN
regressor = Sequential()

# Adding the first LSTM layer and some Dropout regularisation
regressor.add(LSTM(units = 50, return_sequences = True, input_shape = (X_train.shape[1], 1)))
regressor.add(Dropout(0.2))

# Adding a second LSTM layer and some Dropout regularisation
regressor.add(LSTM(units = 50, return_sequences = True))
regressor.add(Dropout(0.2))

# Adding a third LSTM layer and some Dropout regularisation
regressor.add(LSTM(units = 50, return_sequences = True))
regressor.add(Dropout(0.2))

# Adding a fourth LSTM layer and some Dropout regularisation
regressor.add(LSTM(units = 50))
regressor.add(Dropout(0.2))

# Adding the output layer
regressor.add(Dense(units = 1))

# Compiling the RNN
regressor.compile(optimizer = 'adam', loss = 'mean_squared_error') # Regression problem

# Fitting the RNN to the Training set
regressor.fit(X_train, y_train, epochs = 100, batch_size = 32)



# Part 3 - Making the predictions and visualising the results

# Getting the real stock price of 2017
dataset_test = pd.read_csv('Google_Stock_Price_Test.csv')
real_stock_price = dataset_test.iloc[:, 1:2].values

# Getting the predicted stock price of 2017
dataset_total = pd.concat((dataset_train['Open'], dataset_test['Open']), axis = 0)
inputs = dataset_total[len(dataset_total) - len(dataset_test) - 60:].values
inputs = inputs.reshape(-1,1)
inputs = sc.transform(inputs)
X_test = []
for i in range(60, 80):
    X_test.append(inputs[i-60:i, 0])
X_test = np.array(X_test)
X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))
predicted_stock_price = regressor.predict(X_test)
predicted_stock_price = sc.inverse_transform(predicted_stock_price) # undo normalization

# Visualising the results
plt.plot(real_stock_price, color = 'red', label = 'Real Google Stock Price')
plt.plot(predicted_stock_price, color = 'blue', label = 'Predicted Google Stock Price')
plt.title('Google Stock Price Prediction')
plt.xlabel('Time')
plt.ylabel('Google Stock Price')
plt.legend()
plt.show()
"/>
        <parameter key="add label" value="false"/>
        <parameter key="label_type" value="nominal"/>
      </operator>
      <operator activated="true" class="collect" compatibility="9.2.001" expanded="true" height="103" name="Collect" width="90" x="246" y="34">
        <parameter key="unfold" value="false"/>
      </operator>
      <operator activated="true" class="loop_collection" compatibility="9.2.001" expanded="true" height="82" name="Loop Collection" width="90" x="447" y="34">
        <parameter key="set_iteration_macro" value="false"/>
        <parameter key="macro_name" value="iteration"/>
        <parameter key="macro_start_value" value="1"/>
        <parameter key="unfold" value="false"/>
        <process expanded="true">
          <operator activated="true" class="text:remove_document_parts" compatibility="8.1.000" expanded="true" height="68" name="Remove Document Parts" width="90" x="380" y="34">
            <parameter key="deletion_regex" value="#(.*?)\n"/>
          </operator>
          <connect from_port="single" to_op="Remove Document Parts" to_port="document"/>
          <connect from_op="Remove Document Parts" from_port="document" to_port="output 1"/>
          <portSpacing port="source_single" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
        <description align="center" color="transparent" colored="false" width="126">remove comments</description>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents" width="90" x="648" y="34">
        <parameter key="create_word_vector" value="true"/>
        <parameter key="vector_creation" value="Term Frequency"/>
        <parameter key="add_meta_information" value="true"/>
        <parameter key="keep_text" value="false"/>
        <parameter key="prune_method" value="none"/>
        <parameter key="prune_below_percent" value="3.0"/>
        <parameter key="prune_above_percent" value="30.0"/>
        <parameter key="prune_below_rank" value="0.05"/>
        <parameter key="prune_above_rank" value="0.95"/>
        <parameter key="datamanagement" value="double_sparse_array"/>
        <parameter key="data_management" value="auto"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="34">
            <parameter key="mode" value="non letters"/>
            <parameter key="characters" value=".:"/>
            <parameter key="language" value="English"/>
            <parameter key="max_token_length" value="3"/>
          </operator>
          <operator activated="true" class="text:generate_n_grams_terms" compatibility="8.1.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="447" y="34">
            <parameter key="max_length" value="2"/>
          </operator>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
          <connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
        <description align="center" color="transparent" colored="false" width="126">tokenize, do n-grams and count frequencies<br/></description>
      </operator>
      <connect from_op="Create Document" from_port="output" to_op="Collect" to_port="input 1"/>
      <connect from_op="Create Document (2)" from_port="output" to_op="Collect" to_port="input 2"/>
      <connect from_op="Collect" from_port="collection" to_op="Loop Collection" to_port="collection"/>
      <connect from_op="Loop Collection" from_port="output 1" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
</operator>
</process>

You would have to work quite a bit in the subprocess of the Process Documents operator to get something useful.

Regards,

Sebastian

SGolbert · April 2019

Hi,

you can search GitHub repos by language:

https://help.github.com/en/articles/searching-for-repositories#search-by-language

Whether you do it through web crawling or directly with the GitHub API, I leave it for you to find out

Regards,

Sebastian

confusedMonMon · April 2019

Thany you @yyhuang and @SGolbert. Now I'm able to read the source code files I have. I also finished the RapidMiner tutorials which mainly used structured text files and got my head around it. However, since the source code files are unstructured textual data and what I'm looking for is analyzing some programming language features usage based on mining the files. Is there any advice/recommendations on the components/operators I should use considering that (1) the data is unstructured text (source code) and (2) I'm not going to analyze any natural language phrases and will skip the comments parts of the source code files. Thank you

SGolbert · April 2019

Hi confusedMonMon,

Can you tell us what problem you are trying to solve? The approach will vary depending on that.

I think that in general you can work with the text mining extension. The Text and Web Mining Course is a great introduction, but AFIK it hasn't been made open yet (@Knut-RM). Depending on the complexity of the problem, you can build a good model by counting word frequencies and using n-grams.

Regards,

Sebastian

IngoRM · April 2019

Hi,

Just wanted to mention that the text and web mining course is now public as well:

https://academy.rapidminer.com/courses/text-and-web-mining-with-rapidminer

Hope this helps,
Ingo

confusedMonMon · April 2019

Thanks @SGolbert. I'm currently working on the text processing extension. I want to (1) exclude some sentences and paragraphs that start and/or end with a certain character (e.g. /*, //, #,...) from processing. Also, I want to (2) look for a predefined list of words and/or phrases that have a specific pattern in the documents to be detected and compared to others. Any suggestions to start with?

confusedMonMon · April 2019

Hi @SGolbert. I couldn't find the "Text mining" extension. Maybe you mean web mining? would it help since I already have the source code files locally? Thanks

sgenzer · April 2019

hi @confusedMonMon I think @SGolbert meant to say the "Text Processing" extension. I make the same mistake all the time.

Image: https://us.v-cdn.net/6030995/uploads/editor/sy/8lakp4m6ja9q.png

confusedMonMon · April 2019

Thanks @sgenzer and @SGolbert. Makes sense now.

confusedMonMon · April 2019

Hi @SGolbert . I have a further question. How to set the text directories in Process Documents from Files operator automatically instead of manually? Is there any way to do so? This is my attempt:

<br>

SGolbert · April 2019

Hi @confusedMonMon

I cannot parse your process, can you paste it again with the right format?

Regards,

Sebastian

confusedMonMon · April 2019

Thanks @SGolbert I managed to make it work now.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Mining source code files

Best Answers

Answers