added hlbpr assignments 1-7
All checks were successful
build-latex / build (push) Successful in 7m23s
All checks were successful
build-latex / build (push) Successful in 7m23s
This commit is contained in:
parent
ce6b7790b3
commit
0acdffd3ae
6 changed files with 1412 additions and 0 deletions
353
hlbpr/assignment2/main.tex
Normal file
353
hlbpr/assignment2/main.tex
Normal file
|
@ -0,0 +1,353 @@
|
|||
\documentclass[a4paper]{article}
|
||||
%\usepackage[singlespacing]{setspace}
|
||||
\usepackage[onehalfspacing]{setspace}
|
||||
%\usepackage[doublespacing]{setspace}
|
||||
\usepackage{geometry} % Required for adjusting page dimensions and margins
|
||||
\usepackage{amsmath,amsfonts,stmaryrd,amssymb,mathtools,dsfont} % Math packages
|
||||
\usepackage{tabularx}
|
||||
\usepackage{colortbl}
|
||||
\usepackage{listings}
|
||||
\usepackage{amsmath}
|
||||
\usepackage{amssymb}
|
||||
\usepackage{amsthm}
|
||||
\usepackage{enumerate}
|
||||
\usepackage{enumitem}
|
||||
\usepackage{subcaption}
|
||||
\usepackage{float}
|
||||
\usepackage[table,xcdraw]{xcolor}
|
||||
\usepackage{tikz-qtree}
|
||||
\usepackage{forest}
|
||||
\usepackage{pgfplots}
|
||||
\usepackage{changepage,titlesec,fancyhdr} % For styling Header and Titles
|
||||
\pagestyle{fancy}
|
||||
\renewcommand{\headrulewidth}{0.5pt} % Linienbreite anpassen, falls gewünscht
|
||||
\renewcommand{\headrule}{
|
||||
\makebox[\textwidth]{\rule{1.0\textwidth}{0.5pt}}
|
||||
}
|
||||
\usepackage{amsmath}
|
||||
\pagestyle{fancy}
|
||||
\usepackage{diagbox}
|
||||
\usepackage{xfrac}
|
||||
|
||||
\usepackage{enumerate} % Custom item numbers for enumerations
|
||||
|
||||
\usepackage[ruled]{algorithm2e} % Algorithms
|
||||
|
||||
\usepackage[framemethod=tikz]{mdframed} % Allows defining custom boxed/framed environments
|
||||
|
||||
\usepackage{listings} % File listings, with syntax highlighting
|
||||
\lstset{
|
||||
basicstyle=\ttfamily, % Typeset listings in monospace font
|
||||
}
|
||||
|
||||
\usepackage[ddmmyyyy]{datetime}
|
||||
|
||||
|
||||
\geometry{
|
||||
paper=a4paper, % Paper size, change to letterpaper for US letter size
|
||||
top=3cm, % Top margin
|
||||
bottom=3cm, % Bottom margin
|
||||
left=2.5cm, % Left margin
|
||||
right=2.5cm, % Right margin
|
||||
headheight=25pt, % Header height
|
||||
footskip=1.5cm, % Space from the bottom margin to the baseline of the footer
|
||||
headsep=1cm, % Space from the top margin to the baseline of the header
|
||||
%showframe, % Uncomment to show how the type block is set on the page
|
||||
}
|
||||
\lstset{
|
||||
language=C++,
|
||||
basicstyle=\ttfamily\small,
|
||||
numbers=left,
|
||||
numberstyle=\tiny,
|
||||
stepnumber=1,
|
||||
numbersep=5pt,
|
||||
backgroundcolor=\color{white},
|
||||
showspaces=false,
|
||||
showstringspaces=false,
|
||||
showtabs=false,
|
||||
frame=single,
|
||||
rulecolor=\color{black},
|
||||
tabsize=2,
|
||||
captionpos=b,
|
||||
breaklines=true,
|
||||
breakatwhitespace=false,
|
||||
keywordstyle=\color{blue},
|
||||
commentstyle=\color{purple},
|
||||
stringstyle=\color{red}
|
||||
}
|
||||
\lhead{Badan, 7418190\\Kneifel, 8071554}
|
||||
\chead{\bfseries{\vspace{0.5\baselineskip}HL-BPR Praktikum SS25\\Blatt 02}}
|
||||
\rhead{Wolf, 8019440\\Werner, 7987847}
|
||||
\fancyheadoffset[R]{0cm}
|
||||
|
||||
\begin{document}
|
||||
\section{Exercise 1.1}
|
||||
General Ideas and Plans:
|
||||
\begin{itemize}
|
||||
\item using Fliplist to store which cells need to be inverted (column and line indices of a cell)
|
||||
\item using nested vectors to store our cells (game-board)
|
||||
\item computing a 3x3 neighbor grid for each cell before applying the given rules while respecting edge wrapping (edges and corners connect to the opposite side of the game-board)
|
||||
\item custom library for pleasant user interface
|
||||
\end{itemize}
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=1\linewidth]{classdiagram.png}
|
||||
\caption{UML Classdiagram}
|
||||
\end{figure}
|
||||
\clearpage
|
||||
\section{Exercise 1.2}
|
||||
\subsection{Evolve}
|
||||
This is our main game loop, here we have defined the given game Rules via the following nested if-statements
|
||||
\begin{lstlisting}[caption={"Game Rules"}]
|
||||
// Rules of the game as stated on Wikipedia
|
||||
if (board[j][i] == true) {
|
||||
if (liveCellCount < 2) {
|
||||
flipList.push_back(make_tuple(j, i)); // (x, y)
|
||||
} else if (liveCellCount > 3) {
|
||||
flipList.push_back(make_tuple(j, i));
|
||||
}
|
||||
} else {
|
||||
if (liveCellCount == 3) {
|
||||
flipList.push_back(make_tuple(j, i));
|
||||
}
|
||||
}
|
||||
\end{lstlisting}
|
||||
After Deciding which cells need to be Flipped (killed or born) we then also Flip them (invert their current state)
|
||||
\begin{lstlisting}[caption={"Changing cell States"}]
|
||||
// The flipping of the flipList elems
|
||||
for (tuple<int, int> coords : flipList) {
|
||||
int j = get<0>(coords);
|
||||
int i = get<1>(coords);
|
||||
|
||||
board[j][i] = !board[j][i]; //inverting bool
|
||||
}
|
||||
|
||||
secondTolastFlipList = lastFlipList;
|
||||
lastFlipList = flipList;
|
||||
flipList.clear();
|
||||
|
||||
print();
|
||||
\end{lstlisting}
|
||||
\clearpage
|
||||
\subsection{load}
|
||||
This Functions loads a World from a file, it reads the worlds dimensions from the first two lines and saves the rest into a nested vector, it also checks if the file is valid by making sure that it only reads 1s and 0s.
|
||||
\begin{lstlisting}[caption={main loop of the load function}]
|
||||
string line;
|
||||
//Iterator to iterate through the matrix (to address the sublists)
|
||||
auto sublist_It = matrix.begin();
|
||||
while (getline(file, line) && sublist_It !=matrix.end()) {
|
||||
//skipping the first 2 lines as those only serve to save the dimesnions
|
||||
linenumb ++;
|
||||
if (linenumb<=1) continue;
|
||||
stringstream str_s(line);
|
||||
int val;
|
||||
// this loop checks if only 1s and 0s are present in the document (except the frist 2 lines)
|
||||
while (str_s>>val){
|
||||
if(val!=0 && val!=1) {
|
||||
cerr << "Error only 1s and 0s allowed";
|
||||
return{};
|
||||
}
|
||||
sublist_It ->push_back(val!=0); //adding the bools into the sublists
|
||||
|
||||
}
|
||||
++sublist_It; // iterates through sublists
|
||||
|
||||
}
|
||||
\end{lstlisting}
|
||||
\clearpage
|
||||
\subsection{save}
|
||||
The save Function saves the current state of the game as a txt file, for this each subvector of the nested vector is saved as a line in the file after the dimensions of our board/matrix are detected and saved as the first two lines in the file.
|
||||
\begin{lstlisting}
|
||||
ofstream outfile(saveName);
|
||||
if (!outfile) {
|
||||
return 1;
|
||||
}
|
||||
else {
|
||||
outfile << board.size() << endl; // adding the x dimension
|
||||
outfile << board.begin()-> size()<<endl; // adding the y dimesnion
|
||||
}
|
||||
// writing the elements from the list to the file
|
||||
for (const auto& sublist : board) {
|
||||
for (bool elem : sublist) {
|
||||
outfile << elem << " ";
|
||||
}
|
||||
outfile << endl;
|
||||
}
|
||||
\end{lstlisting}
|
||||
\subsection{isstable}
|
||||
This Function checks if any changes have been made since the last round of the game, it does this by comparing the current fliplist last and or second to last one. If they match up perfectly then the game is stable, if not then a flase is returned.
|
||||
\begin{lstlisting}
|
||||
bool is_stable() {
|
||||
if (flipList == lastFlipList) {
|
||||
return true;
|
||||
}
|
||||
else if (flipList == secondTolastFlipList) {
|
||||
return true;
|
||||
}
|
||||
return false;
|
||||
}
|
||||
\end{lstlisting}
|
||||
\clearpage
|
||||
\subsection{print}
|
||||
This function prints out our current board state using unicode symbols to make it more readable.
|
||||
\begin{lstlisting}
|
||||
ofstream outfile(saveName);
|
||||
if (!outfile) {
|
||||
return 1;
|
||||
}
|
||||
else {
|
||||
outfile << board.size() << endl; // adding the x dimension
|
||||
outfile << board.begin()-> size()<<endl; // adding the y dimesnion
|
||||
}
|
||||
// writing the elements from the list to the file
|
||||
for (const auto& sublist : board) {
|
||||
for (bool elem : sublist) {
|
||||
outfile << elem << " ";
|
||||
}
|
||||
outfile << endl;
|
||||
}
|
||||
\end{lstlisting}
|
||||
\subsection{Additional Functions}
|
||||
\subsubsection{getNeighboredCells}
|
||||
This function is used to create a 3x3 grid of all the neighbors of a given cell with the given cell being at the center of the grid (position 1,1), this grid is used in another function (getNeighborMatchCount) to figure out how many living or dead neighbours a cell has.
|
||||
\begin{lstlisting}
|
||||
vector<vector<bool>> getNeighboredCells(int coordX, int coordY) {
|
||||
vector<vector<bool>> neighborGrid(3, std::vector<bool>(3, false));
|
||||
|
||||
for (int dimYCounter = -1; dimYCounter <= 1; ++dimYCounter) {
|
||||
for (int dimXCounter = -1; dimXCounter <= 1; ++dimXCounter) {
|
||||
// Wrap around using modulo
|
||||
int ni = (static_cast<int>(coordY) + dimYCounter + dimY) % dimY;
|
||||
int nj = (static_cast<int>(coordX) + dimXCounter + dimX) % dimX;
|
||||
|
||||
neighborGrid[dimYCounter + 1][dimXCounter + 1] = board[ni][nj];
|
||||
}
|
||||
}
|
||||
return neighborGrid;
|
||||
}
|
||||
\end{lstlisting}
|
||||
\clearpage
|
||||
\subsubsection{getNeighborMatchCount}
|
||||
This functions counts how many neigbours ofa cell match a state that is being asked for (either dead or alive) if our cell also matches the state in question we subtract the number of matched cells by one as not to count it as a neighbor of itself.
|
||||
\begin{lstlisting}
|
||||
// this func counts how many neighbors match a state that we are looking for (dead or alive)
|
||||
int getNeighborMatchCount(vector<vector<bool>> neighborGrid, bool state) {
|
||||
int matchCount = 0;
|
||||
|
||||
for (const vector<bool>& row : neighborGrid) {
|
||||
for (bool cell : row) {
|
||||
if (cell == state) {
|
||||
matchCount += 1;
|
||||
}
|
||||
}
|
||||
}
|
||||
// subtracting matchcount by one if our cell matches the state in question
|
||||
if (neighborGrid[1][1] == state) {
|
||||
matchCount -= 1;
|
||||
}
|
||||
return matchCount;
|
||||
}
|
||||
\end{lstlisting}
|
||||
\subsubsection{getTerminalSize}
|
||||
With this function we try to check the Dimensions of our Terminal, if this fails we ask the used to manually input the Dimensions.
|
||||
\begin{lstlisting}
|
||||
void getTerminalSize() {
|
||||
// Try to get terminal dimensions
|
||||
if (ioctl(STDOUT_FILENO, TIOCGWINSZ, &w) == -1) {
|
||||
std::cerr << "Unable to determine terminal size. Please enter dimensions manually:\n";
|
||||
std::cout << "Terminal width (characters): ";
|
||||
std::cin >> terminalDimX;
|
||||
std::cout << "Terminal height (rows): ";
|
||||
std::cin >> terminalDimY;
|
||||
} else {
|
||||
terminalDimX = w.ws_col;
|
||||
terminalDimY = w.ws_row;
|
||||
}
|
||||
}
|
||||
|
||||
\end{lstlisting}
|
||||
\clearpage
|
||||
\subsection{exercise 1.4}
|
||||
DISCLAIMER: Because the run time for 2000 generations was too high, we used 10 generations. The real times would be approx. times 200.
|
||||
|
||||
The times for compiling with the help of the DEBUG mode:
|
||||
\begin{itemize}
|
||||
\item Flag O0: 3m56.856s
|
||||
\item Flag O1: 4m58.583s
|
||||
\item Flag O2: 5m22.436s
|
||||
\item Flag O3: 5m20.077s
|
||||
\end{itemize}
|
||||
The times for compiling with the help of the RELEASE mode:
|
||||
\begin{itemize}
|
||||
\item Flag O0: 3m48.844s
|
||||
\item Flag O3: 5m27.411s
|
||||
\end{itemize}
|
||||
Normally one would think that by going up in the optimization level, the run time would reduce. In that particular programm that does not seem to be the case.
|
||||
The flag O0 does not increase runtime, as there are not any optimizations.
|
||||
|
||||
The flag O1 decreases efficiency by deleting variables that are not used or factoring terms into into simpler terms. But trying to optimize source code that cannot be optimized, can even increase the runtime. That is the case here.
|
||||
|
||||
The flag O2 and O3 optimize the programm by a much larger margin, but these optimizations cannot be used here, as the code should be written for such an optimization. Flag O2 tries to decrease runtime by combining variables or rolling out loops. Flag O3 tries to increase the runtime in a more aggressive manner, but those optimizations are not usable in this source code and can even increase the runtime.
|
||||
|
||||
The difference in Flag O0 and O3 (in both modes) are huge. The compiler tries to optimize code by using different techniques (examples above). But before the compiler can optimize anything, the source code should be optimized for the higher level of flags. \\ \\
|
||||
The times above are run on a slower computer. Trying the program on a computer much faster gives us this time:
|
||||
\begin{lstlisting}[caption={"test on a different machine
|
||||
"}]
|
||||
❯ g++ -O0 -g main-test.cpp -o main-test && ./main-test
|
||||
Loop took 451 milliseconds.
|
||||
❯ g++ -O1 -g main-test.cpp -o main-test && ./main-test
|
||||
Loop took 149 milliseconds.
|
||||
❯ g++ -O2 -g main-test.cpp -o main-test && ./main-test
|
||||
Loop took 145 milliseconds.
|
||||
❯ g++ -O3 -g main-test.cpp -o main-test && ./main-test
|
||||
Loop took 139 milliseconds.
|
||||
❯ g++ -O4 -g main-test.cpp -o main-test && ./main-test
|
||||
Loop took 134 milliseconds.
|
||||
\end{lstlisting}
|
||||
|
||||
The times calculated here are more in line with what we imagined in the context of the optimization. The computer used here, or rather the compiler that was run on it, was better able to use the optimizations of the flags O1, O2 and O3. The optimizations include, for example, the removal of unused code, constant calculation or vectorization (in flags O2 and O3).
|
||||
|
||||
|
||||
\section*{Exercise 1.5}
|
||||
|
||||
We cut/extended (with zeros) our matrix so that the desired dimensions are maintained. Inside of the directory are the files, with the other dimensions, which can be run.
|
||||
\begin{center}
|
||||
\begin{tikzpicture}
|
||||
\begin{axis}[
|
||||
width=12cm,
|
||||
height=8cm,
|
||||
xlabel={Grid Size ($x \times y$)},
|
||||
ylabel={Run-time [ms]},
|
||||
title={Simulation time for 100 generations},
|
||||
xtick=data,
|
||||
xticklabels={{10×10}, {20×20}, {100×100}, {1000×1000}, {10000×10000}},
|
||||
ymode=log,
|
||||
log basis y=10,
|
||||
ymin=1,
|
||||
ymax=100000000,
|
||||
grid=major,
|
||||
enlargelimits=0.1,
|
||||
tick label style={font=\small},
|
||||
label style={font=\small},
|
||||
title style={font=\small\bfseries},
|
||||
]
|
||||
\addplot[
|
||||
mark=*,
|
||||
color=blue,
|
||||
thick
|
||||
] coordinates {
|
||||
(1, 1032)
|
||||
(2, 1110)
|
||||
(3, 3627)
|
||||
(4, 266628)
|
||||
(5, 100000000)
|
||||
};
|
||||
\end{axis}
|
||||
\end{tikzpicture}
|
||||
\end{center}
|
||||
|
||||
The runtime of the 10,000x10,000 matrix took so long that we only used an estimated value. (approx. 24 h)
|
||||
The estimated time is calculated from how long the program was busy with the 1000x1000 matrix times 100, i.e. approx. 100,000,000 ms = 27 h.
|
||||
|
||||
|
||||
\end{document}
|
262
hlbpr/assignment3/main.tex
Normal file
262
hlbpr/assignment3/main.tex
Normal file
|
@ -0,0 +1,262 @@
|
|||
\documentclass[a4paper]{article}
|
||||
%\usepackage[singlespacing]{setspace}
|
||||
\usepackage[onehalfspacing]{setspace}
|
||||
%\usepackage[doublespacing]{setspace}
|
||||
\usepackage{geometry} % Required for adjusting page dimensions and margins
|
||||
\usepackage{amsmath,amsfonts,stmaryrd,amssymb,mathtools,dsfont} % Math packages
|
||||
\usepackage{tabularx}
|
||||
\usepackage{colortbl}
|
||||
\usepackage{listings}
|
||||
\usepackage{amsmath}
|
||||
\usepackage{amssymb}
|
||||
\usepackage{amsthm}
|
||||
\usepackage{enumerate}
|
||||
\usepackage{enumitem}
|
||||
\usepackage{subcaption}
|
||||
\usepackage{float}
|
||||
\usepackage[table,xcdraw]{xcolor}
|
||||
\usepackage{tikz-qtree}
|
||||
\usepackage{forest}
|
||||
\usepackage{changepage,titlesec,fancyhdr} % For styling Header and Titles
|
||||
\pagestyle{fancy}
|
||||
\renewcommand{\headrulewidth}{0.5pt} % Linienbreite anpassen, falls gewünscht
|
||||
\renewcommand{\headrule}{
|
||||
\makebox[\textwidth]{\rule{1.0\textwidth}{0.5pt}}
|
||||
}
|
||||
\usepackage{amsmath}
|
||||
\pagestyle{fancy}
|
||||
\usepackage{diagbox}
|
||||
\usepackage{xfrac}
|
||||
|
||||
\usepackage{enumerate} % Custom item numbers for enumerations
|
||||
|
||||
\usepackage[ruled]{algorithm2e} % Algorithms
|
||||
|
||||
\usepackage[framemethod=tikz]{mdframed} % Allows defining custom boxed/framed environments
|
||||
|
||||
\usepackage{listings} % File listings, with syntax highlighting
|
||||
\lstset{
|
||||
basicstyle=\ttfamily, % Typeset listings in monospace font
|
||||
}
|
||||
|
||||
\usepackage[ddmmyyyy]{datetime}
|
||||
|
||||
|
||||
\geometry{
|
||||
paper=a4paper, % Paper size, change to letterpaper for US letter size
|
||||
top=3cm, % Top margin
|
||||
bottom=3cm, % Bottom margin
|
||||
left=2.5cm, % Left margin
|
||||
right=2.5cm, % Right margin
|
||||
headheight=25pt, % Header height
|
||||
footskip=1.5cm, % Space from the bottom margin to the baseline of the footer
|
||||
headsep=1cm, % Space from the top margin to the baseline of the header
|
||||
%showframe, % Uncomment to show how the type block is set on the page
|
||||
}
|
||||
\lstset{
|
||||
language=C++,
|
||||
basicstyle=\ttfamily\small,
|
||||
numbers=left,
|
||||
numberstyle=\tiny,
|
||||
stepnumber=1,
|
||||
numbersep=5pt,
|
||||
backgroundcolor=\color{white},
|
||||
showspaces=false,
|
||||
showstringspaces=false,
|
||||
showtabs=false,
|
||||
frame=single,
|
||||
rulecolor=\color{black},
|
||||
tabsize=2,
|
||||
captionpos=b,
|
||||
breaklines=true,
|
||||
breakatwhitespace=false,
|
||||
keywordstyle=\color{blue},
|
||||
commentstyle=\color{purple},
|
||||
stringstyle=\color{red}
|
||||
}
|
||||
\lhead{Badan, 7418190\\Kneifel, 8071554}
|
||||
\chead{\bfseries{\vspace{0.5\baselineskip}HL-BPR Praktikum SS25\\Blatt 03}}
|
||||
\rhead{Wolf, 8019440\\Werner, 7987847}
|
||||
\fancyheadoffset[R]{0cm}
|
||||
|
||||
\begin{document}
|
||||
|
||||
|
||||
\section*{Exercise 2.1: Learning more about Neural Networks}
|
||||
|
||||
\subsection*{Depth of a network}
|
||||
A neuronal network consists of 3 different classes of layers:
|
||||
\begin{itemize}
|
||||
\item Input layer: accepts raw data
|
||||
\item Hidden layers: Those layers are responsible for processing the given data
|
||||
\item Output layer: returns an "answer" for the processed data
|
||||
\end{itemize}
|
||||
The depth is calculated by adding all the layers. But the input layer is not included in the calculation of the layers. If the hidden layer consists of 6 different layers, the depth of the neural network would be 7.
|
||||
|
||||
\subsection*{Width of a layer}
|
||||
The width of a layer implies the width of a hidden layer. It is the number of neurons inside one layer. A neuron is the smallest computing unit that has the responsibility of processing the input it receives, weighting it and adding it up. The activation function is then used on this sum to decide what to pass on.
|
||||
|
||||
\subsection*{Training vs. Testing}
|
||||
Training and testing are two different phases within the learning of a neural network:
|
||||
\subsubsection*{Training}
|
||||
During training, a neural network is given a lot of sample data. The network changes its weights within its layers in order to make correct outputs.
|
||||
\subsubsection*{Testing}
|
||||
In testing, the skills learned (from the training) are applied to a new set of data. There it is possible to check the generality of the learning.
|
||||
|
||||
\subsection*{batch size}
|
||||
A batch describes a packet of data, i.e. a part of the large amount of data that is transferred to the neural network. All packets are gradually passed to the network until the entire quantity has been processed.
|
||||
|
||||
|
||||
The batch size describes the size of a packet. For a network that checks images, for example, this would be 50 images per batch.
|
||||
|
||||
\subsection*{epoch}
|
||||
An epoch describes a pass through the entire data set until each batch has been processed.
|
||||
Generally, several epochs are completed to support the learning of the network.
|
||||
|
||||
\subsection*{feed forward}
|
||||
The term feed forward basically describes a concept of data transmission in which the information in a network is transmitted “straight ahead” in the direction of the output player (via each hidden layer, of course). Information can therefore not be sent in the other direction.
|
||||
|
||||
\subsection*{backpropagation}
|
||||
Backpropagation is precisely the concept with which a neural network can “learn”.
|
||||
This concept (usually) consists of three steps:
|
||||
Using the loss function to calculate the value of how wrong the network is. This is followed by the calculation of how much which weights need to be changed.
|
||||
The last step is to simply change the weights.
|
||||
|
||||
\subsection*{loss}
|
||||
“Loss” here describes a function that calculates how wrong the network is. The values can be between 0 and 1.
|
||||
|
||||
\subsection*{learning rate}
|
||||
The learning rate describes approximately the step size in which the gradient goes in the direction of the lowest point (the loss function).
|
||||
The ultimate goal is to find the lowest point of the loss function in order to keep the error rate as low as possible.
|
||||
|
||||
\section*{exercise 2.3}
|
||||
Our Shuffel function works by first creating a vector filled with ints from 0 to the size of our input matrix, we then shuffel this vector to randomize its order. \\
|
||||
We use this Randomized list of ints as the indeces for our new Vectors. I.E \verb|new_vector[0] = old_vector[first randomized vector elment]|, we make sure that both the lables and inputFeatures are sorted the same way by sorting them in the same for list using the same iterator for the indeces.
|
||||
Our Shuffel function works by first creating a vector filled with ints from 0 to the size of our input matrix, we then shuffel this vector to randomize its order. \\
|
||||
We use this Randomized list of ints as the indeces for our new Vectors. \\I.E \verb|new_vector[0] = old_vector[first randomized vector elment]|, we make sure that both the lables and inputFeatures are sorted the same way by sorting them in the same for list using the same iterator for the indeces.
|
||||
|
||||
\section*{Exercise 2.5: Learning more about Neural Networks}
|
||||
In this experiment, four configurations of a multilayer perceptron (MLP) were tested by varying both the network architecture—whether or not it included a hidden layer—and the learning rate, using values of 0.01 and 0.001. Each configuration was trained for ten epochs, and performance was measured based on training and testing accuracy, loss, and the time taken per epoch.
|
||||
|
||||
The first configuration, which used no hidden layer and a learning rate of 0.01, delivered the most consistent and high-performing results. The training accuracy increased steadily across epochs, ultimately reaching 88.24\%, while the testing accuracy peaked at 90.46\%. Both training and testing losses decreased progressively and ended at 11.76 and 7.88, respectively. This model demonstrated rapid convergence, stable generalization, and required negligible training time per epoch. It was the most efficient and effective configuration in this set of experiments.
|
||||
|
||||
The second configuration, which also excluded a hidden layer but used a smaller learning rate of 0.001, performed slightly worse in terms of speed but remained competitive in accuracy. The testing accuracy gradually improved over the epochs, reaching a high of 89.55\%. The training accuracy similarly increased to 87.75\% by the final epoch. Although convergence was slower compared to the first configuration, loss values still declined steadily over time. This model was more stable but slightly less performant in both accuracy and efficiency than its counterpart using the higher learning rate.
|
||||
|
||||
The third configuration introduced a hidden layer and used a learning rate of 0.001. This version significantly increased training time, taking approximately 58 seconds per epoch, but did not yield a substantial improvement in accuracy. The highest training accuracy achieved was 86.03\%, while testing accuracy fluctuated and peaked at 86.54\%. However, testing accuracy showed instability, dropping to 74.72\% at one point before recovering. Testing loss followed a similar pattern of inconsistency. While this model showed some learning ability, the increased complexity and training time were not justified by a notable improvement in performance, and the results suggest signs of overfitting or insufficient optimization.
|
||||
|
||||
The final configuration, which combined a hidden layer with a high learning rate of 0.01, performed the worst. Training accuracy decreased steadily to 16.47\%, and testing accuracy never surpassed 30\%, dropping to 20.00\% by the tenth epoch. Both training and testing losses were erratic and remained extremely high throughout the training process, ending at 29.92 and 20.30, respectively. These results indicate that the model failed to converge and possibly diverged due to an excessively high learning rate that destabilized the training process when paired with a deeper network.
|
||||
|
||||
In summary, the model without a hidden layer and a learning rate of 0.01 provided the best combination of speed, stability, and accuracy. Reducing the learning rate to 0.001 improved stability but slightly hindered convergence speed. Adding a hidden layer did not provide measurable benefits under either learning rate and introduced significant training time and instability. The combination of a hidden layer with a high learning rate led to complete training failure. Therefore, for this task and dataset, a simple architecture without hidden layers and a moderate learning rate is clearly the most effective approach.
|
||||
|
||||
\begin{table}[h!]
|
||||
\centering
|
||||
\begin{tabular}{|c|c|c|c|c|}
|
||||
\hline
|
||||
\textbf{Configuration} & \textbf{Train Acc.} & \textbf{Test Acc.} & \textbf{Train Loss} & \textbf{Test Loss} \\
|
||||
\hline
|
||||
No Hiddenlayer($0,0001$) & $84.42\%$ & $88.16\%$ & $14.20\%$ & $0.42$\\
|
||||
No Hiddenlayer($0,0001$) & $87.03\%$ & $84.39\%$ & $11.83\%$ & $0.50$\\
|
||||
No Hiddenlayer($0,0001$) & $87.37\%$ & $84.55\%$ & $11.52\%$ & $0.50$\\
|
||||
No Hiddenlayer($0,0001$) & $87.71\%$ & $88.21\%$ & $11.22\%$ & $0.38$\\
|
||||
No Hiddenlayer($0,0001$) & $87.75\%$ & $86.76\%$ & $11.13\%$ & $0.44$\\
|
||||
No Hiddenlayer($0,0001$) & $88.04\%$ & $88.59\%$ & $10.95\%$ & $0.42$\\
|
||||
No Hiddenlayer($0,0001$) & $88.06\%$ & $89.60\%$ & $110.91\%$ & $0.36$\\
|
||||
No Hiddenlayer($0,0001$) & $88.13\%$ & $85.70\%$ & $10.83\%$ & $0.47$\\
|
||||
No Hiddenlayer($0,0001$) & $88.21\%$ & $88.01\%$ & $10.73\%$ & $0.41$\\
|
||||
No Hiddenlayer($0,0001$) & $88.30\%$ & $89.23\%$ & $10.67\%$ & $0.36$\\
|
||||
|
||||
\hline
|
||||
\end{tabular}
|
||||
\caption{Results for MLP with no hidden layer using learning rate $0{,}0001$ over 10 epochs}
|
||||
\label{tab:example_table}
|
||||
\end{table}
|
||||
|
||||
\begin{table}[h!]
|
||||
\centering
|
||||
\begin{tabular}{|l|c|c|c|c|}
|
||||
\hline
|
||||
\textbf{Configuration} & \textbf{Train Acc.} & \textbf{Test Acc.} & \textbf{Train Loss} & \textbf{Test Loss} \\
|
||||
\hline
|
||||
1 Hidden Layer ($0{,}0001$) Epoch 1 & $88{,}66\%$ & $90{,}03\%$ & $0{,}43$ & $1{,}23$ \\
|
||||
1 Hidden Layer ($0{,}0001$) Epoch 2 & $93{,}14\%$ & $90{,}05\%$ & $0{,}25$ & $1{,}11$ \\
|
||||
1 Hidden Layer ($0{,}0001$) Epoch 3 & $94{,}39\%$ & $90{,}50\%$ & $0{,}21$ & $1{,}02$ \\
|
||||
1 Hidden Layer ($0{,}0001$) Epoch 4 & $94{,}98\%$ & $90{,}86\%$ & $0{,}18$ & $0{,}96$ \\
|
||||
1 Hidden Layer ($0{,}0001$) Epoch 5 & $95{,}53\%$ & $90{,}71\%$ & $0{,}16$ & $0{,}92$ \\
|
||||
1 Hidden Layer ($0{,}0001$) Epoch 6 & $95{,}89\%$ & $90{,}30\%$ & $0{,}15$ & $0{,}88$ \\
|
||||
1 Hidden Layer ($0{,}0001$) Epoch 7 & $96{,}28\%$ & $90{,}69\%$ & $0{,}14$ & $0{,}85$ \\
|
||||
1 Hidden Layer ($0{,}0001$) Epoch 8 & $96{,}42\%$ & $90{,}22\%$ & $0{,}13$ & $0{,}82$ \\
|
||||
1 Hidden Layer ($0{,}0001$) Epoch 9 & $96{,}73\%$ & $90{,}37\%$ & $0{,}12$ & $0{,}79$ \\
|
||||
1 Hidden Layer ($0{,}0001$) Epoch 10 & $96{,}89\%$ & $91{,}10\%$ & $0{,}11$ & $0{,}78$ \\
|
||||
\hline
|
||||
\end{tabular}
|
||||
\caption{Results for MLP with 1 hidden layer using learning rate $0{,}0001$ over 10 epochs}
|
||||
\label{tab:one_hidden_layer_results}
|
||||
\end{table}
|
||||
|
||||
\begin{table}[h!]
|
||||
\centering
|
||||
\begin{tabular}{|l|c|c|c|c|}
|
||||
\hline
|
||||
\textbf{Configuration} & \textbf{Train Acc.} & \textbf{Test Acc.} & \textbf{Train Loss} & \textbf{Test Loss} \\
|
||||
\hline
|
||||
No Hidden Layer ($0{,}001$) Epoch 1 & $84{,}40\%$ & $88{,}26\%$ & $15{,}41$ & $2{,}30$ \\
|
||||
No Hidden Layer ($0{,}001$) Epoch 2 & $86{,}98\%$ & $84{,}40\%$ & $12{,}89$ & $3{,}44$ \\
|
||||
No Hidden Layer ($0{,}001$) Epoch 3 & $87{,}46\%$ & $84{,}19\%$ & $12{,}43$ & $3{,}30$ \\
|
||||
No Hidden Layer ($0{,}001$) Epoch 4 & $87{,}64\%$ & $86{,}21\%$ & $12{,}25$ & $2{,}92$ \\
|
||||
No Hidden Layer ($0{,}001$) Epoch 5 & $87{,}90\%$ & $88{,}43\%$ & $11{,}97$ & $2{,}65$ \\
|
||||
No Hidden Layer ($0{,}001$) Epoch 6 & $87{,}99\%$ & $86{,}72\%$ & $11{,}92$ & $2{,}83$ \\
|
||||
No Hidden Layer ($0{,}001$) Epoch 7 & $88{,}10\%$ & $87{,}49\%$ & $11{,}79$ & $2{,}81$ \\
|
||||
No Hidden Layer ($0{,}001$) Epoch 8 & $88{,}06\%$ & $89{,}48\%$ & $11{,}83$ & $2{,}35$ \\
|
||||
No Hidden Layer ($0{,}001$) Epoch 9 & $88{,}13\%$ & $90{,}04\%$ & $11{,}76$ & $2{,}31$ \\
|
||||
No Hidden Layer ($0{,}001$) Epoch 10 & $88{,}24\%$ & $88{,}28\%$ & $11{,}64$ & $2{,}77$ \\
|
||||
\hline
|
||||
\end{tabular}
|
||||
\caption{Results for MLP without hidden layer using learning rate $0{,}001$ over 10 epochs}
|
||||
\label{tab:no_hidden_001_results}
|
||||
\end{table}
|
||||
|
||||
\begin{table}[h!]
|
||||
\centering
|
||||
\begin{tabular}{|l|c|c|c|c|}
|
||||
\hline
|
||||
\textbf{Configuration} & \textbf{Train Acc.} & \textbf{Test Acc.} & \textbf{Train Loss} & \textbf{Test Loss} \\
|
||||
\hline
|
||||
No Hidden Layer ($0{,}01$) Epoch 1 & $84{,}51\%$ & $88{,}46\%$ & $15{,}46$ & $9{,}17$ \\
|
||||
No Hidden Layer ($0{,}01$) Epoch 2 & $86{,}98\%$ & $89{,}04\%$ & $13{,}01$ & $8{,}78$ \\
|
||||
No Hidden Layer ($0{,}01$) Epoch 3 & $87{,}48\%$ & $86{,}93\%$ & $12{,}51$ & $10{,}58$ \\
|
||||
No Hidden Layer ($0{,}01$) Epoch 4 & $87{,}61\%$ & $86{,}89\%$ & $12{,}38$ & $10{,}25$ \\
|
||||
No Hidden Layer ($0{,}01$) Epoch 5 & $87{,}81\%$ & $87{,}79\%$ & $12{,}19$ & $9{,}45$ \\
|
||||
No Hidden Layer ($0{,}01$) Epoch 6 & $88{,}08\%$ & $88{,}67\%$ & $11{,}91$ & $9{,}04$ \\
|
||||
No Hidden Layer ($0{,}01$) Epoch 7 & $88{,}21\%$ & $88{,}18\%$ & $11{,}78$ & $9{,}29$ \\
|
||||
No Hidden Layer ($0{,}01$) Epoch 8 & $88{,}14\%$ & $88{,}24\%$ & $11{,}84$ & $9{,}67$ \\
|
||||
No Hidden Layer ($0{,}01$) Epoch 9 & $88{,}17\%$ & $86{,}39\%$ & $11{,}82$ & $11{,}26$ \\
|
||||
No Hidden Layer ($0{,}01$) Epoch 10 & $88{,}24\%$ & $90{,}46\%$ & $11{,}76$ & $7{,}88$ \\
|
||||
\hline
|
||||
\end{tabular}
|
||||
\caption{Results for MLP without hidden layer using learning rate $0{,}01$ over 10 epochs}
|
||||
\label{tab:no_hidden_01_results}
|
||||
\end{table}
|
||||
|
||||
\begin{table}[h!]
|
||||
\centering
|
||||
\begin{tabular}{|l|c|c|c|c|}
|
||||
\hline
|
||||
\textbf{Configuration} & \textbf{Train Acc.} & \textbf{Test Acc.} & \textbf{Train Loss} & \textbf{Test Loss} \\
|
||||
\hline
|
||||
With Hidden Layer ($0{,}01$) Epoch 1 & $19{,}35\%$ & $21{,}07\%$ & $25{,}01$ & $28{,}79$ \\
|
||||
With Hidden Layer ($0{,}01$) Epoch 2 & $22{,}70\%$ & $30{,}89\%$ & $22{,}75$ & $16{,}24$ \\
|
||||
With Hidden Layer ($0{,}01$) Epoch 3 & $20{,}69\%$ & $20{,}83\%$ & $25{,}92$ & $20{,}99$ \\
|
||||
With Hidden Layer ($0{,}01$) Epoch 4 & $23{,}66\%$ & $23{,}52\%$ & $18{,}50$ & $14{,}71$ \\
|
||||
With Hidden Layer ($0{,}01$) Epoch 5 & $23{,}09\%$ & $27{,}55\%$ & $19{,}51$ & $19{,}52$ \\
|
||||
With Hidden Layer ($0{,}01$) Epoch 6 & $21{,}19\%$ & $13{,}87\%$ & $21{,}72$ & $21{,}55$ \\
|
||||
With Hidden Layer ($0{,}01$) Epoch 7 & $18{,}65\%$ & $19{,}25\%$ & $24{,}79$ & $19{,}33$ \\
|
||||
With Hidden Layer ($0{,}01$) Epoch 8 & $18{,}16\%$ & $19{,}16\%$ & $25{,}58$ & $23{,}32$ \\
|
||||
With Hidden Layer ($0{,}01$) Epoch 9 & $17{,}59\%$ & $17{,}43\%$ & $26{,}96$ & $25{,}20$ \\
|
||||
With Hidden Layer ($0{,}01$) Epoch 10 & $16{,}47\%$ & $20{,}00\%$ & $29{,}92$ & $20{,}30$ \\
|
||||
\hline
|
||||
\end{tabular}
|
||||
\caption{Results for MLP with one hidden layer using learning rate $0{,}01$ over 10 epochs}
|
||||
\label{tab:with_hidden_01_results}
|
||||
\end{table}
|
||||
|
||||
\end{document}
|
152
hlbpr/assignment4/main.tex
Normal file
152
hlbpr/assignment4/main.tex
Normal file
|
@ -0,0 +1,152 @@
|
|||
\documentclass[a4paper]{article}
|
||||
%\usepackage[singlespacing]{setspace}
|
||||
\usepackage[onehalfspacing]{setspace}
|
||||
%\usepackage[doublespacing]{setspace}
|
||||
\usepackage{geometry} % Required for adjusting page dimensions and margins
|
||||
\usepackage{amsmath,amsfonts,stmaryrd,amssymb,mathtools,dsfont} % Math packages
|
||||
\usepackage{tabularx}
|
||||
\usepackage{colortbl}
|
||||
\usepackage{listings}
|
||||
\usepackage{amsmath}
|
||||
\usepackage{amssymb}
|
||||
\usepackage{amsthm}
|
||||
\usepackage{enumerate}
|
||||
\usepackage{enumitem}
|
||||
\usepackage{subcaption}
|
||||
\usepackage{float}
|
||||
\usepackage[table,xcdraw]{xcolor}
|
||||
\usepackage{tikz-qtree}
|
||||
\usepackage{forest}
|
||||
\usepackage{changepage,titlesec,fancyhdr} % For styling Header and Titles
|
||||
\pagestyle{fancy}
|
||||
\renewcommand{\headrulewidth}{0.5pt} % Linienbreite anpassen, falls gewünscht
|
||||
\renewcommand{\headrule}{
|
||||
\makebox[\textwidth]{\rule{1.0\textwidth}{0.5pt}}
|
||||
}
|
||||
\usepackage{amsmath}
|
||||
\pagestyle{fancy}
|
||||
\usepackage{diagbox}
|
||||
\usepackage{xfrac}
|
||||
|
||||
\usepackage{enumerate} % Custom item numbers for enumerations
|
||||
|
||||
\usepackage[ruled]{algorithm2e} % Algorithms
|
||||
|
||||
\usepackage[framemethod=tikz]{mdframed} % Allows defining custom boxed/framed environments
|
||||
|
||||
\usepackage{listings} % File listings, with syntax highlighting
|
||||
\lstset{
|
||||
basicstyle=\ttfamily, % Typeset listings in monospace font
|
||||
}
|
||||
|
||||
\usepackage[ddmmyyyy]{datetime}
|
||||
|
||||
|
||||
\geometry{
|
||||
paper=a4paper, % Paper size, change to letterpaper for US letter size
|
||||
top=3cm, % Top margin
|
||||
bottom=3cm, % Bottom margin
|
||||
left=2.5cm, % Left margin
|
||||
right=2.5cm, % Right margin
|
||||
headheight=25pt, % Header height
|
||||
footskip=1.5cm, % Space from the bottom margin to the baseline of the footer
|
||||
headsep=1cm, % Space from the top margin to the baseline of the header
|
||||
%showframe, % Uncomment to show how the type block is set on the page
|
||||
}
|
||||
\lhead{Badan, 7418190\\Kneifel, 8071554}
|
||||
\chead{\bfseries{\vspace{0.5\baselineskip}HL-BPR Praktikum SS25\\Blatt 04}}
|
||||
\rhead{Wolf, 8019440\\Werner, 7987847}
|
||||
\fancyheadoffset[R]{0cm}
|
||||
|
||||
\begin{document}
|
||||
|
||||
\section*{Exercise 4.1: Fast Element-Wise Unary Operations}
|
||||
The results in the console look as follows:
|
||||
\begin{lstlisting}[caption={"Output of Matrix.cpp"}]
|
||||
Time scalar: 83.2429 ms
|
||||
Time SIMD: 24.0281 ms, speed up 3.4644
|
||||
SIMD and scalar results are the same.
|
||||
\end{lstlisting}
|
||||
Although these results vary a little, the general speedup is between \texttt{3.4} and \texttt{3.8}.\\\\
|
||||
The expected speedup on first thought would be x4 because the \texttt{F32vec4} uses the \texttt{\_\_m128} datatype family and a float being 32bit in size, results in \texttt{4 x 32b = 128b}.\\\\
|
||||
More accurately, the runtime time would be: \[\mathcal{O}\left(\lfloor\frac{n}{4}\rfloor + (n \mod 4)\right)\] since unless inserting $4 - (n \mod 4)$ dummy elements to do another SIMD-whise computation, these last few elements need to be computed in a scalar manor.\\
|
||||
In this case, our Matrix with \texttt{N = 1000} has 1000000 entries. Therefore:\\
|
||||
\(
|
||||
N = 1000000 \overset{\wedge}{=} 28.2429ms
|
||||
\)\\
|
||||
\(
|
||||
N = 1 \overset{\wedge}{=} \text{8.32429e-05}ms
|
||||
\)
|
||||
Based on that, the runtime for \texttt{N = 1000000} should have been\\
|
||||
$T = \left(\lfloor\frac{1000000}{4}\rfloor + (1000000 \mod 4)\right) \cdot \text{8.32429e-05}ms$
|
||||
$\approx 20.810724ms$\\\\
|
||||
The actual result is quite close to the theoretical speedup. Other parameters influencing the time are for example parts of the function that take constant amount of time or random delays and uncertainties during the process execution that change every time.
|
||||
|
||||
\section*{Exercise 4.2}
|
||||
In this experiment, several implementations of a quadratic root calculation algorithm were evaluated, comparing a traditional scalar version with four different SIMD variants labeled SIMD1 through SIMD4. The scalar version served as the baseline for both correctness and performance, completing in 325.824 milliseconds.\\
|
||||
SIMD1 is an implementation with a faster clear time than anything else in the line up.\\
|
||||
Scalar 325.824 1.00×\\
|
||||
SIMD1 0.285 1142.65×\\
|
||||
\\
|
||||
|
||||
In contrast, SIMD2 exhibited an extreme decline in performance, with a total execution time of 1492.1 milliseconds—significantly slower than even the scalar version.\\
|
||||
\\
|
||||
Scalar 325.824 1.00×\\
|
||||
SIMD2 1492.1 0.22×\\
|
||||
\\
|
||||
SIMD3 offered a more balanced result, achieving a moderate improvement over the scalar baseline by completing in 180.083 milliseconds.\\
|
||||
Scalar 325.824 1.00×\\
|
||||
SIMD3 180.083 1.81×\\
|
||||
\\
|
||||
|
||||
SIMD4, however, followed a similar trend to SIMD2, performing poorly with an execution time of 1105.47 milliseconds, being a little bit faster than the scalar process. \\
|
||||
Scalar 325.824 1.00×\\
|
||||
SIMD4 1105.47 0.29×\\
|
||||
\\
|
||||
|
||||
The different SIMD usages lead to different results. All of them proving that SIMD approach is faster than our scalar approach. Although admitedly through the huge despair between SIMD1 and the rest one can assume its depenedant on how the code is implemented and if its the correct use case of the code.
|
||||
|
||||
\section*{Exercise 4.3}
|
||||
\subsection*{Integer version}
|
||||
\begin{lstlisting}
|
||||
|
||||
// post processing
|
||||
sumI = static_cast<unsigned char>(
|
||||
(temp_sumI & 0xFF) ^
|
||||
((temp_sumI >> 8) & 0xFF) ^
|
||||
((temp_sumI >> 16) & 0xFF) ^
|
||||
((temp_sumI >> 24) & 0xFF)
|
||||
);
|
||||
\end{lstlisting}
|
||||
We first use the \verb|reinterpret_cast| function to convert our string into a ptr to our integer, we use our intPtr in the sum function with N/4 iterations because an integer can fit 4 bytes while a char only fits one, so one int can fir 4 of our chars.\\
|
||||
\\
|
||||
lastly we have to post process to convert our result back to an unsigned char, to do this we use a \verb|static_cast| and we XOR the 4 bytes of the int with eachother as that is the part that we leave out of the sum function.
|
||||
\subsection*{SIMD}
|
||||
After we convert our type via \verb|reinterpret_cast| we then use the sum functions with N/16 iterations as one float can fit 4 bytes (i.e. 4 chars) and one of our Vectors can fit 4 floats (4x4=16).
|
||||
\begin{lstlisting}
|
||||
// converting into ints for further processing
|
||||
int* iptr = reinterpret_cast<int*>(&temp_sumV);
|
||||
|
||||
// Post proccesing
|
||||
// going through all the ints in the vector
|
||||
// performing xor operations to get it into single byte format
|
||||
int finalint =0;
|
||||
for(int i=0;i<4;++i){
|
||||
|
||||
finalint= finalint ^
|
||||
(iptr[i] & 0xFF) ^
|
||||
((iptr[i] >> 8) & 0xFF) ^
|
||||
((iptr[i] >> 16) & 0xFF) ^
|
||||
((iptr[i] >> 24) & 0xFF);
|
||||
}
|
||||
\end{lstlisting}
|
||||
We do our post Proccesing by first converting to integers so that we can use a similar technique to the previous version only that this time we xor the 4 bytes of all the 4 integers in the vector with each other.
|
||||
|
||||
|
||||
\section*{Exercise 4.4}
|
||||
|
||||
The Affine Transformation is a calculation, which is based on the function $y = A*x+b.$
|
||||
With the help of SIMD we can reduce the runtime by about 3/4 (only 3/4 of the runtime of AffineTransform). Here you can see the following difference: by using MatVecMul, i.e. the scalar version, the runtime for each epoch is just under 67 seconds.
|
||||
The runtime is reduced by almost 30 seconds if you use the SIMD version of the MatVecMul function. The speedup would be calculated with the function: $
|
||||
\text{Speedup} := \frac{T_{\text{without SIMD}}}{T_{\text{with SIMD}}} = \frac{68s}{37s} = 1.8108 $. In conclusion, the function with the help of SIMD runs approx. 1,8 times faster than without using SIMD.
|
||||
\end{document}
|
163
hlbpr/assignment5/main.tex
Normal file
163
hlbpr/assignment5/main.tex
Normal file
|
@ -0,0 +1,163 @@
|
|||
\documentclass[a4paper]{article}
|
||||
%\usepackage[singlespacing]{setspace}
|
||||
\usepackage[onehalfspacing]{setspace}
|
||||
%\usepackage[doublespacing]{setspace}
|
||||
\usepackage{geometry} % Required for adjusting page dimensions and margins
|
||||
\usepackage{amsmath,amsfonts,stmaryrd,amssymb,mathtools,dsfont} % Math packages
|
||||
\usepackage{tabularx}
|
||||
\usepackage{colortbl}
|
||||
\usepackage{listings}
|
||||
\usepackage{amsmath}
|
||||
\usepackage{amssymb}
|
||||
\usepackage{amsthm}
|
||||
\usepackage{enumerate}
|
||||
\usepackage{enumitem}
|
||||
\usepackage{subcaption}
|
||||
\usepackage{float}
|
||||
\usepackage[table,xcdraw]{xcolor}
|
||||
\usepackage{tikz-qtree}
|
||||
\usepackage{forest}
|
||||
\usepackage{changepage,titlesec,fancyhdr} % For styling Header and Titles
|
||||
\pagestyle{fancy}
|
||||
\renewcommand{\headrulewidth}{0.5pt} % Linienbreite anpassen, falls gewünscht
|
||||
\renewcommand{\headrule}{
|
||||
\makebox[\textwidth]{\rule{1.0\textwidth}{0.5pt}}
|
||||
}
|
||||
\usepackage{amsmath}
|
||||
\pagestyle{fancy}
|
||||
\usepackage{diagbox}
|
||||
\usepackage{xfrac}
|
||||
|
||||
\usepackage{enumerate} % Custom item numbers for enumerations
|
||||
|
||||
\usepackage[ruled]{algorithm2e} % Algorithms
|
||||
|
||||
\usepackage[framemethod=tikz]{mdframed} % Allows defining custom boxed/framed environments
|
||||
|
||||
\usepackage{listings} % File listings, with syntax highlighting
|
||||
\lstset{
|
||||
basicstyle=\ttfamily, % Typeset listings in monospace font
|
||||
}
|
||||
|
||||
\usepackage[ddmmyyyy]{datetime}
|
||||
|
||||
|
||||
\geometry{
|
||||
paper=a4paper, % Paper size, change to letterpaper for US letter size
|
||||
top=3cm, % Top margin
|
||||
bottom=3cm, % Bottom margin
|
||||
left=2.5cm, % Left margin
|
||||
right=2.5cm, % Right margin
|
||||
headheight=25pt, % Header height
|
||||
footskip=1.5cm, % Space from the bottom margin to the baseline of the footer
|
||||
headsep=1cm, % Space from the top margin to the baseline of the header
|
||||
%showframe, % Uncomment to show how the type block is set on the page
|
||||
}
|
||||
\lhead{Badan, 7418190\\Kneifel, 8071554}
|
||||
\chead{\bfseries{\vspace{0.5\baselineskip}HL-BPR Praktikum SS25\\Blatt 05}}
|
||||
\rhead{Wolf, 8019440\\Werner, 7987847}
|
||||
\fancyheadoffset[R]{0cm}
|
||||
|
||||
\begin{document}
|
||||
\section*{Exercise 5.1}
|
||||
We use a parallel vector to calculate our result, we can use the sqrt fucntion as it is predefined in simd.
|
||||
|
||||
\section*{Exercise 5.2}
|
||||
In this task, we looked at the three different types of memory: AOS (arrays of structs), SOA (structs of arrays) and AOSOA (arrays of structs of arrays).
|
||||
We performed 4 different types of calculations, the three types already mentioned and one scalar calculation. \\
|
||||
Once used with the help of compiler-flag AVX (Advanced Vector Extensions):
|
||||
\begin{lstlisting}
|
||||
g++ quadratic_equation.cpp -O3 -fno-tree-vectorize -mavx -o quadratic_equation.out && ./quadratic_equation.out
|
||||
Time scalar: 2799.81 ms.
|
||||
Time stdx AOS: 5112.58 ms, speed up 0.547632.
|
||||
Time stdx SOA: 9.90828 ms, speed up 282.572.
|
||||
Time stdx AOSOA: 1546.49 ms, speed up 1.81042.
|
||||
SIMD using AOS and scalar results are the same.
|
||||
SIMD using SOA and scalar results are the same.
|
||||
SIMD using AOSOA and scalar results are the same.
|
||||
\end{lstlisting}
|
||||
|
||||
Once used with the help of compiler-flag SSE (Streaming SIMD Extensions):
|
||||
\begin{lstlisting}
|
||||
make run-sse
|
||||
g++ quadratic_equation.cpp -O3 -fno-tree-vectorize -msse -o quadratic_equation.out && ./quadratic_equation.out
|
||||
Time scalar: 2777.7 ms.
|
||||
Time stdx AOS: 6700.41 ms, speed up 0.414557.
|
||||
Time stdx SOA: 12.0778 ms, speed up 229.984.
|
||||
Time stdx AOSOA: 2054.76 ms, speed up 1.35184.
|
||||
SIMD using AOS and scalar results are the same.
|
||||
SIMD using SOA and scalar results are the same.
|
||||
SIMD using AOSOA and scalar results are the same.
|
||||
\end{lstlisting}
|
||||
There are almost no differences in the scalar version.
|
||||
With the calculations AOS, SOA and AOSOA with AVX, the runtime is just under 75 percent of the runtime of SSE.
|
||||
\subsection*{AOS}
|
||||
In general you would expect by using SIMD logic, the runtime would decrease exponentially. But in compiler flags the runtime increases by almost 100 \% .
|
||||
The reason for this could be storage. In runtime, the data is saved twice, once to save the data in a vector and then again to save exactly this data in a SIMD vector. \\
|
||||
That takes even more time than just calculating with the scalar version.
|
||||
\subsection*{SOA}
|
||||
By using "reinterpret\_cast", the original data can be saved immediately in a SIMD vector without saving it twice from memory.
|
||||
|
||||
The calculation of the function $a*x^2 +b*x+c = 0$ is now possible, 8 calculations (AVX) / 4 elements (SSE) simultaneously. (Reason: 256 bit register size for AVX, and only 128 bit register size for SSE).
|
||||
|
||||
The calculations can thus be performed in a very short time.
|
||||
|
||||
\subsection*{AOSOA}
|
||||
Perhaps the reason why AOSOA is slower overall than SOA is the prefetching. During an operation (here: a loop of the for-loop) 8 elements are calculated simultaneously. When prefetching, the cache can already “prefetch” (or load the next elements into the cache) so that the next iteration of the for-loop can be calculated faster. However, this only works if you have a long chain at an element (here: element a, for example), as the next elements are directly next to each other in the memory. However, only SOA is arranged in this way. AOSOA has the theoretically perfect arrangement for SIMD calculations, but you would have to jump in the memory in order to have the same prefetching effects as with SOA. Jumping within the memory is not efficient. This is why SOA works faster than AOSOA.
|
||||
|
||||
\section*{Exercise 5.3}
|
||||
After correctly implementing the SIMD version of Newtons Method, we get the following terminal output:
|
||||
\begin{verbatim}
|
||||
Scalar part:
|
||||
Results are correct!
|
||||
Time: 14.695 ms
|
||||
SIMD part:
|
||||
Results are the same!
|
||||
Time: 4.12138 ms
|
||||
Speed up: 3.56555
|
||||
\end{verbatim}
|
||||
The expected speedup, as already discussed in exercise 4.1 of the previous assignment, should be x4 since we use $4 \cdot 32$ Bits (size of float) $= 128$ Bits which is the size of the \texttt{\_\_m128} Datatype. Due to un-paralellizable parts of the code and general random factors, this speedup cannot be achieved in practice. This approach was done with scalar iterations over the SIMD packages but not over the elements themselves and therefore holds up to the set constraints of the exercise.\\
|
||||
When iterating over all the elements, the speedup would be significantly less but the preferred approach, was the first one implemented.\\
|
||||
The implementation also already yields the same results as the given scalar implementation.t
|
||||
|
||||
\section*{Exercise 5.4}
|
||||
We use the built in gather function of the Vc library to gather our values from the input.
|
||||
|
||||
\begin{lstlisting}
|
||||
tmp.gather(input,index);
|
||||
\end{lstlisting}
|
||||
|
||||
For the masked gather we create a makse and give it to our gather function.
|
||||
|
||||
\begin{lstlisting}
|
||||
// gather with masking
|
||||
float_v tmp2;
|
||||
//TODO gather data with indices "index" from the array "input" into float_v tmp2, if the value of "input" is larger than 0.5
|
||||
// Use void gather (const float *array, const uint_v &indexes, const float_m &mask)
|
||||
|
||||
// begin your code here:
|
||||
|
||||
// Createing a mask with the following condition input[index[i]] > 0.5
|
||||
float_v inputValues;
|
||||
inputValues.gather(input, index);
|
||||
float_m mask = inputValues > 0.5f;
|
||||
|
||||
// Masked gather: only gater where mask is true
|
||||
tmp2.gather(input, index, mask);
|
||||
// end of your code
|
||||
\end{lstlisting}
|
||||
|
||||
As for the scatter we make use of the built in scatter function and create a mask to use in it.
|
||||
|
||||
\begin{lstlisting}
|
||||
|
||||
float_m scatter_mask = tmp < 0.5f;
|
||||
tmp.scatter(output, index, scatter_mask);
|
||||
// end of your code
|
||||
\end{lstlisting}
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
\end{document}
|
353
hlbpr/assignment6/main.tex
Normal file
353
hlbpr/assignment6/main.tex
Normal file
|
@ -0,0 +1,353 @@
|
|||
\documentclass[a4paper]{article}
|
||||
%\usepackage[singlespacing]{setspace}
|
||||
\usepackage[onehalfspacing]{setspace}
|
||||
%\usepackage[doublespacing]{setspace}
|
||||
\usepackage{geometry} % Required for adjusting page dimensions and margins
|
||||
\usepackage{amsmath,amsfonts,stmaryrd,amssymb,mathtools,dsfont} % Math packages
|
||||
\usepackage{tabularx}
|
||||
\usepackage{colortbl}
|
||||
\usepackage{listings}
|
||||
\usepackage{amsmath}
|
||||
\usepackage{amssymb}
|
||||
\usepackage{amsthm}
|
||||
\usepackage{enumerate}
|
||||
\usepackage{enumitem}
|
||||
\usepackage{subcaption}
|
||||
\usepackage{float}
|
||||
\usepackage[table,xcdraw]{xcolor}
|
||||
\usepackage{tikz-qtree}
|
||||
\usepackage{forest}
|
||||
\usepackage{changepage,titlesec,fancyhdr} % For styling Header and Titles
|
||||
\pagestyle{fancy}
|
||||
\renewcommand{\headrulewidth}{0.5pt} % Linienbreite anpassen, falls gewünscht
|
||||
\renewcommand{\headrule}{
|
||||
\makebox[\textwidth]{\rule{1.0\textwidth}{0.5pt}}
|
||||
}
|
||||
\usepackage{amsmath}
|
||||
\pagestyle{fancy}
|
||||
\usepackage{diagbox}
|
||||
\usepackage{xfrac}
|
||||
|
||||
\usepackage{enumerate} % Custom item numbers for enumerations
|
||||
|
||||
\usepackage[ruled]{algorithm2e} % Algorithms
|
||||
|
||||
\usepackage[framemethod=tikz]{mdframed} % Allows defining custom boxed/framed environments
|
||||
|
||||
\usepackage{listings} % File listings, with syntax highlighting
|
||||
\lstset{
|
||||
basicstyle=\ttfamily, % Typeset listings in monospace font
|
||||
}
|
||||
|
||||
\usepackage[ddmmyyyy]{datetime}
|
||||
|
||||
|
||||
\geometry{
|
||||
paper=a4paper, % Paper size, change to letterpaper for US letter size
|
||||
top=3cm, % Top margin
|
||||
bottom=3cm, % Bottom margin
|
||||
left=2.5cm, % Left margin
|
||||
right=2.5cm, % Right margin
|
||||
headheight=25pt, % Header height
|
||||
footskip=1.5cm, % Space from the bottom margin to the baseline of the footer
|
||||
headsep=1cm, % Space from the top margin to the baseline of the header
|
||||
%showframe, % Uncomment to show how the type block is set on the page
|
||||
}
|
||||
\lhead{Badan, 7418190\\Kneifel, 8071554}
|
||||
\chead{\bfseries{\vspace{0.5\baselineskip}HL-BPR Praktikum SS25\\Blatt 06}}
|
||||
\rhead{Wolf, 8019440\\Werner, 7987847}
|
||||
\fancyheadoffset[R]{0cm}
|
||||
|
||||
\begin{document}
|
||||
\section*{Exercise 6.1}
|
||||
\subsection*{i}
|
||||
There is a assumption to be made when answering this problem. That is the work required to turn a program to a multi threaded code is not ignorable. \\
|
||||
A loop with two different iterations which are independent from each other fullfills the usecase of multithreading. \\
|
||||
The biggest issue with multithreading is the information dependency of two tasks. If it is given that two tasks are independent from each other and are made can be solved in any order, this type of problem fullfills any condition needed for multithreading to work better than the usual code. The runtime complexity will always be smaller equal to the original code.\\
|
||||
\\
|
||||
There is a discussion to be made. If like in the question mentioned the loop is so short by runtime complexity that the time invested in parallelising the code and the time gain of said parallelising is not worth it. \\
|
||||
\\
|
||||
\\
|
||||
\subsection*{ii}
|
||||
Concurrency uses one core but finishes every task at the same time meaning while it shows the behaviour of parallelism that it finishes task at the same time. The actual runtime resembels the behaviour of running the problem as usual.
|
||||
\\
|
||||
Parallelism uses multiple cores to make work on the different tasks at the same time. To be more accurat different to concurrency parallelism gets more work done at the same time frame usually needing as long as the longest dependant tasks in the threads. The easiest exampels are data collection and management. Lets assume the programm is required to collect information of different sources. Then it should be assummed that those datas are numerical. After a day it should output all collected data. Collecting the data from one day takes one day would mean using concurrency it would take 8 days to output the data of a single day. \\
|
||||
Parallelization is can independently collect data from each source at the same time. With 8 possible threads for example. The fact of the matter is that this is simply untrue.If a computer has the ability to keep 8 threads running at the same time. Only 6 will be used for purely computing. Since stuff like the Operating System. Datamanagement and keeping the threads in check needs computing time so while in a perfect world it could be assummed the 8 threads would reduce the workload to 1 day. In truth we can at best reduce 6 Days. This is good results just not perfect.
|
||||
|
||||
\\
|
||||
|
||||
\subsection*{iii}
|
||||
Dependencies between Tasks leads to the state of race conditions. Simply said if two tasks need to first write acces two different strings of information into the same memory cell. Then they read from that same cell , that can lead to a problem if a information loss happens. This would lead to errors. With this in mind you would have to control the time data is read/written then extract it. This leads to a slow down of the speed up.
|
||||
\\
|
||||
\\
|
||||
Read access does not lead to problems if multiple tasks only take read accesses the order of execution does not matter since the tasks are by nature independent. \\
|
||||
\\
|
||||
Write acces is problematic by the fact that the information loss happens once a memory cell is overwritten and the managing thread needs to properly manage the different tasks then in order for no error to happen.
|
||||
\\
|
||||
\\
|
||||
Race condition is an unstable state that happens if informational dependent tasks are paralleled. This state requires careful planning and preparation and shoould be avoided. It also is a reason for reducing the possible speed up programs should have.
|
||||
\\
|
||||
\sub section*{iv}
|
||||
|
||||
Suppose we have an array with \(n\) elements and \(n\) CPU cores available. We want to find the sum of all elements using multiple threads in parallel, without using OpenMP's built-in reduction feature.\\ \\
|
||||
|
||||
The main idea is to use a \textbf{parallel pairwise reduction} method. This means that in each step, threads add pairs of elements at the same time, cutting the number of elements to sum in half.\\ \\
|
||||
|
||||
At the start, each thread works on one element. In the first step, threads with even indices add their element to the next element. This reduces the problem size from \(n\) to \(n/2\). In the next step, threads with indices that are multiples of 4 add their element to the one two positions away, halving the size again. We keep doing this, doubling the distance each time, until only one element is left — the total sum.\\ \\
|
||||
|
||||
Since the number of elements is cut in half each step, the total number of steps is \(\log_2 n\). Because all additions in each step happen at the same time across the cores, and assuming the overhead (like thread management) takes constant time, the total running time is \(O(\log_2 n)\).
|
||||
\\ \\
|
||||
This method avoids problems because each thread works on separate elements without interfering with others. After all steps are done, the final sum is stored in the first element of the array.\\ \\
|
||||
\subsection*{v}
|
||||
This CPU Model is intel Core i5 - 9400 cpu @ 2.90HZ\\
|
||||
We have 6 of those cores and each core supports 2 threads meaning we have 12 threads.\\
|
||||
In this case the best number of threads would be 10 to use. We need 2 threads for the operating system and thread management. Since ignoring them could either to a significant lowering of the sped up or even leading to an error.
|
||||
|
||||
\section{Exercise 6.2}
|
||||
\subsection*{Uncontrolled Multithreading}
|
||||
|
||||
Each thread attempts to print at the same time. Since access to the standard output is not synchronized, the output may appear jumbled due to race conditions:\\
|
||||
|
||||
\begin{verbatim}
|
||||
Hello, World! from thread 0
|
||||
HelHello, World! from thread 2
|
||||
lo, World! from thread 1
|
||||
Hello, WorHello, World! from thread 3
|
||||
ld! from thread 4
|
||||
\end{verbatim}
|
||||
|
||||
\subsection*{Using \texttt{\#pragma omp critical}}
|
||||
|
||||
To ensure that only one thread prints at a time, we can use a critical section.\\
|
||||
|
||||
This forces each thread to wait its turn to enter the critical section, resulting in clean, ordered output:\\
|
||||
|
||||
\begin{verbatim}
|
||||
Hello, World! from thread 0
|
||||
Hello, World! from thread 1
|
||||
Hello, World! from thread 2
|
||||
Hello, World! from thread 3
|
||||
Hello, World! from thread 4
|
||||
\end{verbatim}
|
||||
|
||||
\subsection*{Conclusion}
|
||||
|
||||
Without synchronization, thread outputs can overlap and produce unreadable results. Using \texttt{\#pragma omp critical} prevents such issues by allowing only one thread to access the output stream at a time.
|
||||
\section*{Exercise 6.3}
|
||||
\subsection*{Bug 1}
|
||||
Part of code before:
|
||||
\lstinline{#pragma omp parallel private(n) num_threads(N_THREADS)} \\
|
||||
Part of code after:
|
||||
\lstinline{#pragma omp parallel firstprivate(n) num_threads(N_THREADS)} \\
|
||||
|
||||
This definition takes a variable n, but without the initialized part, that is, n could be anything, is randomly searched in memory to find what value it will be. (a number, a letter, etc.)
|
||||
One solution here is to change the \lstinline{private(n)} to a \lstinline{firstprivate(n)}, so that the \lstinline{#pragma} definition takes the initialized variable n. However, the program also runs if the \lstinline{private(n)} is omitted completely.
|
||||
|
||||
\subsection*{Bug 2}
|
||||
Part of code before:
|
||||
\begin{lstlisting}
|
||||
tmp = 0;
|
||||
#pragma omp parallel num_threads(N_THREADS)
|
||||
{
|
||||
#pragma omp for
|
||||
for (int i = 1; i < n; ++i) {
|
||||
tmp += i;
|
||||
output_parallel[i] = static_cast<float>(tmp) / i;
|
||||
}
|
||||
}
|
||||
\end{lstlisting}
|
||||
Part of code after:
|
||||
\begin{lstlisting}
|
||||
tmp = 0;
|
||||
#pragma omp parallel num_threads(N_THREADS)
|
||||
{
|
||||
#pragma omp for ordered
|
||||
for (int i = 1; i < n; ++i) {
|
||||
#pragma omp ordered
|
||||
tmp += i;
|
||||
output_parallel[i] = static_cast<float>(tmp) / i;
|
||||
}
|
||||
}
|
||||
\end{lstlisting}
|
||||
The extra definition of \lstinline{ordered} lets the thread work independently, but in order of the "i", as the calculation needs the "i" it is important the threads don´t stand in each others ways, so they each get their own "i".
|
||||
|
||||
|
||||
\subsection*{Bug 3}
|
||||
Part of code before:
|
||||
\begin{lstlisting}
|
||||
#pragma omp parallel firstprivate(n) num_threads(N_THREADS)
|
||||
{
|
||||
#pragma omp for nowait
|
||||
for (int i = 0; i < n; ++i) {
|
||||
#pragma omp atomic
|
||||
sum += input[i];
|
||||
}
|
||||
\end{lstlisting}
|
||||
Part of code after:
|
||||
\begin{lstlisting}
|
||||
#pragma omp parallel firstprivate(n) num_threads(N_THREADS)
|
||||
{
|
||||
#pragma omp for
|
||||
for (int i = 0; i < n; ++i) {
|
||||
#pragma omp atomic
|
||||
sum += input[i];
|
||||
}
|
||||
\end{lstlisting}
|
||||
The \lstinline{nowait} makes all threads act like there are no barriers in place (like this \lstinline{atomic}). But the calculation of \lstinline{sum} needs the atomic so different threads don´t override each others calculation.
|
||||
|
||||
\subsection*{Bug 4}
|
||||
Part of code before:
|
||||
\begin{lstlisting}
|
||||
#pragma omp atomic
|
||||
sum += local_sum;
|
||||
|
||||
#pragma omp for
|
||||
for (int i = 0; i < n; ++i) {
|
||||
output_parallel[i] = input[i] / sum;
|
||||
}
|
||||
\end{lstlisting}
|
||||
Part of code after:
|
||||
\begin{lstlisting}
|
||||
#pragma omp atomic
|
||||
sum += local_sum;
|
||||
|
||||
#pragma omp barrier
|
||||
#pragma omp for
|
||||
for (int i = 0; i < n; ++i) {
|
||||
output_parallel[i] = input[i] / sum;
|
||||
}
|
||||
\end{lstlisting}
|
||||
Before the code is changed, each thread calculates the \lstinline{output_parallel[i]} only with his own sum, not using the whole sum.
|
||||
By including a barrier, all threads need to be done with the calculation of \lstinline{sum}, so in the next step each thread can calculate \lstinline{output_parallel[i]} with the whole/real \lstinline{sum}.
|
||||
|
||||
\section*{Exercise 6.4}
|
||||
Basically both functions do the same thing/return the same value, an estimate of pi. The scalar version is not included here.
|
||||
\subsection*{i) OpenMP without Reduction-Statement}
|
||||
Code of function (without comments):
|
||||
\begin{lstlisting}
|
||||
double calculatePiThreadSums()
|
||||
{
|
||||
double sum = 0.0;
|
||||
#pragma omp parallel num_threads(NUM_THREADS)
|
||||
{
|
||||
double local_sum = 0.0;
|
||||
#pragma omp for
|
||||
for(size_t i = 1; i <= NUM_STEPS; ++i){
|
||||
double x = (i-0.5) * STEP;
|
||||
local_sum += 4.0 / (1.0 + x * x);
|
||||
}
|
||||
#pragma omp critical
|
||||
sum += local_sum;
|
||||
}
|
||||
return sum * STEP;
|
||||
}
|
||||
\end{lstlisting}
|
||||
Each thread saves its own \lstinline{local_sum}, so the sum can be calculated without overlapping different threads, each with its own \lstinline{local_sum}.
|
||||
|
||||
With the use of OpenMP the operations do not have to be performed one after the other, but 4 operations can be calculated simultaneously (each on one thread).
|
||||
It was to be expected that a function that calculates the same thing but is improved with OpenMP would be faster. The speedup here is between 2.7 and 3.2 (the speedups differ in different runs), compared to the scalar version.
|
||||
|
||||
\subsection*{ii) OpenMP with Reduction-Statement}
|
||||
Code of function (without comments):
|
||||
\begin{lstlisting}
|
||||
double calculatePiReduction()
|
||||
{
|
||||
double sum = 0.0;
|
||||
#pragma omp parallel reduction(+:sum) num_threads(NUM_THREADS)
|
||||
{
|
||||
#pragma omp for
|
||||
for(size_t i = 1; i <= NUM_STEPS; ++i){
|
||||
double x = (i-0.5) * STEP;
|
||||
sum += 4.0 / (1.0 + x * x);
|
||||
}
|
||||
}
|
||||
return sum * STEP;
|
||||
}
|
||||
\end{lstlisting}
|
||||
This function is structured like the “thread-local”-function with the difference that the saving of the \lstinline{local_sum} is already implemented internally in the logic of OpenMP.
|
||||
|
||||
Depending on how large the overhead is, it is possible that this function runs just as fast (possibly even slower) than the “thread-local”-function. But basically the speedup compared to the scalar version is about 3.7 to 4.9 .
|
||||
|
||||
|
||||
\section*{Exercise 6.5}
|
||||
We take the SIMD version of our matrix computation and we parallelize the second for loop by using the "parallel for" pragma, thus each cpu gets to calculate one iteration of the 3rd for loop, this yeilds a speedup of approx 7 (while testing on a i5-1235U and 16gb ram). We expected a bigger speedup as this is isnt even double of the simd only version (approx speedup of 4). When testing on a better machine thoughm we get a speedup of 11, this is Due to the cpu having more cores and thus being able to complete more computations at once.
|
||||
\\
|
||||
We make use of the parallel for pragma because we need to parallelize a for loop. If we decided to parallelize the last for loop instead of the second one we would split the individual sqrt operations among the cpus this could lead to a huge overhead and when testing we find that it actually slows our function by a factor of ten compared to the scalar verison.
|
||||
\begin{lstlisting}
|
||||
TStopwatch timerOMP;
|
||||
for (int ii = 0; ii < NIter; ++ii) {
|
||||
#pragma omp parallel for
|
||||
for (int i = 0; i < N; ++i) {
|
||||
for (int j = 0; j < N; j += Vc::float_v::Size) {
|
||||
Vc::float_v& aVec = reinterpret_cast<Vc::float_v&>(a[i][j]);
|
||||
Vc::float_v& cVec = reinterpret_cast<Vc::float_v&>(c_omp[i][j]);
|
||||
cVec = Vc::sqrt(aVec);
|
||||
}
|
||||
}
|
||||
}
|
||||
\end{lstlisting}
|
||||
\clearpage
|
||||
\section*{Exercise 6.6}
|
||||
While trying to optimize the Code from Assignment 2, an approach of a single 1D vector was used with relative indexing for each new step. This approach was written in a way that made it impossible to multi-thread. Due to time constraints, I wasn't able to explore this concept further, with a slightly less optimized scalar code base. The following are the recorded times for the provided \textbf{100x100} \textit{p67\_snark\_loop.txt} file:
|
||||
\begin{itemize}
|
||||
\item \textbf{Scalar Implementation from Assignment 2}:
|
||||
\begin{lstlisting}
|
||||
Loop took 4183 milliseconds.
|
||||
\end{lstlisting}
|
||||
\item \textbf{OpenMP-applied Implementation with Assignment 2 code-base}:
|
||||
\begin{lstlisting}
|
||||
Loop took 1364 milliseconds.
|
||||
\end{lstlisting}
|
||||
\item \textbf{Heavily Optimized Scalar Implementation with 1D Vector approach}:
|
||||
\begin{lstlisting}
|
||||
Loop took 1024 milliseconds.
|
||||
\end{lstlisting}
|
||||
\end{itemize}
|
||||
The results show a 3x increase in time-efficiency on the OpenMP-ized version, hinting towards a lot of scalar overhead. The code was tested on 16 threads on 8 physical cores and should in a perfectly threadable implementation, be 8 times as fast. Looking at the result we can see why that isn't the case in this implementation:
|
||||
\begin{lstlisting}
|
||||
void Game::evolve() {
|
||||
// define variables up front to allow threads to exist (independant)
|
||||
size_t rows = board.size();
|
||||
if (rows == 0) return;
|
||||
size_t cols = board[0].size();
|
||||
|
||||
// Use thread-local flip lists to avoid race conditions
|
||||
#pragma omp parallel
|
||||
{
|
||||
vector<tuple<int, int>> threadFlipList;
|
||||
|
||||
#pragma omp for collapse(2) schedule(static)
|
||||
for (size_t i = 0; i < rows; ++i) {
|
||||
for (size_t j = 0; j < cols; ++j) {
|
||||
// [...] loop contents
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Safely merge local flip lists into the shared one
|
||||
#pragma omp critical
|
||||
flipList.insert(
|
||||
flipList.end(), threadFlipList.begin(), threadFlipList.end()
|
||||
);
|
||||
}
|
||||
|
||||
// Flip the marked cells in parallel
|
||||
#pragma omp parallel for
|
||||
for (size_t k = 0; k < flipList.size(); ++k) {
|
||||
int i = std::get<0>(flipList[k]);
|
||||
int j = std::get<1>(flipList[k]);
|
||||
board[i][j] = !board[i][j];
|
||||
}
|
||||
|
||||
secondTolastFlipList = lastFlipList;
|
||||
lastFlipList = flipList;
|
||||
flipList.clear();
|
||||
|
||||
if (!exercise4Enabled) {
|
||||
print();
|
||||
}
|
||||
}
|
||||
\end{lstlisting}
|
||||
The function after initializing the dimensions and other constants used for all threads is now using local flipLists to avoid race condition errors with \texttt{'\# pragma omp parallel'}. We are then threading the nested loop with \texttt{\# pragma omp for collapse (2) schedule ( static )} by merging both into a single loop and splitting the loop-itarations between threads evenly. \texttt{\# pragma omp critical} ensures that only a single thread merges its local flippList into the global one. Now we are executing the loop in a paralellized matter (\texttt{\# pragma omp parallel for}).\\\\
|
||||
The different cmake project directories are named $GameOfLife\_unchanged$ (unchanged),\\ $GameOfLife\_scalar_optimized$ (1D Vector Scalar Optimization for reference) and $GameofLife\_omp$ (The OMP version). All are prebuilt, and to be executed from the project directory and not the build directory if the default snark loop file should be used. To rebuild, empty the build folder, run the cmake and make commands, then from the parent directory, run the exectutable.
|
||||
\end{document}
|
129
hlbpr/assignment7/main.tex
Normal file
129
hlbpr/assignment7/main.tex
Normal file
|
@ -0,0 +1,129 @@
|
|||
\documentclass[a4paper]{article}
|
||||
%\usepackage[singlespacing]{setspace}
|
||||
\usepackage[onehalfspacing]{setspace}
|
||||
%\usepackage[doublespacing]{setspace}
|
||||
\usepackage{geometry} % Required for adjusting page dimensions and margins
|
||||
\usepackage{amsmath,amsfonts,stmaryrd,amssymb,mathtools,dsfont} % Math packages
|
||||
\usepackage{tabularx}
|
||||
\usepackage{colortbl}
|
||||
\usepackage{listings}
|
||||
\usepackage{amsmath}
|
||||
\usepackage{amssymb}
|
||||
\usepackage{amsthm}
|
||||
\usepackage{enumerate}
|
||||
\usepackage{enumitem}
|
||||
\usepackage{subcaption}
|
||||
\usepackage{float}
|
||||
\usepackage[table,xcdraw]{xcolor}
|
||||
\usepackage{tikz-qtree}
|
||||
\usepackage{forest}
|
||||
\usepackage{changepage,titlesec,fancyhdr} % For styling Header and Titles
|
||||
\pagestyle{fancy}
|
||||
\renewcommand{\headrulewidth}{0.5pt} % Linienbreite anpassen, falls gewünscht
|
||||
\renewcommand{\headrule}{
|
||||
\makebox[\textwidth]{\rule{1.0\textwidth}{0.5pt}}
|
||||
}
|
||||
\usepackage{amsmath}
|
||||
\pagestyle{fancy}
|
||||
\usepackage{diagbox}
|
||||
\usepackage{xfrac}
|
||||
|
||||
\usepackage{enumerate} % Custom item numbers for enumerations
|
||||
|
||||
\usepackage[ruled]{algorithm2e} % Algorithms
|
||||
|
||||
\usepackage[framemethod=tikz]{mdframed} % Allows defining custom boxed/framed environments
|
||||
|
||||
\usepackage{listings} % File listings, with syntax highlighting
|
||||
\lstset{
|
||||
basicstyle=\ttfamily, % Typeset listings in monospace font
|
||||
}
|
||||
|
||||
\usepackage[ddmmyyyy]{datetime}
|
||||
|
||||
|
||||
\geometry{
|
||||
paper=a4paper, % Paper size, change to letterpaper for US letter size
|
||||
top=3cm, % Top margin
|
||||
bottom=3cm, % Bottom margin
|
||||
left=2.5cm, % Left margin
|
||||
right=2.5cm, % Right margin
|
||||
headheight=25pt, % Header height
|
||||
footskip=1.5cm, % Space from the bottom margin to the baseline of the footer
|
||||
headsep=1cm, % Space from the top margin to the baseline of the header
|
||||
%showframe, % Uncomment to show how the type block is set on the page
|
||||
}
|
||||
\lhead{Badan, 7418190\\Kneifel, 8071554}
|
||||
\chead{\bfseries{\vspace{0.5\baselineskip}HL-BPR Praktikum SS25\\Blatt 06}}
|
||||
\rhead{Wolf, 8019440\\Werner, 7987847}
|
||||
\fancyheadoffset[R]{0cm}
|
||||
|
||||
\begin{document}
|
||||
\section*{Exercise 7.1}
|
||||
Here we use the tbb::parallel for function in order to parallelize our for loop, starting at 0 going until N and iteration over the int i.
|
||||
\begin{lstlisting}
|
||||
TStopwatch timerITBB;
|
||||
for( int ii = 0; ii < NIter; ii++ )
|
||||
tbb::parallel_for(0, N,[](int i){
|
||||
for( int j = 0; j < N; j+=Vc::float_v::Size ) {
|
||||
Vc::float_v &aVec = (reinterpret_cast<Vc::float_v&>(a[i][j]));
|
||||
Vc::float_v &cVec = (reinterpret_cast<Vc::float_v&>(c_tbb[i][j]));
|
||||
cVec = f(aVec);
|
||||
}
|
||||
});
|
||||
timerITBB.Stop();
|
||||
\end{lstlisting}
|
||||
\section*{Exercise 7.2}
|
||||
\subsection*{The class}
|
||||
The needed class for this calculation, does not need 4 constructors but is most efficient, if they are manually implemented.
|
||||
\begin{lstlisting}
|
||||
// default constructor
|
||||
PiCalc() : sum(0.0) {}
|
||||
|
||||
// main constructor
|
||||
PiCalc(double s_u, double s) :
|
||||
sum(s_u), step(s) {}
|
||||
|
||||
// copy constructor
|
||||
PiCalc(const PiCalc& other_sum) : sum(other_sum.sum), step(other_sum.step) {}
|
||||
|
||||
// split constructor
|
||||
PiCalc(const PiCalc& other_sum, tbb::split) : sum(0.0) , step(other_sum.step) {}
|
||||
\end{lstlisting}
|
||||
3 out 4 constructors are (if not manually implemented) automatically implemented by the compiler, as they need constructors to use instances of those classes. Only the \lstinline{split constructor} needs to be manually implemented, so the TBB can split the calculations across free threads.
|
||||
Each thread is then initialized with the \lstinline{sum = 0.0} so the new calculations can begin. The \lstinline{tbb::parallel_reduce} method can now automatically use the \lstinline{join} method to calculate the total sum (sum of all local sums).
|
||||
\subsection*{Parallelizing the loop}
|
||||
\begin{lstlisting}
|
||||
PiCalc body(0.0, step);
|
||||
tbb::parallel_reduce(tbb::blocked_range<int>(0, num_steps), body);
|
||||
pi = step * body.getSum();
|
||||
\end{lstlisting}
|
||||
The parallelizing of the loop happens within the use of the \lstinline{parallel_reduce} method. There are different methods to implement this method, with 2 arguments, 3 arguments or 4 arguments.
|
||||
You first define an instance of \lstinline{PiCalc}, as the \lstinline{parallel_reduce} method returns only a void. This method takes two arguments: \lstinline{range} and \lstinline{body} (basically an instance of a class to calculate something). The \lstinline{range} is defined with \lstinline{blocked_range}. This helps the \lstinline{parallel_reduce} method to split the whole range into blocks, so no miscalculations between threads can happen. At last you can use \lstinline{getSum()} to return the sum, so the calculation can be finished.
|
||||
|
||||
\subsection*{Runtime}
|
||||
The scalar version (just using one thread) takes around 2626.27 ms. By using the most efficient parallelization, the runtime is of course faster, around 764.774 ms. Speedup is: 3.434 , so around three times faster.
|
||||
|
||||
\clearpage
|
||||
\section*{Exercise 7.3}
|
||||
After implementing the required parts of the code, we get the following output:
|
||||
\begin{lstlisting}
|
||||
Scalar counter: 489566 Time: 61.3961
|
||||
TBB atomic counter: 489566 Time: 14.3781
|
||||
TBB mutex counter: 489566 Time: 42.5551
|
||||
\end{lstlisting}
|
||||
\subsubsection*{i) Atomic Counter}
|
||||
\textit{Please explain why an atomic counter is necessary here and explain its behavior.}\\\\
|
||||
An atomic counter is required in order to avoid race-conditioning errors. These can occur when multiple threads want to increment the counter simultaneously. This would lead to wrong results.\\
|
||||
An atomic counter however only allows one increment-computation simultaneously (as the name \textit{Atomic} implies). This results in safe threading and no race-conditioning. In addition, no locking is required.
|
||||
\subsubsection*{ii) Mutex}
|
||||
\textit{In which scope do you have to initialize your mutex and why?}\\\\
|
||||
The mutex must be initialized globally for all threads. It needs to at least be defined in the least-outer scope in which the threadding happens. This is needs to be the case because all threads require access to the same mutex, in order to be able to lock and unlock it. Otherwhise 2 threads might lock different mutex's at the same time while simultaneously incrementing the shared counter, possibly resulting in a race-condition.\\\\
|
||||
\textit{Additionally, why is the mutex locked but
|
||||
never unlocked explicity?}\\\\
|
||||
The used implementation is a RAII-variant, meaning it automatically locks on object-creation and auto-unlocks when The loop, etc. goes out of bounds and terminates. We can avoid some boilerplate code and make it safer by not explicitly having to lock and unlock, causing double-locking/unlocking or forgetting it in the first place.\\\\
|
||||
\textbf{Runtime Analysis}\\
|
||||
Now, given the explanation, we can further analyse our results.\\
|
||||
First of all, each implementation returns the same result, meaning all zeros are counted correctly and are therefore correctly implemented.\\
|
||||
Finally, we can compare the duration of each approach. The scalar approach is the slowest, as expected. TBB mutex is the second fastest one but still has some overhead for locking and unlocking. Thus it is not as fast as the atomic counter, which yields a 4x increase in time-efficiency
|
||||
\end{document}
|
Loading…
Add table
Add a link
Reference in a new issue